Synthetic Data in Rare Disease Conversations

Synthetic data is showing up more often in rare disease conversations. That’s not surprising.

When real-world data is sparse, fragmented, or slow to access, synthetic data can be a useful additional signal. It can help with early method development, simulation, and pressure-testing edge cases that barely appear in real datasets.

But it’s still an additional data point, not an alternative.

In many rare disease use cases, there is already an existing patient data ecosystem. Claims, EMR, registries, chart review, sometimes linked together over years. These datasets are imperfect, but they carry real care pathways, real practice variation, and real operational context.

Synthetic data doesn’t plug into that ecosystem easily.

It’s hard to use it for:

– longitudinal outcomes tied to real utilization
– regulatory or evidence generation workflows
– questions where patient history, sequencing, and care transitions matter
– decisions that depend on how data is actually captured, coded, and acted on

There are also practical questions around acquisition and use. Who generates the synthetic data, based on what source distributions, and with what assumptions baked in? Those choices matter more than the volume.

From a data strategy perspective, the most realistic role for synthetic data today is supportive. It extends thinking. It accelerates exploration. It helps scarce real-world data go further.

The work is in knowing where that boundary is. Where have you seen synthetic data add value without breaking alignment with existing real-world data systems?