Why Synthetic Survey Data Isn't Real Data

Sawtooth

Last updated: 10 Mar 2026

The AI hype cycle is in full swing, and market research hasn't been spared. Vendors are pitching synthetic survey data, or AI-generated datasets built from large language model responses, as a faster, cheaper alternative to asking real people real questions. It sounds compelling. It isn't.

In one of Sawtooth's most-attended webinars ever, Chris Chapman, former researcher at Google, Microsoft, and Amazon, and founder of the Quantitative User Experience Association, made a rigorous, three-part case for why synthetic survey data doesn't hold up. He argued that the concept, rather than the technology, is fundamentally broken.

Here's why it's worth understanding before accepting the next vendor's pitch at face value.

What Is Synthetic Survey Data, Exactly?

Before diving into the critique, it helps to define the thing being critiqued. Synthetic survey data, in this case, means survey responses obtained from a large language model (LLM) — like ChatGPT or similar systems — that are then used in place of, or alongside, responses from actual human participants. The resulting synthetic dataset is then treated as a replacement for real survey data.

The typical workflow looks like this:

Start with existing human data: census data, a prior study, or a small, collected sample.
Build demographic "digital twins": prompt templates that describe a persona based on human data.
Have the LLM "take" the survey: feeding each twin's details into the model and collecting its responses.
Compare AI responses to human responses: and use any similarity as evidence that the approach works.

Vendors often point to cases where these results look similar as proof the approach is valid. Chapman's argument is that this proof doesn't prove what people think it does.

There's No Scientific Reason to Expect Synthetic datasets to Work

The first crack in the synthetic data case for Chris is one most researchers haven't fully reckoned with: there is no theoretical basis for assuming LLM responses will resemble human survey responses.

When researchers ask people a survey question, Chapman notes, they're following a well-understood process — identify a business question, reach a defined population, collect responses. It generalizes. There's something knowable about why it works.

With synthetic data, the process involves compiling billions of documents, training a model, building digital twins, and crafting prompt templates — all in the hope that this wildly different path produces something comparable to asking actual people. In Chapman's view, there's no a priori reason to believe it will.

"There's no theoretical reason it should work, and we should assume that it is presumptively wrong for interesting and especially novel and actionable questions." — Chris Chapman, Founder, Quantitative User Experience Association

Chapman offered a bracing sanity check. If LLMs could reliably predict human preferences, savvy researchers would use them to trade stocks. The vendors building these systems would be making fortunes on their own models rather than selling access to them. The fact that they aren't tells us something important about what those systems can and can't actually do.

Standard Statistical Tests Don't Apply — And Vendor Studies Often Rely on Them

This is where Chapman's argument gets more technical, and arguably more important.

When researchers compare LLM outputs to human data, they typically reach for familiar statistical tools: chi-square tests, p-values, mean absolute error. Vendor studies often present these comparisons between synthetic datasets and human samples as validation. Chapman's contention is that these tools rest on assumptions that simply don't hold for LLM-generated responses.

The Statistics Require Conditions That LLMs Can't Meet

According to Chapman, statistical inference requires random sampling from a defined, enumerable population. In principle, one could enumerate the world's 8.3 billion people and draw a random sample. There's no equivalent for LLMs. A model can be prompted many times, but those outputs aren't random samples from any definable population — they're deterministic outputs shaped by training data, prompt wording, and model architecture.

The null hypothesis in these comparisons is typically "no difference between human and LLM responses." Chapman points out that the sources are already known to be different: different stimuli, different cognitive mechanisms, different motivations. Assuming no difference isn't a neutral starting point. It's presumptively wrong before a single data point is collected.

What the Empirical Data Actually Shows

Chapman pulled real data from a Morris and Verasai study examining LLM responses to presidential approval questions — perhaps the most heavily polled survey topic in the world. The findings were striking:

One model's net approval measure was off by 30 points compared to human responses.
Mean average error across response categories reached 22 points.
Models from the same company disagreed with each other dramatically — one gave "strongly disapprove" at 60%, another at 19%, for the identical prompt.
Models performed worst on underrepresented demographic groups — the exact populations vendors often claim LLMs can better reach.

"If these models don't perform well in this space, there's no reason we would expect them to perform better in our typical UX and marketing cases — where we might have a novel product or novel topic, where there's not prior training data." — Chris Chapman, Founder, Quantitative User Experience Association

For Chapman, the math still runs on these comparisons — the formulas work. But without the underlying conditions for valid statistical inference, the output is math without meaning. It's possible to compute the difference between two numbers that have no business being compared.

Surveys Aren't About Measuring a "True Score" — And LLMs Can't Fake What Isn't There

Chapman's third argument goes deeper than statistics. It's about what surveys are actually for.

He identifies a tempting but misleading model embedded in much of the synthetic data conversation: that somewhere in a respondent's head lives a "true" value — a real degree of brand liking, a real purchase intent — and the researcher's job is to build a precise enough instrument to extract it. If that were true, maybe an AI could approximate it without involving actual humans.

But Chapman argues this model of survey research is wrong from the start.

"The opposite of a true score on a survey is not a false score. Actual responses do not live in people's heads. People do not go around with a liking of a brand. Rather, they construct the response of what they want to say about my brand when I ask them." — Chris Chapman, Founder, Quantitative User Experience Association

In Chapman's framing, responses aren't stored values waiting to be retrieved. They're constructed in the moment — shaped by question wording, the respondent's mood, their reason for taking the survey, and countless contextual factors. Change any of those, and the response may shift meaningfully. That's not noise to be minimized. That's how human opinion actually works.

This, Chapman argues, reframes the whole enterprise. A well-designed survey isn't a measurement instrument pointed at a fixed target. It's a designed interaction — crafted to align respondent motivations with the business question at hand. LLMs don't have motivations. They don't have business context. They're trained to produce outputs that humans find plausible and satisfying — which, ironically, makes them more likely to validate whatever idea they're asked about, whether it deserves it or not.

What About the Common Arguments in Favor of Synthetic Data?

Chapman addressed several rebuttals directly. Researchers in this space will encounter all of them.

Claim	The Problem
"It speeds up research."	Speed only helps if the answers are right. There's no basis for assuming speed and correctness here.
"LLMs can set priors for Bayesian models."	This confuses information with math. The model will run, but that doesn't mean the priors represent real information.
"Use it to pre-test surveys."	Random responses do this better — they'll hit every branch of your survey without introducing bias.
"The technology will keep improving."	Most technologies don't survive long enough to improve. Survivorship bias makes progress look inevitable in hindsight.
"I'll be left behind if I don't adopt it."	That's an organizational question, not a research one. It deserves to be evaluated differently.

What Should Researchers Do Instead?

The answer isn't to do nothing. Chapman's argument isn't that research is hard, so give up. It's that LLMs bypass the core purpose of surveys: learning from people.

Working with a small sample? Chapman recommends supplementing with qualitative research — interviews, open-ends, mixed methods — rather than padding with AI-generated responses.

Need stakeholder buy-in before fieldwork closes? Constructing hypothetical outcome scenarios based on prior research is a more defensible approach. Walking stakeholders through "what if they love it / what if they don't" is more useful than an LLM guess dressed up as data.

Feeling pressure to use AI in the research stack? There are legitimate uses of AI in research workflows — cleaning data, coding open-ends, automating reporting. Replacing respondents isn't one of them.

"Instead of hoping that LLMs can magically take surveys for us, we need to rely on survey science — iterating, learning from people in multiple modes, such as qualitative as well as quantitative research." — Chris Chapman, Founder, Quantitative User Experience Association

In Chapman's view, the standard for good research isn't whether the math runs. It's whether the insights support better decisions. Synthetic data can't clear that bar when there's no way to know — for any given project — whether it's right or wrong.

The Bottom Line

According to Chapman, synthetic survey data has three compounding problems:

No theoretical basis — there's no reason to expect a synthetic dataset generated by an LLM to simulate human survey responses.
No valid statistics — the tools used to validate it rely on assumptions that don't apply to LLM outputs.
No alignment with what surveys do — surveys aren't measuring fixed "true scores." They're designed conversations with real people.

All three don't need to land for the argument to hold. Any one of them is enough to warrant serious skepticism.

Real research means asking real people. The infrastructure to do that well — thoughtful survey design, appropriate sampling, mixed methods — is exactly what Sawtooth is built to support. Explore Sawtooth's platform at discover.sawtoothsoftware.com, or watch the full webinar for the complete argument, including the empirical data and code references Chapman walked through.