AI Agents in Online Surveys: Is Your Respondent Still Human?
- Yuan Ren
- 11 minutes ago
- 8 min read

If you run online surveys today, there is a good chance your respondents come from an online recruitment platform. These platforms now sit behind a large share of behavioral research, shaping academic work in psychology, public health, economics, and political science. They also support consequential commercial decisions in product design and public communication.
But the rise of large language models (LLMs) has made an old concern feel newly urgent: are the people answering these surveys actually who they appear to be? Researchers now worry about two related problems. One is inattentive or dishonest humans using LLM assistance (Rilla et al., 2025). The other is automated agents generating synthetic responses at scale. A recent study by Westwood (2025) brought this second possibility into sharp focus by showing that an LLM-powered synthetic respondent could maintain a coherent persona, produce internally consistent response patterns, and evade a wide range of standard quality checks. That finding intensified debate about whether standard survey quality checks can still be assumed to distinguish human respondents from sophisticated automation (Westwood, 2025).
Capability Is Not the Same as Prevalence
Andrew Gordon and his colleagues (2026) make the paper’s central distinction very clear: proving that an AI agent can complete a survey is not the same as proving that agents are already operating at scale inside the panel ecosystems researchers actually use. In their terms, the difference is between “Simple Agentic Survey Completion” and “Ecosystem-Embedded Autonomous Agentic Survey Completion.” The first means an agent can get through a survey once. The second means it can repeatedly access real survey platforms, maintain a coherent identity, and survive within the infrastructure over time. That second version is the one that would matter most for online research, and it is also much harder to demonstrate. This is why the study focuses on two practical questions: how common are AI agents across online sample platforms, and how much does human data quality vary across different market segments?
To persist inside a real panel infrastructure, an agent would need more than good language output. It would need access to surveys, incentives to operate at scale, coordination across many attempts, and the ability to maintain a coherent identity over time. Direct first-party panels typically defend against agents through persistent profiling, identity verification, and longitudinal behavioral monitoring, making it much harder for an agent to persist than to pass a single survey instance. Marketplace aggregators create a different kind of barrier because their supply sources are decentralized. Combined with relatively modest survey compensation (Hornuf and Vrankar, 2022), these structural features may limit the economic incentives for large-scale agent deployment.
The study also keeps autonomous agents separate from LLM-assisted human responding (Rilla et al., 2025). That distinction matters. A human using ChatGPT to help answer an open-ended question may create measurement problems, but that is not the same thing as an autonomous agent completing entire surveys at scale. Previous studies, such as Chen et al. (2026), have used behavioral markers like typing cadence or reverse shibboleths to estimate agent prevalence. The problem is that those markers can also capture LLM-assisted humans, not just truly autonomous agents.

How the Study Tested AI Agent Prevalence
So how did the researchers test whether this threat was actually appearing in real samples? They recruited 5,200 respondents from 10 platforms across three major segments of the online market: direct first-party panels, including Prolific, CloudResearch Connect, Verasight, and Amazon Mechanical Turk; hybrid managed networks, including Prodege and Dynata; and marketplace platforms, including Qualtrics Panels, Prime Panels, PureSpectrum, and Cint. Before running the main study, the team validated its detection method in a pilot study. They tested an automated environment detection method against 125 submissions from ChatGPT, Claude, Gemini, Perplexity, and custom agent pipelines. The primary detection method used the Prolific Authenticity Checker, which monitors signatures of automated scripts, emulators, and AI-driven browser tools through JavaScript instrumentation. It also uses dynamic behavioral analysis to distinguish human engagement from programmatically simulated events. In pilot testing, the method achieved perfect discrimination, with AUC = 1.000. It identified all 125 agent submissions and produced zero false positives among 124 human respondents.
The researchers also used a secondary battery of six behavioral indicators. These included mouse movement counts, flagged if fewer than 189 movements were recorded; reCAPTCHA v3 scores, flagged if below 0.7; fraud risk classifications from Research Defender; and three cognitive trap items designed to exploit LLM reasoning failures (Affonso, 2026).
What the Results Actually Showed
The headline result is surprisingly restrained: outside MTurk, possible AI agent detections were rare. Across the full sample of 5,200 respondents, the primary detection flag identified only 2.37%, or 123 individuals, as potential agents. And those detections were concentrated almost entirely on Amazon MTurk.
MTurk was the exception. Its filtered sample, even with a high approval-rate requirement, showed an 11.2% primary detection rate; the unfiltered sample reached 16%. In contrast, all other platforms had primary detection rates below 2%, with most below 1%. Filtered samples from CloudResearch Connect and Prolific showed rates of 0% and 0.2%, respectively. Once MTurk was excluded, detection rates did not significantly differ between direct platforms at 0.45%, hybrid platforms at 0.38%, and marketplace platforms at 0.12%. These findings provide no compelling evidence of meaningful AI agent infiltration in the sampling environments examined in this study, and they are consistent with the idea that platform-level structural barriers may limit large-scale deployment.
But the most interesting part was not just how many responses were flagged. It was what those flagged responses looked like. To understand this, the researchers deployed “veridical agents,” meaning real LLM agents such as ChatGPT and Perplexity, as a benchmark. Genuine LLM agents performed at or above human levels on nearly every measure, including comprehension, honesty, engagement, and open-ended response quality. The flagged respondents in the actual sample looked very different. They performed significantly worse than both humans and veridical agents, especially in open-ended writing quality (d = -1.32), within-subject consistency (d = -1.06), and between-subject consistency (d = -1.58).
That profile matters. Poor performance, fast completion times, and low-quality open-ended answers are hard to reconcile with modern LLM behavior. They look much more like traditional scripted bots (Varol et al., 2017). For that reason, the elevated detection rates on MTurk likely reflect the resurfacing of a well-documented, pre-existing bot problem (Chmielewski and Kucker, 2020; Kennedy et al., 2020), rather than a novel LLM-agent invasion. The data indicate that while sophisticated LLM agents are technically capable of completing surveys, they were not prevalent in the platforms examined here.

If AI agents were not the dominant problem in this study, what was?
The answer was much less futuristic: human respondent quality. The researchers assessed human data quality across seven behavioral dimensions: attention, measured through four embedded checks; honesty, measured through a fictitious brand task; comprehension, measured through instructional recall; engagement, measured through active window focus; within-subject consistency; between-subject consistency, measured through a Pictionary rating task; and open-ended writing quality.
The pattern was hard to miss. Direct panels performed best, with an average quality score of 0.84. Hybrid platforms followed at 0.72, and marketplace platforms came in lower at 0.68. This hierarchy was especially clear for comprehension, attention, and open-ended writing quality. The four highest-scoring platforms were all direct panels: filtered and unfiltered Connect and Prolific. MTurk samples, by contrast, fell below all marketplace platforms.
The authors’ interpretation is that this is not just about one good or bad vendor. It reflects the structure of the recruitment market itself. Direct panels maintain proprietary respondent pools, persistent identity verification, and longitudinal monitoring. Respondents also have incentives to maintain high-quality participation records within a single managed ecosystem.
The metadata supports this interpretation. Direct panel respondents had a median of only 1 survey attempt in the previous 24 hours. Hybrid respondents had a median of 12, and marketplace respondents had a median of 13. In plain terms, some platforms appear to rely on respondents who are taking many surveys in a short period of time. That kind of high-volume survey activity is not an ideal condition for careful, thoughtful responding.
Device type also mattered. In direct panels, 75.3% of respondents used desktops, compared with only 29.1% in marketplaces. Desktop users consistently outperformed mobile users on tasks requiring sustained cognitive effort, such as comprehension, attention, and writing. This does not mean mobile respondents are inherently poor respondents. It does suggest that platform composition, including the devices respondents use, can shape the quality of the data researchers receive.

The study also examined LLM-assistance among human respondents. By analyzing linguistic markers in open-ended responses, the researchers found that LLM use among humans was slightly higher in direct panels at 2.2% than in hybrid platforms at 1.0% or marketplace platforms at 1.1%. This pattern was driven mainly by unfiltered samples from Connect at 4.1% and Prolific at 3.0%. Without mitigation strategies such as blocking copy-paste, these rates are consistent with prior research (Veselovsky et al., 2025; Westwood and Frederick, 2026). Cheating behavior, measured as switching tabs during general knowledge questions, was low across platforms at around 4–5%. The cost results make the platform-quality problem even harder to ignore. The study used the “Cost Per Quality Response” (CPQR) metric to show how much a usable, high-quality response actually costs.
PureSpectrum looked cheap at first glance, with a raw cost of $1.73 per respondent. But once the authors applied a 90% quality threshold, its CPQR was $24.47 because only 7.1% of respondents met the standard. Direct panels such as CloudResearch Connect and Prolific were more cost-efficient at high quality thresholds despite higher nominal costs. At the 90% threshold, the average CPQR for direct platforms was $8.26, compared with $74.43 for marketplace platforms. In other words, choosing a platform based only on sticker price can badly underestimate the true cost of valid data.
The Takeaway for Online Survey Research
The bigger problem was more ordinary, and more actionable: human data quality varied sharply by platform type. And the takeaway is simple: before worrying only about whether your respondents are AI, look carefully at where your human respondents are coming from.
References:
Affonso, F. M. (2026). Brief Commentary: A Framework for Detecting AI Agents in Online Research. The Journal of Consumer Research. https://doi.org/10.1093/jcr/ucag006
Chen, S., Urminsky, O., Zhang, G., Walatka, R., Fernandez, K., Low, A., Bogard, J., & Fox, C. R. (2026). Estimating the threat of AI-agent responding across online survey platforms. PsyArXiv. https://doi.org/10.31234/osf.io/xcg26_v1
Chmielewski, M., & Kucker, S. C. (2020). An MTurk Crisis? Shifts in Data Quality and the Impact on Study Results. Social Psychological & Personality Science, 11(4), 464–473. https://doi.org/10.1177/1948550619875149
Gordon, A., Rothschild, D., Affonso, F. M., Sulik, J., Hauser, D., Pepin, K., & Jones, S. (2026). AI agent prevalence and data quality across multiple online sample providers. PsyArXiv. https://doi.org/10.31234/osf.io/pvdjr_v2
Hornuf, L., & Vrankar, D. (2022). Hourly Wages in Crowdworking: A Meta-Analysis. Business & Information Systems Engineering, 64(5), 553–573. https://doi.org/10.1007/s12599-022-00769-5
Kennedy, R., Clifford, S., Burleigh, T., Waggoner, P. D., Jewell, R., & Winter, N. J. G. (2020). The shape of and solutions to the MTurk quality crisis. Political Science Research and Methods, 8(4), 614–629. https://doi.org/10.1017/psrm.2020.6
Rilla, R., Werner, T., Yakura, H., Rahwan, I., & Nussberger, A.-M. (2025). Recognising, Anticipating, and Mitigating LLM Pollution of Online Behavioural Research. arXiv.Org.
Varol, O., Ferrara, E., Davis, C. A., Menczer, F., & Flammini, A. (2017). Online Human-Bot Interactions: Detection, Estimation, and Characterization. arXiv.Org.
Veselovsky, V., Manoel Horta Ribeiro, Cozzolino, P., Gordon, A., Rothschild, D., & West, R. (2023). Prevalence and prevention of large language model use in crowd work. arXiv.Org.
Westwood, S. J. (2025). The potential existential threat of large language models to online survey research. Proceedings of the National Academy of Sciences, 122(47), e2518075122. https://doi.org/10.1073/pnas.2518075122
Westwood, S. J., & Frederick, S. (2026). Reply to Van der Stigchel et al.: Empirical evidence that AI survey contamination is real and substantial. Proceedings of the National Academy of Sciences, 123(8), e2537420123. https://doi.org/10.1073/pnas.2537420123




Comments