When bad data looks good

AI-assisted fraud is blending into survey data. Learn how tech-enabled respondents bypass traditional cleaning—and what research teams must do to detect it.

In the most recent issue of Quirk’s magazine, Rep Data’s Steven Snell and Vignesh Krishnan examine how AI-assisted and tech-enabled fraud now blends into online survey datasets.

The article opens with a practical question: What happens when fraudulent data no longer looks fraudulent? As Krishnan and Snell explain, traditional fraud patterns—speeders, contradictory responses, one-word open ends—have given way to coordinated, technically sophisticated activity designed to resemble legitimate respondents.

Fraud that “looks good” in the data

A central point in their article, “When bad data looks good,” is that AI tools and developer frameworks reduce the effort required to generate coherent, passable survey responses. The authors describe how bad actors use emulators, manipulated device signals, web proxies and automation frameworks to enter surveys at scale.

Rather than focusing only on response content, the article highlights how technical manipulation occurs at the device and browser level; and it shows that when evaluation focuses primarily on response content, sophisticated fraud can remain embedded in the dataset.

Comparative evidence across six sample sources

For the piece, Krishnan and Snell report findings from a comparative study of six online consumer sample providers. Using Rep Data’s Research Defender digital fingerprinting and holding survey design constant, they observed tech-enabled fraud rates ranging from 14% to 20% across sources. The article further notes that when additional markers such as duplicate entrants and hyperactive respondents were included, total recommended fraud blocks ranged from 25% to 42%.

When fraud passes traditional cleaning

A second study described in “When bad data looks good” tests whether conventional data cleaning can independently identify fraudulent respondents. In this case, 33% of entrants were flagged as suspicious by Research Defender but were intentionally allowed into the survey.

After applying machine learning-powered, human-supervised data cleaning, 27% of respondents were removed for inattention or poor quality. The overlap analysis presented in the article shows that only 50% of respondents were both qualified and attentive. Of those flagged as fraud at entry, 23% showed no traditional quality markers. Nearly 70% of fraud identified at the front end blended into the dataset during cleaning.

Paradata as a detection layer

Effective detection depends on paradata (information about how responses are generated.) The article references indicators such as manipulated RTC configurations, automation tools like Selenium and Playwright, device signals associated with LLM usage, programmatic typing behavior and high-frequency survey participation. The authors show that these technical signals provide stronger evidence of coordinated fraud than response content alone.

Why the article matters for research teams

Throughout “When bad data looks good,” Krishnan and Snell return to the operational impact of undetected fraud introducing bias into areas like brand ratings, health and policy research and political polling. There is a strong need in our industry for tech-forward defenses that target automation frameworks, LLM usage signals and identity spoofing at the point of entry, supported by ecosystem-level monitoring.

For a detailed review of the methodology, figures and referenced academic research, read the full article in Quirk’s magazine.

mail
public	LinkedIn