LLMs are great at giving the expected shape of an answer

Sometimes a shape is only a mirage.

Jun 11, 2026 · 3 min read

This article from epidemiologist Adam Kucharski describes a neat sanity check he conducted on Microsoft Copilot. In the first experiment, he simulated 2,000 free text responses and labelled them “US”. Next, he copy and pasted these responses and labelled them “UK”. He then randomized the order and passed the combined dataset to Copilot to analyze. In the second experiment, he simulated 200 free text responses but copy and pasted them five times, assigning five different country labels to the otherwise identical datasets before again passing them to Copilot.

In both experiments, Copilot returned a deep analysis of the differences in how participants from each country responded to the prompts. The only problem, of course, is that the responses were identical between all countries: there were no actual differences to describe.

Instead, Copilot leaned on cultural stereotypes to give the expected shape of an answer. This should not be surprising; indeed, the only thing LLMs can do is give the expected shape of an answer. The surprising thing is that these answer-shaped responses are correct and useful as often as they are.

Of course, some will object to the use of Microsoft Copilot, a famously weak and outdated model. But Copilot is a widely deployed enterprise tool, and so for many users it will be their primary exposure to AI in a work context. In this experiment, Copilot was asked, using default settings, to perform a task it is explicitly advertised as being capable of doing. Defaults are powerful, and such an insidious failure mode is sure to cause harm, especially in the hands of ordinary users who lack the understanding and intuition to detect the common failure modes of these AI systems.

This particular trick would probably be caught by thinking models through tool use. However, in a real dataset where there happened to be no particular difference between groups, you might still see the same ersatz reasoning and recourse to stereotypes even among more sophisticated models. The results would be no less fluid or convincing than if the models were describing a real effect present in the data.

A recent paper by Asadi et al. described similar behaviour in another context: multimodal LLMs can produce convincing image descriptions and reasoning traces even when no image is provided. They call it “mirage reasoning”. It shows that models can score highly on visual and medical benchmarks without visual input, including topping a chest X-ray QA benchmark.

A commenter on the original post described a potential real-world manifestation of this phenomenon: Ground News. If you listen to a lot of podcasts, you’ve surely heard them advertised. This service uses AI to summarize and compare countless news stories from across the globe, including describing the slant of each source and the articles they write. According to the commenter (I have not verified any of these claims):

Ground News is an actual service that uses LLMs to do content analysis and it suffers from the same issues. You can test it yourself in the context of their bias analysis. You can provide it with a list of identical articles and tell it that some articles are from left-wing, center, and right-wing sources, and it will tell you something like that the left-wing sources are more concerned with social issues, center sources are straight-forward and to the point, and right-wing sources are concerned with the economic ramifications.

While LLMs are incredibly useful tools, I do sometimes feel like the responses I get are just fitting the shape of the kind of conversation I am having, and that the particulars of what I am saying don’t matter. Sort of like a very fancy ELIZA.