Anthropic's statistical analysis skill doesn't get statistical significance quite right · ↗ github.com

Feb 6, 2026 · 2 min read

Anthropic’s new statistical analysis skill demonstrates a common misunderstanding of statistical significance:

Statistical significance means the difference is unlikely due to chance.

But this phrasing isn’t quite right. The p-value in Null Hypothesis Significance Testing is not about the probability the results are “due to chance”; it is the probability—under the null hypothesis and the model assumptions—of observing results at least as extreme as the ones we obtained. In other words, the p-value summarizes how compatible the data are with the null, given our modelling choices. What it does not tell you is the probability that the null hypothesis is true.

Statistician Andrew Gelman gave a good definition for statistical significance in a 2015 blog post:

A mathematical technique to measure the strength of evidence from a single study. Statistical significance is conventionally declared when the p-value is less than 0.05. The p-value is the probability of seeing a result as strong as observed or greater, under the null hypothesis (which is commonly the hypothesis that there is no effect). Thus, the smaller the p-value, the less consistent are the data with the null hypothesis under this measure.

As some of the commenters in this blog post observe, simply being able to parrot a technically accurate definition of a p-value does not necessarily make us better at applying statistical significance in practice. It is certainly true that statistical significance is widely misused in scientific publishing as a threshold to distinguish signal from noise (or to be fancy, a “lexicographic decision rule”), which is why some scientists have argued that we should abandon it as the default statistical paradigm for research.

…

The CIA World Factbook has been memory holed · ↗ simonwillison.net

Feb 5, 2026

Another staple of my childhood is gone, this time the CIA’s World Factbook. I have fond memories of consulting the World Factbook for school projects in my elementary school computer lab. But as of yesterday, the entire publication along with all of its archives have been suddenly and unceremoniously wiped from the agency’s website. At least archives of the website are still available on the Internet Archive, with complete zip files up to 2020 and Wayback Machine snapshots thereafter.

Guinea worm one step closer to eradication · ↗ www.cartercenter.org

Feb 4, 2026 · 2 min read

Only 10 cases of guinea worm were reported in 2025, down from an estimated 3.5 million cases per year when the elimination campaign began four decades ago. The disease is an ancient one, believed by some to be the “fiery serpents” that beset the ancient Israelites in The Book of Numbers. It is treated by carefully wrapping the parasite around a small stick as it painfully emerges over the course of weeks. This may be the inspiration for the Staff of Asclepius (⚕), the predominant symbol of medicine showing a a serpent wrapped around a rod.

When I was studying mathematical modelling of infectious diseases at the University of Ottawa in the mid 2010s, the question was whether Jimmy Carter would outlive the guinea worm. Tragically, he did not, but his life’s work helped to prevent an estimated 100 million cases of the disabling disease and made him a hero in global health.

While we are within spitting distance of zero cases in humans, true eradication will be more difficult due to significant animal reservoirs of the disease. The press release notes nearly 700 reported cases in animals across six countries (and who knows how many unreported cases). These non-human reservoirs pose a significant barrier to true eradication, since the disease must die out not only in human populations but also in wildlife.

…

msgvault: A personal email archive and search system to watch · ↗ wesmckinney.com

Feb 3, 2026

Here’s a new project to watch if you are interested in taking control of your email: msgvault. The tool provides a local, searchable version of all of your Gmail messages and attachments, backed by SQLite and DuckDB.

The author, Wes McKinney, says he may add support for other email services in the future, as well as WhatsApp, iMessage, and SMS. I’ll probably look into it for myself once the project matures a little. Although given that it stores everything in a single giant database file, it won’t fit into my standard backup strategy of versioned, incremental backups. Still, it could be a nice step forward in regaining control over my email archives.

^{Hat tip to j4mie on HackerNews.}

The Divergent Association Task, a measure for creativity · ↗ www.pnas.org

Feb 2, 2026

The Divergent Association Task is a short, simple test introduced in 2021 claiming to measure creativity. Taking only a minute and a half, it asks participants to “generate 10 nouns that are as different from each other as possible in all meanings and uses of the words”.

Although the instructions say to “avoid specialized vocabulary (e.g., no technical terms)”, I imagine you might score higher if you’ve just finished cramming wordlists for the GRE. Researchers have used this test to compare human and AI creativity (though the use of GPT-4 in this article with a January 2026 publication date speaks to the incompatibility of AI research with traditional publication timelines).

A/B testing for advertising is not randomized · ↗ flovv.github.io

Feb 1, 2026

Florian Teschner writes about a recent paper from Bögershausen, Oertzen, & Bock arguing that online ad platforms like Facebook and Google misrepresent the meaning of “A/B testing” for ad campaigns. In A/B testing, we might assume the platform is randomly assigning users to see ad A or ad B, in an attempt to get a clean causal interpretation about which ad is more likely to drive a click (or whatever outcome you’re tracking).

But according to the paper, this is usually not what is happening. Instead, the platform optimizes delivery for each ad independently, steering each one toward the users most likely to click it. In other words, the two ads may be shown to different groups of users, and differences in click-through rates may be attributable to who is seeing the ad, as opposed to the overall appeal of the ad. Ad platforms convert A/B tests from simple randomized experiments into murky observational comparisons. For example, an ad may appear to do better because it happened to be shown disproportionately to a group with a high click-through rate, not because it presents a more compelling overall message. Advertisers get the warm glow of “experimentally backed” marketing without the assurances of randomization.

Total electoral wipeout · ↗ en.wikipedia.org

Jan 31, 2026

The 2002 Turkish general election is the canonical example of total electoral wipeout. Every party holding seats in the previous legislature was completely wiped out. Of the two parties that won seats in the 2002 election, the one that formed government didn’t even exist at the time of the previous election (current president Erdoğan’s AK Party, formed in 2001). Of note, it wasn’t a complete changing of the guard: one of the three independent members from the 1999 parliament won his seat again in 2002 (Mehmet Ağar), though it seems he took over as leader of one of the wiped-out parties shortly after the election.

_{Hat tip to kynakwado2 on Twitter.}

Twyman's law · ↗ en.wikipedia.org

Jan 30, 2026

From Wikipedia:

Twyman’s law states that “Any figure that looks interesting or different is usually wrong”

A bit different from that oft-quoted line attributed to Isaac Asimov:

The most exciting phrase in science is not ‘Eureka!’ but ‘that’s funny’

But Twyman’s law is much truer in my experience. Surprising results are usually a signal that something is screwy with my data, my assumptions, or my pipeline.

_{Hat tip to DJ Rich on Twitter.}

Remember that a lot of numbers are fake · ↗ davidoks.blog

Jan 29, 2026 · 2 min read

David Oks wrote an essay reminding us that in many countries, even the most basic statistic—the population—is often shockingly uncertain or even outright fabricated. It’s a good reminder that many of the numbers we rely on for international comparisons, like crime rates and economic indices, are similarly troubled by incompatible definitions, uneven measurement, and varying degrees of manipulation. Ask Google what the population of Afghanistan is, and it will happily show you an annual timeline of population since 1960, but the tidiness of the chart belies the murkiness of the estimate.

One of the drawbacks of easily accessible international datasets from organizations like the World Bank and Our World in Data is that they paper over the huge differences among the underlying source datasets. Ultimately, you end up with one number from each country and the implication that they are all pointing to a single construct. This makes it far too easy to draw confident comparisons between countries that simply aren’t measuring the same thing. Without being forced to assemble these datasets yourself, it’s difficult to appreciate how messy it is to measure “the same thing” across different places (or even to measure the same thing over time within one place).

When evaluating a statistical claim, it’s always worth asking where the numbers come from and how they were measured. It’s easy to take figures at face value, especially when they’re rarely presented with any explicit uncertainty, which may be large. This goes double for more esoteric constructs like freedom scores or corruption indices, which often show up in social media posts cheerleading (or doom-mongering) one country over another. I remember one slickly produced video uncritically comparing COVID-19 statistics between Australia and Niger on the basis that they have the same population (do they?). Niger is one of the poorest and youngest countries in the world, and differences in demographics and health infrastructure alone invalidate any straightforward comparison with a wealthy Western country.

…

Welcome to Big Muddy

Jan 28, 2026

Hi, I’m Jean-Paul R. Soucy, a data scientist working in healthcare in Montreal, Canada. Welcome to Big Muddy, my spin on a Simon Willison-style links-and-notes blog. Here I collect and share things I’m learning across technology, science, politics, and whatever else catches my interest. You’ll find interesting links, brief write-ups, quick experiments, and the occasional deep dive.