The triumph of the data raccoons

Apr 3, 2026 · 2 min read

My PhD co-supervisor at the University of Toronto, Dr. David Fisman, liked to use the term “data raccoon” to describe the work of using messy, incomplete, hard-to-work-with data to do serious research. Or, as he described it in testimony to the Canadian House of Commons in May 2020 (emphasis mine):

I’ll tell you, my group at University of Toronto call ourselves “data raccoons”, because we’ve sort of managed to thrive for about 15 years on data that most people regard as garbage, so it’s sort of a bit of the normal state of affairs for us with public health data analysis.

It’s an unmistakably Toronto metaphor—the city isn’t called the raccoon capital of the world for nothing!

It occurred to me recently that data raccoons have basically taken over the world. The basis of the AI revolution is vast quantities of text dredged from the Internet, none of which was written for its final purpose of training the deus ex machina. Arguably the most important dataset for training LLMs has been Common Crawl, a mostly uncurated snapshot of the Internet that has been running since 2007. According to a Mozilla report from 2024, Common Crawl was used in two thirds of LLMs developed in the formative period between 2019 and 2023, and the archive also comprised 80% of tokens in OpenAI’s GPT-3. Unsurprisingly, the Common Crawl Foundation has received financial support from AI companies in recent years, all the while being accused of abetting these same companies to train their models on paywalled articles.

That is how the data raccoons won: what once looked like a scrappy epidemiological habit now looks like the dominant epistemology of the age.

Two mischievous raccoons perched on a ledge by a telephone pole.