The triumph of the data raccoons

Apr 3, 2026 · 2 min read

My PhD co-supervisor at the University of Toronto, Dr. David Fisman, liked to use the term “data raccoon” to describe the work of using messy, incomplete, hard-to-work-with data to do serious research. Or, as he described it in testimony to the Canadian House of Commons in May 2020 (emphasis mine):

I’ll tell you, my group at University of Toronto call ourselves “data raccoons”, because we’ve sort of managed to thrive for about 15 years on data that most people regard as garbage, so it’s sort of a bit of the normal state of affairs for us with public health data analysis.

It’s an unmistakably Toronto metaphor—the city isn’t called the raccoon capital of the world for nothing!

But now the data raccoons have gone and taken over the world. The basis of the AI revolution is vast quantities of text dredged from the Internet, almost none of it written for the purpose of training the deus ex machina.

Arguably the most important dataset for training LLMs has been Common Crawl, a mostly uncurated archive of the web that has been running since 2007. According to a Mozilla report from 2024, Common Crawl was used in two thirds of LLMs developed in the formative period between 2019 and 2023, and the archive also comprised 80% of the tokens in OpenAI’s GPT-3. Unsurprisingly, the Common Crawl Foundation has received financial support from AI companies in recent years, while also facing accusations that it helped those same companies train their models on paywalled articles.

The raccoon used to be a mascot for making do. Now it’s how we’re building the future.

Two mischievous raccoons perched on a ledge by a telephone pole.