Open Data on Big Muddy

The triumph of the data raccoons

Fri, 03 Apr 2026 23:40:00 -0400

My PhD co-supervisor at the University of Toronto, Dr. David Fisman, liked to use the term “data raccoon” to describe the work of using messy, incomplete, hard-to-work-with data to do serious research. Or, as he described it in testimony to the Canadian House of Commons in May 2020 (emphasis mine):

I’ll tell you, my group at University of Toronto call ourselves “data raccoons”, because we’ve sort of managed to thrive for about 15 years on data that most people regard as garbage, so it’s sort of a bit of the normal state of affairs for us with public health data analysis.

It’s an unmistakably Toronto metaphor—the city isn’t called the raccoon capital of the world for nothing!

It occurred to me recently that data raccoons have basically taken over the world. The basis of the AI revolution is vast quantities of text dredged from the Internet, none of which was written for its final purpose of training the deus ex machina. Arguably the most important dataset for training LLMs has been Common Crawl, a mostly uncurated snapshot of the Internet that has been running since 2007. According to a Mozilla report from 2024, Common Crawl was used in two thirds of LLMs developed in the formative period between 2019 and 2023, and the archive also comprised 80% of tokens in OpenAI’s GPT-3. Unsurprisingly, the Common Crawl Foundation has received financial support from AI companies in recent years, all the while being accused of abetting these same companies to train their models on paywalled articles.

How SARS-CoV-2 variants get named on GitHub

Thu, 26 Mar 2026 07:00:00 -0400

Bioinformatics has long been an unusually collaborative and transparent field, with genomes, protein structures, and other complex biological data habitually deposited into open databases during the course of research. The situation was no different at the outset of the COVID-19 pandemic, when a small group of scientists developed the Pango nomenclature for classifying variants of the SARS-CoV-2 virus. Outside of a handful of Greek-letter “variants of concern” names assigned by the World Health Organization, the Pango nomenclature is the standard for tracking the evolution of the SARS-CoV-2 virus. You may recall names such as B.1.1.7 (Alpha or the UK variant), B.1.351 (Beta or the South African variant), and P.1 (Gamma or the Brazilian variant). You can see a complete list of active SARS-CoV-2 lineages using the Pango nomenclature here.

By August 2020, the work of defining new lineages of SARS-CoV-2 had moved to GitHub, where the scientific process could happen in transparent and collaborative way. The definition of new lineages happens on proposals submitted as GitHub issues. In May 2023, a second GitHub repository was opened to move discussions of smaller or less clear lineages out of the main repository. These discussions can be promoted to the main repository, as this issue tracking LP.8.1 sub-lineages was in May 2025.

The work of defining new lineages of SARS-CoV-2 continues to this day on the GitHub repository, as the virus continues to mutate and evolve. And bioinformatics continues to be a shining beacon for open science for the rest of us to learn from.

Vandalism of OpenStreetMap

Mon, 23 Mar 2026 17:26:00 -0400

OpenStreetMap (OSM) is an open, community-driven map database powering countless apps and services and used by organizations including Amazon, Apple, Microsoft, Uber, Mapbox, and Wikimedia. In short, it is foundational infrastructure for the web. For regions with active communities (particularly in Europe), OSM is often noted for the superiority of its data on features such as cycling routes, hiking trails, and footpaths.

The Wikipedia article for OpenStreetMap documents several instances of data vandalism, which OSM is vulnerable to as a crowdsourced project. Three incidents stood out:

In 2012, Google fired two “rogue contractors” for vandalizing the OSM database, intentionally adding false data such as reversing the direction of one-way streets.
In 2018, a vandal made several viciously antisemitic edits to place names around New York City. While quickly reverted at the source, these changes nonetheless propagated into downstream applications pulling data from MapBox, such as Zillow, Snapchat, Citibike, and Wikipedia.
Users of the mobile game Pokémon GO regularly vandalize the OSM database underlying the game to gain a gameplay advantage, although the authors of the research article on this subject note this vandalism tends to be transitory rather than sustained.

Side note: I was amused to note how strong Google’s regional results bias is for “OSM”—the entire first page is taken up by results related to the Orchestre symphonique de Montréal.

Properly the work of federal public health agencies

Sun, 22 Mar 2026 23:38:00 -0400

One of the reasons I started this blog was to have a place to put down posts and articles that have lodged themselves in my brain. The wind-down announcement of the COVID Tracking Project, a volunteer-led COVID-19 data tracking collaboration, is one such article.

But the work itself—compiling, cleaning, standardizing, and making sense of COVID-19 data from 56 individual states and territories—is properly the work of federal public health agencies. Not only because these efforts are a governmental responsibility—which they are—but because federal teams have access to far more comprehensive data than we do, and can mandate compliance with at least some standards and requirements.

After one year of work, the COVID Tracking Project decided to quite collecting data on COVID-19 in the United States, because they recognized that the work of collecting a comparable, national-level dataset was the responsibility of federal government agencies.

As someone who co-led the COVID-19 Canada Open Data Working Group, which curated COVID-19 data for Canada until the end of 2023, I think about this article a lot. It’s a good read, and it speaks to how essential open data was to filling in the gaps in the national and international understanding of the COVID-19 pandemic.

Fight club at the bird feeder

Fri, 20 Mar 2026 07:00:00 -0400

Alternate title: Blue Jay brutally feeder mogs Tufted Titmouse

From the Cornell Lab of Ornithology, a pretty neat article about dominance hierarchies at the bird feeder using over 7,600 observations collected by citizen scientists contributing to Project Feeder Watch. Essentially, bird watchers reported instances when one bird species successfully displaced another at the bird feeder, and used this network of comparisons to build a dominance hierarchy. By using information contained within the network, you can even compare birds that are rarely observed together. Not all dominance patterns are linear, however, as the article reports:

A separate analysis uncovered some dominance triangles in which three birds had one-to-one relationships independent of each other, like a game of birdy rock-paper-scissors. For example, the House Finch dominates the Purple Finch, and the Purple Finch dominates the Dark-eyed Junco, but the junco dominates House Finch.

The full paper is here: Fighting over food unites the birds of North America in a continental dominance hierarchy.

This work is reminiscent of network meta-analysis, in which three or more interventions (e.g., drugs) are compared using both direct and indirect evidence. For example, if there are studies comparing drug A versus drug B and drug B versus drug C, we can infer the comparison between drug A and drug C, even if no study has ever directly compared them.

geoBoundaries: An open database of political administrative boundaries

Fri, 13 Mar 2026 17:05:00 -0400

Today I discovered geoBoundaries, a CC BY 4.0-licensed database of political administrative boundaries covering the entire world. It is notable for its high level of detail, going from ADM0 (country), ADM1 (states/provinces), ADM2 (counties/departments or municipalities), to ADM3 (municipalities or sub-municipalities) for many countries. My go-to source for world map files is Natural Earth, which is limited to ADM0 and ADM1 but is in the public domain. Natural Earth also includes some physical geography like water and bathymetry, while geoBoundaries is focused solely on political administrative boundaries. Both datasets deal with disputed boundaries, which is an endless source of tension in the Natural Earth GitHub.

An R package for retrieving data from geoBoundaries, geobounds, was released in February. A similar package for Natural Earth, rnaturalearth, has long been maintained by rOpenSci.

What will the paper of the future look like?

Tue, 10 Mar 2026 23:48:00 -0400

I am sharing today a short blog post by the Institute for Replication: “What will the paper of the future look like?”

In short: research looking more like software development (as presaged by Prof. Richard McElreath, author of the excellent Statistical Rethinking), with the ability to reuse common material, formalize results, and remix analyses built into the pipeline.

Open By Default: A database of access to information requests to the Canadian government

Sat, 07 Mar 2026 14:32:00 -0500

In Canada, any person or corporation in the country can make a request for general records to any agency of the federal government through the Access to Information Act (the equivalent in the United States is the Freedom of Information Act). The government provides a searchable database of completed requests, but includes only a summary of the request and the number of pages of responsive material. The actual documents turned over are not included. However, completed request packages may be informally re-requested, and should you do so, someone from the relevant agency will (usually) send them to you eventually.

This re-request process has its limits. It can takes weeks or months for the documents to be sent, and the database itself only goes back to January 2020 (they used to delete records older than two years, but stopped doing this some time after 2020). Occasionally, they will never send the documents at all, and all you can do is either re-request them again or open a formal access to information request (which will cost you $5).

Making it easier to access completed access to information requests is why the Investigative Journalism Foundation built Open By Default, “the biggest database of internal government documents never before made publicly accessible”. It includes documents from completed access to information requests obtained using both automated (presumably the re-request form) and manual processes (donations from trusted partners, particularly of documents from before the online re-request form was available). The files are cleaned and OCRed into one beautiful, searchable database.

US Medicaid data gets DOGE'd

Sat, 14 Feb 2026 10:29:00 -0500

The US Health and Human Services DOGE team (I guess DOGE still exists in some form) just released a new aggregated, provider-level Medicaid claims database covering January 2018 through December 2024. With this dataset, you can track the monthly claims for each procedure (by HCPCS Code) and provider over time.

Even if the framing around this dataset’s release is partisan—tied to allegations of Medicaid fraud in Minnesota—it is a genuine advance in transparency for the US’s third largest spending program. No doubt this accomplishment required a lot of work on the backend to harmonize countless fragmented datasets into one tidy schema. These data were difficult to access before, and now they are free for anyone to use. Journalists, policy researchers, and companies working in the US healthcare sector will benefit the most, but every taxpayer benefits from added transparency about where their tax dollars go.

I would say there is the potential for these data to be misused to spark witch hunts, but this is more or less the stated purpose for this data release. Per Elon Musk: “Medicaid data has been open sourced, so the level of fraud is easy to identify.” If you go on Twitter, you will find several people have already plugged in the dataset to Claude Code and trumpeted their ASCII tables of providers flagged for potential fraud. Inevitably, some of these providers targeted by public scrutiny for their unusual billing patterns will have perfectly innocent explanations. But if ProPublica is excited about the release of this new dataset, then so am I.

An end-to-end AI pipeline for policy evaluation papers

Thu, 12 Feb 2026 19:11:00 -0500

Prof. David Yanagizawa-Drott from the Social Catalyst Lab at the University of Zurich has launched Project APE (Autonomous Policy Evaluation), an end-to-end AI pipeline to generate policy evaluation papers. The vast majority of policies around the world are never rigorously evaluated, so it would certainly be useful if we were able to do so in an automated fashion.

Claude Code is the heart of the project, but other models are used to review the outputs and provide journal-style referee reports. All the coding is done in R (though Python is called in some scripts). Currently, judging is done by Gemini 3 Flash to compare against published research in top economics journals:

Blind comparison: An LLM judge compares two papers without knowing which is AI-generated Position swapping: Each pair is judged twice with paper order swapped to control for bias TrueSkill ratings: Papers accumulate skill ratings that update after each match

The project’s home page lists the AI’s current “win rate” at 3.5% in head-to-head matchups against human-written papers.

Prof. Yanagizawa-Drott says “Currently it requires at a minimum some initial human input for each paper,” although he does not specify exactly what. If we look at initialization.json that can be found in each paper’s directory, we see the following questions with user-provided inputs:

Policy domain: What policy area interests you?

Method: Which identification method?

Data era: Modern or historical data?

API keys: Did you configure data API keys?

External review: Include external model reviews?

Risk appetite: Exploration vs exploitation?

Other preferences: Any other preferences or constraints?

The code, reviews, manuscript, and even the results of the initial idea generation process are all available on GitHub. Their immediate goal is to generate a sample of 1,000 papers and run human evaluations on them (at time of posting, there are 264 papers in the GitHub repository).

The case for sharing clinical trial data

Tue, 10 Feb 2026 19:39:00 -0500

Saloni Dattani of the excellent Works in Progress magazine (and formerly of Our World in Data) launched a new Substack today called The Clinical Trials Abundance blog. The first post is on the case for sharing clinical trial data. We have been gradually moving toward mandatory reporting of clinical trial results (though enforcement is another question), but sharing data would be one step further. Even though clinical trials rely on the trust (and often money) of the public, it can be very difficult to gain access to the raw results, even if journal article authors claim they are “available upon request”. A norm of clinical trial data sharing would not only increase the confidence in published results but also aid future drug development, reduce expensive redundancy, and improve meta-analyses (which are often forced to rely on heterogeneous summary measures).