Data on Big Muddy

Open By Default: A database of access to information requests to the Canadian government

Sat, 07 Mar 2026 14:32:00 -0500

In Canada, any person or corporation in the country can make a request for general records to any agency of the federal government through the Access to Information Act (the equivalent in the United States is the Freedom of Information Act). The government provides a searchable database of completed requests, but includes only a summary of the request and the number of pages of responsive material. The actual documents turned over are not included. However, completed request packages may be informally re-requested, and should you do so, someone from the relevant agency will (usually) send them to you eventually.

This re-request process has its limits. It can takes weeks or months for the documents to be sent, and the database itself only goes back to January 2020 (they used to delete records older than two years, but stopped doing this some time after 2020). Occasionally, they will never send the documents at all, and all you can do is either re-request them again or open a formal access to information request (which will cost you $5).

Making it easier to access completed access to information requests is why the Investigative Journalism Foundation built Open By Default, “the biggest database of internal government documents never before made publicly accessible”. It includes documents from completed access to information requests obtained using both automated (presumably the re-request form) and manual processes (donations from trusted partners, particularly of documents from before the online re-request form was available). The files are cleaned and OCRed into one beautiful, searchable database.

The case for sharing clinical trial data

Tue, 10 Feb 2026 19:39:00 -0500

Saloni Dattani of the excellent Works in Progress magazine (and formerly of Our World in Data) launched a new Substack today called The Clinical Trials Abundance blog. The first post is on the case for sharing clinical trial data. We have been gradually moving toward mandatory reporting of clinical trial results (though enforcement is another question), but sharing data would be one step further. Even though clinical trials rely on the trust (and often money) of the public, it can be very difficult to gain access to the raw results, even if journal article authors claim they are “available upon request”. A norm of clinical trial data sharing would not only increase the confidence in published results but also aid future drug development, reduce expensive redundancy, and improve meta-analyses (which are often forced to rely on heterogeneous summary measures).

The CIA World Factbook has been memory holed

Thu, 05 Feb 2026 16:37:00 -0500

Another staple of my childhood is gone, this time the CIA’s World Factbook. I have fond memories of consulting the World Factbook for school projects in my elementary school computer lab. But as of yesterday, the entire publication along with all of its archives have been suddenly and unceremoniously wiped from the agency’s website. At least archives of the website are still available on the Internet Archive, with complete zip files up to 2020 and Wayback Machine snapshots thereafter.

Twyman's law

Fri, 30 Jan 2026 19:25:00 -0500

From Wikipedia:

Twyman’s law states that “Any figure that looks interesting or different is usually wrong”

A bit different from that oft-quoted line attributed to Isaac Asimov:

The most exciting phrase in science is not ‘Eureka!’ but ‘that’s funny’

But Twyman’s law is much truer in my experience. Surprising results are usually a signal that something is screwy with my data, my assumptions, or my pipeline.

_{Hat tip to DJ Rich on Twitter.}

Remember that a lot of numbers are fake

Thu, 29 Jan 2026 23:20:00 -0500

David Oks wrote an essay reminding us that in many countries, even the most basic statistic—the population—is often shockingly uncertain or even outright fabricated. It’s a good reminder that many of the numbers we rely on for international comparisons, like crime rates and economic indices, are similarly troubled by incompatible definitions, uneven measurement, and varying degrees of manipulation. Ask Google what the population of Afghanistan is, and it will happily show you an annual timeline of population since 1960, but the tidiness of the chart belies the murkiness of the estimate.

One of the drawbacks of easily accessible international datasets from organizations like the World Bank and Our World in Data is that they paper over the huge differences among the underlying source datasets. Ultimately, you end up with one number from each country and the implication that they are all pointing to a single construct. This makes it far too easy to draw confident comparisons between countries that simply aren’t measuring the same thing. Without being forced to assemble these datasets yourself, it’s difficult to appreciate how messy it is to measure “the same thing” across different places (or even to measure the same thing over time within one place).

When evaluating a statistical claim, it’s always worth asking where the numbers come from and how they were measured. It’s easy to take figures at face value, especially when they’re rarely presented with any explicit uncertainty, which may be large. This goes double for more esoteric constructs like freedom scores or corruption indices, which often show up in social media posts cheerleading (or doom-mongering) one country over another. I remember one slickly produced video uncritically comparing COVID-19 statistics between Australia and Niger on the basis that they have the same population (do they?). Niger is one of the poorest and youngest countries in the world, and differences in demographics and health infrastructure alone invalidate any straightforward comparison with a wealthy Western country.