Statistics on Big Muddy

Adjusting for recalled past vote in political polling

Sun, 12 Apr 2026 22:55:00 -0400

The founder of Abacus Data, a Canadian polling firm, dropped kind of an interesting URL yesterday: abacus-weighting.com. It’s a advertisement in the form of a case study on why Abacus weights their political polls on past vote. It fits perfectly with the theme of yesterday’s post on how pollster’s get different results from the same data (the answer is they weight the raw data differently).

If you follow Nate Silver (or American political polling in general), you probably know that pollsters undercounted Trump support in all three elections where he was on the ballot. What I learned from this post is that support for the Conservative Party of Canada has been underestimated in their firm’s polling data in every polling wave for every election since 2011:

In every single wave, across every single election cycle, Conservative voters are underrepresented in our demographically weighted sample relative to their actual share of the vote. Not in most waves. Not in some elections. In every case we can observe.

Weighting for recalled past vote improves the estimate in every case, sometimes dramatically so:

In every election, past vote weighting moved our Conservative estimates upward and our Liberal estimates downward — consistently in the direction of the actual result. The 2021 election shows the most dramatic correction: a 7-point improvement in our Conservative estimate.

How do pollsters get different results from the same data?

Sat, 11 Apr 2026 22:36:00 -0400

Nate Silver linked to this throwback article from 2016 in The New York Times in his recent article on fake AI polls, which I also wrote about a few days ago. The article, entitled “We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results.” is a good reminder that modern polling diverges very far from the theoretical ideal of a simple random sample. Even after deciding on a methodology to sample participants and collecting the data, a lot of work goes into interpreting raw poll responses to give us top-line polling numbers. Every pollster needs to figure out how to weight the responses they get, since poll response rates are abysmal and variable across different demographic groups. As in the example given in this piece, these choices can result in large differences in those top-line numbers: from +4 Clinton to +1 Trump, all from the same raw data!

For an interesting follow-up: “Polling is becoming more of an art than a science”, also on Nate Silver’s Substack.

AI makes it easier to generate fake papers, too

Wed, 08 Apr 2026 20:09:00 -0400

Here’s a fun project from Tyler Vigen, creator of the famous Spurious Correlations page (which has been cited as a cautionary tale in many a science class). Using his database of real but spurious correlations (created by calculating the Pearson correlation coefficient r between a very large number of variables and picking out the hits), he used AI to create amusing fake manuscripts expounding on these statistical flukes as if they were real research questions.

These papers were generated in January 2024, and as previously discussed on this blog, the pipeline for end-to-end paper generation has come a long way in two years. I have no doubt Tyler could make these paper’s sound much more convincing using today’s models, though of course his goal here is to make you laugh (and think), not to trick you. But I have no doubt there will be many scholars adopting this data dredging strategy to generate “real” papers, contributing to a deluge of papers flooding the academic publishing system.

Using Claude Claude for cross-package statistical audits

Sun, 15 Mar 2026 22:49:00 -0400

Economist Scott Cunningham shared an important example of why we should always report the statistical package and version used in our analyses, as he used Claude Code to produce six versions of the exact same analysis using six different packages in R, Python, and Stata. In a difference-in-differences analysis of the mental health hospital closures on homicide using the standard Callaway and Sant’Anna estimator (for DiD with multiple time periods), he got very different results for some model specifications.

Since the specifications and the data were identical between packages, he discovered the divergences occurred due to how the packages handled problems with propensity score weights. Packages were not necessarily transparent about issues with these weights. If you were not running multiple analyses and comparing results across packages, or else carefully checking propensity score diagnostics, you might never have realized how precarious your results were.

Prof. Cunningham closes with the following advice:

The fifth point, and the broader point, is that this kind of cross-package, cross-language audit is exactly what Claude Code should be used for. Why? Because this is a task that is time-intensive, high-value, and brutally easy to get wrong. But just one mismatched diagnostic across languages invalidates the entire comparison, even something as simple as sample size values differing across specifications, would flag it. This is both easy and not easy — but it is not the work humans should be doing by hand given how easy it would be to even get that much wrong.

In the multiverse of forking paths

Mon, 16 Feb 2026 22:49:00 -0500

STRANGE: I went forward in time to view alternate modelling decisions, to see all the possible outcomes of the coming analysis.
STAR-LORD: How many did you see?
STRANGE: 14,000,605.
STARK: How many did we achieve statistical significance?
STRANGE: One.

Prof. Jessica Hullman recently wrote a piece on Andrew Gelman’s blog discussing the use of ‘multiverse analysis’, i.e., what if we could see the results of the many slightly different decisions we could have made when constructing a model. This problem is commonly known as the garden of forking paths—during an analysis, a researcher is forced to make many small, sometimes arbitrary decisions that can lead to a different result if another researcher tries to independently replicate the analysis. While usually an innocent and inevitable part of the modelling process, these ‘researcher degrees of freedom’ can also be manipulated to produce a desired result.

Prof. Hullman points out that multiverse analysis will only become salient as AI coding tools such as Claude Code make it easier than ever to iterate on how we model our research questions.

Her longer paper with Julia M. Rohrer and Andrew Gelman, “What’s a multiverse good for anyway?” is available here.

There is only one statistical test

Wed, 11 Feb 2026 23:58:00 -0500

A classic article by computer scientist Allen Downey on why there is only one statistical test: compute a test statistic from your observed data, simulate a null hypothesis, and finally compute/approximate a p-value by calculating the fraction of test statistics from the simulated data exceeding the test statistic from your observed data.

Downey suggests using general simulation methods over the canon of rigid, inflexible tests invented when computation was difficult and expensive.

^{Hat tip to Ryan Briggs on Twitter.}

Anthropic's statistical analysis skill doesn't get statistical significance quite right

Fri, 06 Feb 2026 19:30:00 -0500

Anthropic’s new statistical analysis skill demonstrates a common misunderstanding of statistical significance:

Statistical significance means the difference is unlikely due to chance.

But this phrasing isn’t quite right. The p-value in Null Hypothesis Significance Testing is not about the probability the results are “due to chance”; it is the probability—under the null hypothesis and the model assumptions—of observing results at least as extreme as the ones we obtained. In other words, the p-value summarizes how compatible the data are with the null, given our modelling choices. What it does not tell you is the probability that the null hypothesis is true.

Statistician Andrew Gelman gave a good definition for statistical significance in a 2015 blog post:

A mathematical technique to measure the strength of evidence from a single study. Statistical significance is conventionally declared when the p-value is less than 0.05. The p-value is the probability of seeing a result as strong as observed or greater, under the null hypothesis (which is commonly the hypothesis that there is no effect). Thus, the smaller the p-value, the less consistent are the data with the null hypothesis under this measure.

As some of the commenters in this blog post observe, simply being able to parrot a technically accurate definition of a p-value does not necessarily make us better at applying statistical significance in practice. It is certainly true that statistical significance is widely misused in scientific publishing as a threshold to distinguish signal from noise (or to be fancy, a “lexicographic decision rule”), which is why some scientists have argued that we should abandon it as the default statistical paradigm for research.

A/B testing for advertising is not randomized

Sun, 01 Feb 2026 23:09:00 -0500

Florian Teschner writes about a recent paper from Bögershausen, Oertzen, & Bock arguing that online ad platforms like Facebook and Google misrepresent the meaning of “A/B testing” for ad campaigns. In A/B testing, we might assume the platform is randomly assigning users to see ad A or ad B, in an attempt to get a clean causal interpretation about which ad is more likely to drive a click (or whatever outcome you’re tracking).

But according to the paper, this is usually not what is happening. Instead, the platform optimizes delivery for each ad independently, steering each one toward the users most likely to click it. In other words, the two ads may be shown to different groups of users, and differences in click-through rates may be attributable to who is seeing the ad, as opposed to the overall appeal of the ad. Ad platforms convert A/B tests from simple randomized experiments into murky observational comparisons. For example, an ad may appear to do better because it happened to be shown disproportionately to a group with a high click-through rate, not because it presents a more compelling overall message. Advertisers get the warm glow of “experimentally backed” marketing without the assurances of randomization.

Twyman's law

Fri, 30 Jan 2026 19:25:00 -0500

From Wikipedia:

Twyman’s law states that “Any figure that looks interesting or different is usually wrong”

A bit different from that oft-quoted line attributed to Isaac Asimov:

The most exciting phrase in science is not ‘Eureka!’ but ‘that’s funny’

But Twyman’s law is much truer in my experience. Surprising results are usually a signal that something is screwy with my data, my assumptions, or my pipeline.

_{Hat tip to DJ Rich on Twitter.}