Science on Big Muddy

What is a public opinion poll without the public?

Tue, 07 Apr 2026 18:27:00 -0400

A few days ago, two professors (Leif Weatherby and Benjamin Recht) published an opinion piece in the New York Times calling attention to Axios publishing a story on maternal health using invented polling results:

A recent Axios story on maternal health policy referred to “findings” that a majority of people trusted their doctors and nurses. On the surface, there’s nothing unusual about that. What wasn’t originally mentioned, however, was that these findings were made up.

Clicking through the links revealed (as did a subsequent editor’s note and clarification by Axios) that the public opinion poll was a computer simulation run by the artificial intelligence start-up Aaru. No people were involved in the creation of these opinions.

The piece goes on to argue that this so-called “silicon sampling” is seductive because good public opinion polling is expensive, hard to do, and still prone to bias. But this shortcut magnifies the the problem of bias rather than solving it.

I’ve read a little bit about this strategy of using LLM-generated survey participants in the context of social science research in a series of posts (mostly from Prof. Jessica Hullman) over on Andrew Gelman’s blog:

Validating language models as study participants: How it’s being done, why it fails, and what works instead (2025-12-19)
Survey Statistics: Thomas Lumley writes about Interviewing your Laptop (2025-08-26)
When does it make sense to talk about LLMs having beliefs? (2025-08-15)
Better and worse ways to mix human and LLM responses in behavioral research (but you still have to figure what you’re measuring) (2025-06-12)
LLMs as behavioral study participants (2025-05-29)

Silicon sampling seems moderately interesting from a research perspective, but I can’t help but agree with the New York Times opinion piece authors that this will be ruinous for the already waning trust in public opinion polling. If you didn’t bother to ask the public, then why should the public care what you “find”? I think there is probably a lot of utility in using LLM samples to aid in designing and validating surveys, though.

How SARS-CoV-2 variants get named on GitHub

Thu, 26 Mar 2026 07:00:00 -0400

Bioinformatics has long been an unusually collaborative and transparent field, with genomes, protein structures, and other complex biological data habitually deposited into open databases during the course of research. The situation was no different at the outset of the COVID-19 pandemic, when a small group of scientists developed the Pango nomenclature for classifying variants of the SARS-CoV-2 virus. Outside of a handful of Greek-letter “variants of concern” names assigned by the World Health Organization, the Pango nomenclature is the standard for tracking the evolution of the SARS-CoV-2 virus. You may recall names such as B.1.1.7 (Alpha or the UK variant), B.1.351 (Beta or the South African variant), and P.1 (Gamma or the Brazilian variant). You can see a complete list of active SARS-CoV-2 lineages using the Pango nomenclature here.

By August 2020, the work of defining new lineages of SARS-CoV-2 had moved to GitHub, where the scientific process could happen in transparent and collaborative way. The definition of new lineages happens on proposals submitted as GitHub issues. In May 2023, a second GitHub repository was opened to move discussions of smaller or less clear lineages out of the main repository. These discussions can be promoted to the main repository, as this issue tracking LP.8.1 sub-lineages was in May 2025.

The work of defining new lineages of SARS-CoV-2 continues to this day on the GitHub repository, as the virus continues to mutate and evolve. And bioinformatics continues to be a shining beacon for open science for the rest of us to learn from.

Fight club at the bird feeder

Fri, 20 Mar 2026 07:00:00 -0400

Alternate title: Blue Jay brutally feeder mogs Tufted Titmouse

From the Cornell Lab of Ornithology, a pretty neat article about dominance hierarchies at the bird feeder using over 7,600 observations collected by citizen scientists contributing to Project Feeder Watch. Essentially, bird watchers reported instances when one bird species successfully displaced another at the bird feeder, and used this network of comparisons to build a dominance hierarchy. By using information contained within the network, you can even compare birds that are rarely observed together. Not all dominance patterns are linear, however, as the article reports:

A separate analysis uncovered some dominance triangles in which three birds had one-to-one relationships independent of each other, like a game of birdy rock-paper-scissors. For example, the House Finch dominates the Purple Finch, and the Purple Finch dominates the Dark-eyed Junco, but the junco dominates House Finch.

The full paper is here: Fighting over food unites the birds of North America in a continental dominance hierarchy.

This work is reminiscent of network meta-analysis, in which three or more interventions (e.g., drugs) are compared using both direct and indirect evidence. For example, if there are studies comparing drug A versus drug B and drug B versus drug C, we can infer the comparison between drug A and drug C, even if no study has ever directly compared them.

More on vibe researching

Fri, 13 Feb 2026 23:49:00 -0500

To follow on yesterday’s post on AI-produced research, here is a reflection on “vibe researching” from Prof. Joshua Gans of the University of Toronto’s Rotman School of Management. Since the release of the first “reasoning” models in late 2024, he has gone all in on experimenting with AI-first research.

One of the key takeaways is that he found himself pursuing low quality ideas to completion more often, precisely because the cost of choosing to continue to pursue a questionable idea has been lowered. Sycophancy is a problem, too. With an AI cheerleader, it is easy to convince yourself you have a result when you do not.

Those ideas were all fine but not high quality, and what is worse, I didn’t realise that they weren’t that significant until external referees said so. I didn’t realise it because they were reasonably hard to do, and I was happy to have solved them.

I will note that (human) peer reviewers cannot be the levee that stops the flood of middling AI research: the system of uncompensated labour that undergirds all of academic publishing is already strained to bursting, as every editor desperate to find referees for a paper will tell you.

Prof. Gans concludes his year-long experiment in “vibe researching” was a failure, despite publishing many working papers and publishing a handful of them:

An end-to-end AI pipeline for policy evaluation papers

Thu, 12 Feb 2026 19:11:00 -0500

Prof. David Yanagizawa-Drott from the Social Catalyst Lab at the University of Zurich has launched Project APE (Autonomous Policy Evaluation), an end-to-end AI pipeline to generate policy evaluation papers. The vast majority of policies around the world are never rigorously evaluated, so it would certainly be useful if we were able to do so in an automated fashion.

Claude Code is the heart of the project, but other models are used to review the outputs and provide journal-style referee reports. All the coding is done in R (though Python is called in some scripts). Currently, judging is done by Gemini 3 Flash to compare against published research in top economics journals:

Blind comparison: An LLM judge compares two papers without knowing which is AI-generated Position swapping: Each pair is judged twice with paper order swapped to control for bias TrueSkill ratings: Papers accumulate skill ratings that update after each match

The project’s home page lists the AI’s current “win rate” at 3.5% in head-to-head matchups against human-written papers.

Prof. Yanagizawa-Drott says “Currently it requires at a minimum some initial human input for each paper,” although he does not specify exactly what. If we look at initialization.json that can be found in each paper’s directory, we see the following questions with user-provided inputs:

Policy domain: What policy area interests you?

Method: Which identification method?

Data era: Modern or historical data?

API keys: Did you configure data API keys?

External review: Include external model reviews?

Risk appetite: Exploration vs exploitation?

Other preferences: Any other preferences or constraints?

The code, reviews, manuscript, and even the results of the initial idea generation process are all available on GitHub. Their immediate goal is to generate a sample of 1,000 papers and run human evaluations on them (at time of posting, there are 264 papers in the GitHub repository).