Ai on Big Muddy

Scientists invent a fake disease, AI picks it up, other scientists cite it

Fri, 10 Apr 2026 18:27:00 -0400

A somewhat disturbing bit of reporting from Nature tells the story of bixonimania, a fake eye disease invented by Swedish medical researcher Almira Osmanovic Thunström and her team. She seeded the idea for the fake disease in a series of ridiculous, joke-filled blog posts and preprints in mid-2024.

Because AI can be overly credulous with its sourcing (how often do Google’s AI answers confident cite random Reddit posts for the bulk of an answer?), the disease got picked up as an “emerging term” by the leading chatbots. The preprints even got cited a handful of times in real publications, which is further evidence that scientists don’t read the papers they cite (I guess the modern equivalent of copying citations from other papers is having AI dredge the literature for you).

I can see AI agents being exploited by those pushing dubious medical diagnoses to flood the Internet and preprint servers with articles aimed at convincing LLMs of the validity of their positions. That is if the agents aren’t too busy spinning of websites to defame those who incur their wrath.

A data point against the idea that AI will freeze/homogenize culture

Thu, 09 Apr 2026 07:00:00 -0400

Here’s an interesting figure and accompanying passage from this 2023 preprint entitled “Machine Culture”:

The innovations generated by AlphaGo and AlphaGo Zero soon entered human culture, as shown by research comparing human gameplay before and after the algorithms’ introduction. The decision quality, as measured by an open-source variant of AlphaGo Zero, showed very little improvement in human gameplay from 1950 to 2016, followed by a sudden improvement after the introduction of AlphaGo in March 2016. However, this improvement wasn’t solely due to humans adopting strategies developed by AlphaGo. It also reflected an unexpected shift, wherein humans started developing moves that were qualitatively distinct both from previous human moves and from the novel moves introduced by AlphaGo. In summary, AlphaGo served as an early, quantifiable exemplar of machine culture, generating novel cultural variations through genuine, nonhuman innovation. This was followed by a major transition into an even broader range of traits as the result of humans building on the previous discoveries made by machines. As the methods underpinning AlphaGo have been generalized to other games and extended to scientific problems, we anticipate a continued infusion of machine-generated discoveries across diverse domains of human culture.

AI makes it easier to generate fake papers, too

Wed, 08 Apr 2026 20:09:00 -0400

Here’s a fun project from Tyler Vigen, creator of the famous Spurious Correlations page (which has been cited as a cautionary tale in many a science class). Using his database of real but spurious correlations (created by calculating the Pearson correlation coefficient r between a very large number of variables and picking out the hits), he used AI to create amusing fake manuscripts expounding on these statistical flukes as if they were real research questions.

These papers were generated in January 2024, and as previously discussed on this blog, the pipeline for end-to-end paper generation has come a long way in two years. I have no doubt Tyler could make these paper’s sound much more convincing using today’s models, though of course his goal here is to make you laugh (and think), not to trick you. But I have no doubt there will be many scholars adopting this data dredging strategy to generate “real” papers, contributing to a deluge of papers flooding the academic publishing system.

What is a public opinion poll without the public?

Tue, 07 Apr 2026 18:27:00 -0400

A few days ago, two professors (Leif Weatherby and Benjamin Recht) published an opinion piece in the New York Times calling attention to Axios publishing a story on maternal health using invented polling results:

A recent Axios story on maternal health policy referred to “findings” that a majority of people trusted their doctors and nurses. On the surface, there’s nothing unusual about that. What wasn’t originally mentioned, however, was that these findings were made up.

Clicking through the links revealed (as did a subsequent editor’s note and clarification by Axios) that the public opinion poll was a computer simulation run by the artificial intelligence start-up Aaru. No people were involved in the creation of these opinions.

The piece goes on to argue that this so-called “silicon sampling” is seductive because good public opinion polling is expensive, hard to do, and still prone to bias. But this shortcut magnifies the the problem of bias rather than solving it.

I’ve read a little bit about this strategy of using LLM-generated survey participants in the context of social science research in a series of posts (mostly from Prof. Jessica Hullman) over on Andrew Gelman’s blog:

Validating language models as study participants: How it’s being done, why it fails, and what works instead (2025-12-19)
Survey Statistics: Thomas Lumley writes about Interviewing your Laptop (2025-08-26)
When does it make sense to talk about LLMs having beliefs? (2025-08-15)
Better and worse ways to mix human and LLM responses in behavioral research (but you still have to figure what you’re measuring) (2025-06-12)
LLMs as behavioral study participants (2025-05-29)

Silicon sampling seems moderately interesting from a research perspective, but I can’t help but agree with the New York Times opinion piece authors that this will be ruinous for the already waning trust in public opinion polling. If you didn’t bother to ask the public, then why should the public care what you “find”? I think there is probably a lot of utility in using LLM samples to aid in designing and validating surveys, though.

The definition of "agent"

Sat, 04 Apr 2026 23:59:00 -0400

An interesting exchange between Guido van Rossum and Andrej Karpathy a few days ago on Twitter:

Guido van Rossum: I think I finally understand what an agent is. It’s a prompt (or several), skills, and tools. Did I get this right?

Andrej Karpathy: LLM = CPU (data: tokens not bytes, dynamics: statistical and vague not deterministic and precise) Agent = operating system kernel

The triumph of the data raccoons

Fri, 03 Apr 2026 23:40:00 -0400

My PhD co-supervisor at the University of Toronto, Dr. David Fisman, liked to use the term “data raccoon” to describe the work of using messy, incomplete, hard-to-work-with data to do serious research. Or, as he described it in testimony to the Canadian House of Commons in May 2020 (emphasis mine):

I’ll tell you, my group at University of Toronto call ourselves “data raccoons”, because we’ve sort of managed to thrive for about 15 years on data that most people regard as garbage, so it’s sort of a bit of the normal state of affairs for us with public health data analysis.

It’s an unmistakably Toronto metaphor—the city isn’t called the raccoon capital of the world for nothing!

It occurred to me recently that data raccoons have basically taken over the world. The basis of the AI revolution is vast quantities of text dredged from the Internet, none of which was written for its final purpose of training the deus ex machina. Arguably the most important dataset for training LLMs has been Common Crawl, a mostly uncurated snapshot of the Internet that has been running since 2007. According to a Mozilla report from 2024, Common Crawl was used in two thirds of LLMs developed in the formative period between 2019 and 2023, and the archive also comprised 80% of tokens in OpenAI’s GPT-3. Unsurprisingly, the Common Crawl Foundation has received financial support from AI companies in recent years, all the while being accused of abetting these same companies to train their models on paywalled articles.

Testing ZeroClaw, Part 2.5: ZeroClaw is alive!

Wed, 01 Apr 2026 23:59:00 -0400

Yesterday, I wrote about how the ZeroClaw GitHub repository had been down for two days with little explanation. Earlier today, the project provided a little more information on Twitter:

They flagged our org which is why we’re down. Code is safe and we’re still working, just waiting for @github

Since March 30 (the day after their repo started 404ing), they project has been promising a blog post to explain the situation. As of now, that post is now available:

Over the past few days, a maintainer used aggressive AI automation to review and merge PRs:

Merges went through that shouldn’t have.

In the process of trying to undo the damage, the maintainer’s GitHub account was flagged, which triggered enforcement actions on the ZeroClaw org itself.

That maintainer has been removed from the project.

This sounds strikingly similar to the incident that occurred about a month ago, which I also mentioned in yesterday’s post:

Earlier today, during routine maintenance, the visibility of the ‎`zeroclaw-labs/zeroclaw` repository was accidentally changed from public to private and was later restored to public.

After reviewing the GitHub API audit logs and collecting detailed feedback from our engineers, we confirmed that the incident was caused by improper use of an AI agent tool during maintenance.

Obviously, the use agentic workflows in open source projects is an emerging field where best practices have not yet been established. The case of ZeroClaw should be a warning to other projects to keep human review in the loop, or at least to limit the autonomy of agents when a project has numerous contributors. As they say in their blog post:

Testing ZeroClaw, Part 2: ZeroClaw is dead?

Tue, 31 Mar 2026 20:57:00 -0400

Earlier this month, I wrote about setting up one of the many lightweight OpenClaw alternatives, namely ZeroClaw. I had some issues with initial setup, but I got to the point where I could talk with my bot over Telegram.

Some of my initial enthusiasm for ZeroClaw was dampened by the divergence between the docs and the features available in the release build. The release build was quite out of date due to the breakneck pace of development. In the week or two following my initial setup, the release build pipeline was broken, so even when they released a new tag, there were no new precompiled binaries available. Being forced to compile the Rust binary yourself kind of goes against the project’s philosophy of ultra-low resource consumption.

They eventually fixed the release pipeline and I started casually working on a system where I could send notes and ideas for blog posts to my bot through Telegram and have it turn them into structured Markdown files.

But two days ago (March 29), I noticed that the ZeroClaw GitHub repo was 404ing. On the same day, the project posted the following on Twitter:

Our GitHub repo is currently returning a 404 for some users. We’re aware and actively investigating. The repo is public and all code is safe.

How to avoid cognitive surrender to AI

Sun, 29 Mar 2026 22:59:00 -0400

I am sharing a thoughtful article today from Alex Panetta’s A.I. For You on avoiding over-reliance on AI: “cognitive debt”, “epistemic debt”, or “cognitive surrender”.

A particularly interesting nugget regarding the “Your Brain on ChatGPT” article from the MIT Media Lab (yes, that MIT Media Lab):

The paper is even written to get LLMs to read it carefully. The paper carries instructions telling LLMs which section to read first, which appears to be a clever way to force relevant context atop the context window, as LLMs tend to best remember the beginning and end of conversations — not the middle.

Will AI help Canadian police counter a tsunami of fraud?

Tue, 24 Mar 2026 21:48:00 -0400

Zak Vescera, writing for the Investigative Journalism Foundation, observes that fraud cases reported to Canadian police has more than doubled between 2013 and 2024:

At the same time, the number of cases cleared by Canadian police has fallen. In 2013, the ratio between reported cases and cleared cases was about 3:1; by 2024, this ratio was over 9.5:1.

The vast majority of fraud cases go unsolved. This is unsurprising given that many are perpetrated over the Internet by individuals overseas and involve methods of sending money that are difficult to recover, such as crypto, gift cards, and physical transfers of cash.

In response, the National Cybercrime Coordination Centre (NC3) of the RCMP—Canada’s national police service—have built a case management system and data portal they hope will eventually be adopted by all Canadian police forces. According to the article, this system is aimed at improving coordination, data sharing, and analysis. The platform will also host a set of AI tools, though the RCMP is vague on details and which are currently implemented. The article gives a few examples: OCR allowing victims to scan gift cards used in fraud rather than typing numbers manually, a tool to classify reports to help police target their investigative resources, and a report generator to simply data sharing when investigations go international.

Using Claude Claude for cross-package statistical audits

Sun, 15 Mar 2026 22:49:00 -0400

Economist Scott Cunningham shared an important example of why we should always report the statistical package and version used in our analyses, as he used Claude Code to produce six versions of the exact same analysis using six different packages in R, Python, and Stata. In a difference-in-differences analysis of the mental health hospital closures on homicide using the standard Callaway and Sant’Anna estimator (for DiD with multiple time periods), he got very different results for some model specifications.

Since the specifications and the data were identical between packages, he discovered the divergences occurred due to how the packages handled problems with propensity score weights. Packages were not necessarily transparent about issues with these weights. If you were not running multiple analyses and comparing results across packages, or else carefully checking propensity score diagnostics, you might never have realized how precarious your results were.

Prof. Cunningham closes with the following advice:

The fifth point, and the broader point, is that this kind of cross-package, cross-language audit is exactly what Claude Code should be used for. Why? Because this is a task that is time-intensive, high-value, and brutally easy to get wrong. But just one mismatched diagnostic across languages invalidates the entire comparison, even something as simple as sample size values differing across specifications, would flag it. This is both easy and not easy — but it is not the work humans should be doing by hand given how easy it would be to even get that much wrong.

The other half of the ATM–bank teller story

Wed, 11 Mar 2026 23:49:00 -0400

David Oks had a great post yesterday on the classic parable of how the adoption of ATMs did not lead to the predicted job losses among bank tellers. In fact, the opposite occurred: the number of bank tellers rose. I heard this story recounted several times in early discussions I had about the anticipated effect of AI on labour. I think I first heard it from Ryan Khurana. More recently it has been trotted out by US Vice President JD Vance.

The problem with this story is that the key statistic quoted alongside it, namely that there are more bank tellers than ever before, is no longer true. The famous graph supporting this assertion stops in 2010, and with good reason: the number of bank tellers has sharply fallen since then.

I think I had come across this fact before, this second half of the famous ATM–bank teller story, but it wasn’t until I read David Oks’s post that I understood the reason behind it. Quite simply, mobile banking ate physical banks. The ATM didn’t reduce the demand for bank tellers because it simply changed the kind of labour they did inside the bank. The iPhone made it so we didn’t need to go to the bank at all. It changed the paradigm. Explained this way, it seems obvious. Many new banks (including my own) do not have physical locations and never did.

Editors hate this one weird trick

Thu, 05 Mar 2026 20:05:00 -0500

Given my recent posts on AI in academic publishing, I just wanted to share this joke from Prof. Arthur Spirling on Twitter:

Actually you cant run my paper through Claude to desk reject it because Claude is a regular coauthor of mine. Conflict of interest. Checkmate, editors

The productivity shock coming to academic publishing

Tue, 03 Mar 2026 19:33:00 -0500

Today, I wanted to share this piece from economist Scott Cunningham (Baylor University), who wrote about how AI is widening the gap between research and publishing. Or, in economics terms (emphasis mine):

But what happens when the same productivity shock hits a system where the bottleneck was never really production in the first place, but rather was a hierarchical journal structure that depended immensely on editor time, skill, discretion and voluntary workers with the same talents called referees for screening quality deemed sufficient for publication?

The post mentions the Autonomous Policy Evaluation project—the end-to-end AI paper pipeline I wrote about a few weeks ago—and discusses the likely consequences of this flood of AI-generated papers. Assuming the number of publication slots in reputable journals is relatively fixed, AI-generated papers should add a very large amount of mass to the left side of the paper quality distribution. Acceptance rates will plummet and journals may rely on other signals of quality (name recognition, pedigree, institution) to thin the herd before actually reviewing content. As always, the rich get richer!

But this is imperfect, not to mention unfair, and so desk rejection gets noisier: good papers get killed by tired editors and marginally lower quality papers slip through to referees. It’s a cascading failure: volume breaks editors, broken editing wastes referees, wasted referees slow science.

Testing ZeroClaw, Part 1: Setup

Mon, 02 Mar 2026 19:15:00 -0500

As mentioned last week, I’ve been meaning to test out a personal agent from the Claw-like ecosystem. I settled on testing out Zeroclaw, a popular and lightweight OpenClaw alternative that should run well on my Raspberry Pi 4 4GB.

I wanted to harden my setup as much as possible and opted to running everything in Docker. I started with the official Docker compose file and added my OpenRouter key. I brought up the pre-built container image and tried sending the basic “Hello” message to the agent using the CLI. However, I got error because the automatically generated config file defaulted to a version of Claude Sonnet 4 that wasn’t available on OpenRouter. I switched to claude-sonnet-4.6 and then gpt-oss-20b (for much cheaper testing).

The Zeroclaw web gateway was a bit of a mess. Of the features I tried, only memory management and the basic status dashboard worked. Trying to talk to the agent through the web interface would give me a black screen (here’s someone complaining about the same error). I’m still being charged for the tokens, though! The cost tracker always displayed zero, even as I sent CLI and Telegram messages (more on that soon). The configuration editor gave me an error and so did the diagnostics tool.

The project docs/wiki were helpful for figuring things out, but development is running so far ahead of releases that a bunch of the features referred to aren’t available in the current stable version (v0.1.7, from last week). This includes getting and setting specific config options from the CLI and resetting the gateway pairing token. To use these features, you have to compile yourself.

Some examples of just-build-things-ism

Sun, 01 Mar 2026 11:58:00 -0500

The best mantra to come out of the AI era is: “You can just build things”. (So good OpenAI ripped it off for their Super Bowl ad.)

I’ve been pretty inspired to see how many people are now building all kinds of incredible tools thanks to advances in AI coding agents, even if they have no previous background in coding (see my post on Havelack.AI from a few days ago).

Here are a few more examples I’ve been following:

Canadian journalist Alex Panetta writes about his AI-augmented workflow at A.I. For You. I first came across his work with his debut article “I killed my doomscrolling habit with AI. You can too”. In it, he explains how to vibe code an automated, personalized daily news digest. I’ve tried to build something for myself but I haven’t gotten it quite right yet. A great follow for big news consumers.
Economics professor Scott Cunningham, author of the great textbook Causal Inference: The Mixtape, has a presentation explaining how to encourage AI adoption among academic faculty. This starts with faculty experiencing a killer use case for AI, which he suggests is building slide decks. He shares his tools/agent skills for this use case and more on GitHub.
Another economist, Chris Blattman, built a website to share the productivity tools he developed with Claude Code. He provides a tutorial and code on Claude Blattman.

And of course, Simon Willison has been building and sharing tools habitually for years now.

These academic journal AI policies aren't going to last

Thu, 26 Feb 2026 16:51:00 -0500

I recently came across the following policy on the submission page of an academic journal:

Use of Artificial Intelligence (AI) tools: One of the goals of Spectrum is to stimulate critical thinking and skill development among authors and reviewers alike. Spectrum discourages the submission of content generated by artificial intelligence (AI)-assisted technologies (such as chatGPT and similar tools). This includes tools that generate text, data, images, figures, or other materials, as well as tools that are used to summarize and synthesize sources. Authors should be aware that such tools are vulnerable to factual inaccuracies, biases, and logical fallacies, and may pose risks to privacy, confidentiality, and copyright.

If authors choose to submit work created with the assistance of AI tools, such use must be disclosed and described in the submission. The disclosure must include: 1) what system was used, 2) who used it, 3) the time/date of the use, 4) the prompt(s) used to generate the content, and 5) the content in the submission that resulted from use of AI tools. The output from the AI system should also be submitted as supplementary material. Authors must accept full responsibility for the accuracy and integrity of the submission. AI systems do not meet the criteria for authorship, and should not be listed as a co-author.

Agentic engineering patterns

Wed, 25 Feb 2026 16:15:00 -0500

Simon Willison is building a library of posts covering best practices for using agentic coding tools like Claude Code and OpenAI’s Codex. The existing articles cover test-driven development (red/green—ensure tests fail before the change and succeed after it) and AI-assisted code walkthroughs.

Comparing the Claw-like agent ecosystem

Tue, 24 Feb 2026 22:44:00 -0500

Chrys Bader has created ClawCharts to track the popularity and growth of OpenClaw and its growing number of competitors.

I have an unused Raspberry Pi 4 4GB that I’ve been meaning to test one of these Claw-like personal agents on (locked down to prevent the security nightmare scenarios we’ve seen play out since OpenClaw took off).

OpenClaw is a bit of a resource hog (which is why so many people are running out to buy Mac Minis), so I’ve been looking at the list of lightweight competitors. There is no obvious reason to prefer one over the other, so I’ll probably go with the fast-growing ZeroClaw.

ZeroClaw offers OAuth connectors for OpenAI and Anthropic subscription plans, but presently neither company is clear on whether this usage is permissible or not. Anthropic recently blew up the OpenClaw community by updating their docs to specifically ban using OAuth outside of Claude Code. An Anthropic employee partially walked this back on Twitter, but there is still no clear statement whether this use case is permitted. Regarding the use of OAuth from OpenAI for OpenClaw (specifically, GPT Codex), Peter Steinberger, creator of OpenClaw, stated on Twitter: “that already works, OAI publicly said that”. No one can seem to find this public statement, but it’s worth noting that Steinberger himself is now an OpenAI employee. So, will you get banned for using your ChatGPT Plus/Pro or Claude Pro/Max subscriptions with OpenClaw? Nobody knows.

LLMs automate the erosion of online anonymity

Mon, 23 Feb 2026 22:37:00 -0500

Economist Florian Ederer linked a new preprint describing the creation of an automated LLM-based pipeline for linking anonymous users across datasets based on unstructured text written by or about them. Prof Ederer is himself famous for unmasking the IP addresses of users of the infamous (but influential) Economics Job Market Rumors message board, exploiting a flaw in how usernames were assigned to anonymous posters. For platforms not encoding a user’s IP address in their “anonymous” username, the LLM-based approach involves:

Extracting structured features from free text
Encoding extracted features to embeddings to compare to candidate profiles
Reasoning using all available context to identify the most likely match among top candidates
Calibrate the quality of match by asking the LLM to report confidence

I guess it’s only a matter of time before someone uses this strategy to unmask Reviewer 2. (Currently this is only possible if Reviewer 2 insists you cite all of the work of the brilliant Dr. X.)

Oral texts

Sun, 22 Feb 2026 13:18:00 -0500

A major intellectual current in the post-social media age is the rediscovery of media theorists like Marshall McLuhan, Walter Ong, and Neil Postman, whose works seem incredibly prescient in the age of the Internet and the instantaneous and omnipresent mass communication it enables.

A particular sub-current of this trend is the return to orality, a culture rooted in the spoken rather than written word. Indeed, the vast majority of human history is defined by oral culture, and the world’s brief sojourn to the written tradition may have finally ended thanks to the Internet.

One of the most impressive projects to come out of this domain is Havelock.AI, a tool created by journalist Joe Weisenthal and entirely vibe coded with Claude. The tool analyzes text to give an “orality score” with supporting analysis. For example, qualified assertions are considered literate, whereas categorical statements are considered oral. The tool defines 68 oral/literate markers based on the framework of Walter Ong. It really is an impressive tool that I recommend checking out.

I plugged a few of my old articles into the tool and apparently my writing is very much rooted in the written tradition! (This post also scores as strongly literate.)

Democratizing voice cloning scams

Wed, 18 Feb 2026 22:26:00 -0500

Jamie Pine has launched Voicebox, a new voice cloning studio built upon the open weight Qwen3-TTS model. The project is positioned as a free, local alternative to the well-known ElevenLabs voice generator. A short demo video is available.

Obviously, there are legitimate uses for voice cloning technology. But in practice, this will be used to enable AI impersonation scams and spam on a massive scale. The GitHub page for this release isn’t exactly encouraging on this front. Demo screenshots show voice clones of YouTuber Linus Tech Tips, Minecraft creator Markus “Notch” Persson, and deceased streamer twomad.

Make sure you have a secret passphrase set up with your family, since your voice is no longer uniquely your own.

Don't let AI do your thinking for you

Tue, 17 Feb 2026 21:11:00 -0500

Here’s a thought-provoking article from Harry Law on “The last temptation of Claude”—the urge to outsource all of your thinking to AI (and remember, writing is thinking).

A common theme in the AI commentary I’ve been reading lately is the growing importance of taste. AI is sending the cost of creating “content” (articles, analyses, video, etc.) to zero, even as the attention to consume it all remains fixed. If we want to keep living in a world where AI serves us, we need—more than ever—the discernment to choose the questions worth asking.

As I put it in my Globe and Mail op-ed on AI and journalism a few years ago:

AI won’t replace the sort of journalism that holds power accountable, but it could certainly enhance it. After all, you can teach a machine to spot patterns, but you can’t force it to care about your community.

In the multiverse of forking paths

Mon, 16 Feb 2026 22:49:00 -0500

STRANGE: I went forward in time to view alternate modelling decisions, to see all the possible outcomes of the coming analysis.
STAR-LORD: How many did you see?
STRANGE: 14,000,605.
STARK: How many did we achieve statistical significance?
STRANGE: One.

Prof. Jessica Hullman recently wrote a piece on Andrew Gelman’s blog discussing the use of ‘multiverse analysis’, i.e., what if we could see the results of the many slightly different decisions we could have made when constructing a model. This problem is commonly known as the garden of forking paths—during an analysis, a researcher is forced to make many small, sometimes arbitrary decisions that can lead to a different result if another researcher tries to independently replicate the analysis. While usually an innocent and inevitable part of the modelling process, these ‘researcher degrees of freedom’ can also be manipulated to produce a desired result.

Prof. Hullman points out that multiverse analysis will only become salient as AI coding tools such as Claude Code make it easier than ever to iterate on how we model our research questions.

Her longer paper with Julia M. Rohrer and Andrew Gelman, “What’s a multiverse good for anyway?” is available here.

More on vibe researching

Fri, 13 Feb 2026 23:49:00 -0500

To follow on yesterday’s post on AI-produced research, here is a reflection on “vibe researching” from Prof. Joshua Gans of the University of Toronto’s Rotman School of Management. Since the release of the first “reasoning” models in late 2024, he has gone all in on experimenting with AI-first research.

One of the key takeaways is that he found himself pursuing low quality ideas to completion more often, precisely because the cost of choosing to continue to pursue a questionable idea has been lowered. Sycophancy is a problem, too. With an AI cheerleader, it is easy to convince yourself you have a result when you do not.

Those ideas were all fine but not high quality, and what is worse, I didn’t realise that they weren’t that significant until external referees said so. I didn’t realise it because they were reasonably hard to do, and I was happy to have solved them.

I will note that (human) peer reviewers cannot be the levee that stops the flood of middling AI research: the system of uncompensated labour that undergirds all of academic publishing is already strained to bursting, as every editor desperate to find referees for a paper will tell you.

Prof. Gans concludes his year-long experiment in “vibe researching” was a failure, despite publishing many working papers and publishing a handful of them:

An end-to-end AI pipeline for policy evaluation papers

Thu, 12 Feb 2026 19:11:00 -0500

Prof. David Yanagizawa-Drott from the Social Catalyst Lab at the University of Zurich has launched Project APE (Autonomous Policy Evaluation), an end-to-end AI pipeline to generate policy evaluation papers. The vast majority of policies around the world are never rigorously evaluated, so it would certainly be useful if we were able to do so in an automated fashion.

Claude Code is the heart of the project, but other models are used to review the outputs and provide journal-style referee reports. All the coding is done in R (though Python is called in some scripts). Currently, judging is done by Gemini 3 Flash to compare against published research in top economics journals:

Blind comparison: An LLM judge compares two papers without knowing which is AI-generated Position swapping: Each pair is judged twice with paper order swapped to control for bias TrueSkill ratings: Papers accumulate skill ratings that update after each match

The project’s home page lists the AI’s current “win rate” at 3.5% in head-to-head matchups against human-written papers.

Prof. Yanagizawa-Drott says “Currently it requires at a minimum some initial human input for each paper,” although he does not specify exactly what. If we look at initialization.json that can be found in each paper’s directory, we see the following questions with user-provided inputs:

Policy domain: What policy area interests you?

Method: Which identification method?

Data era: Modern or historical data?

API keys: Did you configure data API keys?

External review: Include external model reviews?

Risk appetite: Exploration vs exploitation?

Other preferences: Any other preferences or constraints?

The code, reviews, manuscript, and even the results of the initial idea generation process are all available on GitHub. Their immediate goal is to generate a sample of 1,000 papers and run human evaluations on them (at time of posting, there are 264 papers in the GitHub repository).

Why a Canadian news site just launched an AI publishing tool

Mon, 09 Feb 2026 19:49:00 -0500

It’s no secret that Canadian journalism (like journalism everywhere) is in trouble. Newsrooms face a steady stream of layoffs despite a couple hundred million Canadian dollars of direct and indirect government subsidies every year. The vast majority of outlets eligible for these subsidies take advantage of them, and combined they can subsidize half of a journalist’s salary. News organizations are desperate to diversify their revenue streams.

The Hub is a right-leaning publication launched in 2021 with a focus on policy and politics. Notably, the outlet declines or donates their subsidies, citing a valid concern that the scale of such subsidies threaten the perceived trustworthiness and independence of the media.

In late January 2026, The Hub launched NewsBox, an AI-powered publishing tool. NewsBox aims to make it easier for creators to transform their content (written, audio, or video) into other formats, such as speeches, essays, or talking points, while maintaining the author’s distinct voice. You can see examples of the tool’s output on new articles in The Hub, each of which is accompanied by an AI-generated summary and list of quotes at the top of the page. There is also a “Hub AI” chatbot in the sidebar of every article.

The app very much uses The Hub’s branding, prominently featuring the outlet’s co-creators, who also created NewsBox. While their pitch talks about preserving creators’ voices to avoid the “soulless prose” and “slop” outputted by ChatGPT and similar tools, I have to wonder if tighter integration of AI into the news and opinion side of the operation will raise its own issues with trust. The Hub has always been fairly tech-friendly, including a longstanding sponsorship by Meta.

Anthropic's statistical analysis skill doesn't get statistical significance quite right

Fri, 06 Feb 2026 19:30:00 -0500

Anthropic’s new statistical analysis skill demonstrates a common misunderstanding of statistical significance:

Statistical significance means the difference is unlikely due to chance.

But this phrasing isn’t quite right. The p-value in Null Hypothesis Significance Testing is not about the probability the results are “due to chance”; it is the probability—under the null hypothesis and the model assumptions—of observing results at least as extreme as the ones we obtained. In other words, the p-value summarizes how compatible the data are with the null, given our modelling choices. What it does not tell you is the probability that the null hypothesis is true.

Statistician Andrew Gelman gave a good definition for statistical significance in a 2015 blog post:

A mathematical technique to measure the strength of evidence from a single study. Statistical significance is conventionally declared when the p-value is less than 0.05. The p-value is the probability of seeing a result as strong as observed or greater, under the null hypothesis (which is commonly the hypothesis that there is no effect). Thus, the smaller the p-value, the less consistent are the data with the null hypothesis under this measure.

As some of the commenters in this blog post observe, simply being able to parrot a technically accurate definition of a p-value does not necessarily make us better at applying statistical significance in practice. It is certainly true that statistical significance is widely misused in scientific publishing as a threshold to distinguish signal from noise (or to be fancy, a “lexicographic decision rule”), which is why some scientists have argued that we should abandon it as the default statistical paradigm for research.