Python on Big Muddy

Opt out of very new Python package versions with uv

Sat, 28 Mar 2026 08:43:00 -0400

In light of several recent Python package compromises (litellm, telnyx), here is a useful tip from Hacker News commenter mil22:

For those using uv, you can at least partially protect yourself against such attacks by adding this to your pyproject.toml:

[tool.uv]

exclude-newer = "7 days"

or this to your ~/.config/uv/uv.toml:

exclude-newer = "7 days"

This will prevent uv picking up any package version released within the last 7 days, hopefully allowing enough time for the community to detect any malware and yank the package version before you install it.

Commenter notatallshaw follows up with how to achieve similar behaviour in *pip*:

Pip maintainer here, to do this in pip (26.0+) now you have to manually calculate the date, e.g. –uploaded-prior-to="$(date -u -d ‘3 days ago’ ‘+%Y-%m-%dT%H:%M:%SZ’)"

In pip 26.1 (release scheduled for April 2026), it will support the day ISO-8601 duration format, which uv also supports, so you will be able to do –uploaded-prior-to=P3D, or via env vars or config files, as all pip options can be set in either.

Using Claude Claude for cross-package statistical audits

Sun, 15 Mar 2026 22:49:00 -0400

Economist Scott Cunningham shared an important example of why we should always report the statistical package and version used in our analyses, as he used Claude Code to produce six versions of the exact same analysis using six different packages in R, Python, and Stata. In a difference-in-differences analysis of the mental health hospital closures on homicide using the standard Callaway and Sant’Anna estimator (for DiD with multiple time periods), he got very different results for some model specifications.

Since the specifications and the data were identical between packages, he discovered the divergences occurred due to how the packages handled problems with propensity score weights. Packages were not necessarily transparent about issues with these weights. If you were not running multiple analyses and comparing results across packages, or else carefully checking propensity score diagnostics, you might never have realized how precarious your results were.

Prof. Cunningham closes with the following advice:

The fifth point, and the broader point, is that this kind of cross-package, cross-language audit is exactly what Claude Code should be used for. Why? Because this is a task that is time-intensive, high-value, and brutally easy to get wrong. But just one mismatched diagnostic across languages invalidates the entire comparison, even something as simple as sample size values differing across specifications, would flag it. This is both easy and not easy — but it is not the work humans should be doing by hand given how easy it would be to even get that much wrong.

An end-to-end AI pipeline for policy evaluation papers

Thu, 12 Feb 2026 19:11:00 -0500

Prof. David Yanagizawa-Drott from the Social Catalyst Lab at the University of Zurich has launched Project APE (Autonomous Policy Evaluation), an end-to-end AI pipeline to generate policy evaluation papers. The vast majority of policies around the world are never rigorously evaluated, so it would certainly be useful if we were able to do so in an automated fashion.

Claude Code is the heart of the project, but other models are used to review the outputs and provide journal-style referee reports. All the coding is done in R (though Python is called in some scripts). Currently, judging is done by Gemini 3 Flash to compare against published research in top economics journals:

Blind comparison: An LLM judge compares two papers without knowing which is AI-generated Position swapping: Each pair is judged twice with paper order swapped to control for bias TrueSkill ratings: Papers accumulate skill ratings that update after each match

The project’s home page lists the AI’s current “win rate” at 3.5% in head-to-head matchups against human-written papers.

Prof. Yanagizawa-Drott says “Currently it requires at a minimum some initial human input for each paper,” although he does not specify exactly what. If we look at initialization.json that can be found in each paper’s directory, we see the following questions with user-provided inputs:

Policy domain: What policy area interests you?

Method: Which identification method?

Data era: Modern or historical data?

API keys: Did you configure data API keys?

External review: Include external model reviews?

Risk appetite: Exploration vs exploitation?

Other preferences: Any other preferences or constraints?

The code, reviews, manuscript, and even the results of the initial idea generation process are all available on GitHub. Their immediate goal is to generate a sample of 1,000 papers and run human evaluations on them (at time of posting, there are 264 papers in the GitHub repository).