<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Claude on Big Muddy</title><link>https://muddy.jprs.me/tags/claude/</link><description>Recent content in Claude on Big Muddy</description><generator>Hugo</generator><language>en-US</language><lastBuildDate>Sun, 15 Mar 2026 22:49:00 -0400</lastBuildDate><atom:link href="https://muddy.jprs.me/tags/claude/index.xml" rel="self" type="application/rss+xml"/><item><title>Using Claude Claude for cross-package statistical audits</title><link>https://muddy.jprs.me/links/2026-03-15-using-claude-claude-for-cross-package-statistical-audits/</link><pubDate>Sun, 15 Mar 2026 22:49:00 -0400</pubDate><guid>https://muddy.jprs.me/links/2026-03-15-using-claude-claude-for-cross-package-statistical-audits/</guid><description>&lt;p&gt;Economist Scott Cunningham shared an important example of why we should always report the statistical package and version used in our analyses, as he used Claude Code to produce six versions of the exact same analysis using six different packages in R, Python, and Stata. In a &lt;a href="https://en.wikipedia.org/wiki/Difference_in_differences"&gt;difference-in-differences&lt;/a&gt; analysis of the mental health hospital closures on homicide using the standard &lt;a href="https://bcallaway11.github.io/did/articles/multi-period-did.html"&gt;Callaway and Sant’Anna estimator&lt;/a&gt; (for DiD with multiple time periods), he got very different results for some model specifications.&lt;/p&gt;
&lt;p&gt;Since the specifications and the data were identical between packages, he discovered the divergences occurred due to how the packages handled problems with &lt;a href="https://www.tandfonline.com/doi/full/10.1080/00273171.2011.568786#d1e368"&gt;propensity score&lt;/a&gt; weights. Packages were not necessarily transparent about issues with these weights. If you were not running multiple analyses and comparing results across packages, or else carefully checking propensity score diagnostics, you might never have realized how precarious your results were.&lt;/p&gt;
&lt;p&gt;Prof. Cunningham closes with the following advice:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The fifth point, and the broader point, is that this kind of cross-package, cross-language audit is exactly what Claude Code should be used for. Why? Because this is a task that is time-intensive, high-value, and brutally easy to get wrong. But just one mismatched diagnostic across languages invalidates the entire comparison, even something as simple as sample size values differing across specifications, would flag it. This is both easy and not easy — but it is not the work humans should be doing by hand given how easy it would be to even get that much wrong.&lt;/p&gt;</description></item><item><title>Editors hate this one weird trick</title><link>https://muddy.jprs.me/notes/2026-03-05-editors-hate-this-one-weird-trick/</link><pubDate>Thu, 05 Mar 2026 20:05:00 -0500</pubDate><guid>https://muddy.jprs.me/notes/2026-03-05-editors-hate-this-one-weird-trick/</guid><description>&lt;p&gt;Given my &lt;a href="https://muddy.jprs.me/links/2026-03-03-the-productivity-shock-coming-to-academic-publishing/"&gt;recent&lt;/a&gt; &lt;a href="https://muddy.jprs.me/notes/2026-02-26-these-academic-journal-ai-policies-aren-t-going-to-last/"&gt;posts&lt;/a&gt; on AI in academic publishing, I just wanted to share this joke from Prof. Arthur Spirling on &lt;a href="https://x.com/arthur_spirling/status/2029006543765520471"&gt;Twitter&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Actually you cant run my paper through Claude to desk reject it because Claude is a regular coauthor of mine. Conflict of interest. Checkmate, editors&lt;/p&gt;
&lt;/blockquote&gt;</description></item><item><title>An end-to-end AI pipeline for policy evaluation papers</title><link>https://muddy.jprs.me/links/2026-02-12-an-end-to-end-ai-pipeline-for-policy-evaluation-papers/</link><pubDate>Thu, 12 Feb 2026 19:11:00 -0500</pubDate><guid>https://muddy.jprs.me/links/2026-02-12-an-end-to-end-ai-pipeline-for-policy-evaluation-papers/</guid><description>&lt;p&gt;Prof. David Yanagizawa-Drott from the Social Catalyst Lab at the University of Zurich has launched Project APE (Autonomous Policy Evaluation), an end-to-end AI pipeline to generate policy evaluation papers. The vast majority of policies around the world are never rigorously evaluated, so it would certainly be useful if we were able to do so in an automated fashion.&lt;/p&gt;
&lt;p&gt;Claude Code is the heart of the project, but other models are used to review the outputs and provide journal-style referee reports. All the coding is done in R (though Python is called in some scripts). Currently, judging is done by Gemini 3 Flash to compare against published research in top economics journals:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Blind comparison: An LLM judge compares two papers without knowing which is AI-generated
Position swapping: Each pair is judged twice with paper order swapped to control for bias
TrueSkill ratings: Papers accumulate skill ratings that update after each match&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The project&amp;rsquo;s home page lists the AI&amp;rsquo;s current &amp;ldquo;win rate&amp;rdquo; at 3.5% in head-to-head matchups against human-written papers.&lt;/p&gt;
&lt;p&gt;Prof. Yanagizawa-Drott says &amp;ldquo;Currently it requires at a minimum some initial human input for each paper,&amp;rdquo; although he does not specify exactly what. If we look at &lt;a href="https://github.com/SocialCatalystLab/ape-papers/blob/main/apep_0264/v1/initialization.md"&gt;&lt;code&gt;initialization.json&lt;/code&gt;&lt;/a&gt; that can be found in each paper&amp;rsquo;s directory, we see the following questions with user-provided inputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Policy domain: What policy area interests you?&lt;/li&gt;
&lt;li&gt;Method: Which identification method?&lt;/li&gt;
&lt;li&gt;Data era: Modern or historical data?&lt;/li&gt;
&lt;li&gt;API keys: Did you configure data API keys?&lt;/li&gt;
&lt;li&gt;External review: Include external model reviews?&lt;/li&gt;
&lt;li&gt;Risk appetite: Exploration vs exploitation?&lt;/li&gt;
&lt;li&gt;Other preferences: Any other preferences or constraints?&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;The code, reviews, manuscript, and even the results of the initial idea generation process are all available on &lt;a href="https://github.com/SocialCatalystLab/ape-papers"&gt;GitHub&lt;/a&gt;. Their immediate goal is to generate a sample of 1,000 papers and run human evaluations on them (at time of posting, there are 264 papers in the GitHub repository).&lt;/p&gt;</description></item><item><title>Anthropic's statistical analysis skill doesn't get statistical significance quite right</title><link>https://muddy.jprs.me/links/2026-02-06-anthropic-s-statistical-analysis-skill-doesn-t-get-statistical-significance-quite-right/</link><pubDate>Fri, 06 Feb 2026 19:30:00 -0500</pubDate><guid>https://muddy.jprs.me/links/2026-02-06-anthropic-s-statistical-analysis-skill-doesn-t-get-statistical-significance-quite-right/</guid><description>&lt;p&gt;Anthropic&amp;rsquo;s new statistical analysis skill demonstrates a common misunderstanding of statistical significance:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Statistical significance means the difference is unlikely due to chance.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;But this phrasing isn&amp;rsquo;t quite right. The p-value in Null Hypothesis Significance Testing is not about the probability the results are &amp;ldquo;due to chance&amp;rdquo;; it is the probability—under the null hypothesis and the model assumptions—of observing results at least as extreme as the ones we obtained. In other words, the p-value summarizes how compatible the data are with the null, given our modelling choices. What it does not tell you is the probability that the null hypothesis is true.&lt;/p&gt;
&lt;p&gt;Statistician Andrew Gelman gave a good definition for statistical significance in a 2015 &lt;a href="https://statmodeling.stat.columbia.edu/2015/07/21/a-bad-definition-of-statistical-significance-from-the-u-s-department-of-health-and-human-services-effective-health-care-program/"&gt;blog post&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A mathematical technique to measure the strength of evidence from a single study. Statistical significance is conventionally declared when the p-value is less than 0.05. The p-value is the probability of seeing a result as strong as observed or greater, under the &lt;em&gt;null hypothesis&lt;/em&gt; (which is commonly the hypothesis that there is no effect). Thus, the smaller the p-value, the less consistent are the data with the null hypothesis under this measure.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As some of the commenters in this blog post observe, simply being able to parrot a technically accurate definition of a p-value does not necessarily make us better at applying statistical significance in practice. It is certainly true that statistical significance is widely misused in scientific publishing as a threshold to distinguish signal from noise (or to be fancy, a &amp;ldquo;lexicographic decision rule&amp;rdquo;), which is why &lt;a href="https://sites.stat.columbia.edu/gelman/research/published/abandon.pdf"&gt;some scientists have argued that we should abandon it as the default statistical paradigm for research&lt;/a&gt;.&lt;/p&gt;</description></item></channel></rss>