Anthropic's statistical analysis skill doesn't get statistical significance quite right · ↗ github.com
Anthropic’s new statistical analysis skill demonstrates a common misunderstanding of statistical significance:
Statistical significance means the difference is unlikely due to chance.
But this phrasing isn’t quite right. The p-value in Null Hypothesis Significance Testing is not about the probability the results are “due to chance”; it is the probability—under the null hypothesis and the model assumptions—of observing results at least as extreme as the ones we obtained. In other words, the p-value summarizes how compatible the data are with the null, given our modelling choices. What it does not tell you is the probability that the null hypothesis is true.
Statistician Andrew Gelman gave a good definition for statistical significance in a 2015 blog post:
A mathematical technique to measure the strength of evidence from a single study. Statistical significance is conventionally declared when the p-value is less than 0.05. The p-value is the probability of seeing a result as strong as observed or greater, under the null hypothesis (which is commonly the hypothesis that there is no effect). Thus, the smaller the p-value, the less consistent are the data with the null hypothesis under this measure.
As some of the commenters in this blog post observe, simply being able to parrot a technically accurate definition of a p-value does not necessarily make us better at applying statistical significance in practice. It is certainly true that statistical significance is widely misused in scientific publishing as a threshold to distinguish signal from noise (or to be fancy, a “lexicographic decision rule”), which is why some scientists have argued that we should abandon it as the default statistical paradigm for research.
In any case, the whole Anthropic Markdown file is worth a read to understand how Claude understands presenting uncertainty and avoiding common biases in statistical analysis.
Hat tip to Quentin André on Twitter.