Prof. David Yanagizawa-Drott from the Social Catalyst Lab at the University of Zurich has launched Project APE (Autonomous Policy Evaluation), an end-to-end AI pipeline to generate policy evaluation papers. The vast majority of policies around the world are never rigorously evaluated, so it would certainly be useful if we were able to do so in an automated fashion.

Claude Code is the heart of the project, but other models are used to review the outputs and provide journal-style referee reports. All the coding is done in R (though Python is called in some scripts). Currently, judging is done by Gemini 3 Flash to compare against published research in top economics journals:

Blind comparison: An LLM judge compares two papers without knowing which is AI-generated Position swapping: Each pair is judged twice with paper order swapped to control for bias TrueSkill ratings: Papers accumulate skill ratings that update after each match

The project’s home page lists the AI’s current “win rate” at 3.5% in head-to-head matchups against human-written papers.

Prof. Yanagizawa-Drott says “Currently it requires at a minimum some initial human input for each paper,” although he does not specify exactly what. If we look at initialization.json that can be found in each paper’s directory, we see the following questions with user-provided inputs:

  1. Policy domain: What policy area interests you?
  2. Method: Which identification method?
  3. Data era: Modern or historical data?
  4. API keys: Did you configure data API keys?
  5. External review: Include external model reviews?
  6. Risk appetite: Exploration vs exploitation?
  7. Other preferences: Any other preferences or constraints?

The code, reviews, manuscript, and even the results of the initial idea generation process are all available on GitHub. Their immediate goal is to generate a sample of 1,000 papers and run human evaluations on them (at time of posting, there are 264 papers in the GitHub repository).

More in this Twitter thread from Prof. Yanagizawa-Drott. An interesting suggestion from a commenter there: journals using a process like this to generate a “benchmark” paper to compare new papers against.