CodeNames is a party card game where players need to find creative word associations. It has some properties that make it compelling as a testbed for scalable oversight experiments:
- It should be easy for language models to learn.
- The computational complexity for generating a clue is much larger than for finding an issue with a clue, which is higher still than evaluating an issue.
- It's easy to procedurally generate many games.
- It's easy to simulate overseers with various kinds of flaws, or artificially limit the oversight budget.
This project aims to expand on the theory of predicting whether a scalable oversight technique will robustly succeed for some problem domain and overseer, and then test out the theory with many small experiments.
For more detail, see the paper draft (targetted at AAAI 2025).
- the original Debate paper
- Redwood Research's post on meta-level adversarial evaluations
(Not really kept up to date, for full details check out the project doc)
- Generate decent clues with GPT-4
- Fine tune an open source LLM on the GPT-4 clues
- Get GPT-3.5 to play as the receiver of the clue (ground truth clue reward)
- POC of improving clue quality with DPO
- Implement "RLHF" using a simulated overseer
- Implement Debate/Critiques (they look pretty much the same here)
- Insert a variety of flaws into the overseer
- Add a meta-level adversarial incentive
- Generate a huge matrix of experiment parameters and run all of them