CodeNames Oversight

CodeNames is a party card game where players need to find creative word associations. It has some properties that make it compelling as a testbed for scalable oversight experiments:

It should be easy for language models to learn.
The computational complexity for generating a clue is much larger than for finding an issue with a clue, which is higher still than evaluating an issue.
It's easy to procedurally generate many games.
It's easy to simulate overseers with various kinds of flaws, or artificially limit the oversight budget.

This project aims to expand on the theory of predicting whether a scalable oversight technique will robustly succeed for some problem domain and overseer, and then test out the theory with many small experiments.

For more detail, see the paper draft (targetted at AAAI 2025).

Relevant background

the original Debate paper
Redwood Research's post on meta-level adversarial evaluations

Rough Roadmap

(Not really kept up to date, for full details check out the project doc)

Generate decent clues with GPT-4
Fine tune an open source LLM on the GPT-4 clues
Get GPT-3.5 to play as the receiver of the clue (ground truth clue reward)
POC of improving clue quality with DPO
Implement "RLHF" using a simulated overseer
Implement Debate/Critiques (they look pretty much the same here)
Insert a variety of flaws into the overseer
Add a meta-level adversarial incentive
Generate a huge matrix of experiment parameters and run all of them

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
codenames_oversight		codenames_oversight
data		data
experiments		experiments
jobs		jobs
models		models
paper		paper
playground		playground
results-explorer		results-explorer
test		test
.gitignore		.gitignore
README.md		README.md
game-words.txt		game-words.txt
plot_experiment.ipynb		plot_experiment.ipynb
plot_oracle_by_optimization.ipynb		plot_oracle_by_optimization.ipynb
requirements.txt		requirements.txt
ruff.toml		ruff.toml
training_plots.ipynb		training_plots.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeNames Oversight

Relevant background

Rough Roadmap

About

Releases

Packages

Contributors 2

Languages

Crazytieguy/codenames-oversight

Folders and files

Latest commit

History

Repository files navigation

CodeNames Oversight

Relevant background

Rough Roadmap

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages