bongard

Evaluating Claude models on Bongard problems. The problems are drawn from the Online Encyclopaedia of Bongard Problems.

You can visualize results in the Streamlit app here.

Next steps

Investigate model descriptions of abstract images, independent of the Bongard task. Indeed, this currently seems to be the bottleneck, not the abstract reasoning aspects.
Add an LLM evaluation of model responses: give an LLM the solution and a model response, and ask it to determine whether the response is correct.
Evaluate more models than Haiku and Opus (including future releases targeting diagrams).
Do more prompt engineering.
Evaluate via classification rather than description. That is, given 5 left images and 5 right images, ask the model to place a new image in the correct group.

As far as I can tell, no one has evaluated modern multimodal models on this exact task, but there is some related work:

Resources on Bongard problems:

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
problems		problems
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
app.py		app.py
get_problems.py		get_problems.py
prompt.txt		prompt.txt
responses.py		responses.py