Evaluating Claude models on Bongard problems. The problems are drawn from the Online Encyclopaedia of Bongard Problems.
You can visualize results in the Streamlit app here.
- Investigate model descriptions of abstract images, independent of the Bongard task. Indeed, this currently seems to be the bottleneck, not the abstract reasoning aspects.
- Add an LLM evaluation of model responses: give an LLM the solution and a model response, and ask it to determine whether the response is correct.
- Evaluate more models than Haiku and Opus (including future releases targeting diagrams).
- Do more prompt engineering.
- Evaluate via classification rather than description. That is, given 5 left images and 5 right images, ask the model to place a new image in the correct group.
As far as I can tell, no one has evaluated modern multimodal models on this exact task, but there is some related work:
- On the Measure of Intelligence, which introduces the Abstraction and Reasoning Corpus (ARC)
- Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning
- Bongard-OpenWorld: Few-Shot Reasoning for Free-Form Visual Concepts in the Real World
- Neural networks for abstraction and reasoning: Towards broad generalization in machines
- Using Program Synthesis and Inductive Logic Programming to solve Bongard Problems, which uses Dreamcoder
- D5, which does something similar for text
Resources on Bongard problems: