Large Multimodal Models (LMMs) are prone to hallucinations, where their outputs are not grounded in the provided multimodal context, leading to unreliable or incorrect answers. This project aimed to reduce hallucinations in vision-language models by implementing two recent techniques: Fact-RLHF and feedback-guided self-revision. Following Fact-RLHF, we fine-tuned the vision-language model BLIP2 using supervised instruction-tuning and DPO training. Additionally, we enhanced the revision process in feedback-guided self-revision by incorporating factual information, inspired by Fact-RLHF. The effectiveness of our methods was demonstrated through evaluations on MMHal-Bench and POPE benchmarks.
- DPO training: Generated BLIP2 response pairs with different temperatures, created preference with Qwen2-VL and used
DPOTrainer
fromTRL
library. - SFT training: Instruction-tuned BLIP2 for 1 epoch with samples from LLaVa-Instruct dataset using
SFTTrainer
fromTRL
. - Self-revision: Performed iterative refinement of BLIP2 and LLaVa Onevision inference upto 3 iterations with previous responses and factual information.
Evaluation performed on POPE and MMHal-Bench.
See report.
- Sadman Sakib
- Danyal Maqbool
- Rishika Ahuja
- Muhammad Musa
- Apoorva Mittal