Multimodal Alignment

Large Multimodal Models (LMMs) are prone to hallucinations, where their outputs are not grounded in the provided multimodal context, leading to unreliable or incorrect answers. This project aimed to reduce hallucinations in vision-language models by implementing two recent techniques: Fact-RLHF and feedback-guided self-revision. Following Fact-RLHF, we fine-tuned the vision-language model BLIP2 using supervised instruction-tuning and DPO training. Additionally, we enhanced the revision process in feedback-guided self-revision by incorporating factual information, inspired by Fact-RLHF. The effectiveness of our methods was demonstrated through evaluations on MMHal-Bench and POPE benchmarks.

Methods

DPO training: Generated BLIP2 response pairs with different temperatures, created preference with Qwen2-VL and used DPOTrainer from TRL library.
SFT training: Instruction-tuned BLIP2 for 1 epoch with samples from LLaVa-Instruct dataset using SFTTrainer from TRL.
Self-revision: Performed iterative refinement of BLIP2 and LLaVa Onevision inference upto 3 iterations with previous responses and factual information.

Results

Evaluation performed on POPE and MMHal-Bench.

See report.

Contributors

Sadman Sakib
Danyal Maqbool
Rishika Ahuja
Muhammad Musa
Apoorva Mittal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Multimodal Alignment

Methods

Results

Contributors

Files

README.md

Latest commit

History

README.md

File metadata and controls

Multimodal Alignment

Methods

Results

Contributors