Skip to content

Latest commit

 

History

History
24 lines (15 loc) · 1.43 KB

README.md

File metadata and controls

24 lines (15 loc) · 1.43 KB

Multimodal Alignment

Large Multimodal Models (LMMs) are prone to hallucinations, where their outputs are not grounded in the provided multimodal context, leading to unreliable or incorrect answers. This project aimed to reduce hallucinations in vision-language models by implementing two recent techniques: Fact-RLHF and feedback-guided self-revision. Following Fact-RLHF, we fine-tuned the vision-language model BLIP2 using supervised instruction-tuning and DPO training. Additionally, we enhanced the revision process in feedback-guided self-revision by incorporating factual information, inspired by Fact-RLHF. The effectiveness of our methods was demonstrated through evaluations on MMHal-Bench and POPE benchmarks.

Methods

  • DPO training: Generated BLIP2 response pairs with different temperatures, created preference with Qwen2-VL and used DPOTrainer from TRL library.
  • SFT training: Instruction-tuned BLIP2 for 1 epoch with samples from LLaVa-Instruct dataset using SFTTrainer from TRL.
  • Self-revision: Performed iterative refinement of BLIP2 and LLaVa Onevision inference upto 3 iterations with previous responses and factual information.

Results

Evaluation performed on POPE and MMHal-Bench.

See report.

Contributors