MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

Abstract

Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations — text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.

Overview

Overall framework of FrameMind, illustrating the iterative perception-reasoning loop. The agent first thinks, then acts (calls tools) to gather visual evidence, and updates its understanding to inform the next cycle.

Figure 2. Overall framework of Multi-Region Fusion Decoding (MRFD): Step 1 uses attention to select and crop salient regions (

v_k

), generates candidate responses (

r_k

) per region, and computes JSD-based consistency weights (

w_k

) for each response. Step 2 forms new inputs per region with a candidate response and the original prompt. They are all processed in parallel, fusing per-region logits using the weights

w_k

during parallel decoding to select the output tokens.

    @article{ge2025mrfd,
  title={Mrfd: Multi-region fusion decoding with self-consistency for mitigating hallucinations in lvlms},
  author={Ge, Haonan and Wang, Yiwei and Yang, Ming-Hsuan and Cai, Yujun},
  journal={arXiv preprint arXiv:2508.10264},
  year={2025}
}

Abstract

Overview

BibTeX citation