

浏览全部资源
扫码关注微信
1.西安交通大学软件学院,陕西西安 710049
2.西安交通大学网络空间安全学院,陕西西安 710049
Received:10 June 2025,
Accepted:01 December 2025,
Published:25 December 2025
移动端阅览
刘帅, 陈达, 潘以恒, 等. 自我反馈多模态大模型:基于指令事实性判别与音频辅助理解的幻觉缓解方案[J]. 电子学报, 2025, 53(12): 4560-4574.
LIU Shuai, CHEN Da, PAN Yi-heng, et al. Self-Alignment Multimodal LLMs: Mitigating Hallucinations via Instruction Factuality and Audio Assistance[J]. Acta Electronica Sinica, 2025, 53(12): 4560-4574.
刘帅, 陈达, 潘以恒, 等. 自我反馈多模态大模型:基于指令事实性判别与音频辅助理解的幻觉缓解方案[J]. 电子学报, 2025, 53(12): 4560-4574. DOI:10.12263/DZXB.20250502
LIU Shuai, CHEN Da, PAN Yi-heng, et al. Self-Alignment Multimodal LLMs: Mitigating Hallucinations via Instruction Factuality and Audio Assistance[J]. Acta Electronica Sinica, 2025, 53(12): 4560-4574. DOI:10.12263/DZXB.20250502
基于人类反馈的强化学习(Reinforcement Learning from Human Feedback,RLHF)能够有效推动模型输出与人类偏好对齐,已被广泛用于抑制多模态大语言模型(Multimodal Large Language Models,MLLMs)在实际应用中出现的幻觉问题.其中,直接偏好优化(Direct Preference Optimization,DPO)方法通过避免显式奖励建模,以更稳定、高效的方式提升MLLMs的可信度与可用性,受到学术界与工业界的广泛关注.然而,DPO训练过程中仍存在若干挑战,如训练数据分布偏移、偏好数据构建过程中对指令事实性区分不足等,均可能加剧模型幻觉.此外,现有方法对视频等多图数据中伴随的音频信息利用不足,而音频可作为视觉理解的有效补充信号,具备缓解幻觉的潜力.针对上述问题,本文提出一种基于指令事实性判别与音频辅助的自对齐训练框架(Instruction Factuality and Audio Assistance,IFAA),通过四个核心模块生成高质量偏好数据,以抑制MLLMs的幻觉现象.具体包括:(1)同风格响应采样,有效降低直接偏好优化训练中的数据分布偏移;(2)长响应分割策略,提升模型自我判别的准确性;(3)指令事实性判别模块,构建更具事实依据的偏好数据;(4)音频辅助理解模块,通过融合音频信息提升偏好数据质量.最后通过直接偏好优化训练增强模型的可靠性.此外,创新性地引入基于ROC(Receiver Operating Characteristic)曲线的置信平衡点选择机制,以有效缓解多模态大型语言模型的过度自信问题.本文在五大主流MLLM评测基准上进行了实验,以验证所提框架的有效性与泛化能力.以LLaVA(Large Language and Vision Assistant)1.5模型为例,经本框架优化后,其在Object HalBench(Object Hallucination Benchmark)评测集上的句子级幻觉率降低43.1%,实例级幻觉率下降37.3%.此外,在其他前沿模型上的迁移实验表明,基于IFAA构建的偏好数据具有良好的泛化性,能够显著降低不同模型的幻觉率.该结果验证了本文框架在不同模型上的适用性,为MLLMs的幻觉抑制提供了新的有效途径.
Reinforcement learning from human feedback (RLHF) can effectively align model outputs with human preferences and has been widely used to mitigate the hallucination problem of multimodal large language models (MLLMs) in practical applications. Among various RLHF approaches
direct preference optimization (DPO) avoids explicit reward modeling
enabling more stable and efficient improvement of MLLMs’ reliability and usability. As a result
DPO has attracted extensive attention from both academia and industry. However
the DPO training process still faces several challenges: issues such as training data distribution shift and insufficient distinction of the factuality of instructions during preference data construction may exacerbate model hallucinations. Additionally
existing methods underutilize the audio information accompanying multi-image data (e.g.
videos). As an effective supplementary signal for visual understanding
audio has the potential to alleviate hallucinations.To address the aforementioned problems
this paper proposes an instruction factuality assessment and audio-aided self-alignment training framework (IFAA). This framework generates high-quality preference data through four core modules to suppress hallucinations in MLLMs. The specific modules are as follows: (1) Style-consistent response sampling
which effectively reduces data distribution shift in DPO training; (2) Long-response segmentation strategy
which improves the accuracy of the model’s self-judgment; (3) Instruction factuality assessment module
which constructs preference data with stronger factual basis; (4) Audio-aided understanding module
which enhances the quality of preference data by fusing audio information. Finally
DPO training is conducted to further improve the model’s reliability. In addition
this paper innovatively introduces a confidence balance point selection mechanism based on the receiver operating characteristic (ROC) curve to effectively mitigate the overconfidence issue of MLLMs.To verify the effectiveness and generalization ability of the proposed framework
experiments are conducted on five mainstream MLLM evaluation benchmarks. Taking the large language and vision assistant (LLaVA) 1.5 model as an example
after optimization with the IFAA framework
its sentence-level hallucination rate on the object hallucination benchmark (Object HalBench) dataset decreases by 43.1%
and the instance-level hallucination rate drops by 37.3%. Furthermore
transfer experiments on other cutting-edge models demonstrate that the preference data constructed based on IFAA exhibits excellent generalization
significantly reducing the hallucination rates of different models. These results confirm the applicability of the proposed framework across various models and provide a new effective approach for hallucination mitigation in MLLMs.
AGRAWAL H , DESAI K R , WANG Y F , et al . Nocaps: Novel object captioning at scale [C ] // 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2020 : 8947 - 8956 .
CHEN D P , CHEN R X , ZHANG S L , et al . MLLM-as-a-judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark [EB/OL ] . ( 2024-06-11 )[ 2025-10-30 ] . https://arXiv.org/abs/2402.04788 https://arXiv.org/abs/2402.04788 .
HUANG Q Q , XIONG Y , RAO A Y , et al . MovieNet: A holistic dataset for movie understanding [C ] // Computer Vision - ECCV 2020 . Cham : Springer , 2020 : 709 - 727 .
KEMBHAVI A , SALVATO M , KOLVE E , et al . A diagram is worth a dozen images [C ] // Computer Vision - ECCV 2016 . Cham : Springer , 2016 : 235 - 251 .
LI J N , LI D X , SAVARESE S , et al . BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models [C ] // International Conference on Machine Learning . New York : ACM , 2023 : 19730 - 19742 .
SUN Z Q , SHEN S , CAO S C , et al . Aligning large multimodal models with factually augmented RLHF [C ] // Findings of the Association for Computational Linguistics ACL 2024 . Stroudsburg : ACL , 2024 : 13088 - 13110 .
YU T Y , YAO Y , ZHANG H Y , et al . RLHF-V: Towards trustworthy MLLMs via behavior alignment from fine-grained correctional human feedback [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 13807 - 13816 .
LENG S C , ZHANG H , CHEN G Z , et al . Mitigating object hallucinations in large vision-language models through visual contrastive decoding [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 13872 - 13882 .
ZHOU Y Y , CUI C H , YOON J , et al . Analyzing and mitigating object hallucination in large vision-language models [EB/OL ] . ( 2024-03-16 )[ 2025-11-11 ] . https://arXiv.org/abs/2310.00754 https://arXiv.org/abs/2310.00754 .
TONG S B , LIU Z , ZHAI Y X , et al . Eyes wide shut? exploring the visual shortcomings of multimodal LLMs [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 9568 - 9578 .
LIU F X , LIN K , LI L J , et al . Mitigating hallucination in large multi-modal models via robust instruction tuning [EB/OL ] . ( 2024-03-19 )[ 2025-10-10 ] . https://arXiv.org/abs/2306.14565 https://arXiv.org/abs/2306.14565 .
YU Q F , LI J C , WEI L H , et al . HalluciDoctor: Mitigating hallucinatory toxicity in visual instruction data [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 12944 - 12953 .
WU P H , XIE S N . V*: Guided visual search as a core mechanism in multimodal LLMs [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 13084 - 13094 .
RAFAILOV R , SHARMA A , MITCHELL E , et al . Direct preference optimization: Your language model is secretly a reward model [EB/OL ] . ( 2024-07-29 )[ 2025-10-10 ] . https://arXiv.org/abs/2305.18290 https://arXiv.org/abs/2305.18290 .
LI L , XIE Z H , LI M K , et al . Silkie: Preference distillation for large visual language models [EB/OL ] . ( 2023-12-17 )[ 2025-10-10 ] . https://arXiv.org/abs/2312.10665 https://arXiv.org/abs/2312.10665 .
CHEN Z X , DENG Y H , YUAN H Z , et al . Self-play fine-tuning converts weak language models to strong language models [EB/OL ] . ( 2024-06-14 )[ 2025-10-10 ] . https://arXiv.org/abs/2401.01335 https://arXiv.org/abs/2401.01335 .
ROHRBACH A , HENDRICKS L A , BURNS K , et al . Object hallucination in image captioning [EB/OL ] . ( 2019-03-29 )[ 2025-10-10 ] . https://arXiv.org/abs/1809.02156 https://arXiv.org/abs/1809.02156 .
ZHAI B H , YANG S J , ZHAO X C , et al . HallE-Switch: Rethinking and controlling object existence hallucinations in large vision-language models for detailed caption [C ] // The 12th International Conference on Learning Representations . Appleton : ICLR , 2023 : 3227 .
GUAN T R , LIU F X , WU X Y , et al . Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 14375 - 14385 .
MCKENNA N , LI T Y , CHENG L , et al . Sources of hallucination by large language models on inference tasks [C ] // Findings of the Association for Computational Linguistics: EMNLP 2023 . Stroudsburg : ACL , 2023 : 2758 - 2774 .
REN R Y , WANG Y H , QU Y Q , et al . Investigating the factual knowledge boundary of large language models with retrieval augmentation [C ] // Proceedings of the 31st International Conference on Computational Linguistics . Kerrville : Association for Computational Linguistics , 2025 : 3697 - 3715 .
CHUANG Y S , XIE Y J , LUO H Y , et al . DoLa: Decoding by contrasting layers improves factuality in large language models [EB/OL ] . ( 2024-03-11 )[ 2025-10-10 ] . https://arXiv.org/abs/2309.03883 https://arXiv.org/abs/2309.03883 .
LEE N , PING W , XU P , et al . Factuality enhanced language models for open-ended text generation [EB/OL ] . ( 2023-03-02 )[ 2025-10-10 ] . https://arXiv.org/abs/2206.04624 https://arXiv.org/abs/2206.04624 .
JIANG D S , LIU Y C , LIU S L , et al . From CLIP to DINO: Visual encoders shout in multi-modal large language models [EB/OL ] . ( 2024-03-08 )[ 2025-10-10 ] . https://arXiv.org/abs/2310.08825 https://arXiv.org/abs/2310.08825 .
YUE Z H , ZHANG L , JIN Q . Less is more: Mitigating multimodal hallucination from an EOS decision perspective [EB/OL ] . ( 2024-05-29 )[ 2025-10-10 ] . https://arXiv.org/abs/2402.14545 https://arXiv.org/abs/2402.14545 .
CHEN K Q , ZHANG Z , ZENG W L , et al . Shikra: Unleashing multimodal LLM’s referential dialogue magic [EB/OL ] . ( 2023-07-03 )[ 2025-10-10 ] . https://arXiv.org/abs/2306.15195 https://arXiv.org/abs/2306.15195 .
CHEN Y Y , SIKKA K , COGSWELL M , et al . DRESS: Instructing large vision-language models to align and interact with humans via natural language feedback [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 14239 - 14250 .
ZIEGLER D M , STIENNON N , WU J , et al . Fine-tuning language models from human preferences [EB/OL ] . ( 2020-01-08 )[ 2025-10-10 ] . https://arXiv.org/abs/1909.08593 https://arXiv.org/abs/1909.08593 .
PI R J , HAN T Y , XIONG W , et al . Strengthening multimodal large language model with bootstrapped preference optimization [C ] // European Conference on Computer Vision - ECCV 2024 . Cham : Springer , 2025 : 382 - 398 .
ZHANG Y , CUI L Y , BI W , et al . Alleviating hallucinations of large language models through induced hallucinations [C ] // Findings of the Association for Computational Linguistics: NAACL 2025 . Stroudsburg : ACL , 2025 : 8218 - 8232 .
HUANG Q D , DONG X Y , ZHANG P , et al . OPERA: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 13418 - 13427 .
YUAN W Z , PANG R Y , CHO K , et al . Self-rewarding language models [EB/OL ] . ( 2025-03-28 )[ 2025-10-10 ] . https://arXiv.org/abs/2401.10020 https://arXiv.org/abs/2401.10020 .
YU T Y , ZHANG H Y , LI Q M , et al . RLAIF-V: Open-source AI feedback leads to super GPT-4V trustworthiness [EB/OL ] . ( 2025-10-29 )[ 2025-10-10 ] . https://arXiv.org/abs/2405.17220 https://arXiv.org/abs/2405.17220 .
JING L Q , DU X Y . FGAIF: Aligning large vision-language models with fine-grained AI feedback [EB/OL ] . ( 2025-05-06 )[ 2025-10-10 ] . https://arXiv.org/abs/2404.05046 https://arXiv.org/abs/2404.05046 .
SUN Y N , MIN X K , ZHANG Z C , et al . Mitigating low-level visual hallucinations requires self-awareness: Database, model and training strategy [EB/OL ] . ( 2025-03-27 )[ 2025-10-10 ] . https://arXiv.org/abs/2503.20673 https://arXiv.org/abs/2503.20673 .
ZHANG X Y , PENG B L , TIAN Y , et al . Self-alignment for factuality: Mitigating hallucinations in LLMs via self-evaluation [EB/OL ] . ( 2024-06-11 )[ 2025-10-10 ] . https://arXiv.org/abs/2402.09267 https://arXiv.org/abs/2402.09267 .
LIANG Y X , SONG Z Y , WANG H , et al . Learning to trust your feelings: Leveraging self-awareness in LLMs for hallucination mitigation [EB/OL ] . ( 2024-01-27 )[ 2025-10-10 ] . https://arXiv.org/abs/2401.15449 https://arXiv.org/abs/2401.15449 .
OPENAI , ACHIAM J , ADLER S , et al . GPT-4 technical report [EB/OL ] . ( 2024-03-04 )[ 2025-10-10 ] . https://arXiv.org/abs/2303.08774 https://arXiv.org/abs/2303.08774 .
WANG J Y , WANG Y H , XU G H , et al . AMBER: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation [EB/OL ] . ( 2024-02-23 )[ 2025-10-10 ] . https://arXiv.org/abs/2311.07397 https://arXiv.org/abs/2311.07397 .
CHEN L , CHEN Z H , DONG X Y , et al . Are we on the right way for evaluating large vision-language models? [C ] // Advances in Neural Information Processing Systems 37 . Vancouver : Neural Information Processing Systems Foundation, Inc. (NeurIPS) , 2024 : 27056 - 27087 .
LIU H T , LI C Y , WU Q Y , et al . Visual instruction tuning [EB/OL ] . ( 2023-12-11 )[ 2025-10-10 ] . https://arxiv.org/abs/2304.08485 https://arxiv.org/abs/2304.08485 .
LIU H T , LI C Y , LI Y H , et al . Improved baselines with visual instruction tuning [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 26286 - 26296 .
LIU H T , LI C Y , LI Y , et al . Llavanext: Improved reasoning, ocr, and world knowledge [EB/OL ] . ( 2024-01-30 )[ 2025-10-10 ] . https://llava-vl.github.io/blog/2024-01-30-llava-next/ https://llava-vl.github.io/blog/2024-01-30-llava-next/ .
BAI J Z , BAI S , YANG S S , et al . Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond [EB/OL ] . ( 2023-10-13 )[ 2025-10-10 ] . https://arXiv.org/abs/2308.12966 https://arXiv.org/abs/2308.12966 .
OpenAI GPT . GPT‑4V(ision) system card [EB/OL ] . ( 2023-09-01 )[ 2025-10-10 ] . https://openai.com/index/gpt-4v-system-card/ https://openai.com/index/gpt-4v-system-card/ .
陆庆阳 , 袁广林 , 朱虹 , 等 . 一种基于对比学习大模型的视觉定位方法 [J ] . 电子学报 , 2024 , 52 ( 10 ): 3448 - 3458 .
LU Q Y , YUAN G L , ZHU H , et al . A visual grounding method with contrastive learning large model [J ] . Acta Electronica Sinica , 2024 , 52 ( 10 ): 3448 - 3458 . (in Chinese)
胡杰 , 昌敏杰 , 徐博远 , 等 . ConvFormer: 基于Transformer的视觉主干网络 [J ] . 电子学报 , 2024 , 52 ( 1 ): 46 - 57 .
HU J , CHANG M J , XU B Y , et al . ConvFormer: Vision backbone network based on transformer [J ] . Acta Electronica Sinica , 2024 , 52 ( 1 ): 46 - 57 . (in Chinese)
0
Views
24
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621