先验信息驱动的跨模态通用特征空间构建与分析

孙婧; 苏剑波

doi:10.12263/DZXB.20250022

您当前的位置：

首页 >

文章列表页 >

先验信息驱动的跨模态通用特征空间构建与分析

学术论文 | 更新时间：2025-12-27

- 先验信息驱动的跨模态通用特征空间构建与分析
- Construction and Analysis of Cross-Modal General Feature Space Driven by Prior Information
- 电子学报 2025年53卷第8期页码：2614-2623
- 作者机构：
  
  上海交通大学自动化与感知学院，上海 200240
- 作者简介：
  
  [ "孙婧女，2000年9月出生于河北省.上海交通大学自动化系硕士研究生.主要研究方向为模式识别.E-mail: sunjing4231@sjtu.edu.cn" ]
  [ "苏剑波男，1969年 11月出生于江苏省．上海交通大学自动化系教授.主要研究方向为机器视觉、机器学习与人机交互、多传感器信息融合与智能机器人等.E-mail: jbsu@sjtu.edu.cn" ]
- 基金信息：
  
  工业与信息化部专项项目(0747-2461SCCZA302)
- DOI：10.12263/DZXB.20250022
  中图分类号： TP391.4
- 收稿：2025-01-06，
  
  录用：2025-05-02，
  
  纸质出版：2025-08-25
- 稿件说明：
移动端阅览
孙婧, 苏剑波. 先验信息驱动的跨模态通用特征空间构建与分析[J]. 电子学报, 2025, 53(08): 2614-2623.

SUN Jing, SU Jian-bo. Construction and Analysis of Cross-Modal General Feature Space Driven by Prior Information[J]. Acta Electronica Sinica, 2025, 53(08): 2614-2623.
孙婧, 苏剑波. 先验信息驱动的跨模态通用特征空间构建与分析[J]. 电子学报, 2025, 53(08): 2614-2623. DOI：10.12263/DZXB.20250022

SUN Jing, SU Jian-bo. Construction and Analysis of Cross-Modal General Feature Space Driven by Prior Information[J]. Acta Electronica Sinica, 2025, 53(08): 2614-2623. DOI：10.12263/DZXB.20250022

摘要

面部识别和声纹识别是身份验证领域中两种核心的生物特征识别技术，广泛应用于多种场景.尽管如此，关于这两种模态特征之间关联性的研究相对较少.本研究旨在探索声音和面部特征之间的共通性.不同于已有研究直接从实现特征对应方式出发寻找解决方案，本文从身份特征特性出发，从对身份信息的准确表示来主动获取通用特征空间，通过引入人脸识别任务中的身份特征间距离关系作为先验信息，在特征对应方法的基础上，保持身份相关关系不被破坏.在声纹特征提取过程中，通过调整语音识别任务中的预训练参数，使模型更好地表示身份信息.实验结果表明，在相同的特征对应方法下，使用语音Transformer模型作为声纹信号提取器，在验证任务上的表现相较于时延网络有显著提升.此外，本文方法对数据要求较低，不需要额外训练分类器，在验证任务上能够取得与已有方法相近的表现.未来的研究可进一步引入声纹特征的先验知识，以期进一步提升跨模态特征匹配的性能.

Abstract

Facial recognition and voiceprint recognition are two core biometric technologies in the field of identity verification

widely applied in various scenarios. However

research on the correlation between these two modal features remains relatively limited. This study aims to explore the commonality between voice and facial features. Unlike the existing studies that directly look for solutions from the way of realising feature correspondences

this study starts from the identity feature characteristics and actively obtains the universal feature space from the accurate representation of identity information. The distance relationship between identity features in facial recognition tasks is introduced as prior information

ensuring that identity-related relationships are preserved while using feature correspondence methods. During the voiceprint feature extraction process

the pre-trained parameters from speech recognition tasks are adjusted to enable the model to better represent identity information. The experimental results demonstrate that the speech transformer model

when used as a voiceprint signal extractor with the same feature correspondence method

achieves significant improvement on verification task compared to the time-delay network. In addition

the method is able to achieve similar performance as the existing methods on the validation task with lower data requirements and no additional training of classifiers. Future studies could further incorporate prior knowledge of voiceprint features to enhance the performance of cross-modal feature matching.

关键词

Keywords

references

ZHU B Q , XU K L , WANG C J , et al . Unsupervised voice-face representation learning by cross-modal prototype contrast [C ] // Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence . Shen Zhen : International Joint Conferences on Artificial Intelligence Organization , 2022 : 3787 - 3794 .

WELLS T , BAGULEY T , SERGEANT M , et al . Perceptions of human attractiveness comprising face and voice cues [J ] . Archives of Sexual Behavior , 2013 , 42 ( 5 ): 805 - 811 .

AWWAD SHIEKH HASAN B , VALDES-SOSA M , GROSS J , et al . “Hearing faces and seeing voices”: Amodal coding of person identity in the human brain [J ] . Scientific Reports , 2016 , 6 : 37494 .

JOASSIN F , PESENTI M , MAURAGE P , et al . Cross-modal interactions between human faces and voices involved in person recognition [J ] . Cortex , 2011 , 47 ( 3 ): 367 - 376 .

MORGADO P , MISRA I , VASCONCELOS N . Robust audio-visual instance discrimination [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recogniti-on (CVPR) . Piscataway : IEEE , 2021 : 12934 - 12945 .

ARANDJELOVIC R , ZISSERMAN A . Look, listen and learn [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 609 - 617 .

XIE Z W , LI L , ZHONG X , et al . Image-to-video person re-identification with cross-modal embeddings [J ] . Pattern Recognition Letters , 2020 , 133 : 70 - 76 .

NAGRANI A , ALBANIE S , ZISSERMAN A . Learnable PINs: Cross-modal embeddings for person identity [M ] // Computer Vision-ECCV 2018 . Cham : Springer International Publishing , 2018 : 73 - 89 .

WEN Y D , ISMAIL M AL , LIU W Y , et al . Disjoint mapping network for cross-modal matching of voices and fac-es [EB/OL ] . ( 2018-07-16 )[ 2025-05-06 ] . https://arxiv.org/abs/1807.04836v2 https://arxiv.org/abs/1807.04836v2 .

NAGRANI A , ALBANIE S , ZISSERMAN A . Seeing voices and hearing faces: Cross-modal biometric matching [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 8427 - 8436 .

CHENG K , LIU X , CHEUNG Y M , et al . Hearing like seeing: Improving voice-face interactions and associations via adversarial deep semantic matching network [C ] // Proceedings of the 28th ACM International Conference on Multimedia . New York : ACM , 2020 : 448 - 455 .

VAESSEN N , VAN LEEUWEN D A . Fine-tuning Wav2Vec2 for speaker recognition [C ] // ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2022 : 7967 - 7971 .

ZHANG L , WANG Q , LEE K A , et al . Multi-level transfer learning from near-field to far-field speaker verification [C ] // Interspeech 2021 . Florida : ISCA , 2021 : 1963 - 1967 .

WANG B K , YANG Y , XU X , et al . Adversarial cross-modal retrieval [C ] // Proceedings of the 25th ACM International Conference on Multimedia . New York : ACM , 2017 : 154 - 162 .

DENG J K , GUO J , XUE N N , et al . ArcFace: Additive angular margin loss for deep face recognition [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 4685 - 4694 .

DESPLANQUES B , THIENPONDT J , DEMUYNCK K . ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verificati-on [C ] // Interspeech 2020 . Florida : ISCA , 2020 : 3830 - 3834 .

XIAO Y , ZHOU A C , ZHOU L , et al . Automatic insect identification system based on SE-ResNeXt [J ] . International Journal of Systems, Control and Communications , 2023 , 14 ( 1 ): 81 .

NAWAZ S , JANJUA M K , GALLO I , et al . Deep latent space learning for cross-modal mapping of audio and visual signals [C ] // 2019 Digital Image Computing: Techniques and Applications (DICTA) . Piscataway : IEEE , 2019 : 1 - 7 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C ] // Advances in Neural Information Processing Systems 30 . New York : Curran Associates Inc , 2017 : 5998 - 6008 .

COURTY N , FLAMARY R , TUIA D , et al . Optimal transport for domain adaptation [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017 , 39 ( 9 ): 1853 - 1865 .

DAMODARAN B B , KELLENBERGER B , FLAMARY R , et al . DeepJDOT: Deep joint distribution optimal transport for unsupervised domain adaptation [M ] // Computer Vision - ECCV 2018 . Cham : Springer International Publishing , 2018 : 467 - 483 .

ZHANG R T , WEI J G , LU X G , et al . Optimal transport with a diversified memory bank for cross-domain speaker verification [C ] // ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processi-ng (ICASSP) . Piscataway : IEEE , 2023 : 1 - 5 .

GE C J , HUANG R , XIE M X , et al . Domain adaptation via prompt learning [J ] . IEEE Transactions on Neural Networks and Learning Systems , 2025 , 36 ( 1 ): 1160 - 1170 .

AGHAJANYAN A , GUPTA S , ZETTLEMOYER L . Intrinsic dimensionality explains the effectiveness of language model fine-tuning [C ] // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) . Stroudsburg : USAACL , 2021 : 7319 - 7328 .

CAO Q , SHEN L , XIE W D , et al . VGGFace2: A dataset for recognising faces across pose and age [C ] // 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) . Piscataway : IEEE , 2018 : 67 - 74 .

NAGRANI A , CHUNG J S , ZISSERMAN A . VoxCeleb: A large-scale speaker identification dataset [C ] // Interspeech 2017 . Florida : ISCA , 2017 : 2616 - 2620 .

PEYRÉ G , CUTURI M . Computational optimal transport: With applications to data science [J ] . Foundations and Trends in Machine Learning , 2019 , 11 ( 5/6 ): 355 - 607 .

PARK W , KIM D , LU Y , et al . Relational knowledge distillation [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 3962 - 3971 .

XUE Z , GAO Z , REN S , et al . The modality focusing hypothesis: Towards understanding crossmodal knowledge distillation [C ] // The Eleventh International Conference on Learning Representations . Oxford : ICLR , 2023 : 1 .

PARK D S , CHAN W , ZHANG Y , et al . SpecAugment: A simple data augmentation method for automatic speech recognition [C ] // Interspeech 2019 . Florida : ISCA , 2019 : 2613 - 2617 .

BAEVSKI A , ZHOU Y , MOHAMED A , et al . Wav2vec 2.0: A framework for self-supervised learning of speech representations [J ] . Advances in Neural Information Processing Systems , 2020 , 33 : 12449 - 12460 .

HSU W N , BOLTE B , TSAI Y H , et al . HuBERT: Self-supervised speech representation learning by masked prediction of hidden units [J ] . IEEE/ACM Transactions on Audio , Speech and Language Processing, 2021 , 29 : 3451 - 3460 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于Diffusion-Mamba和尺度不变损失的渐进式图像生成方法

基于邻域与超图协作的会话推荐

基于渐进式混合对比学习的无监督领域自适应行人再识别

类感知对比学习的弱监督语义分割