Improved ConvMixer and Focal Loss with Dynamic Weight for Audio-Visual Emotion Recognition

SHI Shuo; QIN Jia-jun; YU Yang; HAO Xiao-ke

doi:10.12263/DZXB.20221042

您当前的位置：

首页 >

文章列表页 >

Improved ConvMixer and Focal Loss with Dynamic Weight for Audio-Visual Emotion Recognition

PAPERS | 更新时间：2025-12-08

- Improved ConvMixer and Focal Loss with Dynamic Weight for Audio-Visual Emotion Recognition
- ACTA ELECTRONICA SINICA Vol. 52, Issue 8, Pages: 2824-2835(2024)
- 作者机构：
  
  河北工业大学人工智能与数据科学学院，天津 300401
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(61806071;62102129);Natural Science Foundation of Hebei Province(F2020202025;F2021202030)
- DOI：10.12263/DZXB.20221042
  CLC： TP391.4;
- Received：09 September 2022，
  
  Revised：2023-05-05，
  
  Published：25 August 2024
- 稿件说明：
移动端阅览
师硕, 覃嘉俊, 于洋, 等. 基于改进ConvMixer和动态焦点损失的视听情感识别[J]. 电子学报, 2024, 52(08): 2824-2835.

SHI Shuo, QIN Jia-jun, YU Yang, et al. Improved ConvMixer and Focal Loss with Dynamic Weight for Audio-Visual Emotion Recognition[J]. Acta Electronica Sinica, 2024, 52(08): 2824-2835.
师硕, 覃嘉俊, 于洋, 等. 基于改进ConvMixer和动态焦点损失的视听情感识别[J]. 电子学报, 2024, 52(08): 2824-2835. DOI：10.12263/DZXB.20221042

SHI Shuo, QIN Jia-jun, YU Yang, et al. Improved ConvMixer and Focal Loss with Dynamic Weight for Audio-Visual Emotion Recognition[J]. Acta Electronica Sinica, 2024, 52(08): 2824-2835. DOI：10.12263/DZXB.20221042

摘要

视听双模态情感识别是情感计算领域的研究热点.目前情感识别方法存在无法同时提取视频局部和全局特征，多模态数据融合简单，损失函数在模型优化中无法关注错分样本等问题，导致情感识别结果精确度不高.本文提出一种基于改进的ConvMixer和动态权重焦点损失函数的视听情感识别方法.采用空间和时间邻接矩阵代替ConvMixer中的深度分离卷积，提取视频时域空域上的全局和局部特征.提出跨模态时间注意力模块，以对称结构捕捉模态间的时间相关性，提高特征融合效果.结合混淆矩阵计算具有动态权重的焦点损失函数，差异化地加大错分样本在损失中的占比，优化模型参数.在公开数据集上的实验结果表明，本文方法能提取到代表性特征，可有效优化网络结构，提高了情感识别的准确率.

Abstract

Audio-visual bimodal emotion recognition is a research hotspot in the field of emotion computing. At present

emotion recognition methods cannot simultaneously extract local and global features of video

multi-modal data fusion is simple

loss function can not pay attention to misclassification of samples in model optimization

resulting in low accuracy of emotion recognition results. In this paper

an audio-visual emotion recognition method based on improved ConvMixer and focus loss function with dynamic weight is proposed. Spatial and temporal adjacent matrices were used instead of deep separation convolution in ConvMixer to extract global and local features in video spatial and temporal domain. A cross-modal temporal attention module is proposed to capture the temporal correlation between modals with a symmetrical structure to improve the feature fusion effect. The focus loss function with dynamic weight was calculated by the confusion matrix

and the proportion of error samples in the loss was increased differentially to optimize the model parameters. Experimental results on public data sets show that the proposed method can extract representative features

optimize the network structure effectively

and improve the accuracy of emotion recognition.

关键词

Keywords

references

ROUAST P V , ADAM M T P , CHIONG R . Deep learning for human affect recognition: Insights and new developments [J ] . IEEE Transactions on Affective Computing , 2019 , 12 ( 2 ): 524 - 543 .

张瑞 , 蒋晨之 , 苏剑波 . 基于稀疏特征挑选和概率线性判别分析的表情识别研究 [J ] . 电子学报 , 2018 , 46 ( 7 ): 1710 - 1718 .

ZHANG R , JIANG C Z , SU J B . Expression recognition based on sparse selection and plda [J ] . Acta Electronica Sinica , 2018 , 46 ( 7 ): 1710 - 1718 . (in Chinese)

BIRHALA A , RISTEA C N , RADOI A , et al . Temporal aggregation of audio-visual modalities for emotion recognition [C ] // 2020 43rd International Conference on Telecommunications and Signal Processing (TSP) . Piscataway : IEEE , 2020 : 305 - 308 .

李宏菲 , 李庆 , 周莉 . 基于多视觉描述字及音频特征的动态序列人脸表情识别 [J ] . 电子学报 , 2019 , 47 ( 8 ): 1643 - 1653 .

LI H F , LI Q , ZHOU L . Dynamic facial expression recognition based on multi-visual and audio descriptor [J ] . Acta Electronica Sinica , 2019 , 47 ( 8 ): 1643 - 1653 . (in Chinese)

MOCANU B , TAPU R . Audio-video fusion with double attention for multimodal emotion recognition [C ] // 2022 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP) . Piscataway : IEEE , 2022 : 1 - 5 .

SCARSELLI F , GORI M , TSOI A C , et al . The graph neural network model [J ] . IEEE Transactions on Neural Networks , 2008 , 20 ( 1 ): 61 - 80 .

NOROOZI F , MARJANOVIC M , NJEGUS A , et al . Audio-visual emotion recognition in video clips [J ] . IEEE Transactions on Affective Computing , 2019 , 10 ( 1 ): 60 - 75 .

WU M , SU W , CHEN L , et al . Two-stage fuzzy fusion based-convolution neural network for dynamic emotion recognition [J ] . IEEE Transactions on Affective Computing , 2022 , 13 ( 2 ): 805 - 817 .

CHEN L , WANG K , WU M , et al . K-means clustering-based kernel canonical correlation analysis for multimodal emotion recognition [J ] . IEEE Transactions on Industrial Electronics , 2022 , 70 ( 1 ): 1016 - 1024 .

MIDDYA A I , NAG B , ROY S . Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities [J ] . Knowledge-Based Systems , 2022 , 244 : 108580 .

DU Z , WU S , HUANG D , et al . Spatio-temporal encoder-decoder fully convolutional network for video-based dimensional emotion recognition [J ] . IEEE Transactions on Affective Computing , 2019 , 12 ( 3 ): 565 - 578 .

LIU D , ZHANG H , ZHOU P . Video-based facial expression recognition using graph convolutional networks [C ] // 25th International Conference on Pattern Recognition (ICPR) . Milan : IEEE , 2021 : 607 - 614 .

ZHAO S , MA Y , GU Y , et al . An end-to-end visual-audio attention network for emotion recognition in user-generated videos [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , 34 ( 1 ): 303 - 311 .

CHEN J , LUO Z , ZHANG Z , et al . Polar transformation on image features for orientation-invariant representations [J ] . IEEE Transactions on Multimedia , 2018 , 21 ( 2 ): 300 - 313 .

ZHANG S , DING Y , WEI Z , et al . Continuo-us emotion recognition with audio-visual leader-follower attentive fusion [C ] // Proceedings of the IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2021 : 3567 - 3574 .

LUNA-JIMÉNEZ C , GRIOL D , CALLEJAS Z , et al . Multimodal emotion recognition on ravdess dataset using transfer learning [J ] . Sensors , 2021 , 21 ( 22 ): 7665 .

LUNA-JIMÉNEZ C , KLEINLEIN R , GRIOL D , et al . A proposal for multimodal emotion recognition using aural transformers and action units on ravdess dataset [J ] . Applied Sciences , 2021 , 12 ( 1 ): 327 .

TZIRAKIS P , TRIGEORGIS G , NICOLAOU M A , et al . End-to-end multimodal emotion recognition using deep neural networks [J ] . IEEE Journal of Selected Topics in Signal Processing , 2017 , 11 ( 8 ): 1301 - 1309 .

HOSSAIN M S , MUHAMMAD G . Emotion recognition using deep learning approach from audio-visual emotional big data [J ] . Information Fusion , 2019 , 49 : 69 - 78 .

WANG J , XUE M , CULHANE R , et al . Speech emotion recognition with dual-sequence LSTM architecture [C ] // 45th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2020 : 6474 - 6478 .

MENG H , YAN T , YUAN F , et al . Speech emotion recognition from 3D log-mel spectrograms with deep learning network [J ] . IEEE Access , 2019 , 7 : 125868 - 125881 .

SONG Y , CAI Y , TAN L . Video-audio emotion recognition based on feature fusion deep learning method [C ] // 2021 64th International Midwest Symposium on Circuits and Systems (MWSCAS) . Piscataway : IEEE , 2021 : 611 - 616 .

DOSOVITSKIY A , BEYER L , KOLESNIKOV A , et al . An image is worth 16×16 words: Transformers for image recognition at scale [C ] // International Conference on Learning Representations . Piscataway : IEEE , 2021 : 1 - 21

TROCKMAN A , KOLTER J Z . Patches are all you need? [EB/OL ] . ( 2022-01-24 )[ 2022-08-15 ] . https://arxiv.org/abs/2201.09792 https://arxiv.org/abs/2201.09792 .

孙晓 , 潘汀 . 基于兴趣区域深度神经网络的静态面部表情识别 [J ] . 电子学报 , 2017 , 45 ( 5 ): 1187 - 1197 .

SUN X , PAN T . Static facial expression recognition system using ROI deep neural network [J ] . Acta Electronica Sinica , 2017 , 45 ( 5 ): 1189 - 1197 . (in Chinese)

BASBRAIN A M , GAN J Q , SUGIMOTO A , et al . A neural network approach to score fusion for emotion recognition [C ] // 2018 10th Computer Science and Electronic Engineering (CEE-C) . Piscataway : IEEE , 2018 : 180 - 185 .

MA F , ZHANG W , LI Y , et al . An end-to-end learning approach for multimodal emotion recognition: Extracting common and private information [C ] // 2019 IEEE International Conference on Multimedia and Expo (ICME) . Piscataway : IEEE , 2019 : 1144 - 1149 .

SIMONYAN K , ZISSERMAN A . Two-stream convolutional networks for action recognition in videos [J ] . Advances in Neural Information Processing Systems , 2014 , 1 ( 4 ): 568 - 576 .

FARHOUDI Z , SETAYESHI S . Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition [J ] . Speech Communication , 2021 , 127 : 92 - 103 .

LIU J , CHEN S , WANG L , et al . Multimodal emotion recognition with capsule graph convolutional based representation fusion [C ] // 2021 46th International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2021 : 6339 - 6343 .

NIE W , REN M , NIE J , et al . C-GCN: Correlation based graph convolutional network for audio-video emotion recognition [J ] . IEEE Transactions on Multimedia , 2021 , 23 : 3793 - 3804 .

GHALEB E , POPA M , ASTERIADIS S . Multimodal and temporal perception of audio-visual cues for emotion recognition [C ] // 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII) . Piscataway : IEEE , 2019 : 552 - 558 .

MA F , ZHANG W , LI Y , et al . Learning better representations for audio-visual emotion recognition with common information [J ] . Applied Sciences , 2020 , 10 ( 20 ): 7239 .

ZHONG Y , HU Y , HUANG H , et al . A lightweight model based on separable convolution for speech emotion recognition [C ] // Interspeech 2020 , 21st Annual Conference of the International Speech Communication Association . Lyon : ISCA , 2020 : 3331 - 3335 .

ZHU Z , DAI W , HU Y , et al . Speech emotion recognition model based on Bi-GRU and focal loss [J ] . Pattern Recognition Letters , 2020 , 140 : 358 - 365 .

李锵 , 赵启蒙 , 关欣 . 基于动态卷积的胸部X光片疾病分类算法 [J ] . 天津大学学报(自然科学与工程技术版) , 2022 , 55 ( 9 ): 953 - 964 .

LI Q , ZHAO Q M , GUAN X . Classification algorithm for chest X-ray diseases based on dynamic convolution [J ] . Journal of Tianjin University (Science and Technology) , 2022 , 55 ( 9 ): 953 - 964 . (in Chinese)

BAI H , CHENG J , SU Y , et al . Calibrated focal loss for semantic labeling of high-resolution remote sensing images [J ] . IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , 2022 , 15 : 6531 - 6547 .

LIN T , GOYAL P , GIRSHICK R , et al . Focal loss for dense object detection [C ] // Proceedings of the IEEE International Conference on Computer Vision . Piscataway : IEEE , 2017 : 2980 - 2988 .

MARTIN O , KOTSIA I , MACQ B , et al . The enterface’05 audio-visual emotion database [C ] // 2006 22nd International Conference on Data Engineering Workshops (ICDEW’06) . Piscataway : IEEE , 2006 : 8 - 8 .

LIVINGSTONE S R , RUSSO F A . The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north american english [J ] . PloS One , 2018 , 13 ( 5 ): 1 - 35 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Unified Global Spatial Representation for EEG Subject-Independent Emotion Recognition

Neighborhood and Hypergraph Collaboration for Session-Based Recommendation

Object Detection Based on EIMYOLO for High-Resolution Remote Sensing Images

Related Author

ZHANG Jing

WANG Yi-xin

REN Yong-gong

CHEN Rong-yuan

WEN Jie-bin

HUANG Shao-nian

HE Ye-yu

Related Institution

School of Computer and Information Technology， Liaoning Normal University

College of Frontier Intersection, Hunan University of Technology and Business

Key Laboratory of Hunan Province for Statistical Learning and Intelligent Computation, Hunan University of Technology and Business

School of Computer Science, Hunan University of Technology and Business

School of Information and Technology, Shanxi University

⁰