Micro-Motion Excitation and Time Perception for Lip Reading

MA Jin-lin; LÜ Xin; MA Zi-ping; GUO Zhao-wei; LÜ Ke

doi:10.12263/DZXB.20230888

您当前的位置：

首页 >

文章列表页 >

Micro-Motion Excitation and Time Perception for Lip Reading

PAPERS | 更新时间：2025-12-11

- Micro-Motion Excitation and Time Perception for Lip Reading
- ACTA ELECTRONICA SINICA Vol. 52, Issue 11, Pages: 3657-3668(2024)
- 作者机构：
  
  1.北方民族大学计算机科学与工程学院，宁夏银川 750021
  2.北方民族大学数学与信息科学学院，宁夏银川 750021
  3.中国科学院大学计算机与通信工程学院，北京 100049
- 作者简介：
- 基金信息：
  
  Natural Science Foundation of Ningxia(2023AAC03264);Basic Scientific Research in Central Universities of North Minzu University(2021KJCX09)
- DOI：10.12263/DZXB.20230888
  CLC： TP391;
- Received：22 September 2023，
  
  Revised：2024-06-03，
  
  Published：25 November 2024
- 稿件说明：
移动端阅览
马金林, 吕鑫, 马自萍, 等. 微运动激励与时间感知的唇语识别方法[J]. 电子学报, 2024, 52(11): 3657-3668.

MA Jin-lin, LÜ Xin, MA Zi-ping, et al. Micro-Motion Excitation and Time Perception for Lip Reading[J]. Acta Electronica Sinica, 2024, 52(11): 3657-3668.
马金林, 吕鑫, 马自萍, 等. 微运动激励与时间感知的唇语识别方法[J]. 电子学报, 2024, 52(11): 3657-3668. DOI：10.12263/DZXB.20230888

MA Jin-lin, LÜ Xin, MA Zi-ping, et al. Micro-Motion Excitation and Time Perception for Lip Reading[J]. Acta Electronica Sinica, 2024, 52(11): 3657-3668. DOI：10.12263/DZXB.20230888

摘要

时序信息和唇部细微变化对唇语识别至关重要.然而，现有唇语识别方法不能精准捕获时序信息和关注细微运动.为此，提出一种关注微小唇部变化和增强时序信息的唇语识别方法DMT-GhostNet.首先，引入解藕时空增强块（Decoupled Spatio-Temporal Enhancement Block，DSTE），将单一3D卷积解藕为时间域和空间域；其次，基于运动激励（Motion Excitation，ME）与Ghost瓶颈块提出微运动瓶颈块（Micro-Motion Bottleneck，M-Ghost），捕捉唇部的微小运动；最后，提出时间感知模块（Transformer Multi-Scale Temporal Convolution Network，TransMS-TCN），聚焦重要时间序列，限制无关信息流入MS-TCN.实验结果表明，DMT-GhostNet在LRW数据集上取得了89.21%的准确率，比基于ResNet的主流方法提升3.91%，降低参数量近6 M，能够更好地利用时序信息并聚焦唇部细节，显著提高唇语识别性能.

Abstract

Temporal information and subtle lip changes are crucial for lip reading. However

existing lip-reading methods have not accurately captured temporal information and focus on subtle movements. In response

we propose a lip-reading method named DMT-GhostNet that emphasizes minor lip variations and enhances temporal information. We introduce the decoupled spatio-temporal enhancement block (DSTE) to decouple the single 3D convolution into the time domain and the spatial domain. Based on motion excitation (ME) and the Ghost bottleneck block

we introduce the micro-motion bottleneck (M-Ghost) to detect subtle lip motions. The transformer multi-scale temporal convolution network (TransMS-TCN) is proposed to focus on important temporal sequences and restrict irrelevant information from flowing into MS-TCN. Experimental results show that DMT-GhostNet achieved an accuracy of 89.21% on the LRW dataset

which is an increase of 3.91% over mainstream methods based on ResNet and reduces the parameter count by nearly 6 M. This indicates that DMT-GhostNet effectively utilizes temporal information and focuses on lip details

significantly improving lip-reading performance.

关键词

Keywords

references

姚鸿勋 , 高文 , 王瑞 , 等 . 视觉语言: 唇读综述 [J ] . 电子学报 , 2001 , 29 ( 2 ): 239 - 246 .

YAO H X , GAO W , WANG R , et al . A survey of lipreading—One of visual languages [J ] . Acta Electronica Sinica , 2001 , 29 ( 2 ): 239 - 246 . (in Chinese)

陈雁翔 , 刘鸣 . 基于发音特征的音视频说话人识别鲁棒性的研究 [J ] . 电子学报 , 2010 , 38 ( 12 ): 2920 - 2924 .

CHEN Y X , LIU M . Research on robustness of audio-visual speaker recognition based on articulatory features [J ] . Acta Electronica Sinica , 2010 , 38 ( 12 ): 2920 - 2924 . (in Chinese)

FENG D L , YANG S , SHAN S G , et al . Learn an effective lip reading model without pains [EB/OL ] . ( 2020 )[2023 ] . http://arxiv.org/abs/2011.07557 http://arxiv.org/abs/2011.07557 .

STAFYLAKIS T , TZIMIROPOULOS G . Combining residual networks with LSTMs for lipreading [EB/OL ] . ( 2017 )[2023 ] . http://arxiv.org/abs/1703.04105 http://arxiv.org/abs/1703.04105 .

MA P C , MARTINEZ B , PETRIDIS S , et al . Towards practical lipreading with distilled and efficient models [C ] // ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2021 : 7608 - 7612 .

PETRIDIS S , STAFYLAKIS T , MA P , et al . End-to-end audiovisual speech recognition [C ] // 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2018 : 6548 - 6552 .

MARTINEZ B , MA P C , PETRIDIS S , et al . Lipreading using temporal convolutional networks [C ] // ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2020 : 6319 - 6323 .

LI Y , JI B , SHI X T , et al . TEA: Temporal excitation and aggregation for action recognition [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 906 - 915 .

JIANG B Y , WANG M M , GAN W H , et al . STM: Spatiotemporal and motion encoding for action recognition [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 2000 - 2009 .

WANG Z W , SHE Q , SMOLIC A . ACTION-net: Multipath excitation for action recognition [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 13209 - 13218 .

申屠敏健 , 朱强 , 朱树元 , 等 . 基于视频先验信息的轻量化去噪卷积神经网络 [J ] . 电子学报 , 2023 , 51 ( 6 ): 1510 - 1517 .

SHENTU M J , ZHU Q , ZHU S Y , et al . A priori information-based lightweight convolutional neural network for video denoising [J ] . Acta Electronica Sinica , 2023 , 51 ( 6 ): 1510 - 1517 . (in Chinese)

张淑军 , 彭中 , 李辉 . SAU-Net: 基于U-Net和自注意力机制的医学图像分割方法 [J ] . 电子学报 , 2022 , 50 ( 10 ): 2433 - 2442 .

ZHANG S J , PENG Z , LI H . SAU-net: Medical image segmentation method based on U-net and self-attention [J ] . Acta Electronica Sinica , 2022 , 50 ( 10 ): 2433 - 2442 . (in Chinese)

HAN K , WANG Y H , TIAN Q , et al . GhostNet: More features from cheap operations [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 1577 - 1586 .

马金林 , 刘宇灏 , 马自萍 , 等 . HSKDLR: 同类自知识蒸馏的轻量化唇语识别方法 [J ] . 计算机科学与探索 , 2023 , 17 ( 11 ): 2689 - 2702 .

MA J L , LIU Y H , MA Z P , et al . HSKDLR: Lightweight lip reading method based on homogeneous self-knowledge distillation [J ] . Journal of Frontiers of Computer Science and Technology , 2023 , 17 ( 11 ): 2689 - 2702 . (in Chinese)

ZHANG G Y , LU Y Y . Research on a lip reading algorithm based on efficient-GhostNet [J ] . Electronics , 2023 , 12 ( 5 ): 1151 .

WANG Q L , WU B G , ZHU P F , et al . ECA-net: Efficient channel attention for deep convolutional neural networks [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 11531 - 11539 .

TANG Y H , HAN K , GUO J Y , et al . GhostNetv2: Enhance cheap operation with long-range attention [C ] // Advances in Neural Information Processing Systems 35 . New Orleans : NeurIPS , 2022 : 9969 - 9982 .

ALANSARI M , HAY O A , JAVED S , et al . GhostFaceNets: Lightweight face recognition model from cheap operations [J ] . IEEE Access , 2023 , 11 : 35429 - 35446 .

WANG H B , HAN J . Research on military target detection method based on YOLO method [C ] // 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA) . Piscataway : IEEE , 2023 : 1089 - 1093 .

MA P C , WANG Y J , SHEN J , et al . Lip-reading with densely connected temporal convolutional networks [C ] // 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) . Piscataway : IEEE , 2021 : 2856 - 2865 .

周登文 , 李文斌 , 李金新 , 等 . 一种轻量级的多尺度通道注意图像超分辨率重建网络 [J ] . 电子学报 , 2022 , 50 ( 10 ): 2336 - 2346 .

ZHOU D W , LI W B , LI J X , et al . Image super-resolution reconstruction based on lightweight multi-scale channel attention network [J ] . Acta Electronica Sinica , 2022 , 50 ( 10 ): 2336 - 2346 . (in Chinese)

QIU Z F , YAO T , MEI T . Learning spatio-temporal representation with pseudo-3D residual networks [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 5534 - 5542 .

FENG J F , LONG R H . Cross-language lipreading by reconstructing Spatio-Temporal relations in 3D convolution [J ] . Displays , 2023 , 76 : 102357 .

CHUNG J S , ZISSERMAN A . Lip reading in the wild [C ] // Computer Vision - ACCV 2016: 13th Asian Conference on Computer Vision . Cham : Springer International Publishing , 2017 : 87 - 103 .

KING D E . Dlib-ml: A machine learning toolkit [J ] . Journal of Machine Learning Research , 2009 , 10 : 1755 - 1758 .

CHEN J R , KAO S H , HE H , et al . Run, don't walk: Chasing higher FLOPS for faster neural networks [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 12021 - 12031 .

WANG C H . Multi-grained spatio-temporal modeling for lip-reading [EB/OL ] . ( 2019 )[2023 ] . http://arxiv.org/abs/1908.11618 http://arxiv.org/abs/1908.11618 .

马金林 , 刘宇灏 , 马自萍 , 等 . 解耦同类自知识蒸馏的轻量化唇语识别方法 [J ] . 北京航空航天大学学报 . DOI: 10.13700/j.bh.1001-5965.2022.0931 http://dx.doi.org/10.13700/j.bh.1001-5965.2022.0931 .

MA J L , LIU Y H , MA Z P , et al . Lightweight lip recognition method based on decoupling homogeneous self-knowledge distillation [J ] . Journal of Beijing University of Aeronautics and Astronautics . DOI: 10.13700/j.bh.1001-5965.2022.0931. http://dx.doi.org/10.13700/j.bh.1001-5965.2022.0931. (in Chinese)

SHENG C C , LIU L , DENG W X , et al . Importance-aware information bottleneck learning paradigm for lip reading [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 6563 - 6574 .

XUE J X , HUANG S B , SONG H W , et al . Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation [J ] . Frontiers of Computer Science , 2023 , 17 ( 6 ): 176344 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

⁰