基于视音互补语义清晰化的隐私视频动作识别方法

李泽超; 付孝德; 潘礼勇; 严锐; 唐金辉

doi:10.12263/DZXB.20230971

您当前的位置：

首页 >

文章列表页 >

基于视音互补语义清晰化的隐私视频动作识别方法

学术论文 | 更新时间：2025-12-24

- 基于视音互补语义清晰化的隐私视频动作识别方法
- A Method for Private Video Action Recognition Based on Visual Audio Complementary and Semantic Clarity
- 电子学报 2024年52卷第7期页码：2170-2182
- 作者机构：
  
  南京理工大学计算机科学与工程学院，江苏南京 210094
- 作者简介：
  
  [ "李泽超男，1985年5月出生，河南开封人.南京理工大学计算机科学与工程学院教授.主要研究方向为大规模多媒体分析.中国电子学会会员编号：E190031283M.E-mail: zechao.li@njust.edu.cn" ]
  [ "付孝德男，2000年1月出生，湖北荆州人.南京理工大学硕士研究生.主要研究方向为目标检测、动作识别.E-mail: fxd0122@njust.edu.cn" ]
  [ "潘礼勇男，1997年11月出生，安徽马鞍山人.南京理工大学硕士研究生.主要研究方向为多模态动作识别.E-mail: 120106010751@uestc.edu.cn" ]
  [ "严锐男， 1995年1月出生，江苏南京人.现为南京大学助理研究员.主要研究方向为视频理解.E-mail: ruiyan@nju.edu.cn" ]
  [ "唐金辉男，1981年2月出生，江苏丹阳人.南京理工大学计算机科学与工程学院教授.主要研究方向为多媒体分析、计算机视觉等.中国电子学会会员编号：E190031289M.E-mail: jinhuitang@njust.edu.cn" ]
- 基金信息：
  
  国家自然科学基金(U20B2064;U21B2043);科技创新2030——新一代人工智能重大专项(2022ZD0118802)
- DOI：10.12263/DZXB.20230971
  中图分类号： TP39;
- 收稿：2023-10-19，
  
  修回：2024-05-29，
  
  纸质出版：2024-07-25
- 稿件说明：
移动端阅览
李泽超, 付孝德, 潘礼勇, 等. 基于视音互补语义清晰化的隐私视频动作识别方法[J]. 电子学报, 2024, 52(07): 2170-2182.

LI Ze-chao, FU Xiao-de, PAN Li-yong, et al. A Method for Private Video Action Recognition Based on Visual Audio Complementary and Semantic Clarity[J]. Acta Electronica Sinica, 2024, 52(07): 2170-2182.
李泽超, 付孝德, 潘礼勇, 等. 基于视音互补语义清晰化的隐私视频动作识别方法[J]. 电子学报, 2024, 52(07): 2170-2182. DOI：10.12263/DZXB.20230971

LI Ze-chao, FU Xiao-de, PAN Li-yong, et al. A Method for Private Video Action Recognition Based on Visual Audio Complementary and Semantic Clarity[J]. Acta Electronica Sinica, 2024, 52(07): 2170-2182. DOI：10.12263/DZXB.20230971

摘要

视频隐私保护是当前社会面临的重要挑战之一，对视频进行模糊处理是保护人们隐私权益的重要手段.由于模糊视频天然缺失视觉模态的信息，主流的视频动作识别算法无法取得令人满意的效果.模糊视频作为多模态介质不仅仅只有视觉模态信息，同时，也含有丰富的音频模态信息，从人类的认知角度而言，音频也是获取信息的重要来源.本文提出一种基于多模态融合的隐私视频动作识别方法，在保证不侵犯使用者隐私的前提下进行人类动作行为识别.具体来说，使用音频-视觉特征融合模块将音频模态特征图融入到视觉模态中，充分融合音视频模态的深层语义信息.除此之外，模型还引入清晰视频帧图像作为标签，在模型训练阶段监督动作识别网络的参数更新，为隐私视频动作识别网络提供清晰的语义信息.在多组隐私行为数据集上，通过大量消融和对比实验验证了所提方法的有效性.

Abstract

Video privacy protection is one of the important challenges faced by current society

and blurring videos is an important means to protect people’s privacy rights. Due to the natural lack of visual modality information in blurry videos

mainstream video action recognition algorithms cannot achieve satisfactory results. As a multimodal medium

blurry videos not only contain visual modality information but also rich audio modality information. From a human cognitive perspective

audio is also an important source of information acquisition. In view of this

this article proposes a privacy video action recognition method based on multimodal fusion

which can recognize human action behavior without infringing on user privacy. Specifically

this article uses the audio visual feature fusion module to integrate audio modal feature maps into visual modalities

fully integrating the deep semantic information of audio and video modalities. In addition

the model also introduces clear video frame images as labels to monitor the parameter updates of the action recognition network during the model training phase

providing clear semantic information for the private video action recognition network. The effectiveness of the proposed method was verified through extensive ablation and comparative experiments on multiple sets of private behavior datasets.

关键词

Keywords

references

邓海刚 , 王传旭 , 李成伟 , 等 . 深度学习框架下群组行为识别算法综述 [J ] . 电子学报 , 2022 , 50 ( 8 ): 2018 - 2036 .

DENG H G , WANG C X , LI C W , et al . Summarization of group activity recognition algorithms based on deep learning frame [J ] . Acta Electronica Sinica , 2022 , 50 ( 8 ): 2018 - 2036 . (in Chinese)

罗会兰 , 童康 , 孔繁胜 . 基于深度学习的视频中人体动作识别进展综述 [J ] . 电子学报 , 2019 , 47 ( 5 ): 1162 - 1173 .

LUO H L , TONG K , KONG F S . The progress of human action recognition in videos based on deep learning: A review [J ] . Acta Electronica Sinica , 2019 , 47 ( 5 ): 1162 - 1173 . (in Chinese)

VONDRICK C , KHOSLA A , PIRSIAVASH H , et al . Visualizing object detection features [J ] . International Journal of Computer Vision , 2016 , 119 ( 2 ): 145 - 158 .

WANG H , KLÄSER A , SCHMID C , et al . Dense trajectories and motion boundary descriptors for action recognition [J ] . International Journal of Computer Vision , 2013 , 103 ( 1 ): 60 - 79 .

TRAN D , BOURDEV L , FERGUS R , et al . Learning spatiotemporal features with 3D convolutional networks [C ] // 2015 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2015 : 4489 - 4497 .

DAI J , SAGHAFI B , WU J , et al . Towards privacy-preserving recognition of human activities [C ] // 2015 IEEE International Conference on Image Processing (ICIP) . Piscataway : IEEE , 2015 : 4238 - 4242 .

RYOO M S , ROTHROCK B , FLEMING C , et al . Privacy-preserving human activity recognition from extreme low resolution [C ] // 31th AAAI Conference on Artificial Intelligence . New York : ACM , 2017 : 4255 - 4262 .

RYOO M S , KIM K , YANG H J . Extreme low resolution activity recognition with multi-siamese embedding learning [EB/OL ] . ( 2018-04-27 )[ 2023-03-11 ] . http://arxiv.org/abs/1708.00999 http://arxiv.org/abs/1708.00999 .

CHOU E , TAN M , ZOU C , et al . Privacy-preserving action recognition for smart hospitals using low-resolution depth images [EB/OL ] . ( 2018-11-16 )[ 2023-03-11 ] . http://arxiv.org/abs/1811.09950 http://arxiv.org/abs/1811.09950 .

DEMIR U , RAWAT Y S , SHAH M . TinyVIRAT: Low-resolution video action recognition [C ] // 2020 25th International Conference on Pattern Recognition (ICPR) . Piscataway : IEEE , 2021 : 7387 - 7394 .

CHAUDHARY S , PATIL P W , DUDHANE A , et al . Deep network for extremely low-resolution human action recognition [C ] // 2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) . Piscataway : IEEE , 2022 : 1 - 8 .

TANNERU S G , MUKHERJEE S . Action recognition in haze using an efficient fusion of spatial and temporal features [C ] // Computer Vision and Image Processing: 5th International Conference on Computer Vision and Image Processing . Singapore : Springer , 2021 : 29 - 38 .

RAI N , CHEN H F , JI J W , et al . Home action genome: Cooperative compositional action understanding [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 11179 - 11188 .

SOOMRO K , ZAMIR A R , SHAH M . UCF101: A dataset of 101 human actions classes from videos in the wild [EB/OL ] . ( 2012-12-03 )[ 2023-10-17 ] . http://arxiv.org/abs/1212.0402 http://arxiv.org/abs/1212.0402 .

WANG H , SCHMID C . Action recognition with improved trajectories [C ] // 2013 IEEE International Conference on Computer Vision . Piscataway : IEEE , 2013 : 3551 - 3558 .

DALAL N , TRIGGS B . Histograms of oriented gradients for human detection [C ] // 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2005 : 886 - 893 .

LAPTEV I , MARSZALEK M , SCHMID C , et al . Learning realistic human actions from movies [C ] // 2008 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2008 : 1 - 8 .

LAPTEV I . On space-time interest points [J ] . International Journal of Computer Vision , 2005 , 64 ( 2 ): 107 - 123 .

KRIZHEVSKY A , SUTSKEVER I , HINTON G E . Imagenet classification with deep convolutional neural networks [C ] // Proceedings of the 25th International Conference on Neural Information Processing Systems . Red Hook : Curran Associates Inc. , 2012 : 1097 - 1105 .

谢佳龙 , 张波涛 , 吕强 . 一种基于双流融合 3D 卷积神经网络的动态头势识别方法 [J ] . 电子学报 , 2021 , 49 ( 7 ): 1363 - 1369 .

XIE J L , ZHANG B T , LÜ Q . A dynamic head gesture recognition method based on 3D convolutional two-stream network fusion [J ] . Acta Electronica Sinica , 2021 , 49 ( 7 ): 1363 - 1369 . (in Chinese)

王磊 , 吴俊 , 周志敏 , 等 . 人体检测部分响应特征映射的人体动作识别 [J ] . 软件学报 , 2015 , 26 ( S2 ): 128 - 136 .

WANG L , WU J , ZHOU Z M , et al . Human action recognition through part-configured human detection response feature maps [J ] . Journal of Software , 2015 , 26 ( S2 ): 128 - 136 . (in Chinese)

SIMONYAN K , ZISSERMAN A . Very deep convolutional networks for large-scale image recognition [EB/OL ] . ( 2014-09-04 )[ 2024-03-11 ] . http://arxiv.org/abs/1409.1556 http://arxiv.org/abs/1409.1556 .

SZEGEDY C , LIU W , JIA Y Q , et al . Going deeper with convolutions [C ] // 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2015 : 1 - 9 .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 770 - 778 .

LECUN Y , BOTTOU L , BENGIO Y , et al . Gradient-based learning applied to document recognition [J ] . Proceedings of the IEEE , 1998 , 86 ( 11 ): 2278 - 2324 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 6000 - 6010 .

WANG Z W , VINEET V , PITTALUGA F , et al . Privacy-preserving action recognition using coded aperture videos‍ [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Piscataway : IEEE , 2019 : 1 - 10 .

LI M , LIU J , FAN H H , et al . STPrivacy: Spatio-temporal tubelet sparsification and anonymization for privacy-preserving action recognition [EB/OL ] . ( 2023-01-08 )[ 2023-03-11 ] . http://arxiv.org/abs/2301.03046 http://arxiv.org/abs/2301.03046 .

RUSSO P , TICCA S , ALATI E , et al . Learning to see through a few pixels: Multi streams network for extreme low-resolution action recognition [J ] . IEEE Access , 2021 , 9 : 12019 - 12026 .

LI T F , YANG B , ZHANG T . Human action recognition based on state detection in low-resolution infrared video‍ [C ] // 2021 IEEE 16th Conference on Industrial Electronics and Applications (ICIEA) . Piscataway : IEEE , 2021 : 1667 - 1672 .

FEICHTENHOFER C , FAN H Q , MALIK J , et al . Slowfast networks for video recognition [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 6201 - 6210 .

XIAO F Y , LEE Y J , GRAUMAN K , et al . Audiovisual slowfast networks for video recognition [EB/OL ] . ( 2020-01-23 )[ 2023-03-11 ] . http://arxiv.org/abs/2001.08740 http://arxiv.org/abs/2001.08740 .

LIN J , GAN C , HAN S . TSM: Temporal shift module for efficient video understanding [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 7082 - 7092 .

KAZAKOS E , NAGRANI A , ZISSERMAN A , et al . EPIC-fusion: Audio-visual temporal binding for egocentric action recognition [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 5491 - 5500 .

WANG L M , XIONG Y J , WANG Z , et al . Temporal segment networks: Towards good practices for deep action recognition [C ] // European Conference on Computer Vision . Cham : Springer , 2016 : 20 - 36 .

BERTASIUS G , WANG H , TORRESANI L . Is space-time attention all you need for video understanding? [EB/OL ] . ( 2021-01-09 )[ 2023-03-11 ] . http://arxiv.org/abs/2102.05095 http://arxiv.org/abs/2102.05095 .

PANDA R , CHEN C F R , FAN Q F , et al . AdaMML: Adaptive multi-modal learning for efficient video recognition [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 7556 - 7565 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

几类经典隐私定义间的关系

聚类的(α,k)-匿名数据发布

基于向量相似的权重社会网络隐私保护

基于敏感属性熵的微聚集算法