Special Video Recognition Based on Semantic Embedding Learning

WU Xiao-yu; PU Yu-jiang; WANG Sheng-jin; LIU Zi-hao

doi:10.12263/DZXB.20220601

您当前的位置：

首页 >

文章列表页 >

Special Video Recognition Based on Semantic Embedding Learning

PAPERS | 更新时间：2025-12-08

- Special Video Recognition Based on Semantic Embedding Learning
- ACTA ELECTRONICA SINICA Vol. 51, Issue 11, Pages: 3225-3237(2023)
- 作者机构：
  
  1.中国传媒大学信息与通信工程学院,北京 100024
  2.媒体融合与传播国家重点实验室(中国传媒大学),北京 100024
  3.清华大学电子工程系,北京 100084
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(61801441)
- DOI：10.12263/DZXB.20220601
  CLC： TP391.4;
- Received：23 May 2022，
  
  Revised：2022-08-16，
  
  Published：25 November 2023
- 稿件说明：
移动端阅览
吴晓雨,蒲禹江,王生进等.基于语义嵌入学习的特类视频识别[J].电子学报,2023,51(11):3225-3237.

WU Xiao-yu,PU Yu-jiang,WANG Sheng-jin,et al.Special Video Recognition Based on Semantic Embedding Learning[J].ACTA ELECTRONICA SINICA,2023,51(11):3225-3237.
吴晓雨,蒲禹江,王生进等.基于语义嵌入学习的特类视频识别[J].电子学报,2023,51(11):3225-3237. DOI： 10.12263/DZXB.20220601.

WU Xiao-yu,PU Yu-jiang,WANG Sheng-jin,et al.Special Video Recognition Based on Semantic Embedding Learning[J].ACTA ELECTRONICA SINICA,2023,51(11):3225-3237. DOI： 10.12263/DZXB.20220601.

摘要

暴力视频传播已经成为网络环境治理面临的隐患之一，暴力视频这类特类视频的智能识别技术对维护互联网内容安全具有重要意义.由于采集来源的多样性，暴力视频分布通常呈现较大的类内方差和较小的类间方差，常见的暴力视频识别模型难以适应复杂多变的暴力场景.同时，暴力一词本身具有高度抽象的语义，如何从有限数据中学习通用的暴力语义表示成为一大难点.针对这些问题，本文基于语义嵌入学习的思想，构建了一种新颖的多模态暴力视频识别模型，主要由三部分构成.（1）多模态特征提取.考虑到视频具有多模态属性，采用了三种不同的深度神经网络分别提取表观、运动、音频三种模态的特征表示.（2）多模态特征融合.为获得鲁棒的通用视频表示，设计了一种轻量级的多模态特征融合模块（Multimodal Efficient Fusion Module，MEFM），该模块包括共享空间映射与多模态特征交互两部分，在对多模态特征进行充分交互的同时，又能够有效抑制不同模态信息之间的干扰.（3）语义嵌入学习.为适应不同数据分布的暴力数据集，提出了一种基于语义嵌入的多任务学习方法，通过引入中心损失构建暴力语义中心，并采用余弦嵌入损失将暴力样本向中心聚合、非暴力样本进行离散，形成具有语义判别性的特征表示，从而增强了模型的泛化能力，减少了数据噪声的干扰.在VSD2015，Violent Flows和RWF-2000三个公开数据集上的实验表明，本文提出的暴力视频识别模型较已有方法分别提升了4.79%，0.81%和1.5%，取得了具有竞争力的结果.

Abstract

As special type of videos

violent video dissemination has become one of the hidden dangers facing the Internet environment

and intelligent recognition technology for violent videos is of great significance for maintaining Internet content security. Due to the diversity of collection sources

the distribution of violent videos usually shows large intra-class variance and small inter-class variance

and it is difficult for common violence recognition frameworks to adapt to complex and variable violent scenarios. Meanwhile

the word violence itself has highly abstract semantics

and it becomes a major difficulty to learn a generic semantic representation of violence from limited data. In response to these problems

we present a novel multimodal violent video recognition model based on semantic embedding learning. The model mainly consists of the following three parts. (1) Multimodal feature extraction. Considering that videos have multimodal properties

we use three different deep neural networks to extract feature representations of three modalities

i.e.

appearance

motion

and audio. (2) Multimodal feature fusion. To obtain a robust universal video representation

a lightweight multimodal feature fusion module

referred to as MEFM (Multimodal Efficient Fusion Module)

is designed in this paper. The module includes two parts: common space mapping and multimodal feature interaction

which can effectively suppress the interference between different modal information while fully interacting with multimodal features. (3) Semantic embedding learning. To accommodate violence datasets from different sources

we propose a multi-task learning method based on semantic embedding

which computes the semantic center of violence by introducing a center loss and uses cosine embedding loss to aggregate violent samples toward the center while discrete with non-violent samples to form a semantic discriminative feature representation

thus enhancing the generalization ability of the model and reducing the noise interference. Experiments on three publicly available datasets

VSD2015

Violent Flows

and RWF-2000

demonstrate that the violence video recognition framework proposed in this paper achieves competitive results by improving 4.79%

0.81%

and 1.5% respectively

over the state of the arts.

关键词

Keywords

references

闻佳 , 王宏君 , 邓佳 , 等 . 基于深度学习的异常事件检测 [J ] . 电子学报 , 2020 , 48 ( 2 ): 308 - 313 .

WEN J , WANG H J , DENG J , et al . Abnormal event detection based on deep learning [J ] . Acta Electronica Sinica , 2020 , 48 ( 2 ): 308 - 313 . (in Chinese)

POUR A K , SENG W C , PALAIAHNAKOTE S , et al . A survey on video content rating: Taxonomy, challenges and open issues [J ] . Multimedia Tools and Applications , 2021 , 80 ( 16 ): 24121 - 24145 .

SUDHAKARAN S , LANZ O . Learning to detect violent videos using convolutional long short-term memory [C ] // 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) . Piscataway : IEEE , 2017 : 1 - 6 .

HOU C C , WU X Y , WANG G . End-to-end bloody video recognition by audio-visual feature fusion [C ] // Chinese Conference on Pattern Recognition and Computer Vision (PRCV) . Cham : Springer , 2018 : 501 - 510 .

吴晓雨 , 顾超男 , 王生进 . 多模态特征融合与多任务学习的特种视频分类 [J ] . 光学精密工程 , 2020 , 28 ( 5 ): 1177 - 1186 .

WU X Y , GU C N , WANG S J . Special video classification based on multitask learning and multimodal feature fusion [J ] . Optics and Precision Engineering , 2020 , 28 ( 5 ): 1177 - 1186 . (in Chinese)

ISLAM Z , RUKONUZZAMAN M , AHMED R , et al . Efficient two-stream network for violence detection using separable convolutional LSTM [C ] // 2021 International Joint Conference on Neural Networks (IJCNN) . Piscataway : IEEE , 2021 : 1 - 8 .

SJÖBERG M , BAVEYE Y , WANG H , et al . The MediaEval 2015 affective impact of movies task [C ] // Proceedings of the MediaEval 2015 Workshop . Wurzen : CEUR , 2015 : 1 - 3 .

HASSNER T , ITCHER Y , KLIPER-GROSS O . Violent flows: Real-time detection of violent crowd behavior [C ] // 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2012 : 1 - 6 .

CHENG M , CAI K J , LI M . RWF-2000: An open large scale video database for violence detection [C ] // 2020 25th International Conference on Pattern Recognition (ICPR) . Piscataway : IEEE , 2021 : 4183 - 4190 .

ZHANG T , JIA W J , HE X J , et al . Discriminative dictionary learning with motion weber local descriptor for violence detection [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2017 , 27 ( 3 ): 696 - 709 .

LIN J , WANG W Q . Weakly-supervised violence detection in movies with audio and video based co-training [C ] // Pacific-Rim Conference on Multimedia . Berlin : Springer , 2009 : 930 - 935 .

DAI Q , ZHAO R W , WU Z X , et al . Fudan-Huawei at MediaEval 2015: Detecting violent scenes and affective impact in movies with deep learning [C ] // Proceedings of the MediaEval 2015 Workshop . Wurzen : CEUR , 2015 : 6 - 10 .

XU Q C , SEE J , LIN W Y . Localization guided fight action detection in surveillance videos [C ] // 2019 IEEE International Conference on Multimedia and Expo (ICME) . Piscataway : IEEE , 2019 : 568 - 573 .

DOSOVITSKIY A , FISCHER P , ILG E , et al . FlowNet: Learning optical flow with convolutional networks [C ] // 2015 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2016 : 2758 - 2766 .

SONG W , ZHANG D L , ZHAO X B , et al . A novel violent video detection scheme based on modified 3D convolutional neural networks [J ] . IEEE Access , 2019 , 7 : 39172 - 39179 .

PEIXOTO B , LAVI B , PEREIRA MARTIN J P , et al . Toward subjective violence detection in videos [C ] // ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2019 : 8276 - 8280 .

SU Y K , LIN G S , ZHU J H , et al . Human interaction learning on 3D skeleton point clouds for video violence recognition [C ] // European Conference on Computer Vision . Cham : Springer , 2020 : 74 - 90 .

WU P , LIU J , SHI Y J , et al . Not only look, but also listen: Learning multimodal violence detection under weak supervision [C ] // European Conference on Computer Vision . Cham : Springer , 2020 : 322 - 339 .

RENDÓN-SEGADOR F J , ÁLVAREZ-GARCÍA J A , ENRÍQUEZ F , et al . ViolenceNet: Dense multi-head self-attention with bidirectional convolutional LSTM for detecting violence [J ] . Electronics , 2021 , 10 ( 13 ): 1601 .

ASAD M , YANG J , HE J , et al . Multi-frame feature-fusion-based model for violence detection [J ] . The Visual Computer , 2021 , 37 ( 6 ): 1415 - 1431 .

ADÃO TEIXEIRA M V , AVILA S . What should we pay attention to when classifying violent videos? [C ] // Proceedings of the 16th International Conference on Availability, Reliability and Security . New York : ACM , 2021 : 1 - 10 .

PEIXOTO B M , LAVI B , DIAS Z , et al . Harnessing high-level concepts, visual, and auditory features for violence detection in videos [J ] . Journal of Visual Communication and Image Representation , 2021 , 78 : 103174 .

ZHENG Z X , ZHONG W , YE L , et al . Violent scene detection of film videos based on multi-task learning of temporal-spatial features [C ] // 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR) . Piscataway : IEEE , 2021 : 360 - 365 .

LOU J , ZUO D C , ZHANG Z , et al . Violence recognition based on auditory-visual fusion of autoencoder mapping [J ] . Electronics , 2021 , 10 ( 21 ): 2654 .

FEICHTENHOFER C . X3D: expanding architectures for efficient video recognition [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 200 - 210 .

LIN J , GAN C , HAN S . TSM: Temporal shift module for efficient video understanding [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2020 : 7082 - 7092 .

KONG Q Q , CAO Y , IQBAL T , et al . PANNs: Large-scale pretrained audio neural networks for audio pattern recognition [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2020 , 28 : 2880 - 2894 .

LIU X , YANG X D . Multi-stream with deep convolutional neural networks for human action recognition in videos [C ] // International Conference on Neural Information Processing . Cham : Springer , 2018 : 251 - 262 .

WEN Y D , ZHANG K P , LI Z F , et al . A discriminative feature learning approach for deep face recognition [C ] // European Conference on Computer Vision . Cham : Springer , 2016 : 499 - 515 .

ZHANG H , CISSE M , DAUPHIN Y N , et al . Mixup: Beyond empirical risk minimization [C ] // Proceedings of the 6th International Conference on Learning Representations . Vancouver : ICLR , 2018 : 1 - 13 .

PARK D S , CHAN W , ZHANG Y , et al . SpecAugment: A simple data augmentation method for automatic speech recognition [C ] // Proceedings of the International Speech Communication Association . Graz : ISCA , 2019 : 2613 - 2617 .

TIAN Y P , SHI J , LI B C , et al . Audio-visual event localization in unconstrained videos [C ] // European Conference on Computer Vision . Cham : Springer , 2018 : 252 - 268 .

YU Z , YU J , FAN J P , et al . Multi-modal factorized bilinear pooling with co-attention learning for visual question answering [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 1839 - 1848 .

YU Z , YU J , XIANG C C , et al . Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering [J ] . IEEE Transactions on Neural Networks and Learning Systems , 2018 , 29 ( 12 ): 5947 - 5959 .

ULLAH F U M , MUHAMMAD K , HAQ I U , et al . AI-assisted edge vision for violence detection in IoT-based industrial surveillance networks [J ] . IEEE Transactions on Industrial Informatics , 2022 , 18 ( 8 ): 5359 - 5370 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Multimodal Intent Recognition Based on Hierarchical Semantic-Consistency Learning

DRE-3DC: Document-Level Relation Extraction with Three-Dimensional Representation Combination Modeling

MoGE: Graph Context Enhanced Multi-Task Recommendation Method

Multi-task Learning and Identity-constrained Generative Adversarial Network for Face Frontalization and Recognition

Related Author

WANG Sheng-jin

WU Xiao-yu

WANG Lan

ZHANG Huan-xiang

LI Zheng-yi

PENG Jun-jie

ZHAO Wen

LI Wei-ping

Related Institution

State Key Laboratory of Media Convergence and Communication， Communication University of China

Department of Electronic Engineer， Tsinghua University

School of Innovation and Entrepreneurship Education, Inner Mongolia University of Science and Technology

School of Computer Engineering and Science, Shanghai University

National Engineering Research Center for Software Engineering, Peking University

⁰