Research on 3D Multi-Branch Aggregated Lightweight Network Video Action Recognition Algorithm
HU Zheng-ping1,2, DIAO Peng-cheng1, ZHANG Rui-xue1, LI Shu-fang1, ZHAO Meng-yao1
1. School of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China;
2. Hebei Key Laboratory of Information Transmission and Signal Processing, Yanshan University, Qinhuangdao, Hebei 066004, China
Abstract:To construct a video action recognition model with 2D neural network speed while maintaining the performance of 3D neural network,the 3D multi-branch aggregation lightweight network action recognition algorithm is proposed.Firstly,the neural network is divided into multiple branches by using grouped convolution.Secondly,to promote the information exchange between branches,a multiplexer module with information aggregation function is added.Finally,the adaptive attention mechanism is introduced to redirect channel and spatio-temporal information.Experiments show that,the computational cost of the algorithm on the UCF101 dataset is 11.5GFlops,and the accuracy is 96.2%;the computational cost on the HMDB51 dataset is 11.5GFlops,and the accuracy is 74.7%.Compared with other action recognition algorithms,it improves the efficiency of the video recognition network and reflects certain recognition speed and accuracy advantages.
[1] 罗会兰,王婵娟.行为识别中一种基于融合特征的改进VLAD编码方法[J].电子学报,2019,47(1):49-58. LUO Hui-lan,WANG Chan-juan.An improved VLAD coding method based on fusion feature in action recognition[J].Acta Electronica Sinica,2019,47(1):49-58.(in Chinese)
[2] 张友梅,常发亮,刘洪彬.基于3D人体骨架的动作识别[J].电子学报,2017,45(4):906-911. ZHANG You-mei,CHANG Fa-liang,LIU Hong-bin.Action recognition based on 3D skeleton[J].Acta Electronica Sinica,2017,45(4):906-911.(in Chinese)
[3] 罗会兰,童康,孔繁胜.基于深度学习的视频中人体动作识别进展综述[J].电子学报,2019,47(5):1162-1173. LUO Hui-lan,TONG Kang,KONG Fan-sheng.The progress of human action recognition in videos based on deep learning:a view[J].Acta Electronica Sinica,2019,47(5):1162-1173.(in Chinese)
[4] Qiu Z,Yao T,Mei T.Learning spatio-temporal representation with pseudo-3d residual networks[A].Proceedings of the IEEE International Conference on Computer Vision[C].USA:IEEE,2017.5533-5541.
[5] Xu H,Das A,Saenko K.R-c3d:Region convolutional 3d network for temporal activity detection[A].Proceedings of the IEEE International Conference on Computer Vision[C].USA:IEEE,2017.5783-5792.
[6] Wang X,Girshick R,Gupta A,et al.Non-local neural networks[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].USA:IEEE,2018.7794-7803.
[7] Simonyan K,Zisserman A.Very deep convolutional networks for large-scale image recognition[A].Computer Science[C].ICLR Press,2014.1549-1556.
[8] He K,Zhang X,Ren S,et al.Deep residual learning for image recognition[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].USA:IEEE,2016.770-778.
[9] Xie S,Girshick R,Dollár P,et al.Aggregated residual transformations for deep neural networks[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].USA:IEEE,2017.1492-1500.
[10] Sandler M,Howard A,Zhu M,et al.Mobilenetv2:Inverted residuals and linear bottlenecks[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].USA:IEEE,2018.4510-4520.
[11] Han S,Pool J,Tran J,et al.Learning both weights and connections for efficient neural network[A].Advances in Neural Information Processing Systems[C].Canada:Curran Associates,2015.1135-1143.
[12] Wu J,Leng C,Wang Y,et al.Quantized convolutional neural networks for mobile devices[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].USA:IEEE,2016.4820-4828.
[13] Hinton G,Vinyals O,Dean J.Distilling the knowledge in a neural network[J].Computer Science,2015,14(7):38-39.
[14] Woo S,Park J,Lee J Y,et al.Cbam:Convolutional block attention module[A].Proceedings of the European Conference on Computer Vision[C].Germany:Springer,2018.3-19.
[15] He K,Zhang X,Ren S,et al.Identity mappings in deep residual networks[A].European Conference on Computer Vision[C].The Netherlands:Springer,2016.630-645.
[16] Carreira J,Zisserman A.Quo vadis,action recognition? A new model and the kinetics dataset[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].USA:IEEE,2017.6299-6308.
[17] Tran D,Wang H,Torresani L,et al.A closer look at spatiotemporal convolutions for action recognition[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].USA:IEEE,2018.6450-6459.
[18] Wu C Y,Zaheer M,Hu H,et al.Compressed video action recognition[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].USA:IEEE,2018.6026-6035.
[19] Simonyan K,Zisserman A.Two-stream convolutional networks for action recognition in videos[A].Advances in Neural Information Processing Systems[C].Canada:Curran Associates,2014.568-576.
[20] Wang L,Xiong Y,Wang Z,et al.Temporal segment networks:Towards good practices for deep action recognition[A].European Conference on Computer Vision[C].The Netherlands:Springer,2016.20-36.
[21] Tran D,Bourdev L,Fergus R,et al.Learning spatiotemporal features with 3d convolutional networks[A].Proceedings of the IEEE International Conference on Computer Vision[C].USA:IEEE,2015.4489-4497.
[22] Tran D,Ray J,Shou Z,et al.Convnet architecture search for spatiotemporal feature learning[A].Computer Science[C].ICLR Press,2017.1708-1716.
[23] Wang L,Li W,Li W,et al.Appearance-and-relation networks for video classification[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C].USA:IEEE,2018.1430-1439.
[24] Diba A,Fayyaz M,Sharma V,et al.Temporal 3D ConvNets using temporal transition layer[A].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops[C].USA:IEEE,2018.1117-1121.
[25] Xie S,Sun C,Huang J,et al.Rethinking spatiotemporal feature learning:Speed-accuracy trade-offs in video classification[A].Proceedings of the European Conference on Computer Vision[C].Germany:Springer,2018.305-321.