An Improved VLAD Coding Method Based on Fusion Feature in Action Recognition

LUO Hui-lan, WANG Chan-juan

ACTA ELECTRONICA SINICA ›› 2019, Vol. 47 ›› Issue (1) : 49-58.

PDF(2317 KB)
CIE Homepage  |  Join CIE  |  Login CIE  |  中文 
PDF(2317 KB)
ACTA ELECTRONICA SINICA ›› 2019, Vol. 47 ›› Issue (1) : 49-58. DOI: 10.3969/j.issn.0372-2112.2019.01.007

An Improved VLAD Coding Method Based on Fusion Feature in Action Recognition

  • LUO Hui-lan, WANG Chan-juan
Author information +

Abstract

A novel coding method IVLAD (Improved Vector of Locally Aggregated Descriptors) based on the fusion of features was proposed in this paper.It obtained good performance in behavior recognition.In order to solve the problem that single feature descriptor cannot express space information well,location information was mapped into feature space and then jointly coded to get the video expression vector.In order to avoid the deficiency of the traditional VLAD methods which only consider the distances of features and clustering centers,the distance between each cluster and its most similar feature was also used in the coding stage.Finally concatenating the video expression vector with itself was proposed to raise the dimension of vectors to further improve the recognition accuracy.Furthermore,the influences of the visual dictionary size,the location dictionary size and the normalization method on the recognition accuracy were studied.The experimental results on two large databases UCF101 and HMDB51 have shown that the proposed method had better performance than the traditional VLAD method.

Key words

action recognition / position information / concatenate / expression vector

Cite this article

Download Citations
LUO Hui-lan, WANG Chan-juan. An Improved VLAD Coding Method Based on Fusion Feature in Action Recognition[J]. Acta Electronica Sinica, 2019, 47(1): 49-58. https://doi.org/10.3969/j.issn.0372-2112.2019.01.007

References

[1] ZHU F,SHAO L,XIE J.From handcrafted to learned representations for human action recognition[J].Image and Vision Computing,2016,55(2):42-52.
[2] 杜友田,陈峰,徐文立,李永彬.基于视觉的人的运动识别综述[J].电子学报,2007,35(1):84-90. DU You-tian,CHEN Feng,XU Wen-li,LI Yong-bin.A survey on the vision-based human motion recognition[J].Acta Electronica Sinica,2007,35(1):84-90.(in Chinese)
[3] BOYER E.A Survey of Vision-Based Methods for Action Representation,Segmentation and Recognition[M].Elservier Science Inc,2011.
[4] 苏松志,李绍滋,陈淑媛,蔡国榕,吴云东.行人检测技术综述[J].电子学报,2012,40(4):814-820. SU Song-zhi,LI Shao-zi,CHEN Shu-yuan,CAI Guo-rong,WU Yun-dong.A survey on pedestrian detection[J].Acta Electronica Sinica,2012,40(4):814-820.(in Chinese)
[5] DAWN D D,SHAIKH S H.A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector[J].Visual Computer,2016,32(3):289-306.
[6] 田国会,尹建芹,闫云章,李国栋.基于混合高斯模型和主成分分析的轨迹分析行为识别方法[J].电子学报,2016,44(1):143-149. TIAN Guo-hui,YIN Jian-qin,YAN Yun-zhang,LI Guo-dong.Gaussian mixture models and principal component analysis based human trajectory behavior recognition[J].Acta Electronica Sinica,2016,44(1):143-149.(in Chinese)
[7] SIMONYAN K,ZISSERMAN A.Two-Stream convolutional networks for action recognition in videos[A].Proceedings of International Conference on Neural Information Processing Systems[C].USA:MIT Press,2014.568-576.
[8] WANG L,GE L,LI R,et al.Three-stream CNNs for action recognition[J].Pattern Recognition Letters,2017,92(C):33-40.
[9] LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436.
[10] BILEN H,FERNANDO B,GAVVES E,et al.Action recognition with dynamic image networks[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2016,PP(99):1-1.
[11] GKIOXARI G,GIRSHICK R,MALIK J.Contextual action recognition with R*CNN[A].Proceedings of International Conference on Computer Vision[C].USA:IEEE,2015.1080-1088.
[12] CHERON G,LAPTEV I,SCHMID C,et al.P-CNN:Pose-based CNN features for action recognition[J].Proceedings of International Conference on Computer Vision[C].USA:IEEE,2015.3218-3226.
[13] YANG X,TIAN Y L.Action recognition using super sparse coding vector with spatio-temporal awareness[A].Recognizing Complex Events in Videos by Learning Key Static-Dynamic Evidences[C].Berlin:Springer,2014.727-741.
[14] PENG X,ZOU C,QIAO Y,et al.Action recognition with stacked fisher vectors[A].European Conference on Computer Vision[C].Berlin:Springer,2014.581-595.
[15] PENG X,WANG L,WANG X,et al.Bag of visual words and fusion methods for action recognition:Comprehensive study and good practice[J].Computer Vision & Image Understanding,2016,150(C):109-125.
[16] ARANDJELOVIC R,ZISSERMAN A.All about VLAD[A].Proceedings of IEEE Conference on Computer Vision and Pattern Recognition[C].USA:IEEE,2013.1578-1585.
[17] DUTA I C,IONESCU B,AIZAWA K,et al.Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos[M].USA:MultiMedia Modeling,2017.
[18] WANG H,ULLAH M M,KLÄSER A,et al.Evaluation of local spatio-temporal features for action recognition[A].Proceedings of British Machine Vision Conference(BMVC 2009)[C].London,UK:DBLP,2009.DOI:10.5244/C.23.124.
[19] LAPTEV I.On space-time interest points[A].Proceedings of IEEE International Conference on Computer Vision[C].USA:IEEE,2005.107-123.
[20] WANG H,KLASER A,SCHMID C,et al.Action recognition by dense trajectories[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2011.3169-3176.
[21] WANG H,SCHMID C.Action recognition with improved trajectories[A].Proceedings of IEEE International Conference on Computer Vision[C].USA:IEEE,2014.3551-3558.
[22] UIJLINGS J,DUTA I C,SANGINETO E,et al.Video classification with Densely extracted HOG/HOF/MBH features:an evaluation of the accuracy/computational efficiency trade-off[J].International Journal of Multimedia Information Retrieval,2015,4(1):33-44.
[23] LIU L,WANG L,LIU X.In defense of soft-assignment coding[A].Proceedings of IEEE International Conference on Computer Vision[C].USA:IEEE,2011.2486-2493.
[24] YANG J,YU K,GONG Y,et al.Linear spatial pyramid matching using sparse coding for image classification[A].computer Vision and Pattern Recognition[C].USA:IEEE,2009.1794-1801.
[25] WANG J,YANG J,YU K,et al.Locality-constrained linear coding for image classification[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2010.3360-3367.
[26] PERRONNIN F,SÁNCHEZ J,MENSINK T.Improving the fisher kernel for large-scale image classification[A].European Conference on Computer Vision[C].Berlin:Springer-Verlag,2010.143-156.
[27] ZHOU X,YU K,ZHANG T,et al.Image classification using super-vector coding of local image descriptors[A].European Conference on Computer Vision[C].Berlin:Springer,2010,6315:141-154.
[28] JEGOU H,DOUZE M,SCHMID C,et al.Aggregating local descriptors into a compact image representation[A].Computer Vision and Pattern Recognition[C].USA:IEEE,2010.3304-3311.
[29] SOMASUNDARAM G,CHERIAN A,MORELLAS V,et al.Action recognition using global spatio-temporal features derived from sparse representations[J].Computer Vision & Image Understanding,2014,123(7):1-13.
[30] ZHANG B,WANG H.Encoding scale into fisher vector for human action recognition[A].Visual Communications and Image Processing[C].USA:IEEE,2016.1-4.
[31] SOOMRO K,ZAMIR A R,SHAH M.UCF101:A Dataset of 101 Human Actions Classes From Videos in the Wild[DB/OL].http://crcv.ucf.edu/data/UCF101.php.
[32] KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:A large video database for human motion recognition[A].Proceedings of International Conference on Computer Vision[C].USA:IEEE,2011.2556-2563.

Funding

National Natural Science Foundation of China (No.61862031, No.61462035); Natural Science Foundation of Jiangxi Province Study on Self- Deep Learning Model of Learning Visual Representation (No.20171BAB202014)
PDF(2317 KB)

1246

Accesses

0

Citation

Detail

Sections
Recommended

/