A Survey on Deep Predictive Learning Based on Unlabeled Videos

PAN Min-ting; WANG Yun-bo; ZHU Xiang-ming; GAO Si-yu; LONG Ming-sheng; YANG Xiao-kang

doi:10.12263/DZXB.20211209

您当前的位置：

首页 >

文章列表页 >

A Survey on Deep Predictive Learning Based on Unlabeled Videos

CROSS-DISCIPLINARY INNOVATIONS OF MACHINE LEARNING | 更新时间：2025-12-08

- A Survey on Deep Predictive Learning Based on Unlabeled Videos
- ACTA ELECTRONICA SINICA Vol. 50, Issue 4, Pages: 869-886(2022)
- 作者机构：
  
  1.上海交通大学人工智能研究院、人工智能教育部重点实验室，上海 201109
  2.清华大学软件学院，北京 100084
- 作者简介：
- 基金信息：
- DOI：10.12263/DZXB.20211209
  CLC： TP389.1;
- Received：01 September 2021，
  
  Revised：2022-02-17，
  
  Published：25 April 2022
- 稿件说明：
移动端阅览
潘敏婷,王韫博,朱祥明等.基于无标签视频数据的深度预测学习方法综述[J].电子学报,2022,50(04):869-886.

PAN Min-ting,WANG Yun-bo,ZHU Xiang-ming,et al.A Survey on Deep Predictive Learning Based on Unlabeled Videos[J].ACTA ELECTRONICA SINICA,2022,50(04):869-886.
潘敏婷,王韫博,朱祥明等.基于无标签视频数据的深度预测学习方法综述[J].电子学报,2022,50(04):869-886. DOI： 10.12263/DZXB.20211209.

PAN Min-ting,WANG Yun-bo,ZHU Xiang-ming,et al.A Survey on Deep Predictive Learning Based on Unlabeled Videos[J].ACTA ELECTRONICA SINICA,2022,50(04):869-886. DOI： 10.12263/DZXB.20211209.

摘要

基于视频数据的深度预测学习（以下简称“深度预测学习”）属于深度学习、计算机视觉和强化学习的交叉融合研究方向，是气象预报、自动驾驶、机器人视觉控制等场景下智能预测与决策系统的关键组成部分，在近年来成为机器学习的热点研究领域.深度预测学习遵从自监督学习范式，从无标签的视频数据中挖掘自身的监督信息，学习其潜在的时空模式表达.本文对基于深度学习的视频预测现有研究成果进行了详细综述.首先，归纳了深度预测学习的研究范畴和交叉应用领域.其次，总结了视频预测研究中常用的数据集和评价指标.而后，从基于观测空间的视频预测、基于状态空间的视频预测、有模型的视觉决策三个角度，分类对比了当前主流的深度预测学习模型.最后，本文分析了深度预测学习领域的热点问题，并对研究趋势进行了展望.

Abstract

Deep predictive learning based on video data (hereinafter referred to as "deep predictive learning") is a research direction of deep learning

being interacted with computer vision and reinforcement learning. It is a key part of intelligent prediction and decision-making systems in weather forecasting

autonomous driving

robotics

and other scenarios

and has become a hot research field of machine learning in recent years. Deep predictive learning follows the self-supervised learning paradigm

using internal constraints from unlabeled video data to learn the underlying spatiotemporal patterns. In this paper

we review the existing deep learning techniques for predictive learning in detail. First

we summarize the research scope and application fields of deep predictive learning. Second

we present the datasets and evaluation metrics commonly used in this research field. Third

we summarize current mainstream deep prediction learning models from three perspectives: predictive models based on observation space

predictive models based on state space

and visual planning methods based on the predictive models. Finally

we discuss the hot issues and future research directions in the field of deep predictive learning.

关键词

Keywords

references

SHI X J , CHEN Z R , WANG H , et al . Convolutional LSTM network: A machine learning approach for precipitation nowcasting [C]// Proceedings of The Advances in Neural Information Processing Systems . Montreal : MIT Press , 2015 : 802 - 810 .

CHANDRA R , BHATTACHARYA U , BERA A , et al . Traphic: Trajectory prediction in dense and heterogeneous traffic using weighted interactions [C]// Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach : IEEE , 2019 : 8483 - 8492 .

CASTREJON L , BALLAS N , COURVILLE A . Improved conditional vrnns for video prediction [C]// Proceedings of The IEEE/CVF International Conference on Computer Vision . Seoul : IEEE , 2019 : 7608 - 7617 .

ZHANG J , ZHENG Y , QI D , et al . DNN-based prediction model for spatio-temporal data [C]// Proceedings of The 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems . Burlingame, California : Association for Computing Machinery , 2016 : 1 - 4 .

EBERT F , FINN C , LEE A X , et al . Self-supervised visual planning with temporal skip connections [C]// Proceedings of The 1st Annual Conference on Robot Learning . California : PMLR , 2017 : 344 - 356 .

HA D , SCHMIDHUBER J . World models [EB/OL]. ( 2018-05-27 )[ 2021-09-01 ]. https://arxiv.org/abs/1803.10122 https://arxiv.org/abs/1803.10122 .

HAFNER D , LILLICRAP T , BA J , et al . Dream to control: Learning behaviors by latent imagination [EB/OL]. ( 2019-11-03 )[ 2021-09-01 ]. https://arxiv.org/abs/1912.01603 https://arxiv.org/abs/1912.01603 .

WANG Y , LIU B , WU J , et al . DualSMC: Tunneling differentiable filtering and planningunder continuous POMDPs [C]// Proceedings of The TwentyNinth International Joint Conference on Artificial Intelligence . Yokohama : arXiv , 2020 : 4190 - 4198 .

LECUN Y , BOTTOU L . Gradient-based learning applied to document recognition [J]. Proceedings of The IEEE , 1998 , 86 ( 11 ): 2278 - 2324 .

JAIN V , MURRAY J F , ROTH F , et al . Super-vised learning of image restoration with convolu-tional networks [C]// Proceedings of The 11th International Conference on Computer Vision . Rio de Janeiro : IEEE , 2007 : 1 - 8 .

MATHIEU M , COUPRIE C , LECUN Y . Deep multiscale video prediction beyond mean square error [EB/OL]. ( 2015-11-17 )[ 2021-09-01 ]. https://arxiv.org/abs/1511.05440 https://arxiv.org/abs/1511.05440 .

OH J , GUO Xiao-xiao , LEE H , et al . Action-conditional video prediction using deep networks in Atari games [C]// Proceedings of The Advances in Neural Information Processing Systems . Montreal : MIT Press , 2015 : 2863 - 2871 .

VUKOTIC V , PINTEA S L , RAYMOND C , et al . One-step time-dependent future video frame prediction with a convolutional encoder-decoder neural network [C]// International Conference on Image Analysis and Processing . Cham : Springer , 2017 : 140 - 151 .

JIA X , DE BRABANDERE B , TUYTELAARS T , et al . Dyna-mic filter networks [C]// Proceed ings of The Advances in Neural Information Systems Processing . Barcelona : arXiv , 2016 : 667 - 675 .

XUE T , WU J , BOUMAN K L , et al . Visual dynamics: Stochastic future generation via layered cross convolutional networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018 , 41 ( 9 ): 2236 - 2250 .

XU J , NI B , LI Z , et al . Structure preserving video prediction [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Salt Lake City : IEEE , 2018 : 1460 - 1469 .

JADERBERG M , SIMONYAN K , ZISSERMAN A . Spatial transformer networks [C]// Proceedings of the Advances in Neural Information Process Systems ing . Montreal : MIT Press , 2015 : 2017 - 2025 .

JIN B , HU Y , TANG Q , et al . Exploring spatial-temporal multi-frequency analysis for high-fide lity and temporal-consistency video prediction [C]// Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle : IEEE , 2020 : 4554 - 4563 .

TRAN D , BOURDEV L , FERGUS R , et al . Learning spatiotemporal features with 3D con volutional networks [C]// Proceedings of The IEEE International Conference on Computer Vision . Santiago : IEEE , 2015 : 4489 - 4497 .

KARPATHY A , TODERICI G , SHETTY S , et al . Large-scale video classification with convolu tional neural networks [C]// Proceedings of The IEEE Conference on Computer Vision and Recognition Pattern . Columbus : IEEE , 2014 : 1725 - 1732 .

XU H , DAS A , SAENKO K . R-c3d: Region con volutional 3d network for temporal activity detection [C]// Proceedings of The IEEE Inter national Conference on Computer Vision . Venice : IEEE , 2017 : 5783 - 5792 .

AIGNER S , KORNER M . Futuregan: Antici pating the future frames of video sequences using spatio-temporal 3d convolutions in progressively growing gans [EB/OL]. ( 2018-10-02 )[ 2021-09-01 ]. https://arxiv.org/abs/1810.01325 https://arxiv.org/abs/1810.01325 .

WANG Y , JIANG L , YANG M H , et al . Eidetic 3d lstm: A model for video prediction and beyond [C]// International Conference on Learning Representations . New Orleans : OpenReview , 2018 : 1 - 14 .

BRHRMANN N , GALL J , NOROOZI M . Unsupervised video representation learning by bidirectional feature prediction [C]// Proceedings of The IEEE/CVF Winter Conference on Applications of Computer Vision . Waikoloa : IEEE , 2021 : 1670 - 1679 .

VONDRICK C , PIRSIAVASH H , TORRALBA A . Generating videos with scene dynamics [C]// Proceedings of The Advances in Neural Informa tion Processing Systems . Barcelona : arXiv , 2016 : 613 - 621 .

HAN T , XIE W , ZISSERMAN A . Video representation learning by dense predictive coding [C]// Proceedings of The IEEE/CVF International Conference on Computer Vision Workshops . Seoul : IEEE , 2019 : 1483 - 1492 .

GRAVES A . Generating sequences with recurrent neural networks [J]. arXiv preprint arXiv , 2013 , 1308. 0850 .

RANZATO M A , SZLAM A , BRUNA J , et al . Video (language) modeling: a baseline for generative models of natural videos [EB/OL]. ( 2014-12-20 )[ 2021-09-01 ]. https://arxiv.org/abs/1412.6604 https://arxiv.org/abs/1412.6604 .

SUTSKEVER I , VINYALS O , Le Q V . Sequence to sequence learning with neural networks [C]// Proceedings of The Advances in Neural Information Processing Systems . Montreal : MIT Press , 2014 : 3104 - 3112 .

SRIVASTAVA N , MANSIMOV E , SALAKHUDINOV R . Unsupervised learning of video representations using lstms [C]// Proceedings of The 32nd International Conference on Machine Learning . Lille : JMLR , 2015 : 843 - 852 .

SHI X J , CHEN Z R , WANG H , et al . Convolutional LSTM network: A machine learning approach for precipitation nowcasting [C]// Proceedings of the Advances in Neural Information Processing Systems . Montreal : MIT Press , 2015 : 802 - 810 .

SHI X J , GAO Z H , LAUSEN L , et al . Deep learningfor precipitation nowcasting: A benchmark and a new model [EB/OL]. ( 2017-06-12 )[ 2021-09-01 ]. https://arxiv.org/abs/1706.03458 https://arxiv.org/abs/1706.03458 .

BALLAS N , YAO L , PAL C , et al . Delving deeper into convolutional networks for learningvideo representations [EB/OL]. ( 2015-11-19 )[ 2021-09-01 ]. https://arxiv.org/abs/1511.06432 https://arxiv.org/abs/1511.06432 .

WANG Y , LONG M , WANG J , et al . Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms [C]// Proceedings of The Advances in Neural Information Processing Systems . Long Beach : MIT Press , 2017 : 879 - 888 .

OLIU M , SELVA J , ESCALERA S . Folded recurrent neural networks for future video prediction [C]// Proceedings of The European Conference on Computer Vision . Munich : Springer , 2018 : 716 - 731 .

WU H X , YAO Z Y , LONG M S , et al . MotionRNN: A flexible model for video prediction with space time-varying motions [EB/OL]. ( 2018-03-03 )[ 2021-09-01 ]. https://arxiv.org/abs/2103.02243v2 https://arxiv.org/abs/2103.02243v2 .

GOODFELLOW I , POUGET-ABADIEb J , MIRZA M , et al . Generative adversarial nets [C]// Proceedings of The Advances in Neural Information Processing Systems . Montreal : MIT Press , 2014 : 2672 - 2680 .

MIRZA M , OSINDERO S . Conditional generative adversarial nets [EB/OL]. ( 2014-11-06 )[ 2021-09-01 ]. https://arxiv.org/abs/1411.1784 https://arxiv.org/abs/1411.1784 .

VILLEGAS R , YANG J , HONG S , et al . Decom-posing motion and content for natural video sequence prediction [EB/OL]. ( 2017-06-25 )[ 2021-09-01 ]. arXiv preprint arXiv, 2017 , 1706 . 08033 .

SAITO M , MATSUMOTO E , SAITO S . Tem-poral gene-rative adversarial nets with singular value clipping [C]// Proceedings of The IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 2830 - 2839 .

VILLEGAS R , ERHAN D , LEE H . Hierarchicallong-term video prediction without super-vision [C]// Proceedings of the International Conference on Machine Learning . Stockholm : PMLR , 2018 : 6038 - 6046 .

JIN B , HU Y , TANG Q , et al . Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction [C]// Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle : IEEE , 2020 : 4554 - 4563 .

HENAFF M , ZHAO J , LECUN Y . Prediction under uncertainty with error-encoding networks [EB/OL]. ( 2017-11-24 )[ 2021-09-01 ]. https://arxiv.org/abs/1711.04994 https://arxiv.org/abs/1711.04994 .

DENTON E , FERGUS R . Stochastic videogeneration with a learned prior [C]// International Conference on Machine Learning . Stockholm : PMLR , 2018 : 1174 - 1183 .

KUMAR M , BABAEIZADEH M , ERHAN D , et al . Video⁃flow: A conditional flow-based model for stochastic video generation [EB/OL]. ( 2017-11-24 )[ 2021-09-01 ]. https://arxiv.org/abs/1903.01434 https://arxiv.org/abs/1903.01434 .

KINGMA D P , DHARIWAL P . Glow: Generative flow with invertible 1 x 1 convolutions[EB/OL]. ( 2018-07-09 )[ 2021-09-01 ]. https://arxiv.org/abs/1807.03039 https://arxiv.org/abs/1807.03039 .

PRENGER R , VALLE R , CATANZARO B . Waveglow: A flow-based generative network forspeech syn-thesis [C]// Proceedings of The ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing . Brighton : IEEE . 2019 : 3617 - 3621 .

BAYER J , OSENDORFER C . Learning stochastic recurrent networks [EB/OL]. ( 2014-11-27 )[ 2021-09-01 ]. https://arxiv.org/abs/1411.7610 https://arxiv.org/abs/1411.7610 .

FRACCARO M , SONDERBY S K , PAQUET U , et al . Sequential neural models with stochasticlayers [EB/OL]. ( 2016-05-24 )[ 2021-09-01 ]. https://arxiv.org/abs/1605.07571 https://arxiv.org/abs/1605.07571 .

KRISHNAN R , SHALIT U , SONTAG D . Structured inference networks for nonlinear state space models [C]// Proceedings of The AAAI Conference on Artificial Intelligence . San Francisco : AAAI , 2017 : 2101 - 2109 .

HAFNER D , LILLICRAP T , FISCHER I , et al . Learning latent dynamics for planning from pixels [C]// International Conference on Machine Learning . Long Beach : JMLR , 2019 : 2555 - 2565 .

ASAHARA A , MARUYAMA K , SATO A , et al . Pedestrian-movement prediction based on mixed Markov-chain model [C]// Proceedings of The 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems . Chicago : ACM , 2011 : 25 - 33 .

MATHEW W , RAPOSO R , MARTINS B . Predicting future locations with hidden Markov models [C]// Proceedings of The 2012 ACM Conference on Ubiquitous Computing . Pittsburgh : ACM , 2012 : 911 - 918 .

MURPHY K P . Machine Learning: A Probabilistic Perspective [M]. Cambridge : MIT press , 2012 .

LONG J , SHELHAMER E , DARRELL T . Fully convolutional networks for semantic segmentation [C]// Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition . Boston : IEEE , 2015 : 3431 - 3440 .

WANG W , YU R , HUANG Q , et al . Sgpn: Simi larity group proposal network for 3d pointcloud instance segmentation [C]// Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition . Salt Lake City : IEEE , 2018 : 2569 - 2578 .

JIN X , LI X , XIAO H , et al . Video scene parsing with predictive feature learning [C]// Proceedings of The IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 5580 - 5588 .

WU Y , GAO R , PARK J , et al . Future video synthesis with object motion prediction [C]// Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition . Virtual Conference : IEEE , 2020 : 5539 - 5548 .

BEI X , YANG Y , SOATTO S . Learning seman tic-aware dynamics for video prediction [C]// Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition . Virtual Conference : IEEE , 2021 : 902 - 912 .

PATWARDHAN K A , SAPIRO G , BERTALMI M . Video inpainting under constrained camera motion [J]. IEEE Transactions on Image Processing , 2007 , 16 ( 2 ): 545 - 553 .

XU R , LI X , ZHOU B , et al . Deep flow-guided video inpainting [C]// Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach : IEEE , 2019 : 3723 - 3732 .

WU J , LU E , KOHLI P , et al . Learning to see physics via visual de-animation [C]// Proceedings of The Advances in Neural Information Processing Systems . Long Beach : MIT , 2017 : 153 - 164 .

WANG Y , WU H , ZHANG J , et al . PredRNN: A recurrent neural network for spatiotemporal predictive learning [EB/OL]. ( 2021-03-17 )[ 2021-09-01 ]. https://arxiv.org/abs/2103.09504 https://arxiv.org/abs/2103.09504 .

GULRAJANI I , AHMED F , ARJOVSKY M , et al . Improved training of wasserstein gans [EB/OL]. ( 2017-03-31 )[ 2021-09-01 ]. https://arxiv.org/abs/1704.00028v2 https://arxiv.org/abs/1704.00028v2 .

HENDERSON P , LAMPERT C H . Unsupervised object-centric video generation and decomposition in 3 D[EB/OL]. ( 2020-07-07 )[ 2021-09-01 ]. https://arxiv.org/abs/2007.06705 https://arxiv.org/abs/2007.06705 .

VILLEGAS R , YANG J , ZOU Y , et al . Learning to generate long-term future via hierarchical prediction [C]// Proceedings of The 34th International Conference on Machine Learning . Sydney : JMLR , 2017 : 3560 - 3569 .

WALKER J , MARINO K , GUPTA A , et al . The pose knows: Video forecasting by generating pose futures [C]// Proceedings of The IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 3332 - 3341 .

MINDERER M , SUN C , VILLEGAS R , et al . Unsupervised learning of object structure and dynamics from videos [EB/OL]. ( 2019-06-19 )[ 2021-09-01 ]. https://arxiv.org/abs/1906.07889 https://arxiv.org/abs/1906.07889 .

BODLA N , SHRIVASTAVAh G , CHELLAPPA R , et al . Hierarchical video prediction using relational layouts for human-object interact ions [C]// Proceedings of The IEEE/CVF Con ference on Computer Vision and Pattern Recognition . Virtual Conference : IEEE , 2021 : 12146 - 12155 .

KINGMA D P , WELLING M . Auto-encoding variational bayes [EB/OL]. ( 2013-12-20 )[ 2021-09-01 ]. https://arxiv.org/abs/1312.6114v5 https://arxiv.org/abs/1312.6114v5 .

REZENDE D J , MOHAMED S , WIERSTRA D . Stochastic backpropagation and approximate inference in deep generative models [C]// International Conference on Machine Learning . Beijing : PMLR , 2014 : 1278 - 1286 .

CHUNG J , KASTNER K , DINH L , et al . A recurrent latent variable model for sequential data [C]// Proceedings of The Advances in Neural Information Processing Systems . Montreal : MIT Press , 2015 : 2980 - 2988 .

BABAEIZADEH M , FINN C , ERHAN D , et al . Stochastic variational video prediction [EB/OL]. ( 2017-10-30 )[ 2021-09-01 ]. https://arxiv.org/abs/1710.11252 https://arxiv.org/abs/1710.11252 .

LEE A X , ZHANG R , EBERT F , et al . Stochastic adversarial video prediction [EB/OL]. ( 2018-04-04 )[ 2021-09-01 ]. https://arxiv.org/abs/1804.01523 https://arxiv.org/abs/1804.01523 .

WU B , NAIR S , MARTIN-MARTIN R , et al . Greedy hierarchical variational autoencoders for large-scale video prediction [C]// Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition . Virtual Conference : IEEE , 2021 : 2318 - 2328 .

GUR S , BENAIM S , WOLF L . Hierarchical patch vae-gan: Generating diverse videos from a single sample [EB/OL]. ( 2020-06-22 )[ 2021-09-01 ]. https://arxiv.org/abs/2006.12226 https://arxiv.org/abs/2006.12226 .

SONDERBY C K , RAIKO T , MAALOE L , et al . Ladder variational autoencoders [C]// Proceedings of The Advances in Neural Information Processing Systems . Barcelona : IEEE , 2016 : 3738 - 3746 .

WANG Y , WU J , LONG M , et al . Probabilistic video prediction from noisy data with a posterior confidence [C]// Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition . Virtual Conference : IEEE , 2020 : 10830 - 10839 .

PERTSCH K , RYBKIN O , EBERT F , et al . Long-horizon visual planning with goal-conditioned hierarchical predictors [C]// Proceedings of The Advances in Neural Information Processing Systems . Virtual Conference : Curran Associates , 2020 : 17321 - 17333 .

KIM T , AHN S , BENGIO Y . Variational temporal abstraction [C]// Proceedings of the Advances in Neural Information Processing Systems . Vancouver : MIT Press , 2019 : 11570 - 11579 .

GRAVES A , WAYNE G , DANIHELKA I . Neural turing machines [EB/OL]. ( 2014-10-20 )[ 2021-09-01 ]. https://arxiv.org/abs/1410.5401 https://arxiv.org/abs/1410.5401 .

LEE S , KIM H G , CHOI D H , et al . Video prediction recalling long-term motion context via memory alignment learning [C]// Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville : IEEE , 2021 : 3054 - 3063 .

DENTON E , BIRODKAR V . Unsupervised learning of disentangled representations from video [EB/OL]. ( 2017-05-31 )[ 2021-09-01 ]. https://arxiv.org/abs/1705.10915 https://arxiv.org/abs/1705.10915 .

GUEN V L , THOME N . Disentangling physical dynamics from unknown factors for unsupervised video prediction [C]// Proceedings of The IEEE/CVF Conference on Computer Vision and Pat tern Recognition . Seattle : IEEE , 2020 : 11474 - 11484 .

HSIEH J T , LIU B , Huang D A , et al . Learning to decompose and disentangle representations for video prediction [EB/OL]. ( 2018-06-11 )[ 2021-09-01 ]. https://arxiv.org/abs/1806.04166v1 https://arxiv.org/abs/1806.04166v1 .

VAN STEENKISTE S , CHANG M , GREFF K , et al . Relational neural expectation maximization: Unsupervised discovery of objects and their interactions [EB/OL]. ( 2018-02-28 )[ 2021-09-01 ]. https://arxiv.org/abs/1802.10353 https://arxiv.org/abs/1802.10353 .

ZABLOTSKAIA P , DOMINICI E A , SIGAL L , et al . Unsupervised video decomposition using spatio-temporal iterative inference [EB/OL]. ( 2020-06-25 )[ 2021-09-01 ]. https://arxiv.org/abs/2006.14727 https://arxiv.org/abs/2006.14727 .

GREFF K , KAUFMAN R L , KABRA R , et al . Multi-object representation learning with itera tive variational inference [C]// Proceedings of the International Conference on Machine Learning . Long Beach : PMLR , 2019 : 2424 - 2433 .

FINN C , GOODFELLOW I , LEVINE S . Un supervised learning for physical interaction through video prediction [C]// Proceedings of the Advances in Neural Information Processing Systems . Barcelona : MIT Press , 2016 : 64 - 72 .

GUPTA A , KEMBHAVI A , DAVIS L S . Observing human-object interactions: Using spatial and functional compatibility for recogni tion [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2009 , 31 ( 10 ): 1775 - 1789 .

DREHER C R G , WACHTER M , ASFOUR T . Learning object-action relations from bimanual human demonstration using graph networks [J]. IEEE Robotics and Automation Letters , 2019 , 5 ( 1 ): 187 - 194 .

SCHMIDHUBER J , HUBER R . Learning to generate artificial fovea trajectories for target detection [J]. International Journal of Neural Systems , 1991 , 2 : 125 - 134 .

CHIAPPA S , RACANIERE S , WIERSTRA D , et al . Recurrent environment simulators [EB/OL]. ( 2017-04-07 )[ 2021-09-01 ]. https://arxiv.org/abs/1704.02254 https://arxiv.org/abs/1704.02254 .

BABAEIZADEH M , SAFFAR M T , HAFNERD , et al . Models, pixels, and rewards: Evaluating design trade-offs in visual model-based reinforcement learning [EB/OL]. ( 2020-12-08 )[ 2021-09-01 ]. https://arxiv.org/abs/2012.04603 https://arxiv.org/abs/2012.04603 .

EBERT F , FINN C , DASARI S , et al . Visual foresight: Model-based deep reinforcement learning for vision-based robotic control [EB/OL]. ( 2018-12-03 )[ 2021-09-01 ]. https://arxiv.org/abs/1812.00568 https://arxiv.org/abs/1812.00568 .

HIROSE N , XIA F , MARTIN-MARTIN R , et al . Deep visual mpc-policy learning for naviga tion [J]. IEEE Robotics and Automation Letters , 2019 , 4 ( 4 ): 3184 - 3191 .

WANG Y , LIU B , WU J , et al . DualSMC: Tunneling differentiable filtering and planning under continuous POMDPs [C]// Proceedings of The Twenty-Ninth International Joint Conference on Artificial Intelligence . Virtual Conference : AAAI 2020 : 4190 - 4198 .

JANNER M , LEVINE S , FREEMAN W T , et al . Reasoning about physical interactions with object-oriented prediction and planning [EB/OL]. ( 2018-12-28 )[ 2021-09-01 ]. https://arxiv.org/abs/1812.10972v1 https://arxiv.org/abs/1812.10972v1 .

ZADAIANCHUK A , SEITZER M , MARTIUS G . Self-supervised visual reinforcement learning with object-centric representations [EB/OL]. ( 2020-11-29 )[ 2021-09-01 ]. https://arxiv.org/abs/2011.14381 https://arxiv.org/abs/2011.14381 .

SUTSKEVER I , HINTON G E , TAYLOR G W . The recurrent temporal restricted boltzmann machine [C]// Proceedings of The Advances in Neural Information Processing Systems . Vancouver, British Columbia : MIT Press , 2009 : 1601 - 1608 .

LERER A , GROSS S , FERGUS R . Learning physical intuition of block towers by exam ple [C]// Proceedings of the 32nd International Conference on Machine Learning . New York : JMLR , 2016 : 430 - 438 .

SCHULDT C , LAPTEV I , CAPUTO B . Recog nizing human actions: a local SVM approa ch [C]// Proceedings of The 17th International Conference on Pattern Recognition . Cambridge : IEEE , 2004 : 32 - 36 .

GORELICK L , BLANK M , SHECHTMAN E , et al . Actions as space-time shapes [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2007 , 29 ( 12 ): 2247 - 2253 .

KUEHNE H , JHUANG H , GARROTE E , et al . HMDB: A large video database for human motion recognition [C]// Proceedings of The International Conference on Computer Vision . Barcelona : IEEE , 2011 : 2556 - 2563 .

SOOMRO K , ZAMIR A R , SHAH M . UCF101: A dataset of 101 human actions classes from videos in the wild [EB/OL]. ( 2012-12-03 )[ 2021-09-01 ]. https://arxiv.org/abs/1212.0402v1 https://arxiv.org/abs/1212.0402v1 .

ZHANG W , ZHU M , DERPAMIS K G . From actemes to action: A strongly-supervised representation for detailed action understand ing [C]// Proceedings of the IEEE International Conference on Computer Vision . Sydney : IEEE , 2013 : 2248 - 2255 .

IONESCU C , PAPAVA D , OLARU V , et al . Human3.6m: Large scale datasets and predic tive methods for 3d human sensing in natural environments [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2013 , 36 ( 7 ): 1325 - 1339 .

DOLLAR P , WOJEK C , SCHIELE B , et al . Pedestrian detection: An evaluation of the state of the art [J]. IEEE transactions on pattern analysis and machine intelligence , 2011 , 34 ( 4 ): 743 - 761 .

GEIGER A , LENZ P , STILLER C , et al . Vision meets robotics: The kitti dataset [J]. The International Journal of Robotics Research , 2013 , 32 ( 11 ): 1231 - 1237 .

CORDTS M , OMRAN M , RAMOS S , et al . The cityscapes dataset for semantic urban scene understanding [C]// Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas : IEEE , 2016 : 3213 - 3223 .

WANG Y , ZHANG J , ZHU H , et al . Memory in memory: A predictive neural network for learn ing higher-order non-stationarity from spatio temporal dynamics [C]// Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach : IEEE , 2019 : 9154 - 9162 .

MARTIN H , HONG Y , BUCHERD , et al . Traffic4cast-Traffic Map Movie Forecasting-Team MIE-Lab [EB/OL]. ( 2019-10-27 )[ 2021-09-01 ]. https://arxiv.org/abs/1910.13824 https://arxiv.org/abs/1910.13824 .

EBERT F , FINN C , LEE A X , et al . Self-supervised visual planning with temporal skip connections [C]// Proceedings of the 1st Annual Conference on Robot Learning . Mountain View, California : PMLR , 2017 : 344 - 356 .

DASARI S , EBERT F , TIAN S , et al . Robonet: Large-scale multi-robot learning [EB/OL]. ( 2019-10-24 )[ 2021-09-01 ]. https://arxiv.org/abs/1910.11215v2 https://arxiv.org/abs/1910.11215v2 .

KWON Y H , PARK M G . Predicting future frames using retrospective cycle gan [C]// Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach : IEEE , 2019 : 1811 - 1820 .

HO Y H , CHO C Y , PENG W H , et al . Sme-net: Sparse motion estimation for parametric video prediction through reinforcement learning [C]// Proceedings of The IEEE/CVF International Conference on Computer Vision . Long Bea ch : IEEE , 2019 : 10462 - 10470 .

LIANG X , LEE L , DAI W , et al . Dual motion GAN for future-flow embedded video prediction [C]// Proceedings of The IEEE International Conference on Computer Vision . Venice : IEEE , 2017 : 1744 - 1752 .

XU J , NI B , YANG X . Video prediction via selective sampling [C]// Proceedings of The Advances in Neural Information Processing Systems . Montreal : MIT Press , 2018 : 1712 - 1722 .

VILLEGAS R , PATHAK A , KANNAN H , et al . High fidelity video prediction with large stochastic recurrent neural networks [C]// Proceedings of the Advances in Neural Infor mation Processing Systems . Vancouver : MIT Press , 2019 : 81 - 91 .

LOTTER W , KREIMAN G , Cox D . Unsuper vised learning of visual structure using predictive generative networks [EB/OL]. ( 2015-11-19 )[ 2021-09-01 ]. https://arxiv.org/abs/1511.06380 https://arxiv.org/abs/1511.06380 .

MICHALSKI V , MEMISEVIC R , KONDA K . Modeling deep temporal dependencies with recurrent grammar cells [C]// Proceedings of The Advances in Neural Information Processing Systems . Quebec : MIT Press , 2014 : 1925 - 1933 .

ARMENI I , SAX S , ZAMIR A R , et al . Joint 2d-3d-semantic data for indoor scene understanding [EB/OL]. ( 2017-02-03 )[ 2021-09-01 ]. https://arxiv.org/abs/1702.01105 https://arxiv.org/abs/1702.01105 .

CHANG A , DAI A , FUNKHOUSER T , et al . Matterport 3 d: Learning from rgb-d data in indoor environments[EB/OL]. ( 2017-09-18 )[ 2021-09-01 ]. https://arxiv.org/abs/1709.06158v1 https://arxiv.org/abs/1709.06158v1 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Continual Learning Methods and Applications in Computer Vision

Neural Network Based Image Style Transfer: A Survey

Deep Self-Supervised Multi-Exposure Image Fusion for Dynamic Scenes

DRHA-UIE: An Underwater Image Enhancement Method Based on Dual Residual Hybrid Attention Block

A Survey of Generic Object Detection Methods Based on Deep Learning

Related Author

FANG Yan

Wei Yun-chao

Cong Run-min

Zuo Wang-meng

Zhao Yao

WANG Wei

ZHANG Jing-yi

WEN Yu-hui

Related Institution

School of Computer Science and Technology, Beijing Jiaotong University

School of Control Science and Engineering, Shandong University

School of Computer Science and Technology, Harbin Institute of Technology

School of Computer Science and Technology, Beijing Jiaotong University

School of Electronic and Information Engineering， Beihang University

⁰