电子学报 ›› 2022, Vol. 50 ›› Issue (4): 869-886.DOI: 10.12263/DZXB.20211209
所属专题: 机器学习交叉融合创新; 长摘要论文
潘敏婷1, 王韫博1, 朱祥明1, 高思宇1(), 龙明盛2(
), 杨小康1(
)
收稿日期:
2021-09-01
修回日期:
2022-02-17
出版日期:
2022-04-25
作者简介:
基金资助:
PAN Min-ting1, WANG Yun-bo1, ZHU Xiang-ming1, GAO Si-yu1(), LONG Ming-sheng2(
), YANG Xiao-kang1(
)
Received:
2021-09-01
Revised:
2022-02-17
Online:
2022-04-25
Published:
2022-04-25
Supported by:
摘要:
基于视频数据的深度预测学习(以下简称“深度预测学习”)属于深度学习、计算机视觉和强化学习的交叉融合研究方向,是气象预报、自动驾驶、机器人视觉控制等场景下智能预测与决策系统的关键组成部分,在近年来成为机器学习的热点研究领域.深度预测学习遵从自监督学习范式,从无标签的视频数据中挖掘自身的监督信息,学习其潜在的时空模式表达.本文对基于深度学习的视频预测现有研究成果进行了详细综述.首先,归纳了深度预测学习的研究范畴和交叉应用领域.其次,总结了视频预测研究中常用的数据集和评价指标.而后,从基于观测空间的视频预测、基于状态空间的视频预测、有模型的视觉决策三个角度,分类对比了当前主流的深度预测学习模型.最后,本文分析了深度预测学习领域的热点问题,并对研究趋势进行了展望.
中图分类号:
潘敏婷, 王韫博, 朱祥明, 高思宇, 龙明盛, 杨小康. 基于无标签视频数据的深度预测学习方法综述[J]. 电子学报, 2022, 50(4): 869-886.
PAN Min-ting, WANG Yun-bo, ZHU Xiang-ming, GAO Si-yu, LONG Ming-sheng, YANG Xiao-kang. A Survey on Deep Predictive Learning Based on Unlabeled Videos[J]. Acta Electronica Sinica, 2022, 50(4): 869-886.
预测方法 | 年份 | 网络架构 | 损失函数 | 数据集 | 评价指标 | |
---|---|---|---|---|---|---|
基于卷积神经网络的前馈式预测模型 | BeyondMSE[ | 2016 | 多尺度CNN | L1, Adv, GDL | Sports1M, UCF101 | PSNR, SSIM |
Vukotic等[ | 2016 | ED | L2 | KTH | MSE | |
DFN[ | 2016 | ED | 交叉熵 | Moving MNIST | 二值交叉熵 | |
Jin等[ | 2020 | ED, GAN | L2, Adv, GDL | KTH, KITTI, BAIR, Caltech Pedestrian | PSNR, SSIM, LPIPS, FVD | |
FutureGAN[ | 2018 | 3D-CNN, GAN | WGAN-GP[ | Moving MNIST, KTH | MSE, SSIM | |
基于循环神经网络的循环迭代式预测模型 | Ranzato等[ | 2014 | rCNN | 交叉熵 | UCF101 | MSE |
Srivastava等[ | 2015 | LSTM | 交叉熵 | UCF101, Moving MNIST | 交叉熵, 平方损失 | |
Shi等[ | 2015 | ConvLSTM | 交叉熵 | Moving MNIST, Radar Echo | 交叉熵, MSE | |
TrajGRU[ | 2017 | Trajectory GRU | 有权重的L2 | MovingMNIST++, HKO-7 | CSI, HSS, B-MSE | |
PredRNN[ | 2017 | ST-LSTM | L2 | Moving MNIST, KTH, Radar echo | MSE, PSNR, SSIM | |
fRNN[ | 2018 | bGRU | L1 | Moving MNIST, KTH, UCF101 | MSE, PSNR, SSIM | |
E3D-LSTM[ | 2019 | E3D-LSTM | L1, L2 | Moving MNIST, KTH, TaxiBJ | MSE, PSNR, SSIM | |
MotionRNN[ | 2021 | ConvGRU | L1, L2, GDL | Human3.6M, Moving MNIST | MSE, SSIM, PSNR | |
PredRNN-V2[ | 2021 | ST-LSTM | L2 | Moving MNIST, KTH, Radar echo, Traffic4Cast, BAIR Pushing | MSE, PSNR, SSIM, LPIPS | |
生成式深度预 测网络 | VideoGAN[ | 2016 | 3D-CNN,GAN | Adv | Two million videos from Flickr | Amazon Mechanical Turk |
TGAN[ | 2017 | GAN | Adv | Moving MNIST, UCF101 | IS (Inception Score) | |
VideoFlow[ | 2020 | Glow[ | 对数似然 | BAIR Pushing | FVD, PSNR, SSIM |
表1 观测空间中的视频预测模型对比(ED表示图像编码-解码架构,Adv表示对抗损失函数)
预测方法 | 年份 | 网络架构 | 损失函数 | 数据集 | 评价指标 | |
---|---|---|---|---|---|---|
基于卷积神经网络的前馈式预测模型 | BeyondMSE[ | 2016 | 多尺度CNN | L1, Adv, GDL | Sports1M, UCF101 | PSNR, SSIM |
Vukotic等[ | 2016 | ED | L2 | KTH | MSE | |
DFN[ | 2016 | ED | 交叉熵 | Moving MNIST | 二值交叉熵 | |
Jin等[ | 2020 | ED, GAN | L2, Adv, GDL | KTH, KITTI, BAIR, Caltech Pedestrian | PSNR, SSIM, LPIPS, FVD | |
FutureGAN[ | 2018 | 3D-CNN, GAN | WGAN-GP[ | Moving MNIST, KTH | MSE, SSIM | |
基于循环神经网络的循环迭代式预测模型 | Ranzato等[ | 2014 | rCNN | 交叉熵 | UCF101 | MSE |
Srivastava等[ | 2015 | LSTM | 交叉熵 | UCF101, Moving MNIST | 交叉熵, 平方损失 | |
Shi等[ | 2015 | ConvLSTM | 交叉熵 | Moving MNIST, Radar Echo | 交叉熵, MSE | |
TrajGRU[ | 2017 | Trajectory GRU | 有权重的L2 | MovingMNIST++, HKO-7 | CSI, HSS, B-MSE | |
PredRNN[ | 2017 | ST-LSTM | L2 | Moving MNIST, KTH, Radar echo | MSE, PSNR, SSIM | |
fRNN[ | 2018 | bGRU | L1 | Moving MNIST, KTH, UCF101 | MSE, PSNR, SSIM | |
E3D-LSTM[ | 2019 | E3D-LSTM | L1, L2 | Moving MNIST, KTH, TaxiBJ | MSE, PSNR, SSIM | |
MotionRNN[ | 2021 | ConvGRU | L1, L2, GDL | Human3.6M, Moving MNIST | MSE, SSIM, PSNR | |
PredRNN-V2[ | 2021 | ST-LSTM | L2 | Moving MNIST, KTH, Radar echo, Traffic4Cast, BAIR Pushing | MSE, PSNR, SSIM, LPIPS | |
生成式深度预 测网络 | VideoGAN[ | 2016 | 3D-CNN,GAN | Adv | Two million videos from Flickr | Amazon Mechanical Turk |
TGAN[ | 2017 | GAN | Adv | Moving MNIST, UCF101 | IS (Inception Score) | |
VideoFlow[ | 2020 | Glow[ | 对数似然 | BAIR Pushing | FVD, PSNR, SSIM |
预测方法 | 年份 | 网络架构 | 损失函数 | 数据集 | 评价指标 | |
---|---|---|---|---|---|---|
基于语义状态空间的深度预测模型 | Villegas等[ | 2017 | LSTM, ED | L2, Adv | Penn Action, Human3.6M | PSNR |
VDA[ | 2017 | CNN | L2 | Block towers | MSE | |
Struct-VRNN[ | 2019 | RNN | L2, KL | Basketball, Human3.6M | FVD, SSIM, PSNR | |
Bodla等[ | 2021 | RNN, GAN | L1, Adv | UMD-HOI[ | PSNR, SSIM, LPIPS | |
Bei等[ | 2021 | MLP, RNN | L1, 交叉熵, KL | Cityscapes, KITTI | PSNR, SSIM, LPIPS | |
隐状态空间学习与时空概率预测 | SV2P[ | 2018 | CDNA[ | L1, L2, KL | BAIR robot pushing, Human3.6M, Robotic pushing | PSNR, SSIM |
SVG[ | 2018 | LSTM, ED | L2, KL | KTH, BAIR robot pushing | PSNR, SSIM | |
SAVP[ | 2018 | GAN, VAE | L1, KL | KTH, BAIR robot pushing | PSNR, SSIM | |
Gur等[ | 2020 | GAN, VAE | L2, KL, Adv | UCF101, YouTube 8M | FID | |
GHVAE[ | 2021 | VAE | ELBO, KL | RoboNet, Human3.6M, KITTI, Cityscapes | FVD, SSIM, LPIPS | |
基于状态空间的长时趋势 建模 | EPVA[ | 2018 | LSTM, ED | L2, Adv | Toy Dataset, Human3.6M | SSIM |
GCP[ | 2020 | LSTM, ED | 交叉熵 | Human3.6M | PSNR, SSIM | |
Lee等[ | 2021 | ConvLSTM, ED | L2, L1 | Moving MNIST, KTH, Human3.6M | MSE, PSNR, SSIM, LPIPS | |
预测学习中的时空表征解耦 | DrNet[ | 2017 | LSTM, ED | L2, CE, Adv | MNIST, KTH | IS, PSNR, SSIM |
DDPAE[ | 2018 | VAE | ELBO | Moving MNIST | BCE, MSE | |
RNN-EM[ | 2018 | RNN-ED | KL | Bouncing balls | BCE | |
PhyDNet[ | 2020 | ConvLSTM, ED | L2 | Moving MNIST, Traffic BJ, Human 3.6 | MSE, MAE, SSIM |
表2 基于深度网络状态空间的视频预测模型对比(ED表示图像编码-解码架构,Adv表示对抗损失函数)
预测方法 | 年份 | 网络架构 | 损失函数 | 数据集 | 评价指标 | |
---|---|---|---|---|---|---|
基于语义状态空间的深度预测模型 | Villegas等[ | 2017 | LSTM, ED | L2, Adv | Penn Action, Human3.6M | PSNR |
VDA[ | 2017 | CNN | L2 | Block towers | MSE | |
Struct-VRNN[ | 2019 | RNN | L2, KL | Basketball, Human3.6M | FVD, SSIM, PSNR | |
Bodla等[ | 2021 | RNN, GAN | L1, Adv | UMD-HOI[ | PSNR, SSIM, LPIPS | |
Bei等[ | 2021 | MLP, RNN | L1, 交叉熵, KL | Cityscapes, KITTI | PSNR, SSIM, LPIPS | |
隐状态空间学习与时空概率预测 | SV2P[ | 2018 | CDNA[ | L1, L2, KL | BAIR robot pushing, Human3.6M, Robotic pushing | PSNR, SSIM |
SVG[ | 2018 | LSTM, ED | L2, KL | KTH, BAIR robot pushing | PSNR, SSIM | |
SAVP[ | 2018 | GAN, VAE | L1, KL | KTH, BAIR robot pushing | PSNR, SSIM | |
Gur等[ | 2020 | GAN, VAE | L2, KL, Adv | UCF101, YouTube 8M | FID | |
GHVAE[ | 2021 | VAE | ELBO, KL | RoboNet, Human3.6M, KITTI, Cityscapes | FVD, SSIM, LPIPS | |
基于状态空间的长时趋势 建模 | EPVA[ | 2018 | LSTM, ED | L2, Adv | Toy Dataset, Human3.6M | SSIM |
GCP[ | 2020 | LSTM, ED | 交叉熵 | Human3.6M | PSNR, SSIM | |
Lee等[ | 2021 | ConvLSTM, ED | L2, L1 | Moving MNIST, KTH, Human3.6M | MSE, PSNR, SSIM, LPIPS | |
预测学习中的时空表征解耦 | DrNet[ | 2017 | LSTM, ED | L2, CE, Adv | MNIST, KTH | IS, PSNR, SSIM |
DDPAE[ | 2018 | VAE | ELBO | Moving MNIST | BCE, MSE | |
RNN-EM[ | 2018 | RNN-ED | KL | Bouncing balls | BCE | |
PhyDNet[ | 2020 | ConvLSTM, ED | L2 | Moving MNIST, Traffic BJ, Human 3.6 | MSE, MAE, SSIM |
预测方法 | 年份 | 网络架构 | 视频预测(世界模型)损失函数 | 数据集 | |
---|---|---|---|---|---|
高维观测空间中的有模型视觉决策 | Visual MPC[ | 2018 | ConvLSTM | L2 | \ |
PoliNet[ | 2019 | ED | L1, L2 | Stanford 2D-3D-S[ | |
低维语义空间中的有模型视觉决策 | O2P2[ | 2019 | CNN | L2, perceptual loss | \ |
SMORL[ | 2020 | CNN | KL | MuJoCo, Multiworld | |
低维隐空间中的有模型视觉决策 | PlaNet[ | 2019 | RSSM | KL | DeepMind Control Suite |
Dreamer[ | 2019 | RSSM | KL, contrastive loss | DeepMind Control Suite |
表3 基于预测模型的视觉决策方法对比(ED表示图像编码-解码架构)
预测方法 | 年份 | 网络架构 | 视频预测(世界模型)损失函数 | 数据集 | |
---|---|---|---|---|---|
高维观测空间中的有模型视觉决策 | Visual MPC[ | 2018 | ConvLSTM | L2 | \ |
PoliNet[ | 2019 | ED | L1, L2 | Stanford 2D-3D-S[ | |
低维语义空间中的有模型视觉决策 | O2P2[ | 2019 | CNN | L2, perceptual loss | \ |
SMORL[ | 2020 | CNN | KL | MuJoCo, Multiworld | |
低维隐空间中的有模型视觉决策 | PlaNet[ | 2019 | RSSM | KL | DeepMind Control Suite |
Dreamer[ | 2019 | RSSM | KL, contrastive loss | DeepMind Control Suite |
数据集 | 年份 | 视频数量 | 真实/合成 | 分辨率 | 标注信息 | 现有工作示例 | |
---|---|---|---|---|---|---|---|
预测学习专用的合成数据集 | Bouncing balls[ | 2008 | 4000 | 合成 | / | / | PGN[ |
MovingMNIST[ | 2015 | / | 真实+合成 | 64×64 | / | DFN[ ConvLSTM[ | |
Block towers[ | 2016 | 12 781 | 真实+合成 | / | / | PhysNet[ | |
人体动作视频数据集 | KTH[ | 2004 | 2391 | 真实 | 160×120 | 动作 | PredRNN[ |
Weizmann[ | 2005 | 90 | 真实 | 180×144 | 动作 | MCnet[ | |
HMDB-51[ | 2011 | 6766 | 真实 | / | 动作, 视角 | Behrmann等[ | |
UCF101[ | 2012 | 13 320 | 真实 | 320×240 | 动作 | BeyondMSE[ | |
Penn Action[ | 2013 | 2326 | 真实 | 480×270 | 动作, 姿态, 视角 | Villegas等[ | |
Human3.6M[ | 2014 | / | 真实+合成 | 1000×1000 | 动作, 深度图, 姿态, 2D框, 3D扫描掩膜 | MIM[ Struct-VRNN[ | |
Sports1M[ | 2014 | 1 133 158 | 真实 | / | 运动 | BeyondMSE[ | |
城市交通热力图与车辆驾驶视频数据集 | Caltech Pedest.[ | 2009 | 137 | 真实 | 640×480 | 行人边界框 | Jin等[ |
Kitti[ | 2013 | 151 | 真实 | 1392×512 | 里程 | Jin等[ | |
Cityscapes[ | 2016 | 50 | 真实 | 2048×1024 | 语义, 实例 | PEARL[ | |
TaxiBJ[ | 2019 | / | 真实 | 32×32 | / | MIM[ | |
Traffic4Cast[ | 2019 | / | 真实 | 495×436 | / | PredRNN-V2[ | |
机器人视觉预测数据集 | RobotPushing[ | 2016 | 57 000 | 真实 | 640×512 | 机械臂姿态 | SV2P[ |
BAIR Pushing[ | 2017 | 45 000 | 真实 | 64×64 | 机械臂姿态 | PredRNN-V2[ | |
RoboNet[ | 2019 | 161 000 | 真实 | / | 机械臂姿态 | GHVAE[ |
表4 常用的视频预测学习数据集
数据集 | 年份 | 视频数量 | 真实/合成 | 分辨率 | 标注信息 | 现有工作示例 | |
---|---|---|---|---|---|---|---|
预测学习专用的合成数据集 | Bouncing balls[ | 2008 | 4000 | 合成 | / | / | PGN[ |
MovingMNIST[ | 2015 | / | 真实+合成 | 64×64 | / | DFN[ ConvLSTM[ | |
Block towers[ | 2016 | 12 781 | 真实+合成 | / | / | PhysNet[ | |
人体动作视频数据集 | KTH[ | 2004 | 2391 | 真实 | 160×120 | 动作 | PredRNN[ |
Weizmann[ | 2005 | 90 | 真实 | 180×144 | 动作 | MCnet[ | |
HMDB-51[ | 2011 | 6766 | 真实 | / | 动作, 视角 | Behrmann等[ | |
UCF101[ | 2012 | 13 320 | 真实 | 320×240 | 动作 | BeyondMSE[ | |
Penn Action[ | 2013 | 2326 | 真实 | 480×270 | 动作, 姿态, 视角 | Villegas等[ | |
Human3.6M[ | 2014 | / | 真实+合成 | 1000×1000 | 动作, 深度图, 姿态, 2D框, 3D扫描掩膜 | MIM[ Struct-VRNN[ | |
Sports1M[ | 2014 | 1 133 158 | 真实 | / | 运动 | BeyondMSE[ | |
城市交通热力图与车辆驾驶视频数据集 | Caltech Pedest.[ | 2009 | 137 | 真实 | 640×480 | 行人边界框 | Jin等[ |
Kitti[ | 2013 | 151 | 真实 | 1392×512 | 里程 | Jin等[ | |
Cityscapes[ | 2016 | 50 | 真实 | 2048×1024 | 语义, 实例 | PEARL[ | |
TaxiBJ[ | 2019 | / | 真实 | 32×32 | / | MIM[ | |
Traffic4Cast[ | 2019 | / | 真实 | 495×436 | / | PredRNN-V2[ | |
机器人视觉预测数据集 | RobotPushing[ | 2016 | 57 000 | 真实 | 640×512 | 机械臂姿态 | SV2P[ |
BAIR Pushing[ | 2017 | 45 000 | 真实 | 64×64 | 机械臂姿态 | PredRNN-V2[ | |
RoboNet[ | 2019 | 161 000 | 真实 | / | 机械臂姿态 | GHVAE[ |
1 | SHIX J, CHENZ R, WANGH, et al. Convolutional LSTM network: A machine learning approach for precipitation nowcasting[C]//Proceedings of The Advances in Neural Information Processing Systems. Montreal: MIT Press, 2015: 802-810. |
2 | CHANDRAR, BHATTACHARYAU, BERAA, et al. Traphic: Trajectory prediction in dense and heterogeneous traffic using weighted interactions[C]//Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 8483-8492. |
3 | CASTREJONL, BALLASN, COURVILLEA. Improved conditional vrnns for video prediction[C]//Proceedings of The IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 7608-7617. |
4 | ZHANGJ, ZHENGY, QID, et al. DNN-based prediction model for spatio-temporal data[C]//Proceedings of The 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. Burlingame, California: Association for Computing Machinery, 2016: 1-4. |
5 | EBERTF, FINNC, LEEA X, et al. Self-supervised visual planning with temporal skip connections[C]//Proceedings of The 1st Annual Conference on Robot Learning. California: PMLR, 2017: 344-356. |
6 | HA D, SCHMIDHUBERJ. World models[EB/OL]. (2018-05-27)[2021-09-01]. . |
7 | HAFNERD, LILLICRAPT, BAJ, et al. Dream to control: Learning behaviors by latent imagination[EB/OL]. (2019-11-03)[2021-09-01]. . |
8 | WANGY, LIUB, WUJ, et al. DualSMC: Tunneling differentiable filtering and planningunder continuous POMDPs[C]//Proceedings of The TwentyNinth International Joint Conference on Artificial Intelligence. Yokohama: arXiv, 2020: 4190-4198. |
9 | LECUNY, BOTTOUL. Gradient-based learning applied to document recognition[J]. Proceedings of The IEEE, 1998, 86(11): 2278-2324. |
10 | JAINV, MURRAYJ F, ROTHF, et al. Super-vised learning of image restoration with convolu-tional networks[C]//Proceedings of The 11th International Conference on Computer Vision. Rio de Janeiro:IEEE, 2007: 1-8. |
11 | MATHIEUM, COUPRIEC, LECUNY. Deep multiscale video prediction beyond mean square error[EB/OL]. (2015-11-17)[2021-09-01]. . |
12 | OH J, GUOXiao-xiao, LEEH, et al. Action-conditional video prediction using deep networks in Atari games[C]//Proceedings of The Advances in Neural Information Processing Systems. Montreal: MIT Press, 2015: 2863-2871. |
13 | VUKOTICV, PINTEAS L, RAYMONDC, et al. One-step time-dependent future video frame prediction with a convolutional encoder-decoder neural network[C]//International Conference on Image Analysis and Processing. Cham: Springer, 2017: 140-151. |
14 | JIAX, DE BRABANDEREB, TUYTELAARST, et al. Dyna-mic filter networks[C]//Proceed ings of The Advances in Neural Information Systems Processing. Barcelona: arXiv, 2016: 667-675. |
15 | XUET, WUJ, BOUMANK L, et al. Visual dynamics: Stochastic future generation via layered cross convolutional networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(9): 2236-2250. |
16 | XUJ, NIB, LIZ, et al. Structure preserving video prediction[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 1460-1469. |
17 | JADERBERGM, SIMONYANK, ZISSERMANA. Spatial transformer networks[C]//Proceedings of the Advances in Neural Information Process Systems ing. Montreal: MIT Press, 2015: 2017-2025. |
18 | JINB, HUY, TANGQ, et al. Exploring spatial-temporal multi-frequency analysis for high-fide lity and temporal-consistency video prediction [C]//Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 4554-4563. |
19 | TRAND, BOURDEVL, FERGUSR, et al. Learning spatiotemporal features with 3D con volutional networks [C]//Proceedings of The IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 4489-4497. |
20 | KARPATHYA, TODERICIG, SHETTYS, et al. Large-scale video classification with convolu tional neural networks[C]//Proceedings of The IEEE Conference on Computer Vision and Recognition Pattern. Columbus: IEEE, 2014: 1725-1732. |
21 | XUH, DAS A, SAENKOK. R-c3d: Region con volutional 3d network for temporal activity detection[C]//Proceedings of The IEEE Inter national Conference on Computer Vision. Venice: IEEE, 2017: 5783-5792. |
22 | AIGNERS, KORNERM. Futuregan: Antici pating the future frames of video sequences using spatio-temporal 3d convolutions in progressively growing gans[EB/OL]. (2018-10-02)[2021-09-01]. . |
23 | WANGY, JIANGL, YANGM H, et al. Eidetic 3d lstm: A model for video prediction and beyond[C]//International Conference on Learning Representations. New Orleans: OpenReview, 2018: 1-14. |
24 | BRHRMANNN, GALLJ, NOROOZIM. Unsupervised video representation learning by bidirectional feature prediction[C]//Proceedings of The IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2021: 1670-1679. |
25 | VONDRICKC, PIRSIAVASHH, TORRALBAA. Generating videos with scene dynamics[C]//Proceedings of The Advances in Neural Informa tion Processing Systems. Barcelona: arXiv, 2016: 613-621. |
26 | HANT, XIEW, ZISSERMANA. Video representation learning by dense predictive coding[C]//Proceedings of The IEEE/CVF International Conference on Computer Vision Workshops. Seoul: IEEE, 2019: 1483-1492. |
27 | GRAVESA. Generating sequences with recurrent neural networks[J]. arXiv preprint arXiv, 2013, 1308.0850. |
28 | RANZATOM A, SZLAMA, BRUNAJ, et al. Video (language) modeling: a baseline for generative models of natural videos[EB/OL]. (2014-12-20)[2021-09-01]. . |
29 | SUTSKEVERI, VINYALSO, LeQ V. Sequence to sequence learning with neural networks[C]//Proceedings of The Advances in Neural Information Processing Systems. Montreal: MIT Press, 2014: 3104-3112. |
30 | SRIVASTAVAN, MANSIMOVE, SALAKHUDINOVR. Unsupervised learning of video representations using lstms[C]//Proceedings of The 32nd International Conference on Machine Learning. Lille: JMLR, 2015: 843-852. |
31 | SHIX J, CHENZ R, WANGH, et al. Convolutional LSTM network: A machine learning approach for precipitation nowcasting[C]//Proceedings of the Advances in Neural Information Processing Systems. Montreal: MIT Press, 2015: 802-810. |
32 | SHIX J, GAOZ H, LAUSENL, et al. Deep learningfor precipitation nowcasting: A benchmark and a new model[EB/OL]. (2017-06-12)[2021-09-01]. . |
33 | BALLASN, YAOL, PAL C, et al. Delving deeper into convolutional networks for learningvideo representations[EB/OL]. (2015-11-19)[2021-09-01]. . |
34 | WANGY, LONGM, WANGJ, et al. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms[C]//Proceedings of The Advances in Neural Information Processing Systems. Long Beach: MIT Press, 2017: 879-888. |
35 | OLIUM, SELVAJ, ESCALERAS. Folded recurrent neural networks for future video prediction[C]//Proceedings of The European Conference on Computer Vision. Munich: Springer, 2018: 716-731. |
36 | WUH X, YAOZ Y, LONGM S, et al. MotionRNN: A flexible model for video prediction with space time-varying motions[EB/OL]. (2018-03-03)[2021-09-01]. . |
37 | GOODFELLOWI, POUGET-ABADIEbJ, MIRZAM, et al. Generative adversarial nets[C]//Proceedings of The Advances in Neural Information Processing Systems. Montreal: MIT Press, 2014: 2672-2680. |
38 | MIRZAM, OSINDEROS. Conditional generative adversarial nets[EB/OL]. (2014-11-06)[2021-09-01]. . |
39 | VILLEGASR, YANGJ, HONGS, et al. Decom-posing motion and content for natural video sequence prediction[EB/OL]. (2017-06-25)[2021-09-01]. arXiv preprint arXiv, 2017, 1706.08033. |
40 | SAITOM, MATSUMOTOE, SAITOS. Tem-poral gene-rative adversarial nets with singular value clipping[C]//Proceedings of The IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2830-2839. |
41 | VILLEGASR, ERHAND, LEEH. Hierarchicallong-term video prediction without super-vision[C]//Proceedings of the International Conference on Machine Learning. Stockholm: PMLR, 2018: 6038-6046. |
42 | JINB, HUY, TANGQ, et al. Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction[C]//Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 4554-4563. |
43 | HENAFFM, ZHAOJ, LECUNY. Prediction under uncertainty with error-encoding networks[EB/OL]. (2017-11-24)[2021-09-01]. . |
44 | DENTONE, FERGUSR. Stochastic videogeneration with a learned prior[C]//International Conference on Machine Learning. Stockholm: PMLR, 2018: 1174-1183. |
45 | KUMARM, BABAEIZADEHM, ERHAND, et al. Video⁃flow: A conditional flow-based model for stochastic video generation[EB/OL]. (2017-11-24)[2021-09-01]. . |
46 | KINGMAD P, DHARIWALP. Glow: Generative flow with invertible1x1 convolutions[EB/OL]. (2018-07-09)[2021-09-01]. . |
47 | PRENGERR, VALLER, CATANZAROB. Waveglow: A flow-based generative network forspeech syn-thesis[C]//Proceedings of The ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE. 2019: 3617-3621. |
48 | BAYERJ, OSENDORFERC. Learning stochastic recurrent networks[EB/OL]. (2014-11-27)[2021-09-01]. . |
49 | FRACCAROM, SONDERBYS K, PAQUETU, et al. Sequential neural models with stochasticlayers[EB/OL]. (2016-05-24)[2021-09-01]. . |
50 | KRISHNANR, SHALITU, SONTAGD. Structured inference networks for nonlinear state space models[C]//Proceedings of The AAAI Conference on Artificial Intelligence. San Francisco: AAAI, 2017: 2101-2109. |
51 | HAFNERD, LILLICRAPT, FISCHERI, et al. Learning latent dynamics for planning from pixels[C]//International Conference on Machine Learning. Long Beach: JMLR, 2019: 2555-2565. |
52 | ASAHARAA, MARUYAMAK, SATOA, et al. Pedestrian-movement prediction based on mixed Markov-chain model[C]//Proceedings of The 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. Chicago: ACM, 2011: 25-33. |
53 | MATHEWW, RAPOSOR, MARTINSB. Predicting future locations with hidden Markov models[C]//Proceedings of The 2012 ACM Conference on Ubiquitous Computing. Pittsburgh: ACM, 2012: 911-918. |
54 | MURPHYK P. Machine Learning: A Probabilistic Perspective[M]. Cambridge: MIT press, 2012. |
55 | LONGJ, SHELHAMERE, DARRELLT. Fully convolutional networks for semantic segmentation[C]//Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3431-3440. |
56 | WANGW, YUR, HUANGQ, et al. Sgpn: Simi larity group proposal network for 3d pointcloud instance segmentation[C]//Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 2569-2578. |
57 | JINX, LIX, XIAOH, et al. Video scene parsing with predictive feature learning[C]//Proceedings of The IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 5580-5588. |
58 | WUY, GAOR, PARKJ, et al. Future video synthesis with object motion prediction[C]//Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual Conference: IEEE, 2020: 5539-5548. |
59 | BEIX, YANGY, SOATTOS. Learning seman tic-aware dynamics for video prediction[C]//Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual Conference: IEEE, 2021: 902-912. |
60 | PATWARDHANK A, SAPIROG, BERTALMIM. Video inpainting under constrained camera motion[J]. IEEE Transactions on Image Processing, 2007, 16(2): 545-553. |
61 | XUR, LIX, ZHOUB, et al. Deep flow-guided video inpainting[C]//Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 3723-3732. |
62 | WUJ, LUE, KOHLIP, et al. Learning to see physics via visual de-animation[C]//Proceedings of The Advances in Neural Information Processing Systems. Long Beach: MIT, 2017: 153-164. |
63 | WANGY, WUH, ZHANGJ, et al. PredRNN: A recurrent neural network for spatiotemporal predictive learning[EB/OL]. (2021-03-17)[2021-09-01]. . |
64 | GULRAJANII, AHMEDF, ARJOVSKYM, et al. Improved training of wasserstein gans[EB/OL]. (2017-03-31)[2021-09-01]. . |
65 | HENDERSONP, LAMPERTC H. Unsupervised object-centric video generation and decomposition in3D[EB/OL]. (2020-07-07)[2021-09-01].. |
66 | VILLEGASR, YANGJ, ZOUY, et al. Learning to generate long-term future via hierarchical prediction[C]//Proceedings of The 34th International Conference on Machine Learning. Sydney: JMLR, 2017: 3560-3569. |
67 | WALKERJ, MARINOK, GUPTAA, et al. The pose knows: Video forecasting by generating pose futures[C]//Proceedings of The IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 3332-3341. |
68 | MINDERERM, SUNC, VILLEGASR, et al. Unsupervised learning of object structure and dynamics from videos[EB/OL]. (2019-06-19)[2021-09-01]. . |
69 | BODLAN, SHRIVASTAVAhG, CHELLAPPAR, et al. Hierarchical video prediction using relational layouts for human-object interact ions[C]//Proceedings of The IEEE/CVF Con ference on Computer Vision and Pattern Recognition. Virtual Conference: IEEE, 2021: 12146-12155. |
70 | KINGMAD P, WELLINGM. Auto-encoding variational bayes[EB/OL]. (2013-12-20)[2021-09-01]. . |
71 | REZENDED J, MOHAMEDS, WIERSTRAD. Stochastic backpropagation and approximate inference in deep generative models[C]//International Conference on Machine Learning. Beijing: PMLR, 2014: 1278-1286. |
72 | CHUNGJ, KASTNERK, DINHL, et al. A recurrent latent variable model for sequential data[C]//Proceedings of The Advances in Neural Information Processing Systems. Montreal: MIT Press, 2015: 2980-2988. |
73 | BABAEIZADEHM, FINNC, ERHAND, et al. Stochastic variational video prediction[EB/OL]. (2017-10-30)[2021-09-01]. . |
74 | LEEA X, ZHANGR, EBERTF, et al. Stochastic adversarial video prediction[EB/OL]. (2018-04-04)[2021-09-01]. . |
75 | WUB, NAIRS, MARTIN-MARTINR, et al. Greedy hierarchical variational autoencoders for large-scale video prediction[C]//Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual Conference: IEEE, 2021: 2318-2328. |
76 | GUR S, BENAIMS, WOLFL. Hierarchical patch vae-gan: Generating diverse videos from a single sample[EB/OL]. (2020-06-22)[2021-09-01]. . |
77 | SONDERBYC K, RAIKOT, MAALOEL, et al. Ladder variational autoencoders[C]//Proceedings of The Advances in Neural Information Processing Systems. Barcelona: IEEE, 2016: 3738-3746. |
78 | WANGY, WUJ, LONGM, et al. Probabilistic video prediction from noisy data with a posterior confidence[C]//Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual Conference: IEEE, 2020: 10830-10839. |
79 | PERTSCHK, RYBKINO, EBERTF, et al. Long-horizon visual planning with goal-conditioned hierarchical predictors[C]//Proceedings of The Advances in Neural Information Processing Systems. Virtual Conference: Curran Associates,2020: 17321-17333. |
80 | KIMT, AHN S, BENGIOY. Variational temporal abstraction[C]//Proceedings of the Advances in Neural Information Processing Systems. Vancouver: MIT Press, 2019: 11570-11579. |
81 | GRAVESA, WAYNEG, DANIHELKAI. Neural turing machines[EB/OL]. (2014-10-20)[2021-09-01]. . |
82 | LEES, KIMH G, CHOID H, et al. Video prediction recalling long-term motion context via memory alignment learning[C]//Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 3054-3063. |
83 | DENTONE, BIRODKARV. Unsupervised learning of disentangled representations from video[EB/OL]. (2017-05-31)[2021-09-01]. . |
84 | GUENV L, THOMEN. Disentangling physical dynamics from unknown factors for unsupervised video prediction[C]//Proceedings of The IEEE/CVF Conference on Computer Vision and Pat tern Recognition. Seattle: IEEE, 2020: 11474-11484. |
85 | HSIEHJ T, LIUB, HuangD A, et al. Learning to decompose and disentangle representations for video prediction[EB/OL]. (2018-06-11)[2021-09-01].. |
86 | STEENKISTE SVAN, CHANGM, GREFFK, et al. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions[EB/OL]. (2018-02-28)[2021-09-01]. . |
87 | ZABLOTSKAIAP, DOMINICIE A, SIGALL, et al. Unsupervised video decomposition using spatio-temporal iterative inference[EB/OL]. (2020-06-25)[2021-09-01]. . |
88 | GREFFK, KAUFMANR L, KABRAR, et al. Multi-object representation learning with itera tive variational inference[C]//Proceedings of the International Conference on Machine Learning. Long Beach: PMLR, 2019: 2424-2433. |
89 | FINNC, GOODFELLOWI, LEVINES. Un supervised learning for physical interaction through video prediction[C]//Proceedings of the Advances in Neural Information Processing Systems. Barcelona: MIT Press, 2016: 64-72. |
90 | GUPTAA, KEMBHAVIA, DAVISL S. Observing human-object interactions: Using spatial and functional compatibility for recogni tion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(10): 1775-1789. |
91 | DREHERC R G, WACHTERM, ASFOURT. Learning object-action relations from bimanual human demonstration using graph networks[J]. IEEE Robotics and Automation Letters, 2019, 5(1): 187-194. |
92 | SCHMIDHUBERJ, HUBERR. Learning to generate artificial fovea trajectories for target detection[J]. International Journal of Neural Systems, 1991, 2: 125-134. |
93 | CHIAPPAS, RACANIERES, WIERSTRAD, et al. Recurrent environment simulators[EB/OL]. (2017-04-07)[2021-09-01]. . |
94 | BABAEIZADEHM, SAFFARM T, HAFNERD, et al. Models, pixels, and rewards: Evaluating design trade-offs in visual model-based reinforcement learning[EB/OL]. (2020-12-08)[2021-09-01]. . |
95 | EBERTF, FINNC, DASARIS, et al. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control[EB/OL]. (2018-12-03)[2021-09-01]. . |
96 | HIROSEN, XIAF, MARTIN-MARTINR, et al. Deep visual mpc-policy learning for naviga tion[J]. IEEE Robotics and Automation Letters, 2019, 4(4): 3184-3191. |
97 | WANGY, LIUB, WUJ, et al. DualSMC: Tunneling differentiable filtering and planning under continuous POMDPs[C]//Proceedings of The Twenty-Ninth International Joint Conference on Artificial Intelligence. Virtual Conference: AAAI2020: 4190-4198. |
98 | JANNERM, LEVINES, FREEMANW T, et al. Reasoning about physical interactions with object-oriented prediction and planning[EB/OL]. (2018-12-28)[2021-09-01]. . |
99 | ZADAIANCHUKA, SEITZERM, MARTIUSG. Self-supervised visual reinforcement learning with object-centric representations[EB/OL]. (2020-11-29)[2021-09-01]. . |
100 | SUTSKEVERI, HINTONG E, TAYLORG W. The recurrent temporal restricted boltzmann machine[C]//Proceedings of The Advances in Neural Information Processing Systems. Vancouver, British Columbia: MIT Press, 2009: 1601-1608. |
101 | LERERA, GROSSS, FERGUSR. Learning physical intuition of block towers by exam ple[C]//Proceedings of the 32nd International Conference on Machine Learning. New York: JMLR, 2016: 430-438. |
102 | SCHULDTC, LAPTEVI, CAPUTOB. Recog nizing human actions: a local SVM approa ch[C]//Proceedings of The 17th International Conference on Pattern Recognition. Cambridge: IEEE, 2004: 32-36. |
103 | GORELICKL, BLANKM, SHECHTMANE, et al. Actions as space-time shapes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(12): 2247-2253. |
104 | KUEHNEH, JHUANGH, GARROTEE, et al. HMDB: A large video database for human motion recognition[C]//Proceedings of The International Conference on Computer Vision. Barcelona: IEEE, 2011: 2556-2563. |
105 | SOOMROK, ZAMIRA R, SHAHM. UCF101: A dataset of 101 human actions classes from videos in the wild[EB/OL]. (2012-12-03)[2021-09-01].. |
106 | ZHANGW, ZHUM, DERPAMISK G. From actemes to action: A strongly-supervised representation for detailed action understand ing[C]//Proceedings of the IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 2248-2255. |
107 | IONESCUC, PAPAVAD, OLARUV, et al. Human3.6m: Large scale datasets and predic tive methods for 3d human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 36(7): 1325-1339. |
108 | DOLLARP, WOJEKC, SCHIELEB, et al. Pedestrian detection: An evaluation of the state of the art[J]. IEEE transactions on pattern analysis and machine intelligence, 2011, 34(4): 743-761. |
109 | GEIGERA, LENZP, STILLERC, et al. Vision meets robotics: The kitti dataset[J]. The International Journal of Robotics Research, 2013, 32(11): 1231-1237. |
110 | CORDTSM, OMRANM, RAMOSS, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 3213-3223. |
111 | WANGY, ZHANGJ, ZHUH, et al. Memory in memory: A predictive neural network for learn ing higher-order non-stationarity from spatio temporal dynamics[C]//Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 9154-9162. |
112 | MARTINH, HONGY, BUCHERD, et al.Traffic4cast-Traffic Map Movie Forecasting-Team MIE-Lab[EB/OL]. (2019-10-27)[2021-09-01]. . |
113 | EBERTF, FINNC, LEEA X, et al. Self-supervised visual planning with temporal skip connections[C]//Proceedings of the 1st Annual Conference on Robot Learning. Mountain View, California: PMLR, 2017: 344-356. |
114 | DASARIS, EBERTF, TIANS, et al. Robonet: Large-scale multi-robot learning[EB/OL]. (2019-10-24)[2021-09-01]. . |
115 | KWONY H, PARKM G. Predicting future frames using retrospective cycle gan[C]//Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 1811-1820. |
116 | HOY H, CHOC Y, PENGW H, et al. Sme-net: Sparse motion estimation for parametric video prediction through reinforcement learning[C]//Proceedings of The IEEE/CVF International Conference on Computer Vision. Long Bea ch: IEEE, 2019: 10462-10470. |
117 | LIANGX, LEEL, DAIW, et al. Dual motion GAN for future-flow embedded video prediction[C]//Proceedings of The IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 1744-1752. |
118 | XUJ, NIB, YANGX. Video prediction via selective sampling[C]//Proceedings of The Advances in Neural Information Processing Systems. Montreal: MIT Press, 2018: 1712-1722. |
119 | VILLEGASR, PATHAKA, KANNANH, et al. High fidelity video prediction with large stochastic recurrent neural networks[C]//Proceedings of the Advances in Neural Infor mation Processing Systems. Vancouver: MIT Press, 2019: 81-91. |
120 | LOTTERW, KREIMANG, CoxD. Unsuper vised learning of visual structure using predictive generative networks[EB/OL]. (2015-11-19)[2021-09-01]. . |
121 | MICHALSKIV, MEMISEVICR, KONDAK. Modeling deep temporal dependencies with recurrent grammar cells[C]//Proceedings of The Advances in Neural Information Processing Systems. Quebec: MIT Press, 2014: 1925-1933. |
122 | ARMENII, SAX S, ZAMIRA R, et al. Joint 2d-3d-semantic data for indoor scene understanding[EB/OL]. (2017-02-03)[2021-09-01]. . |
123 | CHANGA, DAIA, FUNKHOUSERT, et al. Matterport3d: Learning from rgb-d data in indoor environments[EB/OL]. (2017-09-18)[2021-09-01]. . |
[1] | 吴靖, 叶晓晶, 黄峰, 陈丽琼, 王志锋, 刘文犀. 基于深度学习的单帧图像超分辨率重建综述[J]. 电子学报, 2022, 50(9): 2265-2294. |
[2] | 李雪莹, 王田路, 梁鹏, 王翀. 基于系统模型的用户评论中非功能需求的自动分类[J]. 电子学报, 2022, 50(9): 2079-2089. |
[3] | 琚长瑞, 秦晓燕, 袁广林, 李豪, 朱虹. 尺度敏感损失与特征融合的快速小目标检测方法[J]. 电子学报, 2022, 50(9): 2119-2126. |
[4] | 张志昌, 于沛霖, 庞雅丽, 朱林, 曾扬扬. SMGN:用于对话状态跟踪的状态记忆图网络[J]. 电子学报, 2022, 50(8): 1851-1858. |
[5] | 张亚洲, 俞洋, 朱少林, 陈锐, 戎璐, 梁辉. 一种量子概率启发的对话讽刺识别网络模型[J]. 电子学报, 2022, 50(8): 1885-1893. |
[6] | 王飞扬, 冀鹏欣, 孙笠, 危倩, 李根, 张忠宝. 一种基于深度学习的动态社交网络用户对齐方法[J]. 电子学报, 2022, 50(8): 1925-1936. |
[7] | 徐兴荣, 刘聪, 李婷, 郭娜, 任崇广, 曾庆田. 基于双向准循环神经网络和注意力机制的业务流程剩余时间预测方法[J]. 电子学报, 2022, 50(8): 1975-1984. |
[8] | 欧阳与点, 谢鲲, 谢高岗, 文吉刚. 面向大规模网络测量的数据恢复算法:基于关联学习的张量填充[J]. 电子学报, 2022, 50(7): 1653-1663. |
[9] | 裴炤, 邱文涛, 王淼, 马苗, 张艳宁. 基于Transformer动态场景信息生成对抗网络的行人轨迹预测方法[J]. 电子学报, 2022, 50(7): 1537-1547. |
[10] | 彭闯, 王伦文, 胡炜林. 融合深度特征的电磁频谱异常检测算法[J]. 电子学报, 2022, 50(6): 1359-1369. |
[11] | 杨伟超, 杜宇, 文伟, 侯舒维, 徐常志, 张建华. 基于多重分形谱智能分析的卫星信号调制识别研究[J]. 电子学报, 2022, 50(6): 1336-1343. |
[12] | 张波, 陆云杰, 秦东明, 邹国建. 一种卷积自编码深度学习的空气污染多站点联合预测模型[J]. 电子学报, 2022, 50(6): 1410-1427. |
[13] | 李政伟, 李佳树, 尤著宏, 聂茹, 赵欢, 钟堂波. 基于异质图注意力网络的miRNA与疾病关联预测算法[J]. 电子学报, 2022, 50(6): 1428-1435. |
[14] | 冀振燕, 韩梦豪, 宋晓军, 冯其波. 面向激光光条图像修复的循环相似度映射网络[J]. 电子学报, 2022, 50(5): 1234-1242. |
[15] | 廖勇, 李玉杰. 一种轻量化低复杂度的FDD大规模MIMO系统CSI反馈方法[J]. 电子学报, 2022, 50(5): 1211-1217. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||