电子学报 ›› 2021, Vol. 49 ›› Issue (10): 2048-2060.DOI: 10.12263/DZXB.20200669
石义乐1, 杨文忠2, 杜慧祥1, 王丽花1, 王婷1, 理珊珊1
收稿日期:
2020-07-08
修回日期:
2020-09-13
出版日期:
2021-11-29
作者简介:
基金资助:
SHI Yi-le1, YANG Wen-zhong2, DU Hui-xiang1, WANG Li-hua1, WANG Ting1, LI Shan-shan1
Received:
2020-07-08
Revised:
2020-09-13
Online:
2021-11-29
Published:
2021-10-25
Supported by:
摘要:
图像描述旨在通过提取图像的特征输入到语言生成模型中最后输出图像对应的描述,来解决人工智能中自然语言处理与计算机视觉的交叉领域问题——智能图像理解.现对2015—2020年间图像描述方向有代表性的论文进行汇总与分析,以不同核心技术作为分类标准将图像描述大致划分为基于Encoder-Decoder框架的图像描述、基于注意力机制的图像描述、基于强化学习的图像描述、基于生成对抗网络的图像描述和基于新融合数据集的图像描述五大类.使用NIC、Hard-Attention和Neural Talk三个模型在真实数据集MS-COCO数据集上进行实验,并从BLEU1、BLEU2、BLEU3、BLEU4四处平均评分对比分析,展示三个模型效果.本文点明了未来图像描述的发展趋势,并指出了图像描述将要面临的挑战和可深入挖掘的研究方向.
中图分类号:
石义乐, 杨文忠, 杜慧祥, 王丽花, 王婷, 理珊珊. 基于深度学习的图像描述综述[J]. 电子学报, 2021, 49(10): 2048-2060.
SHI Yi-le, YANG Wen-zhong, DU Hui-xiang, WANG Li-hua, WANG Ting, LI Shan-shan. Overview of Image Captions Based on Deep Learning[J]. Acta Electronica Sinica, 2021, 49(10): 2048-2060.
不同数据集下的评价得分 | |||||
---|---|---|---|---|---|
技术 分类 | 模型称 | Flickr 8k数据集 | Flickr 30k数据集 | MSCOCO | |
端对端框架 | 首用 | NIC | B4:27.7 M:23.7 C:85.5 | ||
BRNN | B1:57.9 B2:38 B3:24.5 B4:16 | B1:57 B2:37 B3:24 B4:15.7 | B1:62 B2:45 B3:32 B4:23 M:19 C:66 | ||
编码端 | Up-Down | Flickr 8k和Flickr 8k无评价 | B1:80.2 B2:64.1 B3:49.1 B4:36.9 M:27.6 R:57.1 C:117.9 S:21.5 | ||
VS-LSTM | B1:78.9 B2:63.4 B3:48.1 B4:36.3 M:27.3 C:120.8 | B1:79 B2:63 B3:48 B4:35.9 M:27 R:56.5 C:116 | |||
解码端 | CNN+Attn | Flickr 8k和Flickr 8k无评价 | B1:71.5 B2:54.5 B3:40.8 B4:30.4 M:24.6 R:52.5 C:91 | ||
NBT | B1:69.0 B4:27.1 M:21.7 C:57.5 S:15.6 | B1:75.5 B4:34.7 M:27.1 C:107.2 S:20.1 | |||
Entity-aware | Flickr 8k和Flickr 8k无评价 | B1:25.5 B2:14.9 B3:8.0 B4:4.7 M:11.0 R:21.1 C:29.9 F:39.7 H:0.87 | |||
GroupCap | Flickr 8k和Flickr 8k无评价 | B1:72.9 B2:56.5 B3:42.5 B4:31.6 M:25.8 R:54.5 C:101.9 | |||
one hot+Glove | Flickr 8k和Flickr 8k无评价 | F(bottle):29.6 F(bus):74 F(couch):38 F(pizza):68 F(average):55.66 M:23 | |||
SemStyle-coco | Flickr 8k和Flickr 8k无评价 | B1:0.653 B4:0.238 M:0.219 C:0.769 S:0.157 C:0.003 L:6.905 G:6.691 | |||
Stack-Cap | Flickr 8k和Flickr 8k无评价 | B1:93.2 B2:86.1 B3:76.0 B4:64.6 M:35.6 R:70.6 C:118.3 S:20.9 | |||
注意力机制 | Up-Down Attention | Flickr 8k和Flickr 8k无评价 | B1:95.2 B2:88.8 B3:79.4 B4:68.5 M:36.7 R:72.4 C:120.5 S:71.5 | ||
Adaptive | B1:0.677 B2:0.494 B3:0.354 B4:0.251 M:0.204 C:0.531 | B1:0.74 B2:0.58 B3:0.44 B4:0.33 M:0.27 C:1.085 | |||
Hard-Attention | B1:67 B2:45 B3:31 B4:21 M:20 | B1:66 B2:43 B3:29 B4:19 M:18.4 | B1:71.8 B2:50.4 B3:35.7 B4:25.0 M:23.04 | ||
SimNet | S:0.160 C:0.585 M:0.221 R:0.489 B4:0.251 | S:0.220 C:1.135 M:0.283 R:0.564 B4:0.332 | |||
Att-KB+LSTM | COCO-QA数据集:Acc:67.66 WUPS@0.9:75.76 WUPS@0.0:93.63 | B1:0.8 B2:0.64 B3:0.5 B4:0.4 M:0.28 C:1.07 P:9.6 | |||
对抗 网络 | G+MLE | B3:37 B4:30 M:21 R:47 C:76 S: E-NGAN:46 E-GAN:43 | B3:0.4 B4:0.3 M:0.2 R:0.5 C:1.0 S:0.2 E-NGAN:0.4 E-GAN:0.42 | ||
Tgt | Flickr 8k和Flickr 8k无评价 | B1:53.4 B2:39.8 B3:30.7 B4:24.5 R:50.7 M:23.2 | |||
Base+CL | Flickr 8k和Flickr 8k无评价 | B1:75.5 B2:59.8 B3:46.0 B4:35.3 R:55.9 M:27.1 C:114.2 | |||
Adv-samp | Flickr 8k和Flickr 8k无评价 | M:27.2 S:18.7 S(color):10.1 S(Attribute):8.5 S(object):34.5 S(relaton):4.9 S(count):2.5 | |||
强化 学习 | Full-model | Flickr 8k和Flickr 8k无评价 | B1:71.3 B2:53.9 B3:40.3 B4:30.4 R:52.5 M:25.1 C:93.7 | ||
B-SCST | Flickr 8k和Flickr 8k无评价 | B1:72.7 B4:29.6 M:22.6 R:50.6 C:67.0 S:16.4 | |||
PG-SPIDEr | Flickr 8k和Flickr 8k无评价 | B1:74.3 B2:57.8 B3:43.3 B4:32.2 R:54.4 M:25.1 C:100.0 | |||
新融合数据 | MTTSNet | Flickr 8k和Flickr 8k无评价 | |||
M4C-Captioner | Flickr 8k和Flickr 8k无评价 | B4:18.9 M:19.8 R:43.2 S:12.8 C:81.0 H:3.0 | |||
Transformer +OA | Flickr 8k和Flickr 8k无评价 | B4:6.30 R:21.7 C:54.4 Named entities(P:24.6 R:22.2) | |||
Unsupervised | Flickr 8k和Flickr 8k无评价 | B1:58.9 B2:40.3 B3:27 B4:18.6 R:43.1 M:17.9 C:54.9 S:11.1 | |||
其中评分字符表示: B:BLEU,M:METEOR,C:CIDER,S:SPICE,R:ROUGE,A:Accuracy,H:Human,F:F1,C:CLF,L:LM,G:GRULM |
表1 不同模型、数据集下的评分对比
不同数据集下的评价得分 | |||||
---|---|---|---|---|---|
技术 分类 | 模型称 | Flickr 8k数据集 | Flickr 30k数据集 | MSCOCO | |
端对端框架 | 首用 | NIC | B4:27.7 M:23.7 C:85.5 | ||
BRNN | B1:57.9 B2:38 B3:24.5 B4:16 | B1:57 B2:37 B3:24 B4:15.7 | B1:62 B2:45 B3:32 B4:23 M:19 C:66 | ||
编码端 | Up-Down | Flickr 8k和Flickr 8k无评价 | B1:80.2 B2:64.1 B3:49.1 B4:36.9 M:27.6 R:57.1 C:117.9 S:21.5 | ||
VS-LSTM | B1:78.9 B2:63.4 B3:48.1 B4:36.3 M:27.3 C:120.8 | B1:79 B2:63 B3:48 B4:35.9 M:27 R:56.5 C:116 | |||
解码端 | CNN+Attn | Flickr 8k和Flickr 8k无评价 | B1:71.5 B2:54.5 B3:40.8 B4:30.4 M:24.6 R:52.5 C:91 | ||
NBT | B1:69.0 B4:27.1 M:21.7 C:57.5 S:15.6 | B1:75.5 B4:34.7 M:27.1 C:107.2 S:20.1 | |||
Entity-aware | Flickr 8k和Flickr 8k无评价 | B1:25.5 B2:14.9 B3:8.0 B4:4.7 M:11.0 R:21.1 C:29.9 F:39.7 H:0.87 | |||
GroupCap | Flickr 8k和Flickr 8k无评价 | B1:72.9 B2:56.5 B3:42.5 B4:31.6 M:25.8 R:54.5 C:101.9 | |||
one hot+Glove | Flickr 8k和Flickr 8k无评价 | F(bottle):29.6 F(bus):74 F(couch):38 F(pizza):68 F(average):55.66 M:23 | |||
SemStyle-coco | Flickr 8k和Flickr 8k无评价 | B1:0.653 B4:0.238 M:0.219 C:0.769 S:0.157 C:0.003 L:6.905 G:6.691 | |||
Stack-Cap | Flickr 8k和Flickr 8k无评价 | B1:93.2 B2:86.1 B3:76.0 B4:64.6 M:35.6 R:70.6 C:118.3 S:20.9 | |||
注意力机制 | Up-Down Attention | Flickr 8k和Flickr 8k无评价 | B1:95.2 B2:88.8 B3:79.4 B4:68.5 M:36.7 R:72.4 C:120.5 S:71.5 | ||
Adaptive | B1:0.677 B2:0.494 B3:0.354 B4:0.251 M:0.204 C:0.531 | B1:0.74 B2:0.58 B3:0.44 B4:0.33 M:0.27 C:1.085 | |||
Hard-Attention | B1:67 B2:45 B3:31 B4:21 M:20 | B1:66 B2:43 B3:29 B4:19 M:18.4 | B1:71.8 B2:50.4 B3:35.7 B4:25.0 M:23.04 | ||
SimNet | S:0.160 C:0.585 M:0.221 R:0.489 B4:0.251 | S:0.220 C:1.135 M:0.283 R:0.564 B4:0.332 | |||
Att-KB+LSTM | COCO-QA数据集:Acc:67.66 WUPS@0.9:75.76 WUPS@0.0:93.63 | B1:0.8 B2:0.64 B3:0.5 B4:0.4 M:0.28 C:1.07 P:9.6 | |||
对抗 网络 | G+MLE | B3:37 B4:30 M:21 R:47 C:76 S: E-NGAN:46 E-GAN:43 | B3:0.4 B4:0.3 M:0.2 R:0.5 C:1.0 S:0.2 E-NGAN:0.4 E-GAN:0.42 | ||
Tgt | Flickr 8k和Flickr 8k无评价 | B1:53.4 B2:39.8 B3:30.7 B4:24.5 R:50.7 M:23.2 | |||
Base+CL | Flickr 8k和Flickr 8k无评价 | B1:75.5 B2:59.8 B3:46.0 B4:35.3 R:55.9 M:27.1 C:114.2 | |||
Adv-samp | Flickr 8k和Flickr 8k无评价 | M:27.2 S:18.7 S(color):10.1 S(Attribute):8.5 S(object):34.5 S(relaton):4.9 S(count):2.5 | |||
强化 学习 | Full-model | Flickr 8k和Flickr 8k无评价 | B1:71.3 B2:53.9 B3:40.3 B4:30.4 R:52.5 M:25.1 C:93.7 | ||
B-SCST | Flickr 8k和Flickr 8k无评价 | B1:72.7 B4:29.6 M:22.6 R:50.6 C:67.0 S:16.4 | |||
PG-SPIDEr | Flickr 8k和Flickr 8k无评价 | B1:74.3 B2:57.8 B3:43.3 B4:32.2 R:54.4 M:25.1 C:100.0 | |||
新融合数据 | MTTSNet | Flickr 8k和Flickr 8k无评价 | |||
M4C-Captioner | Flickr 8k和Flickr 8k无评价 | B4:18.9 M:19.8 R:43.2 S:12.8 C:81.0 H:3.0 | |||
Transformer +OA | Flickr 8k和Flickr 8k无评价 | B4:6.30 R:21.7 C:54.4 Named entities(P:24.6 R:22.2) | |||
Unsupervised | Flickr 8k和Flickr 8k无评价 | B1:58.9 B2:40.3 B3:27 B4:18.6 R:43.1 M:17.9 C:54.9 S:11.1 | |||
其中评分字符表示: B:BLEU,M:METEOR,C:CIDER,S:SPICE,R:ROUGE,A:Accuracy,H:Human,F:F1,C:CLF,L:LM,G:GRULM |
算法 | 年份 | 优点 | 缺点 | ||
---|---|---|---|---|---|
端对端框架 | 首用 | NIC | 2016 | 模型设计简单但取得还不错的结果 | 没考虑图像细微特征之间的关系 |
BRNN | 2016 | 在全帧和区域级别情况下多峰RNN都优于检索基准 | 生成的主题只是区域的而不是整个图像 | ||
编码端 | Up-Down | 2018 | 能够一次考虑与一个对象有关的所有信息 | 训练参数增加了导致训练量明显变大了 | |
VS-LSTM | 2018 | 引入状态扰动探索在频繁和不太频繁单词中适当的词汇 | 统一网格输出的均匀网格模糊了边界导致对物体的误识别 | ||
解码端 | one hot+Glove | 2017 | 不需要巨量的数据集一样可以到一样评分的结果 | — | |
CNN+Attn | 2018 | 训练每个参数所需要的时间比LSTM的方法要好 | 无法考虑到语句上下文信息 | ||
NBT | 2018 | 对文本赋予不同权重抽取出更关键信息而不会增加模型计算量 | 训练集主题长度小于17 | ||
Entity-aware | 2018 | 使用细粒度来命名插槽可以在生成主题中体现特殊信息 | 填充实体的关系错误、生成的Template 错误 | ||
GroupCap | 2018 | 更好的捕获并准确的解释群组间图像的关联性 | 每次参数的优化都需要过一遍整个数据集 | ||
SemStyle-coco | 2018 | 模型可以生成具有语言风格的主题 | 外部信息量和寻找与图像相关文本处理工作量比较大 | ||
Stack-Cap | 2018 | 模型产生出越来越精细的描述 | 通过传统贪心解码获得描述增加计算量 | ||
注意力机制 | Hard-Attention | 2015 | 首次在图像主题领域使用注意力提高生成主题的质量 | 给图像区域确定权重时比较死板 | |
Att-KB+LSTM | 2016 | 能够获取到图像的高层次的抽象信息 | 词表量不大导致一些词无法表达 | ||
Adaptive | 2017 | 通过改变空间注意力从时间的维度来决定什么时间看、看多少 | 相近意义的同源词的概率却相差很大 | ||
Up-Down Attention | 2018 | 候选注意区域与对象相关视觉概念同一位置一起处理 | 模型会过分依赖句子前后信息误导图像识别 | ||
SimNet | 2018 | 引入逐步合并机制使生成的字幕既详尽又全面 | 许多对象和模型无法掌握前景和背景的关系 | ||
生成对抗网络 | G+MLE | 2017 | 首次用条件对抗网络提高生成主题的多样性 | 对抗网络生成器和鉴别器训练循序和次数难控制 | |
Base+CL | 2017 | 同一类的图片能够生成有差异但语义相似的描述 | 只有正样本模型表现只提升了一点,只有负样本模型表现大幅下降 | ||
Adv-samp | 2017 | 能够降低字幕任务固有的含糊性 | — | ||
Tgt | 2018 | 不同的生成模型能够生成语义相似的句子,具有很强的迁移性 | 无法生成被动语态的目标主题 | ||
强化学习 | state-of-the-art | 2017 | 模型能够更加强大对政策错误保持稳定 | 顺序单词生成器提供的置信度仅考虑本地信息 | |
PG-SPIDEr | 2017 | 融入了强化学习的专门用于评价图像主题的方法 | 训练收敛有点慢 | ||
Full-model | 2017 | 政策网络做本地指南价值网络做全局和前瞻性指南性能最优 | 光束尺寸对SL的影响相对较大,模型对光束大小不太敏感 | ||
B-SCST | 2020 | 首个使用策略梯度的强化学习来训练技术的贝叶斯变体 | SoftMax概率可能是过度自信而不是很好的预测信心的衡量 | ||
新融合数据 | MTTSNet | 2019 | 建立的关系字幕在多样性和信息量方面具有优势 | 对场景图生成或VRD模型无法执行 | |
Unsupervised | 2019 | 首次用无监督学习来解决图像主题的问题 | 使用图像区域和重文本作为条件条件对抗网络结果改变不大 | ||
M4C-Captioner | 2020 | 无论在新数据集还是COCO图像上都能够生成令人印象深刻字幕 | 数据需生成较长的句子且OCR和词汇标记之间进行的许多切换 | ||
Transformer +OA | 2020 | 在任何标题中都没有出现但出现在描述上下文中的专有名词 | 模型生成字幕和人工字幕之间在TTR和长度上仍然存在差距 |
表2 不同模型优缺点对比
算法 | 年份 | 优点 | 缺点 | ||
---|---|---|---|---|---|
端对端框架 | 首用 | NIC | 2016 | 模型设计简单但取得还不错的结果 | 没考虑图像细微特征之间的关系 |
BRNN | 2016 | 在全帧和区域级别情况下多峰RNN都优于检索基准 | 生成的主题只是区域的而不是整个图像 | ||
编码端 | Up-Down | 2018 | 能够一次考虑与一个对象有关的所有信息 | 训练参数增加了导致训练量明显变大了 | |
VS-LSTM | 2018 | 引入状态扰动探索在频繁和不太频繁单词中适当的词汇 | 统一网格输出的均匀网格模糊了边界导致对物体的误识别 | ||
解码端 | one hot+Glove | 2017 | 不需要巨量的数据集一样可以到一样评分的结果 | — | |
CNN+Attn | 2018 | 训练每个参数所需要的时间比LSTM的方法要好 | 无法考虑到语句上下文信息 | ||
NBT | 2018 | 对文本赋予不同权重抽取出更关键信息而不会增加模型计算量 | 训练集主题长度小于17 | ||
Entity-aware | 2018 | 使用细粒度来命名插槽可以在生成主题中体现特殊信息 | 填充实体的关系错误、生成的Template 错误 | ||
GroupCap | 2018 | 更好的捕获并准确的解释群组间图像的关联性 | 每次参数的优化都需要过一遍整个数据集 | ||
SemStyle-coco | 2018 | 模型可以生成具有语言风格的主题 | 外部信息量和寻找与图像相关文本处理工作量比较大 | ||
Stack-Cap | 2018 | 模型产生出越来越精细的描述 | 通过传统贪心解码获得描述增加计算量 | ||
注意力机制 | Hard-Attention | 2015 | 首次在图像主题领域使用注意力提高生成主题的质量 | 给图像区域确定权重时比较死板 | |
Att-KB+LSTM | 2016 | 能够获取到图像的高层次的抽象信息 | 词表量不大导致一些词无法表达 | ||
Adaptive | 2017 | 通过改变空间注意力从时间的维度来决定什么时间看、看多少 | 相近意义的同源词的概率却相差很大 | ||
Up-Down Attention | 2018 | 候选注意区域与对象相关视觉概念同一位置一起处理 | 模型会过分依赖句子前后信息误导图像识别 | ||
SimNet | 2018 | 引入逐步合并机制使生成的字幕既详尽又全面 | 许多对象和模型无法掌握前景和背景的关系 | ||
生成对抗网络 | G+MLE | 2017 | 首次用条件对抗网络提高生成主题的多样性 | 对抗网络生成器和鉴别器训练循序和次数难控制 | |
Base+CL | 2017 | 同一类的图片能够生成有差异但语义相似的描述 | 只有正样本模型表现只提升了一点,只有负样本模型表现大幅下降 | ||
Adv-samp | 2017 | 能够降低字幕任务固有的含糊性 | — | ||
Tgt | 2018 | 不同的生成模型能够生成语义相似的句子,具有很强的迁移性 | 无法生成被动语态的目标主题 | ||
强化学习 | state-of-the-art | 2017 | 模型能够更加强大对政策错误保持稳定 | 顺序单词生成器提供的置信度仅考虑本地信息 | |
PG-SPIDEr | 2017 | 融入了强化学习的专门用于评价图像主题的方法 | 训练收敛有点慢 | ||
Full-model | 2017 | 政策网络做本地指南价值网络做全局和前瞻性指南性能最优 | 光束尺寸对SL的影响相对较大,模型对光束大小不太敏感 | ||
B-SCST | 2020 | 首个使用策略梯度的强化学习来训练技术的贝叶斯变体 | SoftMax概率可能是过度自信而不是很好的预测信心的衡量 | ||
新融合数据 | MTTSNet | 2019 | 建立的关系字幕在多样性和信息量方面具有优势 | 对场景图生成或VRD模型无法执行 | |
Unsupervised | 2019 | 首次用无监督学习来解决图像主题的问题 | 使用图像区域和重文本作为条件条件对抗网络结果改变不大 | ||
M4C-Captioner | 2020 | 无论在新数据集还是COCO图像上都能够生成令人印象深刻字幕 | 数据需生成较长的句子且OCR和词汇标记之间进行的许多切换 | ||
Transformer +OA | 2020 | 在任何标题中都没有出现但出现在描述上下文中的专有名词 | 模型生成字幕和人工字幕之间在TTR和长度上仍然存在差距 |
数据集名称 | 数量/张 | ||
---|---|---|---|
训练集 | 测试集 | 验证集 | |
MS-COCO2017 | 118287 | 50694 | 5000 |
MS-COCO2014 | 82783 | 40775 | 5000 |
Flickr30k | 28000 | 1000 | 1000 |
Flickr8k | 6000 | 1000 | 1000 |
表3 不同数据集分类
数据集名称 | 数量/张 | ||
---|---|---|---|
训练集 | 测试集 | 验证集 | |
MS-COCO2017 | 118287 | 50694 | 5000 |
MS-COCO2014 | 82783 | 40775 | 5000 |
Flickr30k | 28000 | 1000 | 1000 |
Flickr8k | 6000 | 1000 | 1000 |
用途分类 | 评价方法 | 功能 |
---|---|---|
机器翻译 | BLEU | 计算候选译文与参考译文中N元组共同出现的程度 |
Meteor | 计算特定序列匹配,释义同义词、词根和词缀之间的关系 | |
自动文摘 | ROUGE | 计算生成结果的召回率 |
图像主题 | CIDER | 计算参考主题与生成主题的余弦相似度 |
SPICE | 计算生成主题中对象、属性和语义的F-score值 |
表4 常用评价方法
用途分类 | 评价方法 | 功能 |
---|---|---|
机器翻译 | BLEU | 计算候选译文与参考译文中N元组共同出现的程度 |
Meteor | 计算特定序列匹配,释义同义词、词根和词缀之间的关系 | |
自动文摘 | ROUGE | 计算生成结果的召回率 |
图像主题 | CIDER | 计算参考主题与生成主题的余弦相似度 |
SPICE | 计算生成主题中对象、属性和语义的F-score值 |
1 | 权宇, 李志欣, 张灿龙, 等. 融合深度扩张网络和轻量化网络的目标检测模型[J]. 电子学报, 2020, 48(2): 390-397. |
QuanY, LiZ X, ZhangC L, et al. Fusing deep dilated convolutions network and light-weight network for object detection[J]. Acta Electronica Sinica, 2020, 48(2): 390-397. (in Chinese) | |
2 | 刘颖, 刘红燕, 范九伦, 等. 基于深度学习的小目标检测研究与应用综述[J]. 电子学报, 2020, 48(3): 590-601. |
LiuY, LiuH Y, FanJ L, et al. A survey of research and application of small object detection based on deep learning[J]. Acta Electronica Sinica, 2020, 48(3): 590-601. (in Chinese) | |
3 | 杨KL. Image caption的发展历程和最新工作的简要综述(2010-2018) [EB/OL]. , 2018-12-01. |
4 | VinyalsO, ToshevA, BengioS, et al. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 652-663. |
5 | TanX, RenY, HeD, et al. Multilingual neural machine translation with knowledge distillation[EB/OL]. , 2019. |
6 | 亲历者. show_and_tell代码实现及测试——批量训练[EB/OL]. , 2018-09-07. |
7 | KarpathyA, LiF F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676. |
8 | SimonyanK, ZissermanA. Very deep convolutional networks for large-scale image recognition[EB/OL]. , 2014. |
9 | FangH, GuptaS, IandolaF, et al. From captions to visual concepts and back[A]. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Boston, MA, USA: IEEE, 2015. 1473-1482. |
10 | LiN, ChenZ. Image cationing with visual-semantic LSTM[A]. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence[C]. Shanghai, China: IJCAI, 2018. 793-799. |
11 | AndersonP, HeX D, BuehlerC, et al. Bottom-up and top-down attention for image captioning and visual question answering[A]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition[C]. Salt Lake City, UT, USA: IEEE, 2018. 6077-6086. |
12 | FeiZ C. Better understanding hierarchical visual relationship for image caption[EB/OL]. , 2019. |
13 | LeeK H, PalangiH, ChenX, et al. Learning visual relation priors for image-text matching and image captioning with neural scene graph generators[EB/OL]. , 2019. |
14 | YaoT, PanY W, LiY H, et al. Hierarchy parsing for image captioning[A]. 2019 IEEE/CVF International Conference on Computer Vision (ICCV)[C]. Seoul, Korea (South): IEEE, 2019. 2621-2629. |
15 | HeS, LiaoW T, TavakoliH R, et al. Image captioning through image transformer[A]. Computer Vision-ACCV2020[M]. Cham, Germany: Springer International Publishing, 2021. 153-169. |
16 | 张红斌, 蒋子良, 熊其鹏, 等. 基于改进的有效区域基因选择与跨模态语义挖掘的图像属性标注[J]. 电子学报, 2020, 48(4): 790-799. |
ZhangH B, JiangZ L, XiongQ P, et al. Image attribute annotation via a modified effective range based gene selection and cross-modal semantics mining[J]. Acta Electronica Sinica, 2020, 48(4): 790-799. (in Chinese) | |
17 | ChenF H, JiR R, SunX S, et al. GroupCap: group-based image captioning with structured relevance and diversity constraints[A]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition[C]. Salt Lake City, UT, USA: IEEE, 2018. 1345-1353. |
18 | PasunuruR, BansalM. Multi-task video captioning with video and entailment generation[A]. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)[C]. Stroudsburg, PA, USA: Association for Computational Linguistics, 2017. 1273-1283. |
19 | ZhouL W, PalangiH, ZhangL, et al. Unified vision-language pre-training for image captioning and VQA[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 13041-13049. |
20 | WangY F, LinZ, ShenX H, et al. Skeleton key: Image captioning by skeleton-attribute decomposition[A]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Honolulu, HI, USA: IEEE, 2017. 7378-7387. |
21 | LuJ S, YangJ W, BatraD, et al. Neural baby talk[A]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition[C]. Salt Lake City, UT, USA: IEEE, 2018. 7219-7228. |
22 | MathewsA, XieL X, HeX M. SemStyle: learning to generate stylised image captions using unaligned text[A]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition[C]. Salt Lake City, UT, USA: IEEE, 2018. 8591-8600. |
23 | WangL, BaiZ C, ZhangY H, et al. Show, recall, and tell: Image captioning with recall mechanism[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 12176-12183. |
24 | NguyenG, JunT J, TranT, et al. ContCap: A comprehensive framework for continual image captioning[EB/OL]. , 2019. |
25 | GuJ X, CaiJ F, WangG, et al. Stack-captioning: Coarse-to-fine learning for image captioning[EB/OL]. , 2018. |
26 | FeiZ C. Fast image caption generation with position alignment[EB/OL]. , 2019. |
27 | ThapliyalA V, SoricutR. Cross-modal language generation using pivot stabilization for Web-scale language coverage[A]. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics[C]. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020. 160-170. |
28 | XiaQ L, HuangH Y, DuanN, et al. XGPT: cross-modal generative pre-training for image captioning[EB/OL]. , 2020. |
29 | YaoT, PanY, LiY,et al. Incorporating copying mechanism in image captioning for learning novel objects[A]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C]. New York, USA: IEEE, 2017. 6580-6588. |
30 | AnejaJ, DeshpandeA, SchwingA G. Convolutional image captioning[A]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition[C]. Salt Lake City, UT, USA, IEEE, 2018: 5561-5570. |
31 | NikolausM, AbdouM, LammM, et al. Compositional generalization in image captioning[A]. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)[C]. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019. 87-98. |
32 | LuD, WhiteheadS, HuangL F, et al. Entity-aware image caption generation[A]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing[C]. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018. 4013-4023. |
33 | SammaniF, ElsayedM. Look and modify: Modification networks for image captioning[EB/OL]. , 2019. |
34 | XuK, BaJ, KirosR, et al. Show, attend and tell: Neural image caption generation with visual attention[EB/OL]. , 2015. |
35 | LuJ S, XiongC M, ParikhD, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning[A]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Honolulu, HI, USA: IEEE, 2017. 3242-3250. |
36 | HuangL, WangW M, XiaY X, et al. Adaptively aligned image captioning via adaptive attention time[EB/OL]. , 2019. |
37 | GoelA, FernandoB, NguyenT S, et al. Learning to caption images with two-stream attention and sentence auto-encoder[EB/OL]. , 2019. |
38 | GuoL T, LiuJ, ZhuX X, et al. Normalized and geometry-aware self-attention network for image captioning[A]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Seattle, WA, USA: IEEE, 2020. 10324-10333. |
39 | PanY W, YaoT, LiY H, et al. X-linear attention networks for image captioning[A]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Seattle, WA, USA: IEEE, 2020. 10968-10977. |
40 | WuQ, ShenC H, LiuL Q, et al. What value do explicit high level concepts have in vision to language problems? [A]. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Las Vegas, NV, USA: IEEE, 2016. 203-212. |
41 | SunJ M, LapuschkinS, SamekW, et al. Understanding image captioning models beyond visualizing attention[EB/OL]. , 2020. |
42 | LiuF L, RenX C, LiuY X, et al. simNet: stepwise image-topic merging network for generating detailed and comprehensive image captions[A]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing[C]. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018. 137-149. |
43 | RanzatoM, ChopraS, AuliM, et al. Sequence level training with recurrent neural networks[EB/OL]. , 2015. |
44 | LiuS Q, ZhuZ H, YeN, et al. Improved image captioning via policy gradient optimization of SPIDEr[A]. 2017 IEEE International Conference on Computer Vision (ICCV)[C]. Venice, Italy: IEEE, 2017. 873-881. |
45 | RenZ, WangX Y, ZhangN, et al. Deep reinforcement learning-based image captioning with embedding reward[A]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Honolulu, HI, USA: IEEE, 2017. 1151-1159. |
46 | SeoP H, SharmaP, LevinboimT, et al. Reinforcing an image caption generator using off-line human feedback[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(3): 2693-2700. |
47 | RennieS J, MarcheretE, MrouehY, et al. Self-critical sequence training for image captioning[A]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Honolulu, HI, USA: IEEE, 2017. 1179-1195. |
48 | PasunuruR, BansalM. Reinforced video captioning with entailment rewards[A]. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing[C]. Stroudsburg, PA, USA: Association for Computational Linguistics, 2017. 979-985. |
49 | DaiB, FidlerS, UrtasunR, et al. Towards diverse and natural image descriptions via a conditional GAN[A]. 2017 IEEE International Conference on Computer Vision (ICCV)[C]. Venice, Italy: IEEE, 2017. 2989-2998. |
50 | ShettyR, RohrbachM, HendricksL A, et al. Speaking the same language: Matching machine to human captions by adversarial training[A]. 2017 IEEE International Conference on Computer Vision (ICCV)[C]. Venice, Italy: IEEE, 2017. 4155-4164. |
51 | LiuB C, SongK P, ZhuY Z, et al. TIME: text and image mutual-translation adversarial networks[EB/OL]. , 2020. |
52 | LiN N, ChenZ Z. Learning compact reward for image captioning[EB/OL]. , 2020. |
53 | ChenH G, ZhangH, ChenP Y, et al. Attacking visual language grounding with adversarial examples: A case study on neural image captioning[A]. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)[C]. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018. 2587-2597. |
54 | ShekharR, PezzelleS, KlimovichY, et al. FOIL it! Find One mismatch between Image and Language caption[A]. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)[C]. Stroudsburg, PA, USA: Association for Computational Linguistics, 2017. 255-265. |
55 | DaiB, LinD H. Contrastive learning for image captioning[EB/OL]. , 2017. |
56 | FengY, MaL, LiuW, et al. Unsupervised image captioning[A]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Long Beach, CA, USA: IEEE, 2019. 4120-4129. |
57 | BhargavaS, ForsythD. Exposing and correcting the gender bias in image captioning datasets and models[EB/OL]. , 2019. |
58 | ShusterK, HumeauS, HuH X, et al. Engaging image captioning via personality[A]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Long Beach, CA, USA: IEEE, 2019. 12508-12518. |
59 | KimD J, ChoiJ, OhT H, et al. Dense relational captioning: Triple-stream networks for relationship-based captioning[A]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Long Beach, CA, USA: IEEE, 2019. 6264-6273. |
60 | BitenA F, GomezL, RusiñolM, et al. Good news, everyone! context driven entity-aware captioning for news images[A]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Long Beach, CA, USA: IEEE, 2019. 12458-12467. |
61 | GuoL T, LiuJ, YaoP, et al. MSCap: multi-style image captioning with unpaired stylized text[A]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)[C]. Long Beach, CA, USA: IEEE, 2019. 4199-4208. |
62 | SidorovO, HuR H, RohrbachM, et al. TextCaps: A dataset for image captioning with reading comprehension[A]. Computer Vision - ECCV2020[M]. Cham, Germany: Springer International Publishing, 2020. 742-758. |
63 | TranA, MathewsA, XieL X. Transform and tell: Entity-aware news image captioning[EB/OL]. , 2020. |
64 | LevinboimT, ThapliyalA V, SharmaP, et al. Quality estimation for image captions based on large-scale human evaluations[A]. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies[C]. Stroudsburg, PA, USA: Association for Computational Linguistics, 2021. 3157-3166. |
65 | XieH Y, SherborneT, KuhnleA, et al. Going beneath the surface: Evaluating image captioning for grammaticality, truthfulness and diversity[EB/OL]. , 2019. |
[1] | 刘群, 谭洪胜, 张优敏, 王国胤. 基于元路径的动态异质网络表示学习[J]. 电子学报, 2022, 50(8): 1830-1839. |
[2] | 张志昌, 于沛霖, 庞雅丽, 朱林, 曾扬扬. SMGN:用于对话状态跟踪的状态记忆图网络[J]. 电子学报, 2022, 50(8): 1851-1858. |
[3] | 赵俊男, 佘青山, 孟明, 陈云. 基于多流空间注意力图卷积SRU网络的骨架动作识别[J]. 电子学报, 2022, 50(7): 1579-1585. |
[4] | 段世红, 何昊, 徐诚, 殷楠, 王然. 未知环境下基于深度序列蒙特卡罗树搜索的信源导航方法[J]. 电子学报, 2022, 50(7): 1744-1752. |
[5] | 饶宁, 许华, 蒋磊, 宋佰霖, 史蕴豪. 基于多智能体深度强化学习的分布式协同干扰功率分配算法[J]. 电子学报, 2022, 50(6): 1319-1330. |
[6] | 宋佰霖, 许华, 齐子森, 饶宁, 彭翔. 一种基于深度强化学习的协同通信干扰决策算法[J]. 电子学报, 2022, 50(6): 1301-1309. |
[7] | 易令, 李泽平. 基于深度强化学习的码率自适应算法研究[J]. 电子学报, 2022, 50(5): 1192-1200. |
[8] | 卓亚琦, 魏家辉, 李志欣. 基于双注意模型的图像描述生成方法研究[J]. 电子学报, 2022, 50(5): 1123-1130. |
[9] | 王延达, 陈炜通, 皮德常, 岳琳. 一种自适应记忆神经网络多跳读取与覆盖度机制结合的药物推荐模型[J]. 电子学报, 2022, 50(4): 943-953. |
[10] | 张涛, 张文涛, 代凌, 陈婧怡, 王丽, 魏倩茹. 基于序贯博弈多智能体强化学习的综合模块化航空电子系统重构方法[J]. 电子学报, 2022, 50(4): 954-966. |
[11] | 张凌明, 赵悦, 李鹏程, 刘洋, 高陈强. 基于局部注意力机制的三维牙齿模型分割网络[J]. 电子学报, 2022, 50(3): 681-690. |
[12] | 伍邦谷, 张苏林, 石红, 朱鹏飞, 王旗龙, 胡清华. 基于多分支结构的不确定性局部通道注意力机制[J]. 电子学报, 2022, 50(2): 374-382. |
[13] | 周登文, 李文斌, 李金新, 黄志勇. 一种轻量级的多尺度通道注意图像超分辨率重建网络[J]. 电子学报, 2022, 50(10): 2336-2346. |
[14] | 王波, 黄冕, 刘利军, 黄青松, 单文琦. 基于多层聚焦Inception-V3卷积网络的细粒度图像分类[J]. 电子学报, 2022, 50(1): 72-78. |
[15] | 赵琰, 赵凌君, 匡纲要. 基于注意力机制特征融合网络的SAR图像飞机目标快速检测[J]. 电子学报, 2021, 49(9): 1665-1674. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||