电子学报 ›› 2021, Vol. 49 ›› Issue (10): 2020-2031.DOI: 10.12263/DZXB.20201191

• 学术论文 • 上一篇    下一篇

基于锚点的字符级甲骨图像自动标注算法研究

史先进1,2, 曹爽1, 张重生1, 陶月锋1, 吕灵灵3, 沈夏炯1   

  1. 1.河南大学计算机与信息工程学院,河南大学黄河文化遗产实验室,河南 开封 475004
    2.河南省电化教育馆,河南 郑州 450004
    3.华北水利水电大学电力学院,河南 郑州 450045
  • 收稿日期:2020-10-26 修回日期:2021-09-29 出版日期:2021-10-25 发布日期:2021-10-25
  • 作者简介:史先进 男,1973年12月生,河南商水人.现为河南大学博士研究生.高级工程师.主要研究领域为计算甲骨学、教育大数据分析. E-mail:shixj@henu.edu.cn
    曹 爽 女,1993年2月生,河南商丘人.2021年硕士毕业于河南大学.主要研究领域为生成对抗网络、不均衡学习、计算甲骨学. E-mail:scao@henu.edu.cn
    张重生(通信作者) 男,1982年9月生,河南南阳人.现为河南大学教授、博士生导师.主要研究领域为大数据分析、深度学习. E-mail:chongsheng.zhang@yahoo.com

Research on Automatic Annotation Algorithm for Character-level Oracle-Bone Images Based on Anchor Points

Xian-jin SHI1,2, Shuang CAO1, Chong-sheng ZHANG1, Yue-feng TAO1, Ling-ling LÜ3, Xia-jiong SHEN1   

  1. 1.School of Computer and Information Engineering,Laboratory of the Yellow River Cultural Heritage,Henan University,Kaifeng,Henan 475004,China
    2.Henan Electrochemical Education Center,Zhengzhou,Henan 450004,China
    3.School of Electric Power,North China University of Water Conservancy and Hydropower,Zhengzhou,Henan 450045,China
  • Received:2020-10-26 Revised:2021-09-29 Online:2021-10-25 Published:2021-10-25

摘要:

甲骨文是中国最早的系统文字,是目前能见到的最早的成熟汉字.甲骨文的研究对历史探究和文化传承具有重要的意义.但是要实现字符级别的甲骨字符图像标注,在现有技术环境下,只能通过资深甲骨学专家进行人工标注,不仅耗费人力资源,而且效率低下.针对这一问题,在前期工作中的甲骨字符图像识别模型的基础上,本文提出了一种甲骨字符图像自动标注算法.该算法通过先分列后切割的思想,先将甲骨拓片上的每一个字符图像归结到某一个特定列,再以锚点甲骨字为参考点,根据空间近邻关系找到甲骨原文中的字所对应的甲骨字符图像,从而实现了甲骨字符图像的自动标注.同时,将标注好的甲骨字符图像添加到样本数据集,并利用增广后的数据集(增加6~10倍)重新训练甲骨字符图像识别模型,有利于提高基于深度学习的甲骨文识别算法的识别准确度;以较小的成本大幅增加样本数量,也可以节约专家大量的时间和人力.

关键词: 甲骨文, 图像标注, 数据增广, 锚点, 空间近邻, 模式识别

Abstract:

Oracle-Bone inscriptions are the earliest systematic and mature Chinese characters presently discovered. The study of Oracle-Bone inscriptions is of great significance to historical exploration and cultural inheritance. However, in order to realize character-level Oracle-Bone image annotation, in the existing technical environment, only experienced experts in Oracle-Bone inscriptions can carry out manual annotation, which not only consumes human resources, but also is inefficient. Aiming at this problem, based on the Oracle-Bone image recognition model in the previous work, this paper proposes an automatic annotation algorithm for Oracle-Bone character images. In this algorithm, each character image on the Oracle-Bone rubbings is first reduced to a specific column. Then, the Oracle-Bone character images corresponding to the characters in the original text are found by taking the anchor point as the reference point and according to the nearest neighbor relation of space, so as to realize the automatic labeling of the Oracle-Bone character images.At the same time, the labeled Oracle-Bone images are added to the sample data set, and the original Oracle-Bone character image recognition model is retrained by using the augmented data set(6-10 times increase), which is conducive to improve the recognition accuracy of the Oracle-Bone character recognition algorithm based on deep learning. In this way, the number of samples can be greatly increased at a small cost, and a lot of time and manpower of experts can be saved.

Key words: Oracle-Bone inscriptions? image annotation? data augmentation? anchor point? spatial neighbor? pattern recognition

中图分类号: