1.浙江大学生物医学工程与仪器科学学院,浙江杭州 310027
2.浙江师范大学教师教育学院,浙江金华 321004
3.浙江大学伊利诺伊大学厄巴纳香槟校区联合学院,浙江海宁 314499
[ "王海波 (通讯作者) 男,1995年7月生于安徽合肥.浙江大学生物医学工程与仪器科学学院硕士研究生,主要研究方向为自然语言处理、机器翻译、人工智能. E-mail: wanghaibo111@zju.edu.cn" ]
[ "余丽丽 女,1994年10月生于安徽合肥.浙江师范大学教师教育学院博士研究生,主要研究方向为语言学、多语种翻译、自然语言处理. E-mail: yulili@zjnu.edu.cn" ]
[ "王宏伟 男,1981年11月生于黑龙江齐齐哈尔.毕业于英国剑桥大学获博士学位,现为浙江大学伊利诺伊大学厄巴纳香槟校区联合学院长聘教授、博士生导师,主要研究方向为人工智能、知识图谱、工业大数据以及故障诊断. E-mail: hongweiwang@zju.edu.cn" ]
网络出版:2023-04-07,
移动端阅览
王海波, 余丽丽, 王宏伟. NMT语料库中语符不平衡度的测评研究[J/OL]. 电子学报, 2023,1-10.
WANG Hai-bo, YU Li-li, WANG Hong-wei. Research on The Evaluation of Token Imbalance Degree in NMT Corpus[J/OL]. ACTA ELECTRONICA SINICA, 2023, 1-10.
语符不平衡是神经机器翻译(Neural Machine Translation,NMT)语料库中普遍存在的现象。评估NMT语料库的语符不平衡度对于提升语料库质量和翻译效果具有重要意义。针对现有的语符不平衡度测评研究在算法和分词范围上的缺陷与不足,本文提出语符分布离散度算法(The Dispersion of Token Distribution,DTD)用以计算语符不平衡度,并扩大分词范围,从字符、子词和词三种粒度对语料库进行评估。实验结果表明,该算法在准确度、有效性和鲁棒性方面较以往研究有较大提升;语料库在不同分词粒度下的语符不平衡度差异很大,其中字符粒度的语符不平衡度最大,子词粒度次之,词粒度最小。
Token imbalance is a common phenomenon in the corpus of Neural Machine Translation (NMT). It is of great significance to evaluate the token imbalance degree of NMT corpus to improve the quality of corpus and translation effect. Aiming at the defects and deficiencies in the algorithm and word segmentation scope of the existing studies on the measurement of the token imbalance degree
this paper proposes the Dispersion of Token Distribution (DTD) algorithm to calculate the token imbalance degree
expands the word segmentation scope
and evaluates the corpus from three granularity: character
subword and word. The experimental results show that the accuracy
validity and robustness of the proposed algorithm are greatly improved compared with previous studies. There are great differences in the token imbalance degree of corpora under different word segmentation granularity
in which character granularity has the highest token imbalance degree
followed by subword granularity
and word granularity has the smallest.
Cho K , Merrienboer B V , Gulcehre C , et al . Learning phrase representations using RNN encoder-decoder for statistical machine translation [A]. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing [C]. Doha, Qatar : EMNLP , 2014 . 1724 – 1734 .
Sutskever I , Vinyals O , Le Q V . Sequence to sequence learning with neural networks [J]. Advances in neural information processing systems , 2014 , 27 : 3104 - 3112 .
Bahdanau D , Cho K , Bengio Y . Neural machine translation by jointly learning to align and translate [J]. arXiv preprint arXiv: 1409.0473 , 2014 .
Papineni K , Roukos S , Ward T , et al . Bleu: a method for automatic evaluation of machine translation [A]. Proceedings of the 40th annual meeting of the Association for Computational Linguistics [C]. Philadelphia, Pennsylvania, USA : ACL , 2002 . 311 - 318 .
Sennrich R. , Haddow B. , Birch, A. Neural machine translation of rare words with subword units [A]. In 54th Annual Meeting of the Association for Computational Linguistics [C]. Berlin, Germany : ACL , 2016 . 1715 – 1725 .
Koehn Philipp , Och Franz J. , Marcu Daniel . Statistical phrase-based translation [A]. Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics [C]. Edmonton, Canada : HLT-NAACL , 2003 . 127 - 133 .
Chiang D . Hierarchical phrase-based translation [J]. computational linguistics , 2007 , 33 ( 2 ): 201 - 228 .
Luong M T , Manning C D . Achieving open vocabulary neural machine translation with hybrid word-character models [A]. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics [C]. Berlin, Germany : ACL , 2016 . 1054 - 1063 .
Wu Y , Schuster M , Chen Z , et al . Google's neural machine translation system: Bridging the gap between human and machine translation [J]. arXiv preprint arXiv: 1609.08144 , 2016 .
Lee J , Cho K , Hofmann T . Fully character-level neural machine translation without explicit segmentation [J]. Transactions of the Association for Computational Linguistics , 2017 , 5 : 365 - 378 .
Cherry C , Foster G , Bapna A , et al . Revisiting character-based neural machine translation with capacity and compression [A]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing [C]. Brussels, Belgium : EMNLP , 2018 . 4295 - 4305 .
Gowda T , May J . Finding the optimal vocabulary size for neural machine translation [A]. In Findings of the Association for Computational Linguistics: EMNLP 2020 [C]. Online : EMNLP , 2020 . 3955 – 3964 .
Mikolov T , Chen K , Corrado G , et al . Efficient estimation of word representations in vector space [J]. arXiv preprint arXiv: 1301.3781 , 2013 .
尤洪峰 , 田生伟 , 禹龙 , 吕亚龙 . 基于Word Embedding的遥感影像检测分割 [J]. 电子学报 , 2020 , 48 ( 1 ): 75 - 83 .
YOU Hong-feng , TIAN Sheng-wei , YU Long , Ya-long Lü . Remote Sensing Image Detection and Segmentation Based on Word Embedding [J]. Acta Electronica Sinica , 2020 , 48 ( 1 ): 75 - 83 . (in Chinese)
张昱 , 刘开峰 , 张全新 , 王艳歌 , 高凯龙 . 基于组合-卷积神经网络的中文新闻文本分类 [J]. 电子学报 , 2021 , 49 ( 6 ): 1059 - 1067 .
ZHANG Yu , LIU Kai-feng , ZHANG Quan-xin , WANG Yan-ge , GAO Kai-long . A Combined-Convolutional Neural Network for Chinese News Text Classification [J]. Acta Electronica Sinica , 2021 , 49 ( 6 ): 1059 - 1067 . (in Chinese)
李维刚 , 甘平 , 谢璐 , 李松涛 . 基于样本对元学习的小样本图像分类方法 [J]. 电子学报 , 2022 , 50 ( 2 ): 295 - 304 .
LI Wei-gang , GAN Ping , XIE Lu , LI Song-tao . A Few-Shot Image Classification Method by Pairwise-Based Meta Learning [J]. Acta Electronica Sinica , 2022 , 50 ( 2 ): 295 - 304 . (in Chinese)
胡峰 , 王蕾 , 周耀 . 基于三支决策的不平衡数据过采样方法 [J]. 电子学报 , 2018 , 46 ( 1 ): 135 - 144 .
HU Feng , WANG Lei , ZHOU Yao . An oversampling method for imbalance data based on three-way decision model [J]. Acta Electronica Sinica , 2018 , 46 ( 1 ): 135 - 144 . (in Chinese)
徐婕 , 贺美美 . 基于马氏抽样的SVM非平衡数据分类算法的泛化性能研究 [J]. 电子学报 , 2018 , 46 ( 11 ): 2660 - 2670 .
XU Jie , HE Mei-mei . Research on the generalization performance of SVM imbalanced data classification algorithm based on Markov sampling [J]. Acta Electronica Sinica , 2018 , 46 ( 11 ): 2660 - 2670 . (in Chinese)
陆克中 , 陈超凡 , 蔡桓 , 吴定明 . 面向概念漂移和类不平衡数据流的在线分类算法 [J]. 电子学报 , DOI: 10.12263/DZXB.20210094 http://dx.doi.org/10.12263/DZXB.20210094 .
LU Ke-zhong , CHEN Chao-fan , CAI Huan , WU Ding-ming . Online classification algorithm for concept drift and class imbalance data stream [J]. Acta Electronica Sinica , DOI: 10.12263/DZXB.20210094. http://dx.doi.org/10.12263/DZXB.20210094. (in Chinese)
Johnson J M , Khoshgoftaar T M . Survey on deep learning with class imbalance [J]. Journal of Big Data , 2019 , 6 ( 1 ): 1 - 54 .
Rubner Y , Tomasi C , Guibas L J . The earth mover's distance as a metric for image retrieval [J]. International journal of computer vision , 2000 , 40 ( 2 ): 99 - 121 .
Zipf G K . Human behavior and the principle of least effort: An introduction to human ecology [M]. Ravenio Books , 2016 .
0
浏览量
11
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621