1.浙江大学生物医学工程与仪器科学学院,浙江杭州 310027
2.浙江师范大学教师教育学院,浙江金华 321004
3.浙江大学伊利诺伊大学厄巴纳香槟校区联合学院,浙江海宁 314499
[ "王海波 男,1995年生,安徽合肥人.浙江大学生物医学工程与仪器科学学院硕士研究生.主要研究方向为自然语言处理、机器翻译、人工智能.E-mail: wanghaibo111@zju.edu.cn" ]
[ "余丽丽 女,1994年生,安徽合肥人.浙江师范大学教师教育学院博士研究生.主要研究方向为语言学、多语种翻译、自然语言处理.E-mail: yulili@zjnu.edu.cn" ]
[ "王宏伟(通讯作者) 男,1981年生,黑龙江齐齐哈尔人.于英国剑桥大学获博士学位.现为浙江大学伊利诺伊大学厄巴纳香槟校区联合学院长聘教授、博士生导师.主要研究方向为人工智能、知识图谱、工业大数据以及故障诊断.E-mail: hongweiwang@zju.edu.cn" ]
收稿:2021-10-09,
修回:2022-05-10,
纸质出版:2023-10-25
移动端阅览
王海波,余丽丽,王宏伟.NMT语料库中语符不平衡度的测评研究[J].电子学报,2023,51(10):2884-2893.
WANG Hai-bo,YU Li-li,WANG Hong-wei.Research on Evaluation of Token Imbalance Degree in NMT Corpus[J].ACTA ELECTRONICA SINICA,2023,51(10):2884-2893.
王海波,余丽丽,王宏伟.NMT语料库中语符不平衡度的测评研究[J].电子学报,2023,51(10):2884-2893. DOI: 10.12263/DZXB.20211369.
WANG Hai-bo,YU Li-li,WANG Hong-wei.Research on Evaluation of Token Imbalance Degree in NMT Corpus[J].ACTA ELECTRONICA SINICA,2023,51(10):2884-2893. DOI: 10.12263/DZXB.20211369.
语符不平衡是神经机器翻译(Neural Machine Translation,NMT)语料库中普遍存在的现象.评估NMT语料库的语符不平衡度对提升语料库质量和翻译效果具有重要意义.针对现有的语符不平衡度测评研究在算法和分词范围上的缺陷与不足,本文提出语符分布离散度算法(Dispersion of Token Distribution,DTD),用以计算语符不平衡度,并扩大分词范围,从字符、子词和词3种粒度对语料库进行评估.实验结果表明,该算法在准确度、有效性和鲁棒性方面较以往研究有较大提升;语料库在不同分词粒度下的语符不平衡度差异很大,其中字符粒度的语符不平衡度最大,子词粒度次之,词粒度最小.
Token imbalance is a common phenomenon in the corpus of neural machine translation (NMT). It is of great significance to evaluate the token imbalance degree of NMT corpus to improve the quality of corpus and translation effect. Aiming at the defects and deficiencies in the algorithm and word segmentation scope of the existing studies on the measurement of the token imbalance degree
this paper proposes the dispersion of token distribution (DTD) algorithm to calculate the token imbalance degree
expands the word segmentation scope
and evaluates the corpus from three granularity: character
subword and word. The experimental results show that the accuracy
validity and robustness of the proposed algorithm are greatly improved compared with previous studies. There are great differences in the token imbalance degree of corpora under different word segmentation granularity
in which character granularity has the highest token imbalance degree
followed by subword granularity and word granularity.
CHO K , VAN MERRIENBOER B , GULCEHRE C , et al . Learning phrase representations using RNN encoder-decoder for statistical machine translation [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Stroudsburg : Association for Computational Linguistics , 2014 : 1724 - 1734 .
SUTSKEVER I , VINYALS O , LE Q V . Sequence to sequence learning with neural networks [C]// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 . New York : ACM , 2014 : 3104 - 3112 .
BAHDANAU D , CHO K , BENGIO Y . Neural machine translation by jointly learning to align and translate [EB/OL]. ( 2014 )[2021]. https://arxiv.org/abs/1409.0473 https://arxiv.org/abs/1409.0473 .
PAPINENI K , ROUKOS S , WARD T , et al . BLEU: a method for automatic evaluation of machine translation [C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL'02 . Morristown : Association for Computational Linguistics , 2001 : 311 - 318 .
SENNRICH R , HADDOW B , BIRCH A . Neural machine translation of rare words with subword units [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Stroudsburg : Association for Computational Linguistics , 2016 : 1715 - 1725 .
KOEHN P , OCH F J , MARCU D . Statistical phrase-based translation [C]// Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL'03 . Morristown : Association for Computational Linguistics , 2003 : 127 - 133 .
CHIANG D . Hierarchical phrase-based translation [J]. Computational Linguistics , 2007 , 33 ( 2 ): 201 - 228 .
LUONG M T , MANNING C D . Achieving open vocabulary neural machine translation with hybrid word-character models [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Stroudsburg : Association for Computational Linguistics , 2016 : 1054 - 1063 .
WU Y H , SCHUSTER M , CHEN Z F , et al . Google's neural machine translation system: Bridging the gap between human and machine translation [EB/OL]. ( 2016 )[2021]. https://arxiv.org/abs/1609.08144 https://arxiv.org/abs/1609.08144 .
LEE J , CHO K , HOFMANN T . Fully character-level neural machine translation without explicit segmentation [J]. Transactions of the Association for Computational Linguistics , 2017 , 5 : 365 - 378 .
CHERRY C , FOSTER G , BAPNA A , et al . Revisiting character-based neural machine translation with capacity and compression [C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing . Stroudsburg : Association for Computational Linguistics , 2018 : 4295 - 4305 .
GOWDA T , MAY J . Finding the optimal vocabulary size for neural machine translation [C]// Findings of the Association for Computational Linguistics: EMNLP 2020 . Stroudsburg : Association for Computational Linguistics , 2020 : 3955 - 3964 .
MIKOLOV T , CHEN K , CORRADO G , et al . Efficient estimation of word representations in vector space [EB/OL]. ( 2013 )[2021]. https://arxiv.org/abs/1301.3781 https://arxiv.org/abs/1301.3781 .
尤洪峰 , 田生伟 , 禹龙 , 等 . 基于Word Embedding的遥感影像检测分割 [J]. 电子学报 , 2020 , 48 ( 1 ): 75 - 83 .
YOU H F , TIAN S W , YU L , et al . Remote sensing image detection and segmentation based on word embedding [J]. Acta Electronica Sinica , 2020 , 48 ( 1 ): 75 - 83 . (in Chinese)
尤洪峰 , 田生伟 , 禹龙 , 等 . 基于Word Embedding的遥感影像检测分割 [J]. 电子学报 , 2020 , 48 ( 1 ): 75 - 83 .
YOU H F , TIAN S W , YU L , et al . Remote sensing image detection and segmentation based on word embedding [J]. Acta Electronica Sinica , 2020 , 48 ( 1 ): 75 - 83 . (in Chinese)
张昱 , 刘开峰 , 张全新 , 等 . 基于组合-卷积神经网络的中文新闻文本分类 [J]. 电子学报 , 2021 , 49 ( 6 ): 1059 - 1067 .
ZHANG Y , LIU K F , ZHANG Q X , et al . A combined-convolutional neural network for Chinese news text classification [J]. Acta Electronica Sinica , 2021 , 49 ( 6 ): 1059 - 1067 . (in Chinese)
李维刚 , 甘平 , 谢璐 , 等 . 基于样本对元学习的小样本图像分类方法 [J]. 电子学报 , 2022 , 50 ( 2 ): 295 - 304 .
LI W G , GAN P , XIE L , et al . A few-shot image classification method by pairwise-based meta learning [J]. Acta Electronica Sinica , 2022 , 50 ( 2 ): 295 - 304 . (in Chinese)
胡峰 , 王蕾 , 周耀 . 基于三支决策的不平衡数据过采样方法 [J]. 电子学报 , 2018 , 46 ( 1 ): 135 - 144 .
HU F , WANG L , ZHOU Y . An oversampling method for imbalance data based on three-way decision model [J]. Acta Electronica Sinica , 2018 , 46 ( 1 ): 135 - 144 . (in Chinese)
徐婕 , 贺美美 . 基于马氏抽样的SVM非平衡数据分类算法的泛化性能研究 [J]. 电子学报 , 2018 , 46 ( 11 ): 2660 - 2670 .
XU J , HE M M . Research on the generalization performance of SVM imbalanced data classification algorithm based on Markov sampling [J]. Acta Electronica Sinica , 2018 , 46 ( 11 ): 2660 - 2670 . (in Chinese)
陆克中 , 陈超凡 , 蔡桓 , 等 . 面向概念漂移和类不平衡数据流的在线分类算法 [J]. 电子学报 , 2022 , 50 ( 3 ): 585 - 597 .
LU K Z , CHEN C F , CAI H , et al . Online classification algorithm for concept drift and class imbalance data stream [J]. Acta Electronica Sinica , 2022 , 50 ( 3 ): 585 - 597 . (in Chinese)
JOHNSON J M , KHOSHGOFTAAR T M . Survey on deep learning with class imbalance [J]. Journal of Big Data , 2019 , 6 ( 1 ): 1 - 54 .
RUBNER Y , TOMASI C , GUIBAS L J . The earth mover's distance as a metric for image retrieval [J]. International Journal of Computer Vision , 2000 , 40 ( 2 ): 99 - 121 .
CHAO Y R , ZIPF G K . Human behavior and the principle of least effort: An introduction to human ecology [J]. Language , 1950 , 26 ( 3 ): 394 .
0
浏览量
25
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621