1.燕山大学信息科学与工程学院,河北秦皇岛 066004
2.河北省虚拟技术与系统集成重点实验室,河北秦皇岛 066004
3.河北建材职业技术学院,河北秦皇岛 066000
4.河北科技大学大数据与社会计算研究中心,河北石家庄 050018
[ "孟伟伦 男,1998年4月出生于河北省衡水市.现为燕山大学信息科学与工程学院博士,主要研究方向为自然语言处理." ]
[ "郭景峰 男,1962年2月出生于黑龙江省哈尔滨市.现为燕山大学信息科学与工程学院计算机系教授、博士生导师,在国内外发表学术论文150余篇. Email: jfguo@ysu.edu.cn" ]
[ "邢珂萱 女,1998年12月出生于黑龙江省大庆市.现为燕山大学信息科学与工程学院研究生,主要研究方向为自然语言处理." ]
魏 宁 男,1995年1月出生于山东省枣庄市.现为燕山大学信息科学与工程学院博士,主要研究方向为推荐系统以及数据挖掘.
王巧梭 女,1965年5月出生于河北省石家庄市.现为河北建材职业技术学院高级工程师、高级实验师,主要研究方向计算数据分析与基础应用.
刘 滨 男,1975年11月出生于河北省石家庄市.现为河北科技大学大数据与社会计算研究中心教授、硕士生导师,在国内外发表学术论文100余篇.
收稿:2023-06-08,
修回:2024-01-17,
纸质出版:2024-06-25
移动端阅览
孟伟伦, 郭景峰, 邢珂萱, 等. 基于字形特征的中文医学命名实体识别方法[J]. 电子学报, 2024, 52(06): 1945-1954.
MENG Wei-lun, GUO Jing-feng, XING Ke-xuan, et al. A Chinese Medical Named Entity Recognition Method Based on Glyph Features[J]. Acta Electronica Sinica, 2024, 52(06): 1945-1954.
孟伟伦, 郭景峰, 邢珂萱, 等. 基于字形特征的中文医学命名实体识别方法[J]. 电子学报, 2024, 52(06): 1945-1954. DOI:10.12263/DZXB.20230516
MENG Wei-lun, GUO Jing-feng, XING Ke-xuan, et al. A Chinese Medical Named Entity Recognition Method Based on Glyph Features[J]. Acta Electronica Sinica, 2024, 52(06): 1945-1954. DOI:10.12263/DZXB.20230516
作为医学信息抽取的第一个关键环节,医学命名实体识别任务旨在从如电子医疗病例、中文医药说明书等非结构化文本中抽取出医学相关的实体.目前大多数中文医学命名实体识别工作通过在预训练模型上进行微调来获得文本表示向量,然后利用特征工程来提升模型在医疗领域上的性能.这些模型大部分源自在通用数据集上表现较好的模型,没有考虑中文医学数据集的语言特性.通过在多个医学数据集上进行统计分析,发现部分类型的医学实体在字形上具有共性,如在汉字中大部分表示疾病含义的字符都包含“疒”,大部分表示身体器官的字符都包含“月”.针对这些问题,本文提出了一种基于字形特征的中文医学命名实体识别方法,该方法通过在文本表示向量上融合字形向量以及进一步利用数据集中负样本来提升模型的准确度和泛化能力.在多个公共的中文医学数据集上的实验结果表明,该方法获得了比其他模型更好的效果,并且通过消融实验证明了融合字形特征和从负样本中学习对于该任务是有效的.
As the first key link in medical information extraction
the medical named entity recognition task aims to extract medical-related entities from unstructured texts such as electronic medical records and Chinese medical instructions. Most current Chinese medical named entity recognition works obtain text representation vectors by fine-tuning pre-trained models
and then use feature engineering to improve the performance of the models in the medical field. Most of these models are derived from models that perform well on general-purpose datasets
without considering the language characteristics of Chinese medical datasets. Through statistical analysis on multiple medical data sets
it is found that some types of medical entities have similarities in glyphs. For example
in Chinese characters
most of the characters representing diseases contain “疒”
and most of the characters representing body organs contain “月”. In response to these problems
this paper proposes a Chinese medical named entity recognition method based on glyph features. This method improves the accuracy and generalization ability of the model by fusing the glyph vector on the text representation vector and further utilizing the negative samples in the dataset. Experimental results on multiple public Chinese medical datasets show that this method achieves better results than other models
and ablation experiments prove that fusing glyph features and learning from negative samples is effective for this task.
李冬梅 , 罗斯斯 , 张小平 , 等 . 命名实体识别方法研究综述 [J ] . 计算机科学与探索 , 2022 , 16 ( 9 ): 1954 - 1968 .
LI D M , LUO S S , ZHANG X P , et al . Review on named entity recognition [J ] . Journal of Frontiers of Computer Science and Technology , 2022 , 16 ( 9 ): 1954 - 1968 . (in Chinese)
杨锦锋 , 于秋滨 , 关毅 , 等 . 电子病历命名实体识别和实体关系抽取研究综述 [J ] . 自动化学报 , 2014 , 40 ( 8 ): 1537 - 1562 .
YANG J F , YU Q B , GUAN Y , et al . An overview of research on electronic medical record oriented named entity recognition and entity relation extraction [J ] . Acta Automatica Sinica , 2014 , 40 ( 8 ): 1537 - 1562 . (in Chinese)
ZHANG Z Y , HAN X , LIU Z Y , et al . ERNIE: Enhanced language representation with informative entities [C ] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : Association for Computational Linguistics , 2019 : 1441 - 1451 .
胡婕 , 胡燕 , 刘梦赤 , 等 . 基于知识库实体增强BERT模型的中文命名实体识别 [J ] . 计算机应用 , 2022 , 42 ( 9 ): 2680 - 2685 .
HU J , HU Y , LIU M C , et al . Chinese named entity recognition based on knowledge base entity enhanced BERT model [J ] . Journal of Computer Applications , 2022 , 42 ( 9 ): 2680 - 2685 . (in Chinese)
殷章志 , 李欣子 , 黄德根 , 等 . 融合字词模型的中文命名实体识别研究 [J ] . 中文信息学报 , 2019 , 33 ( 11 ): 95 - 100, 106 .
YIN Z Z , LI X Z , HUANG D G , et al . Chinese named entity recognition ensembled with character [J ] . Journal of Chinese Information Processing , 2019 , 33 ( 11 ): 95 - 100, 106 . (in Chinese)
LI Y M , LIU L M , SHI S M . Empirical analysis of unlabeled entity problem in named entity recognition [EB/OL ] . [2020 ] . http://arxiv.org/abs/2012.05426.pdf http://arxiv.org/abs/2012.05426.pdf .
LI Y M , LIU L M , SHI S M . Rethinking negative sampling for handling missing entity annotations [C ] // Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Stroudsburg : Association for Computational Linguistics , 2022 : 7188 - 7197 .
LING Y , HASAN S A , FARRI O , et al . A domain knowledge-enhanced LSTM-CRF model for disease named entity recognition [J ] . AMIA Joint Summits on Translational Science , 2019 , 2019 : 761 - 770 .
LI Y , DU G D , XIANG Y , et al . Towards Chinese clinical named entity recognition by dynamic embedding using domain-specific knowledge [J ] . Journal of Biomedical Informatics , 2020 , 106 : 103435 .
DONG C H , ZHANG J J , ZONG C Q , et al . Character-based LSTM-CRF with radical-level features for Chinese named entity recognition [M ] // Natural Language Processing and Chinese Computing . Cham : Springer , 2016 : 239 - 250 .
崔少国 , 陈俊桦 , 李晓虹 . 融合语义及边界信息的中文电子病历命名实体识别 [J ] . 电子科技大学学报 , 2022 , 51 ( 4 ): 565 - 571 .
CUI S G , CHEN J H , LI X H . Named entity recognition for Chinese electronic medical record by fusing semantic and boundary information [J ] . Journal of University of Electronic Science and Technology of China , 2022 , 51 ( 4 ): 565 - 571 . (in Chinese)
DEVLIN J , CHANG M-W , LEE K , et al . Bert: Pre-training of deep bidirectional transformers for language understanding [C ] // 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Stroudsburg : ACL , 2019 : 4171 - 4186 .
RADFORD A , NARASIMHAN K , SALIMANS T , et al . Improving language understanding by generative pre-training [EB/OL ] . [2023 ] . https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018-improving.pdf https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018-improving.pdf .
GU Y J , QU X Y , WANG Z F , et al . Delving deep into regularity: A simple but effective method for Chinese named entity recognition [C ] // Findings of the Association for Computational Linguistics: NAACL 2022 . Stroudsburg : Association for Computational Linguistics , 2022 : 1863 - 1873 .
吴炳潮 , 邓成龙 , 关贝 , 等 . 动态迁移实体块信息的跨领域中文实体识别模型 [J ] . 软件学报 , 2022 , 33 ( 10 ): 3776 - 3792 .
WU B C , DENG C L , GUAN B , et al . Dynamically transfer entity span information for cross-domain Chinese named entity recognition [J ] . Journal of Software , 2022 , 33 ( 10 ): 3776 - 3792 . (in Chinese)
PENG M L , XING X Y , ZHANG Q , et al . Distantly supervised named entity recognition using Positive-unlabeled learning [C ] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : Association for Computational Linguistics , 2019 : 2409 - 2419 .
LI X N , YAN H , QIU X P , et al . FLAT: Chinese NER using flat-lattice transformer [C ] // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : Association for Computational Linguistics , 2020 : 6836 - 6842 .
WU S , SONG X N , FENG Z H . MECT: Multi-metadata embedding based cross-transformer for Chinese named entity recognition [C ] // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing . Stroudsburg : Association for Computational Linguistics , 2021 : 1529 - 1539 .
LI X Y , FENG J R , MENG Y X , et al . A unified MRC framework for named entity recognition [C ] // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Stroudsburg : Association for Computational Linguistics , 2020 : 5849 - 5859 .
YANG P , CONG X , SUN Z Y , et al . Enhanced language representation with label knowledge for span extraction [C ] // Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing . Stroudsburg : Association for Computational Linguistics , 2021 : 4623 - 4635 .
0
浏览量
29
下载量
1
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621