电子学报 ›› 2020, Vol. 48 ›› Issue (2): 279-284.DOI: 10.3969/j.issn.0372-2112.2020.02.008

所属专题: 自然语言处理技术 自然语言处理:技术与应用

• 学术论文 • 上一篇    下一篇

基于高效用神经网络的文本分类方法

吴玉佳, 李晶, 宋成芳, 常军   

  1. 武汉大学计算机学院, 湖北武汉 430072
  • 收稿日期:2018-10-08 修回日期:2019-10-22 出版日期:2020-02-25
    • 通讯作者:
    • 李晶
    • 作者简介:
    • 吴玉佳 男,1986年11月出生,湖北广水人,武汉大学博士生,计算机应用技术专业.研究方向:数据挖掘,自然语言处理,深度学习.E-mail:wuyujia@whu.edu.cn
    • 基金资助:
    • 国家重点基础研究发展规划 (973计划)项目 (No.2012CB719905); 国家自然科学基金 (No.41201404); 中央高校基本科研业务费专项资金项目 (No.2042015gf0009)

High Utility Neural Networks for Text Classification

WU Yu-jia, LI Jing, SONG Cheng-fang, CHANG Jun   

  1. School of Computer Science, Wuhan University, Wuhan, Hubei 430072, China
  • Received:2018-10-08 Revised:2019-10-22 Online:2020-02-25 Published:2020-02-25
    • Corresponding author:
    • LI Jing
    • Supported by:
    • Program of National Program on Key Basic Research Project  (973 Program) (No.2012CB719905); National Natural Science Foundation of China (No.41201404); Fundamental Research Funds for the Central Universities (No.2042015gf0009)

摘要: 现有的基于深度学习的文本分类方法没有考虑文本特征的重要性和特征之间的关联关系,影响了分类的准确率.针对此问题,本文提出一种基于高效用神经网络(High Utility Neural Networks,HUNN)的文本分类模型,可以有效地表示文本特征的重要性及其关联关系.利用高效用项集挖掘(Mining High Utility Itemsets,MHUI)算法获取数据集中各个特征的重要性以及共现频率.其中,共现频率在一定程度上反映了特征之间的关联关系.将MHUI作为HUNN的挖掘层,用于挖掘每个类别数据中重要性和关联性强的文本特征.然后将这些特征作为神经网络的输入,再经过卷积层进一步提炼类别表达能力更强的高层次文本特征,从而提高模型分类的准确率.通过在6个公开的基准数据集上进行实验分析,提出的算法优于卷积神经网络(Convolutional Neural Networks,CNN),循环神经网络(Recurrent Neural Networks,RNN),循环卷积神经网络(Recurrent Convolutional Neural Networks,RCNN),快速文本分类(Fast Text Classifier,FAST),分层注意力网络(Hierarchical Attention Networks,HAN)等5个基准算法.

关键词: 数据挖掘, 关联规则, 高效用项集, 自然语言处理, 文本分类, 神经网络

Abstract: The existing text classification methods based on deep learning do not consider the importance and association of text features. The association between the text features perhaps affects the accuracy of the classification. To solve this problem, in this study, a framework based on high utility neural networks (HUNN) for text classification were proposed. Which can effectively mine the importance of text features and their association. Mining high utility itemsets (MHUI) from databases is an emerging topic in data mining. It can mine the importance and the co-occurrence frequency of each feature in the dataset.The co-occurrence frequency of the feature reflects the association between the text features. Using MHUI as the mining layer of HUNN, it is used to mine strong importance and association text features in each type, select these text features as input to the neural networks. And then acquire the high-level features with strong ability of categorical representation through the convolution layer for improving the accuracy of model classification. The experimental results showed that the proposed model performed significantly better on six different public datasets compared with convolutional neural networks (CNN), recurrent neural networks (RNN), recurrent convolutional neural networks (RCNN), fast text classifier (FAST), and hierarchical attention networks (HAN).

Key words: data mining, association rule, high utility itemset, natural language processing, text classification, neural networks

中图分类号: