CIE Homepage  |  Join CIE  |  Login CIE  |  中文 

Collections

Natural Language Processing Technology
Sort by Default Latest Most read  
Please wait a minute...
  • Select all
    |
  • PAPERS
    HUANG Ming-xuan
    Acta Electronica Sinica. 2021, 49(7): 1305-1313. https://doi.org/10.12263/DZXB.20200654

    In order to solve the problems of query topic drift and word mismatch in natural language processing, an algorithm of association pattern mining and rule expansion based on CSC(Copulas-based Support and Confidence) framework is proposed. The association patterns based on statistical analysis are fused with the word embedding with context semantic information, and a pseudo-relevance feedback query expansion model is presented based on the fusion of association pattern mining and word embedding learning. In this model, the rule expansion terms are mined from the pseudo-relevance feedback document set, and the word vectors are obtained by word embedding learning training of the initial document set. The vector similarity between the rule expansion term and original query is calculated, and the rule expansion terms whose vector similarity is not lower than the threshold are extracted as the final expansion terms. The experimental results show that the proposed expansion model can effectively reduce the problems of query topic drift and word mismatch, improving the performance of information retrieval. Compared with the existing query expansion methods based on association pattern and word embedding, the average increase of the MAP(Mean Average Precision)of the proposed expansion model is up to 17.52%. The expansion model in this paper is more effective for short queries. The proposed mining method can be used in other text mining tasks and recommendation systems to improve their performance.

  • PAPERS
    ZHANG Yu, LIU Kai-feng, ZHANG Quan-xin, WANG Yan-ge, GAO Kai-long
    Acta Electronica Sinica. 2021, 49(6): 1059-1067. https://doi.org/10.12263/DZXB.20200134
    At present, most of the researches on news classification are in English, and the traditional machine learning methods have a problem of incomplete extraction of local text block features in long text processing. In order to solve the problem of lack of special term set for Chinese news classification, a vocabulary suitable for Chinese text classification is made by constructing a data index method, and the text feature construction is combined with word2vec pre-trained word vector. In order to solve the problem of incomplete feature extraction, the effects of different convolution and pooling operations on the classification results are studied by improving the structure of classical convolution neural network model. In order to improve the precision of Chinese news text classification, this paper proposes and implements a combined-convolution neural network model, and designs an effective method of model regularization and optimization. The experimental results show that the precision of the combined-convolutional neural network model for Chinese news text classification reaches 93.69%, which is 6.34% and 1.19% higher than the best traditional machine learning method and classic convolutional neural network model, and it is better than the comparison model in recall and F-measure.
  • YANG Qi-meng, YU Long, TIAN Sheng-wei, Aishan Wumaier
    Acta Electronica Sinica. 2020, 48(6): 1077-1083. https://doi.org/10.3969/j.issn.0372-2112.2020.06.005
    Deep neural network models for Uyghur personal pronouns resolution learn semantic information for current anaphora chain, but ignore the long-term effects of single anaphora chain recognition results. This paper proposes a Uyghur personal pronoun anaphora resolution based on deep reinforcement learning. This method defines the anaphora resolution task as the sequential decision process under the reinforcement learning environment, and effectively uses the antecedent information in the previous state to determine the current personal pronoun-candidate antecedent pairs. In this study, we use an overall reward signal optimization strategy, which is more efficient than directly using the loss function heuristic to optimize a specific single decision. Finally, we conduct experiments in the Uyghur dataset. The experimental results show that the F value of this method in the Uyghur personal pronouns resolution task is 85.80%. The experimental results show that the deep reinforcement learning model can significantly improve the performance of the Uyghur personal pronouns resolution.
  • HUANG Ming-xuan, JIANG Cao-qing
    Acta Electronica Sinica. 2020, 48(3): 568-576. https://doi.org/10.3969/j.issn.0372-2112.2020.03.021
    To ameliorate the long-standing problems of theme drift and word mismatch in natural language processing applications, this paper first proposes a computing method for weighted itemset support and a pruning method based on item weight sorting (IWS). And then, a weighted association rule mining algorithm for query expansion is presented based on the IWS, and the models such as association rule antecedent and consequent hybrid expansion (RACHE), rule consequent expansion (RCE) along with rule antecedent expansion (RAE) are discussed. Finally, an algorithm of cross-language query expansion (CLQE) is put forward based on the IWS mining. The algorithm utilized the new support and the pruning method to mine the weighted association rules,and extracted high quality expansion terms from the rules according to the expansion models in order to carry out CLQE. A comparison between the proposed expansion algorithm and the existing CLQE algorithms based on weighted association rules mining is made, which shows that the former can effectively restrain the problems of query topic drift and word mismatch, and can be used in information retrieval in various languages to improve retrieval performance. The RCE achieves the optimal retrieval performance in the proposed expansion models, and the retrieval performance of the RACHE is not as good as that of the RAE and the RCE. The support is more effective for the RCE algorithm. The confidence can make the RAE and the RACHE get the best retrieval result. And moreover, the proposed mining method can be used in text mining, business data mining and recommendation system to improve its mining performance.
  • WU Yu-jia, LI Jing, SONG Cheng-fang, CHANG Jun
    Acta Electronica Sinica. 2020, 48(2): 279-284. https://doi.org/10.3969/j.issn.0372-2112.2020.02.008
    The existing text classification methods based on deep learning do not consider the importance and association of text features. The association between the text features perhaps affects the accuracy of the classification. To solve this problem, in this study, a framework based on high utility neural networks (HUNN) for text classification were proposed. Which can effectively mine the importance of text features and their association. Mining high utility itemsets (MHUI) from databases is an emerging topic in data mining. It can mine the importance and the co-occurrence frequency of each feature in the dataset.The co-occurrence frequency of the feature reflects the association between the text features. Using MHUI as the mining layer of HUNN, it is used to mine strong importance and association text features in each type, select these text features as input to the neural networks. And then acquire the high-level features with strong ability of categorical representation through the convolution layer for improving the accuracy of model classification. The experimental results showed that the proposed model performed significantly better on six different public datasets compared with convolutional neural networks (CNN), recurrent neural networks (RNN), recurrent convolutional neural networks (RCNN), fast text classifier (FAST), and hierarchical attention networks (HAN).
  • ZHANG Zhi-chang, ZENG Yang-yang, PANG Ya-li
    Acta Electronica Sinica. 2020, 48(11): 2162-2169. https://doi.org/10.3969/j.issn.0372-2112.2020.11.010
    Recognizing textual entailment is intended to infer the logical relationship between two given sentences. In this paper, we incorporate the deep semantic information of sentences and the encoder of Transformer by constructing the SRL-Attention fusion module, and it effectively improves the ability of self-attention mechanism to capture sentence semantics. Furthermore, concerning the small scale and high noise problems on the dataset, we use large-scale pre-trained language model improving the recognition performance of model on small-scale dataset. Experimental results show that the accuracy of our model on the dataset CNLI, it is released as Chinese textual entailment recognition evaluation corpus at the 17th China National Conference on Computational Linguistics, reaches 80.28%.
  • MA Hui-fang, LIU Wen, LI Zhi-xin, LIN Xiang-hong
    Acta Electronica Sinica. 2019, 47(6): 1331-1336. https://doi.org/10.3969/j.issn.0372-2112.2019.06.021
    Text similarity measures play a vital role in text related applications in tasks such as social networks,text mining,natural language processing,and others.The typical characteristics of short texts demonstrate severe sparseness and high dimension while the traditional short texts similarity calculation always ignores category information.A coupled distance discrimination and strong classification features based approach for short text similarity calculation,CDDCF,is presented.On the one hand,co-occurrence distance between terms are considered in each text to determine the co-occurrence distance correlation,based on which the weight for each term can be determined and the intra and inter relations between words are established.The similarity of coupling distance discrimination on short text can be captured.On the other hand,strong classification features are extracted via labeled texts.The similarity between two short texts is measured by using the common number of strong discrimination features with the same context.Finally,the distance discrimination and strong classification features are unified into a joint framework to measure the similarity of short texts.Experimental results show that CDDCF performs better compared to baseline algorithms in term of its performance and efficiency of similarity computation.
  • Lü Pin, YU Wen-bing, WANG Xin, JI Chun-lei, ZHOU Xi-min
    Acta Electronica Sinica. 2019, 47(10): 2228-2234. https://doi.org/10.3969/j.issn.0372-2112.2019.10.026
    Toxic comment detection is an important work to prevent the negative impact of social media platform on users, and it is also one of the important fields of natural language processing. In order to solve the problems of unstable model accuracy and low accuracy of boosting ensemble model when an individual classifier detects toxic comments, a stack generalization with heterogeneous classifiers is proposed. In this method, the classification problem of multi-label toxic comments is transformed into binary categories by using deep recurrent neural network, which prevents the model accuracy from being unstable. Individual classifiers called GRU (Gated Recurrent Unit) and NB-SVM (Naïve Bayes-Support Vector Machine) are used during stacked generalization in order to embody the differences on model structure and classification deviation of individual classifiers, the goal is to improve the model accuracy.  Experimental results on Wikipedia toxic comments show that the proposed method has better than boosting ensemble, which reports that stacked generalization of heterogeneous classifiers is feasible and effective for toxic comments detection.