Combining Coupled Distance Discrimination and Strong Classifica-tion Features for Short Text Similarity Calculation
MA Hui-fang1,2,3, LIU Wen1, LI Zhi-xin3, LIN Xiang-hong1
1. College of Computer Science and Engineering, Northwest Normal University, Lanzhou, Gansu 730000, China;
2. Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China;
3. Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin, Guangxi 541004, China
Abstract:Text similarity measures play a vital role in text related applications in tasks such as social networks,text mining,natural language processing,and others.The typical characteristics of short texts demonstrate severe sparseness and high dimension while the traditional short texts similarity calculation always ignores category information.A coupled distance discrimination and strong classification features based approach for short text similarity calculation,CDDCF,is presented.On the one hand,co-occurrence distance between terms are considered in each text to determine the co-occurrence distance correlation,based on which the weight for each term can be determined and the intra and inter relations between words are established.The similarity of coupling distance discrimination on short text can be captured.On the other hand,strong classification features are extracted via labeled texts.The similarity between two short texts is measured by using the common number of strong discrimination features with the same context.Finally,the distance discrimination and strong classification features are unified into a joint framework to measure the similarity of short texts.Experimental results show that CDDCF performs better compared to baseline algorithms in term of its performance and efficiency of similarity computation.
马慧芳, 刘文, 李志欣, 蔺想红. 融合耦合距离区分度和强类别特征的短文本相似度计算方法[J]. 电子学报, 2019, 47(6): 1331-1336.
MA Hui-fang, LIU Wen, LI Zhi-xin, LIN Xiang-hong. Combining Coupled Distance Discrimination and Strong Classifica-tion Features for Short Text Similarity Calculation. Acta Electronica Sinica, 2019, 47(6): 1331-1336.
[1] 曹玖新,陈高君,吴江林,等.基于多维特征分析的社交网络意见领袖挖掘[J].电子学报,2016,44(4):898-905. Cao J X,Chen G J,Wu J L,et.al.Multi feature based opinion leader mining in social networks.[J].Acta Electronica Sinica,2016,44(4):898-905.(in Chinese)
[2] Song S,Zhu H,Chen L.Probabilistic correlation-based similarity measure on text records[J].Information Sciences,2014,289(1):8-24.
[3] Li P,Wang H,Zhu K Q,et al.Alarge probabilistic semantic network based approach to compute term similarity[J].IEEE Transactions on Knowledge & Data Engineering,2015,27(10):2604-2617.
[4] Chen Q,Hu L,Xu J,et al.Document similarity analysis via involving both explicit and implicit semantic couplings[A].IEEE International Conference on Data Science and Advanced Analytics[C].Montreal,Canada:IEEE Computer Society,2016.1-10.
[5] Cheng X,Miao D,Wang C,et al.Coupled term-term relation analysis for document clustering[A].International Joint Conference on Neural Networks[C].Dallas,Texas,USA:IEEE Computer Society,2013.1-8.
[6] Zhang L,Gao Y,Hong C,et al.Feature correlation hypergraph:exploiting high-order potentials for multimodal recognition[J].IEEE Transactions on Cybernetics,2014,44(8):1408-1419.
[7] Ma H,Xing Y,Wang S,et al.Leveraging term co-occurrence distance and strong classification features for short text feature extraction[A].International Conference on Knowledge Science,Engineering and Management[C].Australia:Springer,2017.67-75.
[8] Michael Ley,DBLP Dataset[EB/OL].http://dblp.uni-trier.de/xml/,2016-04-20.
[9] 搜狗实验室.文本分类语料库[EB/OL].http://www.sogou.com/labs/dl/c.html,2012-04-30/2012-09-01.
[10] Ken Lang,20 Newsgroups Dataset[EB/OL].http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.data.html,2009-12-9.
[11] 刘文,马慧芳,脱婷,等.融合共现距离和区分度的短文本相似度计算方法[J].计算机工程与科学,2017,29(3):52-53. Liu W,Ma H,Tuo T,Chen H B.Co-occurrence distance and discrimination based similarity measure on short Text[J].Computer Engineering and Science,2018,40(7):1281-1286.(in Chinese)
[12] Wen H,Xiao N.A semi-supervised text clustering based on strong classification features affinity propagation[J].Pattern Recognition and Artificial Intelligence,2014,27(7):646-654.