1. 安徽大学计算智能与信号处理教育部重点实验室, 安徽合肥 230601;
2. 安徽大学计算机科学与技术学院, 安徽合肥 230601;
3. 合肥工业大学计算机与信息学院, 安徽合肥 230601;
4. School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette 70503
Keyphrase Extraction Using Sequential Patterns Mining Algorithm with One-Off and General Gaps Condition
LIU Hui-ting1,2, LIU Zhi-zhong1,2, WANG Li-li1,2, WU Xin-dong3,4
1. Key Laboratory of Intelligent Computing and Signal Processing of the Ministry of Education, Anhui University, Hefei, Anhui 230601, China;
2. School of Computer Science and Technology, Anhui University, Hefei, Anhui 230601, China;
3. School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, Anhui 230601, China;
4. School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette 70503, USA
摘要 本文提出了有监督的关键词抽取算法——KEING(Keyphrase Extraction using sequentIal patterns with oNe-off and General gaps condition)算法.首先,将每篇文档作为一个序列库,利用SPING(Sequential Patterns mIning with oNe-off and General gaps condition)算法获取词语之间的关系及其多种变化形式,并利用统计模式特征的方式描述候选关键词;然后,通过朴素贝叶斯分类算法对大量带标记的训练数据进行训练,构造分类器;最后利用分类器从测试文档中识别出关键词.通过实验验证了SPING算法的完备性以及KEING算法的有效性.
Abstract:Keyphrases are used to summarize the document and high-quality keyphrases have great importance in text summarizing,reading and indexing.However,most studies of keyphrase extraction have strict limitation in the form of patterns,and are unable to achieve the semantic relation between words and phrases.The results are failure to autonomously extract keyphrases.Keyphrase extraction using sequential patterns mining with one-off and general gaps condition algorithm (KEING) is proposed in this paper.Taking into account one off condition and general gaps,SPING(Sequential Patterns mIning with oNe-off and General gaps condition)can catch semantic relations between words and phrases more effectively.Therefore,KEING will get effective candidate keyphrases and count their features.Then a supervised machine learning method is used to train features and construct a classification model,we can extract keyphrase with this model.Experimental results demonstrate KEING can effectively extract high quality keyphrases.
[1] Zhang J,Wang W,Wei X,et al.Climate analytics workflow recommendation as a service-provenance-driven automatic workflow mashup[A].Proceedings of 2015 IEEE International Conference on Web Services[C].New York,USA:IEEE Computer Society,2015.89-97.
[2] 赵京胜,朱巧明,周国栋,张丽.自动关键词抽取研究综述[J],软件学报,2017,28(9):2431-2449. Zhao JS,Zhu QM,Zhou GD,Zhang L.Review of research in automatic keywordextraction[J].Journal of Software,2017,28(9):2431-2449.(in Chinese)
[3] Liu X,Song Y,Liu S,et al.Automatic taxonomy construction from keywords[A].Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining[C].Beijing,China:ACM,2012.1433-1441.
[4] Luhn H P.A statistical approach to mechanized encoding and searching of literary information[J].IBM Journal of Research and Development,1957,1(4):309-317.
[5] 马慧芳,刘芳,夏琴,等.基于加权超图随机游走的文献关键词提取算法[J].电子学报,2018,46(6):1410-1414. MA HF,LIU F,XIA Q,et al.Keywords extraction algorithm based on weighted hypergraph random walk[J].Acta Electronica Sinica,2018,46(6):1410-1414.(in Chinese)
[6] El-Beltagy S R,Rafea A.KP-Miner:A keyphrase extraction system for English and Arabicdocuments[J].Information Systems,2009,34(1):132-144.
[7] Witten I H,Paynter G W,Frank E,et al.KEA:practical automatic keyphraseextraction[A].Proceedings of ACM Conference on Digital Libraries[C].Berkeley,USA:ACM,1999.254-255.
[8] Turney P D.Learning algorithms for keyphraseextraction[J].Information Retrieval,2000,2(4):303-336.
[9] Barker K,Cornacchia N.Using noun phrase heads to extract document keyphrases[A].Proceedings of Canadian Conference on AI 2000[C].Montréal,Canada:Springer,2000.40-52.
[10] Steier A M,Belew R K.Exporting phrases:a statistical analysis of topicallanguage[A].Second Symposium on Document Analysis & Information Retrieval[C].Cite Seer,1993.179-190.
[11] Teneva N,Cheng W.Salience rank:efficient keyphrase extraction with topicmodeling[A].Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics[C].Vancouver,Canada:ACL,2017.530-535.
[12] Onan A,Korukoglu S,Bulut H.Ensemble of keyword extraction methods and classifiers in text classification[J].Expert Systems with Applications,2016,57:232-247.
[13] Wang Q,Sheng V S,Wu X.Document-specific keyphrase candidate search and ranking[J].Expert Systems with Applications,2018,97:163-176.
[14] Haddoud M,Abdedda¿m S.Accurate keyphrase extraction by discriminating overlapping phrases[J].Journal of Information Science,2014,40(4):488-500.
[15] Medelyan O,Witten I H.Thesaurus based automatic keyphraseindexing[A].Proceedings of the ACM/IEEE Joint Conference on Digital Libraries[C].Chapel Hill,USA:ACM,2006.296-297.
[16] Gollapalli S D,Li X.Keyphrase Extraction using Sequential Labeling[EB/OL].arXiv:1608.00329[cs.CL],2016-08-01/2018-03-05.
[17] Xie F,Wu X,Zhu X.Efficient sequential pattern mining with wildcards for keyphrase extraction[J].Knowledge-Based Systems,2017,115:27-39.
[18] Porter M F.An algorithm for suffixstripping[J].Program Electronic Library & Information Systems,2013,14(3):130-137.
[19] 刘慧婷,刘志中,黄厚柱,吴信东.一般间隙与One-Off条件的序列模式匹配[J].软件学报,2018,29(2):363-382. Liu H T,Liu Z Z,Huang H Z,Wu X D.Sequential pattern matching with general gap and one-offcondition[J].Journal of Software,2018,29(2):363-382.(in Chinese)
[20] 王新军,闫实,彭朝晖,李庆忠.Extractor:支持查询重构的高效数据库关键词检索系统[J].电子学报,2014,42(2):209-216. WANG X J,YAN S,PENG Z H,LI Q Z.Extractor:a query-reformulation embedded efficient keyword search system over relational databases[J].Acta Electronica Sinica,2014,42(2):209-216.(in Chinese)
[21] Huang Y,Wu X,Hu X,et al.Mining Frequent patterns with gaps and one-off condition[A].Proceedings of the 12th IEEE International Conference on Computational Science and Engineering[C].Vancouver,Canada:IEEE Computer Society,2009.180-186.
[22] Zhang M,Kao B,Cheung D W,et al.Mining periodic patterns with gap requirement fromsequences[J].ACM Transactions on Knowledge Discovery from Data,2007,1(2):7.