电子学报 ›› 2019, Vol. 47 ›› Issue (5): 1086-1093.DOI: 10.3969/j.issn.0372-2112.2019.05.016

• 学术论文 • 上一篇    下一篇

基于增量式鲁棒非负矩阵分解的短文本在线聚类

贺超波1, 汤庸2, 张琼3, 刘双印1, 刘海2   

  1. 1. 仲恺农业工程学院信息科学与技术学院, 广东广州 510225;
    2. 华南师范大学计算机 学院, 广东广州 510631;
    3. 中山大学数据科学与计算机学院, 广东广州 510006
  • 收稿日期:2018-05-31 修回日期:2018-08-19 出版日期:2019-05-25 发布日期:2019-05-25
  • 通讯作者: 汤庸
  • 作者简介:贺超波 男,1981年生于广东河源,现为仲恺农业工程学院副教授,主要研究方向为数据挖掘、机器学习与大数据技术.E-mail:hechaobo@foxmail.com
  • 基金资助:
    国家自然科学基金(No.61772211);广东省科技计划项目(No.2017A040405057,No.2017A030303074,No.2016A030303058);广州市科技计划项目(No.201807010043)

Short Text Online Clustering Based on Incremental Robust Nonnegative Matrix Factorization

HE Chao-bo1, TANG Yong2, ZHANG Qiong3, LIU Shuang-yin1, LIU Hai2   

  1. 1. School of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou, Guangdong 510225, China;
    2. School of Computer, South China Normal University, Guangzhou, Guangdong 510631, China;
    3. School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510006, China
  • Received:2018-05-31 Revised:2018-08-19 Online:2019-05-25 Published:2019-05-25

摘要: 对社会化媒体产生的大量短文本进行聚类分析具有重要的应用价值,但短文本往往具有噪音数据多、增长迅速且数据量大的特点,导致现有相关算法难于有效处理.提出一种基于增量式鲁棒非负矩阵分解的短文本在线聚类算法STOCIRNMF.STOCIRNMF基于非负矩阵分解构建短文本聚类模型,通过l2,1范数设计模型的优化求解目标函数提高鲁棒性,同时应用增量式迭代更新规则实现短文本的在线聚类.在搜狐新闻标题和微博短文本数据集上进行相关实验,结果表明STOCIRNMF不仅比现有代表性算法具有更好的聚类性能,而且能够有效对微博话题进行在线检测.

关键词: 短文本聚类, 鲁棒非负矩阵分解, 在线聚类, l2, 1范数, 增量式迭代更新规则

Abstract: Clustering a large number of short texts in social media has great value in applications.However,short texts often have these characteristics:lots of noises,growing rapidly and massive data.Most existing short text clustering algorithms are not effectively enough to process such short texts.Aiming at this problem,we propose an algorithm of short text online clustering based on incremental robust nonnegative matrix factorization (STOCIRNMF).This algorithm uses NMF to build the short text clustering model and applies l2,1 norm to devise its objective function for improving its robustness.Meanwhile,STOCIRNMF can cluster short texts incrementally by using incremental iterative update rules.We conduct extensive experiments on real Sohu news titles and Weibo datasets.The results show that STOCIRNMF not only has better performance of short text clustering than some representative algorithms,but also is very effective to detect micro blog's topics online.

Key words: short text clustering, robust nonnegative matrix factorization, online clustering, l2,1 norm, incremental iterative update rules

中图分类号: