1.西安理工大学计算机科学与工程学院, 陕西西安 710048
2.陕西省网络计算与安全技术重点实验室, 陕西西安 710048
[ "孟海宁 女,1979年生于内蒙古乌海.现为西安理工大学计算机科学与工程学院副教授、硕士生导师,主要研究方向为数据挖掘算法和可靠性建模.E⁃mail:hnmeng@xaut.edu.cn" ]
[ "冯 锴 男,1997年生于内蒙古锡林郭勒盟.现为西安理工大学计算机科学与工程学院硕士研究生.主要研究方向数据挖掘算法.E⁃mail:wang389331557@163.com" ]
[ "朱 磊 男,1983年生于陕西咸阳.现为西安理工大学计算机科学与工程学院讲师,主要研究方向为数据挖掘算法和自然语言处理.E⁃mail:leizhu@xaut.edu.cn" ]
[ "张贝贝 男,1978年生于陕西延安.现为西安理工大学计算机科学与工程学院讲师,主要研究方向为数据挖掘算法和大数据处理技术.E⁃mail:bbzhang115@hotmail.com" ]
[ "童新宇 男,1996年生于陕西西安.现为西安理工大学计算机科学与工程学院硕士研究生.主要研究方向数据挖掘算法.E⁃mail: tongxinyu@stu.xaut.edu.cn" ]
[ "黑新宏 男,1976年生于陕西延安.现为西安理工大学计算机科学与工程学院教授、博士生导师,主要研究方向为机器学习和安全性评估.E⁃mail: heixinhong@xaut.edu.cn" ]
收稿:2020-11-09,
修回:2021-06-11,
纸质出版:2021-09-25
移动端阅览
孟海宁,冯锴,朱磊等.基于Laplacian图谱的短文本聚类算法[J].电子学报,2021,49(09):1716-1723.
MENG Hai-ning,FENG Kai,ZHU Lei,et al.Short-Text Clustering Algorithm Based on Laplacian Graph[J].ACTA ELECTRONICA SINICA,2021,49(09):1716-1723.
孟海宁,冯锴,朱磊等.基于Laplacian图谱的短文本聚类算法[J].电子学报,2021,49(09):1716-1723. DOI: 10.12263/DZXB.20201266.
MENG Hai-ning,FENG Kai,ZHU Lei,et al.Short-Text Clustering Algorithm Based on Laplacian Graph[J].ACTA ELECTRONICA SINICA,2021,49(09):1716-1723. DOI: 10.12263/DZXB.20201266.
提出基于词频处理的Laplacian图谱聚类算法,以解决短文本数据维数高、特征稀疏等问题.首先采用词频-逆文本频率指数TF-IDF(Term Frequency-Inverse Document Frequency)方法,将短文本数据集映射到文本向量空间得到词频权值矩阵;其次利用Laplacian矩阵的图谱聚类特性,对词频权值矩阵进行数据降维处理;然后依据Laplacian矩阵的特征值表示文本相似度的特点,选择前
K
个特征值对应的特征向量作为初始聚类中心,以减少聚类过程的迭代次数.在SSC、20 News Group及Microblog PCU数据集上进行相关实验,结果表明Laplacian图谱聚类算法比传统聚类算法,不仅具有更优的聚类结果与更快的收敛速度,而且受噪声点影响较小,有很好的鲁棒性.
A Laplacian graph clustering algorithm based on word frequency processing is presented
to solve the problems of high feature dimension and sparse feature in short text. First
the term frequency-inverse document frequency (TF-IDF) method is used to map the short text dataset to the text vector space
to obtain the word frequency weight matrix. Secondly
the dimension of the word frequency weight matrix is reduced by using the graph clustering property of Laplacian matrix. Afterwards
according to the feature that the eigenvalues of Laplace matrix can represent the degree of text similarity
the eigenvectors corresponding to the first
K
eigenvalues are selected as the initial clustering center
thus reducing the number of iterations in the clustering process. We conduct extensive experiments on SSC
20 News Group and Microblog PCU datasets. The results show that the Laplacian graph clustering algorithm not only has better clustering results and faster convergence speed compared with the traditional clustering algorithm
but also it is less affected by noises and has good robustness.
Habib S T , Zahid A . An analysis of map reduce efficiency in document clustering using parallel K-means algorithm [J]. Future Computing & Informatics Journal , 2018 , 3 ( 2 ): 200 - 209 .
Deng H , Qin H , Sun X , et al . A K-means clustering algorithm of meliorated initial center [J]. Computer Technology and Development , 2013 , 11 : 42 - 45 .
贺超波 , 汤庸 , 张琼 , 等 . 基于增量式鲁棒非负矩阵分解的短文本在线聚类 [J]. 电子学报 , 2019 , 47 ( 5 ): 1086 - 1093 .
He Chao-bo , Tang Yong , Zhang Qiong , et al . Short text online clustering based on incremental robust nonnegative matrix factorization [J]. Acta Electronica Sinica , 2019 , 47 ( 5 ): 1086 - 1093 . (in Chinese)
Yang K , Miao R . Research on improvement of text processing and clustering algorithms in public opinion early warning system [A]. Proceedings of the 5th International Conference on Systems and Informatics [C]. NY,USA : IEEE , 2018 . 333 - 337 .
Zhang X , Qiang S , Gao H , et al . A density-based method for selection of the initial clustering centers of K-means algorithm [A]. Proceedings of the 2nd Advanced Information Technology , Electronic and Automation Control Conference[C]. NY,USA : IEEE , 2017 . 2565 - 2568 .
张雪松 , 贾彩燕 . 一种基于频繁词集表示的新文本聚类方法 [J]. 计算机研究与发展 , 2018 , 55 ( 1 ): 102 - 112 .
Zhang Xue-song , Jia Cai-yan . A new documents clustering method based on frequent itemsets [J]. Journal of Computer Research and Development , 2018 , 55 ( 1 ): 102 - 112 . (in Chinese)
Ma H , Lei D , Zeng X , et al . Short text feature extension based on improved frequent term sets [A]. Proceedings of Intelligent Information Processing [C]. Berlin,Germany : Springer Cham , 2016 . 169 - 178 .
Yang Y , Ma Z , Yang Y , et al . Multitask spectral clustering by exploring intertask correlation [J]. IEEE Transactions on Cybernetics , 2015 , 45 ( 5 ): 1085 - 1090 .
唐俊 , 梁亮 , 梁栋 , 等 . 基于拟Laplace谱的形状表示与聚类 [J]. 华东理工大学学报 , 2011 , 37 ( 6 ): 749 - 753 .
Tang Jun , Liang Liang , Liang Dong , et al . Shape representation and clustering based on quasi-Laplace spectrum [J]. Journal of East China University of Science and Technology , 2011 , 37 ( 6 ): 749 - 753 . (in Chinese)
Zeng M , Cai Y , Liu X , et al . Spectral-spatial clustering of hyperspectral image based on Laplacian regularized deep subspace clustering [A]. Proceedings of IEEE International Geoscience and Remote Sensing Symposium [C]. NY,USA : IEEE , 2019 . 2694 - 2697 .
Lei X , Zheng L , Liu Z , et al . Laplacian eigenmaps for automatic story segmentation of broadcast news [J]. IEEE Transactions on Audio , Speech, and Language Processing, 2012 , 20 ( 1 ): 276 - 289 .
Pirani M , Sundaram S . On the smallest eigenvalue of grounded Laplacian matrices [J]. IEEE Transactions on Automatic Control , 2016 , 6 ( 2 ): 509 - 514 .
Liu X , Xiong H , Shen N . A hybrid model of VSM and LDA for text clustering [A]. Proceedings of the 2nd IEEE International Conference on Computational Intelligence and Applications [C]. NY,USA : IEEE , 2017 . 230 - 233 .
Li J , Nie F , Li X . Directly solving the original Ratiocut problem for effective data clustering [A]. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing [C]. NY,USA : IEEE , 2018 . 2306 - 2310 .
Marutho D , Handaka S H , Wijaya E , et al . The determination of cluster number at K-means using elbow method and purity evaluation on headline news [A]. Proceedings of International Seminar on Application for Technology of Information and Communication [C]. NY,USA : IEEE , 2018 . 533 - 538 .
Xu T , Chiang H , Liu G , et al . Hierarchical K-means method for clustering large-scale advanced metering infrastructure data [J]. IEEE Transactions on Power Delivery , 2017 , 32 ( 2 ): 609 - 616 .
Sapkota N , Alsadoon A , Prasad P W C , et al . Data summarization using clustering and classification: spectral clustering combined with K-means using NFPH [A]. Proceedings of International Conference on Machine Learning , Big Data, Cloud and Parallel Computing[C]. NY,USA : IEEE , 2019 . 146 - 151 .
Fontanini A D , Abreu J . A data-driven BIRCH clustering method for extracting typical load profiles for big data [A]. Proceedings of IEEE Power & Energy Society General Meeting [C]. NY,USA : IEEE , 2018 . 1 - 5 .
Deng D . DBSCAN clustering algorithm based on density [A]. Proceedings of 7th International Forum on Electrical Engineering and Automation [C]. NY,USA : IEEE , 2020 . 949 - 953 .
0
浏览量
12
下载量
1
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621