电子学报 ›› 2016, Vol. 44 ›› Issue (10): 2466-2470.DOI: 10.3969/j.issn.0372-2112.2016.10.026

• 学术论文 • 上一篇    下一篇

基于AC-Trie的在线社交网络文本流热点要素抽取

黄九鸣, 吴泉源, 张圣栋, 贾焰, 刘东, 周斌   

  1. 国防科学技术大学计算机学院, 湖南长沙 410073
  • 收稿日期:2015-02-15 修回日期:2015-08-14 出版日期:2016-10-25
    • 作者简介:
    • 黄九鸣,男,1981年生于福建安溪.博士、中国人民解放军国防科学技术大学助理研究员.研究方向为Web挖掘、大数据、分布式计算和社交网络分析.E-mail:jiuming.huang@qq.com;吴泉源,男,1942年生于上海.中国人民解放军国防科学技术大学教授、博士生导师.研究方向为人工智能和分布式计算.
    • 基金资助:
    • 国家973重点基础研究发展计划 (No.2013CB329601); 国家自然科学基金 (No.61502517)

Mining Hot Phrases on Social Network Text Streams Based on AC-Trie

HUANG Jiu-ming, WU Quan-yuan, ZHANG Sheng-dong, JIA Yan, LIU Dong, ZHOU Bin   

  1. School of Computer, National University of Defense Technology, Changsha, Hunan 410073, China
  • Received:2015-02-15 Revised:2015-08-14 Online:2016-10-25 Published:2016-10-25
    • Supported by:
    • National Program on Key Basic Research Project of China  (973 Program) (No.2013CB329601); National Natural Science Foundation of China (No.61502517)

摘要:

在线社交网络文本流中的热点短语能反映文本流中隐含的热点话题和突发事件.本文提出了一种无需分词并能支持多种热度度量函数的热点短语挖掘技术.首先用文本流的某个典型时段采样得到候选短语,构建AC-Trie前缀树.然后,基于该前缀树,单遍扫描后续的文本流,将候选短语的历史出现频率记录在Trie相应节点上,从而支持多种基于历史频率的热度计算方法.此外,为及时发现新的热点短语并减少AC-Trie的构建次数,本文通过分析Trie树各节点上的遗漏短语频率,动态确定候选短语的更新时机.新浪微博数据集上的实验验证了本文方法的有效性(准确率达89%)和高效性(时空开销仅为基准算法的2%).

关键词: 文本流, 热点短语, AC-Trie, 文本挖掘, 在线社交网络

Abstract:

The hot phrases in the social network text streams can reflect the hidden hot topics and sudden events.This paper proposes a hot phrase mining technology which can support various hot degree measures without word segmentation.We first construct an AC-Trie using the candidate phrases gathered from text streams.Based on such AC-Trie,we record the historical occurrence frequency of phrases on the Trie by scanning the following streams in single-pass.Furthermore,the AC-Trie needs to be reconstructed using the new samples in the text stream because of the evolution of hot phrases.Thus,we start the reconstruction dynamically according to estimating the occurrence frequency of the missed phrases.The experiments on the Sina micro-blog show that our approach is effective (precision of 89%) and efficient (overhead is 2% of naïve approach).

Key words: text stream, hot phrase, AC-Trie, text mining, micro-blog

中图分类号: