电子学报 ›› 2019, Vol. 47 ›› Issue (9): 1919-1928.DOI: 10.3969/j.issn.0372-2112.2019.09.015

• 学术论文 • 上一篇    下一篇

基于多种词特征的微博突发事件检测方法

张仰森1,3, 段宇翔1, 王建1, 吴云芳2   

  1. 1. 北京信息科技大学智能信息处理研究所, 北京 100101;
    2. 北京大学计算语言学 研究所, 北京 100871;
    3. 国家经济安全预警工程北京实验室, 北京 100044
  • 收稿日期:2018-08-13 修回日期:2018-11-20 出版日期:2019-09-25 发布日期:2019-09-25
  • 作者简介:张仰森 男,1962年6月出生于山西临猗,博士后,教授,研究方向为中文信息处理、人工智能.E-mail:zhangyangsen@163.com;段宇翔 男,1992年3月出生于山西太原,硕士研究生,研究方向为中文信息处理、突发事件检测.E-mail:duanyx5173@163.com;王建 男,1993年5月生于浙江温州,硕士,研究方向为中文信息处理、信息安全.E-mail:455858538@qq.com;吴云芳 女,1973年3月生于山西,博士,副教授,研究方向为语义计算,智能问答.E-mail:wuyf@pku.edu.cn
  • 基金资助:
    国家自然科学基金(No.61772081);科技创新服务能力建设-科研基地建设-北京实验室-国家经济安全预警工程北京实验室项目(No.PXM2018_014224_000010)

Microblog Bursty Events Detection Method Based on Multiple Word Features

ZHANG Yang-sen1,3, DUAN Yu-xiang1, WANG Jian1, WU Yun-fang2   

  1. 1. Institute of Intelligent Information Processing, Beijing Information Science and Technology University, Beijing 100101, China;
    2. Institute of Computational Linguistics, Peking Universit,, Beijing, 100871, China;
    3. Beijing Laboratory of National Economic Security Early-warning Engineering, Beijing 100044, China
  • Received:2018-08-13 Revised:2018-11-20 Online:2019-09-25 Published:2019-09-25

摘要: 近年来,各领域内频频发生各类突发事件,对社会稳定发展产生了一定程度的影响.本文提出了一种基于多种词特征的微博突发事件检测模型,可以在海量微博数据中对突发事件进行检测,便于相关决策者进行微博监控和舆论引导,尽可能减少突发事件给社会带来的危害.首先根据时间信息对微博数据进行时间切片,对每一个时间窗口内的数据分别计算各个词语的词频特征、话题标签特征和词频增长率特征;然后基于D-S证据理论和层次分析法,确定词的各个特征权重,并进行加权融合得到词的突发特征值,将突发特征值大的词挑选出来构成突发特征词集,构建基于共现度和结合紧密度的突发事件特征词集的耦合度矩阵;最后将该耦合度矩阵作为凝聚式层次聚类算法的输入,生成一棵由突发词为叶子节点的二叉树,并采用内部相似度的二叉树剪枝算法对聚类结果进行划分,即可实现对相应时间窗口突发事件的检测.实验结果表明,基于突发词的事件检测模型在簇内部相似度阈值等于1.1时效果最好,正确率达到0.8462、召回率达到0.8684、F值为0.8571,表明了本文所提方法的有效性.

关键词: 微博, 突发事件, 突发特征词, D-S证据理论, 凝聚式层次聚类

Abstract: In recent years,a wide variety of bursty events have been occurring frequently in many fields,impacting both the stability and the development of our society.This paper proposes an event detection model based on multiple word features,which is intended to detect bursty events in the massive microblog data.The model will assist decision-makers to monitor microblogs and guide public opinions and will minimize the negative effect of bursty events to society.Firstly,the model slices the microblog data according to the time information.In each time window,the word frequency feature,the topic tag feature and the word frequency growth rate feature of each word are calculated separately.Then,the D-S evidence theory and the analytic hierarchy process are utilized to determine each word's feature weights,which are then merged to obtain the bursty feature value of the word.Words with large bursty feature value are selected to form the bursty feature word set and to construct a coupling degree matrix of bursty feature word set based on co-occurrence degree and tightness.Finally,the coupling degree matrix is used as the input of the hierarchical agglomerative clustering algorithm to generate a binary tree with bursty words being leaf nodes,and the internal similarity binary tree pruning algorithm is used to divide the clustering results.In this way,the detection of the corresponding time window's bursty events can be realized.The experimental results show that the event detection model based on bursty words has the best effect when the intra-cluster similarity threshold is 1.1,the correct rate is as high as 0.8462,the recall rate reaches 0.8684,and the F value is 0.8571,indicating the effectiveness of the proposed method.

Key words: microblog, bursty events, bursty feature words, D-S evidence theory, hierarchical agglomerative clustering

中图分类号: