LOU Zheng-zheng, YANG Chen, YE Yang-dong. An IB Algorithm Based on Data Selection Model[J]. Acta Electronica Sinica, 2014, 42(9): 1839-1846.
DOI:
LOU Zheng-zheng, YANG Chen, YE Yang-dong. An IB Algorithm Based on Data Selection Model[J]. Acta Electronica Sinica, 2014, 42(9): 1839-1846. DOI: 10.3969/j.issn.0372-2112.2014.09.027.
针对数据对象自身模式特征明确程度的不同给IB(Information Bottleneck)方法数据分析带来的问题,定义一个基于明确因素的数据选择模型,使得IB方法可从数据集中选取模式特征较为明确的数据对象并对其进行模式分析,提出DSIB (Data Selection Information Bottleneck)算法.DSIB算法采用数据压缩过程中所产生的信息损失作为数据对象模式特征是否明确的判定条件,使用边选择边学习的顺序抽取-合并策略来优化DSIB目标函数.实验结果表明:随着数据选择标准的不断提高,DSIB算法在提高数据分析精度的同时所牺牲的召回率较小;与未做选择的数据分析算法相比,DSIB算法可更好地识别出数据中所固有的内在模式.
Abstract
In the original IB (Information Bottleneck) algorithms
all the data points are employed to learn the cluster patterns.However
in many real-world applications
some data show clear coherent behavior and can be summarized well
while some data present weak tendencies to be assigned to any particular pattern.For such situations
this paper proposes a DSIB (Data Selection Information Bottleneck) algorithm which has the ability to select data points with clear coherent behavior and find their corresponding cluster patterns.To realize this goal
the DSIB algorithm takes the information loss as the data selection criterion
which is generated when we try to compress the data point into one of the clusters.The DSIB algorithm adopts sequential draw-and-merge procedure to select the data and learn the cluster patterns.This learning process can take full account of each datum's natural pattern.Experimental results show that with the improvement of the data selection criterion
the DSIB algorithm can improve the clustering precision while the expense of the recall is small.In our evaluation
the DSIB algorithm is found to be consistently superior to all the other clustering methods we examine.