电子学报 ›› 2018, Vol. 46 ›› Issue (8): 1804-1814.DOI: 10.3969/j.issn.0372-2112.2018.08.002

• 学术论文 • 上一篇    下一篇

基于差集的高效用项集挖掘方法

黄坤1, 吴玉佳2, 李晶2   

  1. 1. 中国舰船研究设计中心, 湖北武汉 430064;
    2. 武汉大学计算机学院, 湖北武汉 430072
  • 收稿日期:2016-11-26 修回日期:2017-06-30 出版日期:2018-08-25
    • 作者简介:
    • 黄坤 男,1979年10月生于湖北孝感,中国舰船研究设计中心高级工程师,研究方向电子信息系统.E-mail:hkcfan@163.com;吴玉佳 男,1986年11月生于湖北广水.武汉大学计算机学院博士生.研究方向为数据挖掘与机器学习.E-mail:wuyujia@whu.edu.cn
    • 基金资助:
    • 国家自然科学基金 (No.61303046)

Mining High Utility Itemsets Using Diffsets

HUANG Kun1, WU Yu-jia2, LI Jing2   

  1. 1. China Ship Development and Design Center, Wuhan, Hubei 430072, China;
    2. School of Computer, Wuhan University, Wuhan, Hubei 430072, China
  • Received:2016-11-26 Revised:2017-06-30 Online:2018-08-25 Published:2018-08-25

摘要: 高效用项集挖掘已成为关联规则中的一个热点研究问题.一些基于垂直结构的算法已用来挖掘高效用项集,此类算法的主要优点是将项集的事务和效用信息存储到效用列表中.在求一个项集的超集所在事务可以通过对它的子集进行一次交集运算得到.这种算法在稀疏数据集中非常的有效.但在稠密数据集中存在一个问题,即列表中存储的事务太多,在计算用于剪枝的效用上界时,需要耗费大量的存储空间,同时也影响运行速度.并且在现有的算法中,缺乏针对稠密数据集的高效用项集挖掘算法,往往需要设置很高的最小效用阈值,影响算法的运行效率.针对此问题,提出一个新的算法D-HUI (mining High Utility Itemsets using Diffsets)以及一个新的数据结构—项集列表,首次在高效用项集挖掘中引入差集的概念.利用事务的差集求项集的效用上界,减少计算量以及存储空间,从而提高算法的运行效率.实验结果表明,提出的算法在稠密数据集中,执行速度更快,内存消耗更少.

关键词: 关联规则, 高效用项集, 稠密数据集, 垂直结构, 差集

Abstract: High utility itemsets mining (HUIM) has become an emerging topic in association rules.Some algorithms based on vertical data structure have been used for mining high utility itemsets(HUIs),and the main advantage of the algorithms are to maintain transaction and utility information of itemsets in some utility lists(ULs).The transactions of superset of an itemsets can be calculated by its subset doing an intersection.These algorithms are very effective in sparse datasets.However,in the dense datasets,a problem is that:too many transactions maintained in ULs,not only required a lot of memory space,but also affected the runtime when computing the upper bound of utility in order to prune search space.Few of existing HUIM algorithm focused on dense datasets and it often need to set a high minimum threshold utility which affect the running efficiency of the algorithm.To solve this problem,propose a new algorithm D-HUI(mining High Utility Itemsets using Diffsets) and a new data structure,namely Itemset Lists (ILs).Introduce the concept of diffsets in the HUIM.Calculate upper bound of utility by using diffsets of transaction for pruning search space.The runtime and memory consumption are reduced,and the running efficiency of the algorithm is improved.Experimental results show that the proposed algorithm in the dense datasets outperforms state-of-the-art algorithms in terms of both running time and memory consumption.

Key words: association rules, high utility itemsets, dense datasets, vertical structure, diffsets

中图分类号: