电子学报 ›› 2020, Vol. 48 ›› Issue (2): 321-332.DOI: 10.3969/j.issn.0372-2112.2020.02.015

• 学术论文 • 上一篇    下一篇

面向分类的流特征在线特征选择算法

尤殿龙1,2, 郭松1,2, 赵春慧1,2, 原福永1,2, 申利民1,2, 陈真1,2   

  1. 1. 燕山大学信息科学与工程学院, 河北秦皇岛 066004;
    2. 河北省计算机虚拟技术与系统集成重点实验室, 河北秦皇岛 066004
  • 收稿日期:2019-06-10 修回日期:2019-10-17 出版日期:2020-02-25 发布日期:2020-02-25
  • 通讯作者: 原福永
  • 作者简介:尤殿龙 男,1981年1月生,内蒙古赤峰人,博士,副教授.主要研究方向:流特征选择和因果发现,图信号处理,数据挖掘、人工智能、知识工程.E-mail:youdianlong@sina.com;郭松 男,1994年5月生,河北邯郸人,硕士研究生.主要研究方向:流特征选择,人工智能.E-mail:guosongysu@163.com;赵春慧 女,1994年4月生,山西阳泉人,硕士研究生.主要研究方向:数据挖掘,推荐系统.E-mail:zch635134004@163.com
  • 基金资助:
    国家自然科学基金(No.61772450);中国博士后科学基金(No.2018M631764);河北省自然科学基金(No.F2019203287,No.F2017203307);河北省科技计划项目(No.17210701D)河北省博士后科研项目(No.B2018003009);河北省教育厅科学研究计划项目(No.KCJSX2017028)燕山大学基础研究专项课题(No.16SKY011);燕山大学博士基金(No.BL18003)

Online Feature Selection with Streaming Features for Classification

YOU Dian-long1,2, GUO Song1,2, ZHAO Chun-hui1,2, YUAN Fu-yong1,2, SHEN Li-min1,2, CHEN Zhen1,2   

  1. 1. School of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China;
    2. The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province, Qinhuangdao, Hebei 066004, China
  • Received:2019-06-10 Revised:2019-10-17 Online:2020-02-25 Published:2020-02-25

摘要: 在线流特征选择通过实时过滤无关特征和冗余特征,实现流特征空间降维.针对已有算法,如Alpha-investing分类精度低、SAOLA选择特征数多和OSFS在低冗余高相关数据集下运行时间长的问题,提出了一种面向分类的流特征在线特征选择算法——OSFIC.算法运用四层过滤框架,通过无条件独立过滤不相关新特征、单条件下互信息过滤冗余新特征和候选特征集合中的部分冗余特征,最后通过多条件独立过滤候选特征集中的剩余冗余特征,最终得到分类标签的近似马尔可夫毯.为了分析OSFIC的性能,选择了NIPS 2003和Causality Workbench中的数据集,从预测精度、特征数量、运行时间和AUC方面与已有基准算法进行比较.实验表明,OSFIC平均分类精度比Alpha-investing提升4.41%.在保证精度的前提下,平均特征数量比SAOLA减少41.9%,运行时间比OSFS减少91.59%.最后,在真实的应用场景下验证了OSFIC的有效性.

关键词: 在线特征选择, 流特征, 互信息, 条件独立, 近似马尔可夫毯

Abstract: Online streaming feature selection achieves stream feature space dimensionality reduction by filtering irrelevant features and redundant features in real time.Existing works, such as Alpha-investing and Online Streaming Feature Selection (OSFS), have been proposed to serve this purpose, but they have drawbacks, including low prediction accuracy and high running time if the streaming features exhibit characteristics such as low redundancy and high relevance.We propose a novel classification-oriented online feature selection algorithm for streaming features, named OSFIC.OSFIC uses a four-layer filtering framework to filter irrelevant new features by null-conditional independence, filter redundant new features and redundant features in a candidate feature set by a single-conditional mutual information, and finally filter the remaining redundancy in the candidate feature set by multi-conditional independence.The approximate Markov blanket of the classify label is finally obtained.To analyze the performance of the algorithm, we selected the datasets in NIPS 2003 and Causality Workbench to compare prediction accuracy, number of selected features, runtime, and AUC with existing state-of-the-art algorithms.Experiments show that the average classification accuracy of OSFIC is 4.41% higher than that of Alpha-investing.Under the premise of high precision, the average number of features is 41.9% lower than SAOLA, and the runtime is 91.59% lower than OSFS.Finally, the efficiency of OSFIC is verified in real scenarios.

Key words: online feature selection, streaming feature, mutual information, conditional independence, approximate Markov blanket

中图分类号: