基于主动学习和否定选择的垃圾邮件分类算法

胡小娟; 刘磊; 邱宁佳

doi:10.3969/j.issn.0372-2112.2018.01.028

您当前的位置：

首页 >

文章列表页 >

基于主动学习和否定选择的垃圾邮件分类算法

学术论文 | 更新时间：2025-07-16

- 基于主动学习和否定选择的垃圾邮件分类算法
- A Novel Spam Categorization Algorithm Based on Active Learning Method and Negative Selection Algorithm
- 电子学报 2018年46卷第1期页码：203-209
- 作者机构：
  
  1. 吉林大学计算机科学与技术学院,吉林,长春,130012
  2. 长春理工大学计算机科学技术学院,吉林,长春,130022
  3. 吉林大学计算机科学与技术学院,吉林,长春,130012
  4. 长春理工大学计算机科学技术学院,吉林,长春,130022
- 作者简介：
- 基金信息：
  
  吉林省自然科学基金 (No.20150101054JC）;吉林省博士后科研资助项目 (No.40301919）;吉林省科技发展计划重点科技攻关项目 (No.20150204036GX）;中国博士后科学基金 (No.2016M591482）
- DOI：10.3969/j.issn.0372-2112.2018.01.028
  中图分类号： TP391
- 网络出版：2018-01-25，
  
  纸质出版：2018
- 稿件说明：
移动端阅览
胡小娟, 刘磊, 邱宁佳. 基于主动学习和否定选择的垃圾邮件分类算法[J]. 电子学报, 2018,46(1):203-209.

HU Xiao-juan, LIU Lei, QIU Ning-jia. A Novel Spam Categorization Algorithm Based on Active Learning Method and Negative Selection Algorithm[J]. Acta Electronica Sinica, 2018, 46(1): 203-209.
胡小娟, 刘磊, 邱宁佳. 基于主动学习和否定选择的垃圾邮件分类算法[J]. 电子学报, 2018,46(1):203-209. DOI： 10.3969/j.issn.0372-2112.2018.01.028.

HU Xiao-juan, LIU Lei, QIU Ning-jia. A Novel Spam Categorization Algorithm Based on Active Learning Method and Negative Selection Algorithm[J]. Acta Electronica Sinica, 2018, 46(1): 203-209. DOI： 10.3969/j.issn.0372-2112.2018.01.028.

摘要

针对现在网络上泛滥的垃圾邮件问题，本文结合主动学习方法和否定选择算法提出了一种二类文本分类方法：主动否定学习算法.根据用户少量标注建立双向兴趣集，利用否定选择算法的自体异常检测机制改善主动学习中的采样策略，并将双向兴趣集作为检测器，新增样本集作为自体集，对两者进行异常匹配.本文算法与在线垃圾邮件快速识别方法、增强差异性的半监督协同分类算法、垃圾邮件过滤方法、基于人工高免疫的多层垃圾邮件过滤算法和在线主动多领域学习方法在六个常用邮件语料集上进行了分析比较，结果表明本文算法具有较高的准确率、召回率、分类精度，和较低的用户标注负担.使用用户个性喜好转换为双向兴趣特征的方式有助于提高算法的分类能力；利用异常检测匹配选取未知类别特征的方式，有效地降低了用户标注负担.

Abstract

A two-class text categorization method

active learning negative selection text categorization (ALNSTC) algorithm

based on active learning (AL) method and negative selection (NS) algorithm

is proposed for the problem of spam proliferation. The positive user interest set and the negative user interest set are established according to a small number of labeled samples. And the sampling engine (SE) of AL method is improved by the autologous anomaly detection mechanism of the NS algorithm. The two-way user interest sets are used as detectors

and a new sample set is employed as a self-set. The above two sets are matched with Hamming match rules. The classification process of each sample set is able to update the two user interest sets. The proposed algorithm is carried out with a full-scale test on six common spam corpus

which are selected as experimental material

and analyzed and compared with other five state-of-the-art spam classification methods

which are quick online spam identification (QOSI) method

semi-supervised collaboration classification algorithm with enhanced difference (DSCC)

dynamic web spam filtering (WSF2) method

multilevel spam filtering algorithm based on artificial immunity (MSFA-AI)

and integrated multi-field learning (MFL) method

in different evaluation metrics

such as precision

recall

ROC curve

categorization running time and the labeled number of spam. The results show that the proposed method has better precision rate

recall rate

classification accuracy

and can reduce the artificial labeled number of spam samples. It is advantageous to enhance the classification capacity of the algorithm that the user preferences are converted into positive and negative user interest sets. In addition

the user labeled number is reduced when unknown category features are obtained by the exception detection mechanism.

关键词

Keywords

references

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于联邦学习的主动半监督短文本分类方法

基于主动学习的图像分类技术：现状与未来

一种基于潜在语义分析和直推式谱图算法的文本分类方法LSASGT

有监督主题模型的SLDA-TC文本分类新方法