电子学报 ›› 2022, Vol. 50 ›› Issue (1): 250-256.DOI: 10.12263/DZXB.20200619

所属专题: 长摘要论文

• 科研通信 • 上一篇    下一篇

融合字符级滑动窗口和深度残差网络的僵尸网络DGA域名检测方法

刘小洋1, 刘加苗1, 刘超1, 张宜浩2   

  1. 1.重庆理工大学计算机科学与工程学院,重庆 400054
    2.重庆理工大学人工智能学院,重庆 401135
  • 收稿日期:2020-06-28 修回日期:2021-02-20 出版日期:2022-01-25
    • 作者简介:
    • 刘小洋 男,1980年出生,安徽安庆人.博士后.现为重庆理工大学计算机科学与工程学院教授、硕士生导师.主要从事社交网络分析、人工智能、网络安全与数据挖掘等方面的研究工作. E-mail:lxy3103@163.com
      刘加苗(通信作者) 男,1994年出生,重庆渝北人.现为重庆理工大学计算机科学与工程学院硕士研究生.主要从事网络安全、恶意流量检测与域名分析等方面的研究工作. E-mail:jiamiaoliu@126.com
    • 基金资助:
    • 国家社会科学基金 (17XXW004)

Novel Botnet DGA Domain Detection Method Based on Character Level Sliding Window and Deep Residual Network

LIU Xiao-yang1, LIU Jia-miao1, LIU Chao1, ZHANG Yi-hao2   

  1. 1.School of Computer Science and Engineering, Chongqing University of Technology, Chongqing 400054, China
    2.School of Artificial Intelligence, Chongqing University of Technology, Chongqing 401135, China
  • Received:2020-06-28 Revised:2021-02-20 Online:2022-01-25 Published:2022-01-25
    • Supported by:
    • The National Social Science Fund of China (17XXW004)

摘要:

本文提出了一种基于字符级滑动窗口的深度残差网络(Sliding Window-Depth Residual Network,SW-DRN),首次将轻量级深度可分离式卷积应用于僵尸网络中DGA(Domain Generation Algorithm)域名检测.SW-DRN采用深度可分离式卷积,相比标准卷积减少了约56%的参数,增强了模型检测效率.采集两种不同来源的数据,分别命名为Real-Dataset和Gen-Dataset.SW-DRN与对照组模型在两个数据集上进行实验,实验结果表明:SW-DRN模型在DGA域名二分类任务中的F-Score评估指标上分别取得了99.23%和97.81%的成绩;并且在少样本DGA域名家族以及域名字符串易混淆DGA域名情形下多分类任务中取得不错的成绩,相比目前已有的DGA域名分类模型在总体F-Score上提升了1.23%和1.01%的性能,增强了DGA域名家族之间的识别;同时还对所提出的模型在生成对抗模型产生域名进行测试,均能得到有效的识别.

长摘要
针对当前的网络安全态势逐渐加剧的,大量僵尸主机构成的僵尸网络威胁着整个互联网的稳定,僵尸网络控制依靠 DGA (Domain Generation Algorithm)域名解析连接,而目前的DGA域名检测模型性能不足以及在多分类任务中对小样本DGA域名识别率低下和对高随机性、易混淆的DGA域名识别困难的情况下,提出了一种基于字符级滑动窗口的深度残差网络模型(SW-DRN, Sliding Window-Depth Residual Network)。首先,为了增加实验可靠性采集两种不同来源的数据,收集来自互联网中真实DGA域名数据集;其次对获取的域名数据集进行欠采样缓和数据不平衡处理以及域名字符串用字符级编码向量表示;然后把字符级编码向量依次输入到提出的SW-DRN模型中,接着用多尺寸滑动窗口感知不同的特征图后汇入可变长的深度残差神经网络进行复杂、抽象的特征提取;最后通过在数据集上与先前研究人员所提出DGA域名检测模型进行对比实验。实验结果表明:提出的SW-DRN模型在DGA域名二分类任务中的F-Score评估指标上分别取得了99.19%和97.71%良好的成绩;而且在小样本DGA域名家族以及域名字符串高随性、易混淆DGA域名情形下多分类任务中取得了惊人的成绩,相比目前已有DGA域名分类模型在macro F-Score上做出了3.34%和4.8%性能上的提升;提出SW-DRN模型不仅提高了DGA域名分类性能,同时还为小样本的DGA域名家族识别提供一种新的研究方法。

关键词: 域名生成算法, 字符级向量, 残差网络, 深度可分离式卷积

Abstract:

This paper proposed a character-level sliding window based deep residual network model SW-DRN (Sliding Window-Depth Residual Network), which was the first to apply light depthwise separable convolution to the DGA(Domain Generation Algorithm) domain name detection. In SW-DRN, the use of depthwise separable convolution reduced the number of model parameters by about 56% compared with standard convolution, which enhanced the efficiency of model detection. Collect data from two different sources, named Real-Dataset and Gen-Dataset. Finally, comparison experiments on the dataset with the proposed DGA domain name detection model by previous researchers. Experimental results on two datasets show that the proposed SW-DRN model has achieved good results of 99.23% and 97.81% on the F-Score evaluation indicator in the DGA domain name binary classification task. Compared with the existing DGA domain name classification model, the SW-DRN has made a 1.23% and 1.01% performance improvement on the F-Score, enhancing the DGA domain name family recognition. At the same time, the proposed model tests in the generative adversarial networks to generate domain names, and it can be effectively identified.

Extended Abstract
In view of the current situation of network security, consisting of a large number of bots botnet threatens the stability of the entire Internet, botnet control relies on the DGA (Domain Generation Algorithm) domain name resolution (DNS) connections, on which the DGA domain detection model and performance of small sample in the classification task more DGA domain name recognition rate is low and the high randomness, it is easy to confuse the DGA domain name recognition difficult circumstances. This paper presents a new kind of SW-DRN (Sliding Window-Depth residual network) model based on character-level sliding windows. First of all, in order to increase the reliability of the experiment, data from two different sources are collected. Dataset 01 is defined as a real DGA domain name from the Internet. Meanwhile, Dataset 02 is synthesized by the algorithm of DGA domain name. Secondly, the data set of acquired domain name is under-sampled to mitigate data imbalance and the domain name string is represented by character coding vector. The character-level coding vectors are successively input into the proposed SW-DRN model. The model reduces the computational burden of the model by embedding layer to compress character encoding vector dimension. Then the multi-size sliding window is used to perceive different feature graphs and then the feature is merged into the variable length depth residual neural network for complex and abstract feature extraction. Finally, compared with the DGA domain name detection model proposed by previous researchers through the data set. The experimental results show that the proposed SW-DRN model achieves 99.19% and 97.71% good performance in the F-Score evaluation index of DGA domain name binary classification task. Moreover, in the case of small sample DGA domain name families and domain name strings with high randomness and confusion, the multi-classification task has achieved amazing results. Compared with the existing DGA domain name classification model, the performance of macro F-Score has improved by 3.34% and 4.8%. It greatly enhances the category recognition between DGA domain names. The proposed SW-DRN model not only improves the classification performance of DGA domain names. It also provides a new method for small sample DGA domain name identification.

Key words: domain generation algorithm, character-level vector, residual network, depthwise separable convolution

中图分类号: