电子学报 ›› 2020, Vol. 48 ›› Issue (10): 1983-1992.DOI: 10.3969/j.issn.0372-2112.2020.10.016

• 学术论文 • 上一篇    下一篇

基于生成对抗网络的差分隐私数据发布方法

方晨1, 郭渊博1, 王娜1, 甄帅辉1,2, 唐国栋3   

  1. 1. 信息工程大学, 河南郑州 450001;
    2. 中国人民解放军93808部队, 甘肃兰州 730000;
    3. 中国人民解放军75775部队, 广东广州 510000
  • 收稿日期:2019-12-10 修回日期:2020-04-16 出版日期:2020-10-25
    • 通讯作者:
    • 郭渊博
    • 作者简介:
    • 方晨 男,1993年出生,安徽安庆人.战略支援部队信息工程大学博士研究生.研究方向为机器学习隐私安全.E-mail:17756230629@163.com
      王娜 女,1970年出生,河南郑州人.战略支援部队信息工程大学副教授.研究方向为网络信息安全.
    • 基金资助:
    • 国家自然科学基金 (No.61501515,No.61601515); 信息保障技术重点实验室开放基金 (No.KJ-15-108)

Differential Private Data Publishing Method Based on Generative Adversarial Network

FANG Chen1, GUO Yuan-bo1, WANG Na1, ZHEN Shuai-hui1,2, TANG Guo-dong3   

  1. 1. Information Engineering University, Zhengzhou, Henan 450001, China;
    2. Unit 93808, Lanzhou, Gansu 730000, China;
    3. Unit 75775, Guangzhou, Guangdong 510000, China
  • Received:2019-12-10 Revised:2020-04-16 Online:2020-10-25 Published:2020-10-25
    • Corresponding author:
    • GUO Yuan-bo
    • Supported by:
    • National Natural Science Foundation of China (No.61501515, No.61601515); Open Fund of Key Laboratory of Information Assurance Technology (No.KJ-15-108)

摘要: 机器学习的飞速发展使其成为数据挖掘领域最有效的工具之一,但算法的训练过程往往需要大量的用户数据,给用户带来了极大的隐私泄漏风险.由于数据统计特征的复杂性及语义丰富性,传统隐私数据发布方法往往需要对原始数据进行过度清洗,导致数据可用性低而难以再适用于数据挖掘任务.为此,提出了一种基于生成对抗网络(Generative Adversarial Network,GAN)的差分隐私数据发布方法,通过在GAN模型训练的梯度上添加精心设计的噪声来实现差分隐私,确保GAN可无限量生成符合源数据统计特性且不泄露隐私的合成数据.针对现有同类方法合成数据质量低、模型收敛缓慢等问题,设计多种优化策略来灵活调整隐私预算分配并减小总体噪声规模,同时从理论上证明了合成数据严格满足差分隐私特性.在公开数据集上与现有方法进行实验对比,结果表明本方法能够更高效地生成质量更高的隐私保护数据,适用于多种数据分析任务.

关键词: 差分隐私, 生成对抗网络, 隐私数据发布, 合成数据, 数据挖掘

Abstract: The rapid development of machine learning makes itself one of the most effective tools in the data mining research community. However, the training of algorithm often needs a large amount of user data, which brings a great risk of privacy leakage to users. Due to the complex statistical characteristics and semantic richness of the data, traditional private data publishing methods usually sanitize original data too excessively to lead to low data availability and uselessness in data mining tasks. In this paper, a differential private data publishing method based on generative adversarial network (GAN) is proposed. The differential privacy of the GAN model is realized by adding carefully designed noise to the gradients during the training procedure, so that the GAN can generate unlimited synthetic data conforming to the original statistical characteristics without disclosing any privacy. Aiming at the problems of low quality synthetic data and slow convergence in the existing similar methods, several optimization strategies are designed to adjust the privacy budget allocation and reduce the overall noise scale. Moreover, we provide rigorous proof that the synthetic data satisfies the differential privacy. Comparisons with existing methods on public datasets show that the method proposed can generate private data with higher quality more efficiently, which is suitable for various data analysis tasks.

Key words: differential privacy, generative adversarial network, private data publishing, synthetic data, data mining

中图分类号: