电子学报 ›› 2016, Vol. 44 ›› Issue (4): 906-912.DOI: 10.3969/j.issn.0372-2112.2016.04.022

• 学术论文 • 上一篇    下一篇

基于LSH的高维大数据k近邻搜索算法

王忠伟, 陈叶芳, 钱江波, 陈华辉   

  1. 宁波大学信息科学与工程学院, 浙江宁波 315211
  • 收稿日期:2014-12-24 修回日期:2015-09-20 出版日期:2016-04-25
    • 通讯作者:
    • 陈叶芳
    • 作者简介:
    • 王忠伟 男,1988年出生于河南商丘.宁波大学信息科学与工程学院硕士研究生,主要研究方向为大数据、数据挖掘. E-mail:huanfengwei@163.com
    • 基金资助:
    • 国家自然科学基金 (No.61472194,No.61572266); 浙江省自然科学基金 (No.LY13F020040); 宁波市自然科学基金 (No.2014A610023); "信息与通信工程"浙江省重中之重学科开放基金

LSH-Based Algorithm for k Nearest Neighbor Search on Big Data

WANG Zhong-wei, CHEN Ye-fang, QIAN Jiang-bo, CHEN Hua-hui   

  1. Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo, Zhejiang 315211, China
  • Received:2014-12-24 Revised:2015-09-20 Online:2016-04-25 Published:2016-04-25
    • Supported by:
    • National Natural Science Foundation of China (No.61472194, No.61572266); National Natural Science Foundation of Zhejiang Province,  China (No.LY13F020040); Ningbo Natural Science Fund (No.2014A610023); Open Fund for Information and Communication Engineering --Top Priority Discipline of Zhejiang Province

摘要:

局部敏感哈希(LSH)及其变体是解决高维数据k近邻(kNN)搜索的有效算法.但是,随着数据规模的日趋庞大,传统的集中式LSH算法结构已经不能够满足大数据时代的需求.本文分析传统LSH方案的不足之处,拓展AND-OR结构,提出通过索引而不比较原始数据直接实现高维大数据k近邻搜索算法C2SLSH.理论分析和实验证明,C2SLSH在分布式平台下具有稳定的可扩展性,在保证同等精确率的情况下,处理速度大约是现有方法的3倍.

关键词: 高维数据k近邻, 局部敏感哈希, MapReduce, 冲突计数排序

Abstract:

The locality sensitive hashing (LSH) and its variants are efficient algorithms to solve the k nearest neighbor (kNN) search problems on high-dimensional data.However,with the increase of large data size,the traditional centralized LSH algorithms cannot meet the challenge of the big data era.Based on a new AND-OR construction,this paper proposes an algorithm (called C2SLSH) for the k nearest neighbor search on big data.Different to the traditional algorithms,the C2SLSH can directly get the results from an index without having to compare the original data.The theoretical analysis and experimental results show that the algorithm has stable scalability on a distributed platform.Furthermore,it is faster than the conventional methods for about three times with the same accuracy rate.

Key words: kNN, LSH, MapReduce, collision counting sorting

中图分类号: