电子学报 ›› 2021, Vol. 49 ›› Issue (10): 2002-2011.DOI: 10.12263/DZXB.20190919

• 学术论文 • 上一篇    下一篇

DisHAP:基于层次亲和聚类的分布式大图划分算法

柳菁, 李琪   

  1. 绍兴文理学院计算机科学与工程系,浙江 绍兴 312000
  • 收稿日期:2019-08-12 修回日期:2021-07-26 出版日期:2021-10-25 发布日期:2021-10-25
  • 作者简介:柳 菁 女,1979年生,浙江绍兴人.2009年于同济大学获计算机工程硕士学位,现任绍兴文理学院计算机科学与技术专业实验师,研究方向为智能信息处理等.
    李 琪(通信作者) 男,1987年生,江苏淮安人.2019年毕业于重庆大学获软件工程博士学位,现任绍兴文理学院计算机科学与技术专业讲师,研究方向为复杂网络、知识图谱等.E-mail:liqi0713@foxmail.com
  • 基金资助:
    国家自然科学基金青年科学基金(62002226)

DisHAP: A Distributed Partition Algorithm for Large Scale Graphs Based on Hierarchical Affinity Clustering

Jing LIU, Qi LI   

  1. Department of Computer Science and Engineering,Shaoxing University,Shaoxing,Zhejiang 312000,China
  • Received:2019-08-12 Revised:2021-07-26 Online:2021-10-25 Published:2021-10-25

摘要:

平衡图划分是改善并行图计算性能的关键.一个良好的划分算法应保证划分后的子图在负载均衡的前提下,减少子图之间的交互边(切割边)规模,从而减少网络通信.对此,本文设计一种基于层次亲和聚类的分布式大图划分算法(DisHAP).该算法采用亲和聚类的思想,将图初始划分为规模相等的k个子图;再将结果映射成顶点序列,以线性嵌入顺序处理节点,通过局部交换策略优化割边率;最后将DisHAP应用在MapReduce框架中,使用多种真实及理论图数据,与现有的大图划分算法做比较分析.以Twitter图为例,划分2,4,8,16,32个子区,相较于现有的大图划分算法(LDG,BLP,Spinner,Fennel,ParMetis及PSA-MIR算法),割边率减少1.7%~30.2%,说明了该算法的优越性.同时该算法具有良好的可扩展性,划分的子区数量及图的规模对划分时间具有较低的影响.

关键词: 分布式大图划分, 层次聚类, 局部优化, 分布式图计算, 平衡划分

Abstract:

It is the key to improve the performance of graph calculation by improving the efficiency of the graph partitioning algorithm and reducing the communication edge scale between the subgraphs. Due to the limited memory capacity of single computing node, it is difficult to meet the partitioning requirements of large-scale graphs in terms of partitioning efficiency and partitioning quality. In this paper, a distributed graph partitioning algorithm based on hierarchical affinity clustering for massive scale graphs is designed, called DisHAP. It uses the Boruvka minimum spanning tree algorithm to balance cluster the graph under the constraint condition of the input graph according to the vertex similarity, and partitions the graph into k subgraph (partitions) with equal size. In order to optimize the fractiom of edges cut between these large subgraphs, we map the initial partition results to the vertex sequence and cut into a large number of subsections, randomly select the sub slice pairs in the neighbor subgraph sequence, and migrate the vertices according to the mutual exchange and the single vertex positive profit. Thus, the optimization problem of large data volume is converted to a large number of small problems to solve. The algorithm is applied to the MapReduce framework to effectively improve the efficiency of the algorithm. Finally, we use various actual and theoretical graph data to compare with existing graph partitioning algorithms to verify the effectiveness of the DisHAP algorithm. Taking the graph Twitter as an example, it is partitioned into 2,4,8,16,32 partitions. Compared to LDG, BLP, Spinner, Fennel, ParMetis and PSA-MIR algorithms, the fraction of edges cut is reduced by 30.2%, 29.4%, 10.2%, 7.8%, 1.7%, and 3.3% respectively.

Key words: distributed large-scale graph partitioning, hierarchical clustering, local optimization, distributed graph computation, balanced partitioning

中图分类号: