基于优先填补策略的Spark数据均衡分区方法

何玉林; 吴东彤; Philippe Fournier-Viger; 黄哲学

doi:10.12263/DZXB.20240094

您当前的位置：

首页 >

文章列表页 >

基于优先填补策略的Spark数据均衡分区方法

学术论文 | 更新时间：2026-05-07

- 基于优先填补策略的Spark数据均衡分区方法
- First Filling Strategy-Based Partitioning Method to Balance Data in Spark
- 电子学报 2024年52卷第10期页码：3322-3335
- 作者机构：
  
  1.人工智能与数字经济广东省实验室（深圳），广东深圳 518107
  2.深圳大学计算机与软件学院，广东深圳 518060
- 作者简介：
  
  [ "何玉林男，1982年4月出生于河北衡水.现为人工智能与数字经济广东省实验室（深圳）研究员、高级工程师.主要研究方向为大数据系统计算技术、多样本统计分析理论与方法、数据挖掘与机器学习算法及应用.中国电子学会会员编号：E190029878M.E-mail: yulinhe@gml.ac.cn" ]
  [ "吴东彤女，1999年9月出生于广东汕头.现为人工智能与数字经济广东省实验室（深圳）硕士研究生.主要研究方向为大数据智能处理与分析技术.E-mail: 2210273127@email.szu.edu.cn" ]
  [ "Philippe Fournier-Viger 男，1980年8月出生于加拿大蒙特利尔.现为深圳大学计算机与软件学院特聘教授、博士生导师.主要研究方向为数据挖掘、人工智能、知识表示和推理、认知模型建构等.E-mail: philfv@szu.edu.cn" ]
  [ "黄哲学男，1959年07月出生于黑龙江哈尔滨.现为深圳大学计算机与软件学院特聘教授、博士生导师.主要研究方向为数据挖掘、机器学习、大数据系统计算技术.E-mail: zx.huang@szu.edu.cn" ]
- 基金信息：
  
  深圳市科技重大专项项目(202302D074);广东省自然科学基金面上项目(2023A1515011667);深圳市基础研究面上项目(JCYJ20210324093609026);广东省基础与应用基础研究基金粤深联合基金重点项目(2023B1515120020)
- DOI：10.12263/DZXB.20240094
  中图分类号： TP311
- 收稿：2024-01-22，
  
  修回：2024-07-17，
  
  纸质出版：2024-10-25
- 稿件说明：
移动端阅览
何玉林, 吴东彤, Philippe Fournier-Viger, 等. 基于优先填补策略的Spark数据均衡分区方法[J]. 电子学报, 2024, 52(10): 3322-3335.

HE Yu-lin, WU Dong-tong, Fournier-Viger Philippe, et al. First Filling Strategy-Based Partitioning Method to Balance Data in Spark[J]. Acta Electronica Sinica, 2024, 52(10): 3322-3335.
何玉林, 吴东彤, Philippe Fournier-Viger, 等. 基于优先填补策略的Spark数据均衡分区方法[J]. 电子学报, 2024, 52(10): 3322-3335. DOI：10.12263/DZXB.20240094

HE Yu-lin, WU Dong-tong, Fournier-Viger Philippe, et al. First Filling Strategy-Based Partitioning Method to Balance Data in Spark[J]. Acta Electronica Sinica, 2024, 52(10): 3322-3335. DOI：10.12263/DZXB.20240094

摘要

Spark作为基于内存计算的分布式大数据处理框架，运行速度快且通用性强.在任务计算过程中，Spark的默认分区器HashPartitioner在处理倾斜数据时，容易产生各个分区数据量不平衡的情况，导致资源利用率低且运行效率差.现存的Spark均衡分区改进方法，例如多阶段分区、迁移分区和采样分区等，大多存在尺度把控难、通信开销成本高、对采样过度依赖等缺陷.为改善上述问题，本文提出了一种基于优先填补策略的分区方法，同时考虑了样本数据和非样本数据的分配，以便实现对全部数据的均衡分区.该方法在对数据采样并根据样本信息估算出每个键的权值后，将键按照权值大小降序排列，依次将键在满足分区容忍度的条件下分配到前面的分区中，为未被采样的键预留后面的分区空间，以获得针对样本数据的分区方案.Spark根据分区方案对样本中出现的键对应的数据进行分区，没有出现的键对应的数据则直接映射到可分配的最后一个分区中.实验结果表明，新分区方法能够有效实现Spark数据的均衡分区，在美国运输统计局发布的真实航空数据集上，基于该方法设计的优先填补分区器的总运行时间比HashPartitioner平均缩短了15.3%，比现有的均衡数据分区器和哈希键值重分配分区器分别平均缩短了38.7%和30.2%.

Abstract

Spark is a distributed big data processing framework based on in-memory computing

which has the advantages of fast running speed and strong versatility. When conducting the computation task

Spark’s default partitioner HashPartitioner is easy to generate data skewing among partitions. It results in low resource utilization and poor operating efficiency. Most of the existing Spark balanced partitioning methods

such as multi-stage partitioning

migration partitioning

and sampling partitioning

have defects of scale control difficulty

high communication overhead

and excessive sampling dependence. In order to solve the above-mentioned problems

we propose a partitioning method based on first filling strategy

which considers the allocations of sample data and non-sample data at the same time

so as to achieve a balanced data partitioning. After sampling the data and estimating the weight of each key according to the sample information

the keys are sorted in descending order according to the weights. The keys are in turn assigned to the previous partitions if their additions can satisfy the partition tolerance

and the space of the last partition is reserved for the keys that are not sampled

so as to obtain the partitioning plan for the sample data. Spark partitions the data corresponding to the keys that appear in the sample according to the partitioning plan

and the data of other keys that do not appear is directly allocated to the last data partition available. The experimental results show that the new method can effectively achieve balanced partitioning for Spark data. On the real datasets from Bureau of Transportation Statistics

compared with HashPartitioner

the total running time of first filling partitioner (FFP)

designed based on the proposed method

is shortened by 15.3% on average. In addition

FFP’s total running time is on average 38.7% shorter than balanced Spark data partitioner and 30.2% shorter than hash based key reassigning partitioner.

关键词

Keywords

references

KATAL A , WAZID M , GOUDAR R H . Big data: Issues, challenges, tools and Good practices [C ] // 2013 Sixth International Conference on Contemporary Computing (IC3) . Piscataway : IEEE , 2013 : 404 - 409 .

DITTRICH J , QUIANÉ-RUIZ J A . Efficient big data processing in Hadoop MapReduce [J ] . Proceedings of the VLDB Endowment , 2012 , 5 ( 12 ): 2014 - 2015 .

ZAHARIA M , CHOWDHURY M , FRANKLIN M J , et al . Spark: Cluster computing with working sets [C ] // 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) . Berkeley : USENIX Association , 2010 : 1 - 7 .

CARBONE P , KATSIFODIMOS A , EWEN S , et al . Apache flink: Stream and batch processing in a single engine [J ] . Bulletin of the IEEE Computer Society Technical Committee on Data Engineering , 2015 , 38 ( 4 ): 28 - 38 .

SHVACHKO K , KUANG H R , RADIA S , et al . The hadoop distributed file system [C ] // 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) . Piscataway : IEEE , 2010 : 1 - 10 .

DEAN J , GHEMAWAT S . MapReduce: Simplified data processing on large clusters [J ] . Communications of the ACM , 2008 , 51 ( 1 ): 107 - 113 .

GEETHA J , HARSHIT N G . Implementation and performance comparison of partitioning techniques in apache spark [C ] // 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT) . Piscataway : IEEE , 2019 : 1 - 5 .

BAERT Q , CARON A C , MORGE M , et al . An adaptive multi-agent system for task reallocation in a MapReduce job [J ] . Journal of Parallel and Distributed Computing , 2021 , 153 : 75 - 88 .

BERLIŃSKA J , DROZDOWSKI M . Comparing load-balancing algorithms for MapReduce under Zipfian data skews [J ] . Parallel Computing , 2018 , 72 : 14 - 28 .

梁俊杰 , 何利民 . 基于MapReduce的数据倾斜连接算法 [J ] . 计算机科学 , 2016 , 43 ( 9 ): 27 - 31 .

LIANG J J , HE L M . Join algorithm in skewed datasets based on MapReduce [J ] . Computer Science , 2016 , 43 ( 9 ): 27 - 31 . (in Chinese)

LU W , CHEN L , WANG L Q , et al . NPIY: A novel partitioner for improving mapreduce performance [J ] . Journal of Visual Languages & Computing , 2018 , 46 : 1 - 11 .

陶永才 , 丁雷道 , 石磊 , 等 . MapReduce在线抽样分区负载均衡研究 [J ] . 小型微型计算机系统 , 2017 , 38 ( 2 ): 238 - 242 .

TAO Y C , DING L D , SHI L , et al . Research on MapReduce on-line load balancing based on sample partition [J ] . Journal of Chinese Computer Systems , 2017 , 38 ( 2 ): 238 - 242 . (in Chinese)

CHEN Q , YAO J Y , XIAO Z . LIBRA: Lightweight data skew mitigation in MapReduce [J ] . IEEE Transactions on Parallel and Distributed Systems , 2015 , 26 ( 9 ): 2520 - 2533 .

SINGH B , VERMA H K . IMSM: An interval migration based approach for skew mitigation in MapReduce [J ] . Recent Advances in Computer Science and Communications , 2021 , 14 ( 1 ): 71 - 81 .

王卓 , 陈群 , 李战怀 , 等 . 基于增量式分区策略的 MapReduce 数据均衡方法 [J ] . 计算机学报 , 2016 , 39 ( 1 ): 19 - 35 .

WANG Z , CHEN Q , LI Z H , et al . An incremental partitioning strategy for data balance on MapReduce [J ] . Chinese Journal of Computers , 2016 , 39 ( 1 ): 19 - 35 . (in Chinese)

ALAMRO S , LAN T , SUBRAMANIAM S . Forseti: Dynamic chunk-level reshaping for data processing on heterogeneous clusters [J ] . Journal of Parallel and Distributed Computing , 2023 , 171 : 14 - 23 .

侯震梅 , 杨玉莹 . 分布式数据流数据倾斜均衡方法研究 [J ] . 长春大学学报(自然科学版) , 2020 , 30 ( 5 ): 11 - 20 .

HOU Z M , YANG Y Y . Research on data skew equalization method for distributed data streams [J ] . Journal of Changchun University , 2020 , 30 ( 5 ): 11 - 20 . (in Chinese)

KWON Y , BALAZINSKA M , HOWE B , et al . SkewTune: Mitigating skew in mapreduce applications [C ] // Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data . New York : ACM , 2012 : 25 - 36 .

HOU X F , ASHWIN K T K , THOMAS J P , et al . Dynamic workload balancing for Hadoop MapReduce [C ] // 2014 IEEE Fourth International Conference on Big Data and Cloud Computing (BDCloud) . Piscataway : IEEE , 2014 : 56 - 62 .

TANG Z , ZHANG X S , LI K L , et al . An intermediate data placement algorithm for load balancing in Spark computing environment [J ] . Future Generation Computer Systems , 2018 , 78 : 287 - 301 .

卞琛 , 于炯 , 修位蓉 , 等 . 基于迭代填充的内存计算框架分区映射算法 [J ] . 计算机应用 , 2017 , 37 ( 3 ): 647 - 653 .

BIAN C , YU J , XIU W R , et al . Partitioning and mapping algorithm for in-memory computing framework based on iterative filling [J ] . Journal of Computer Applications , 2017 , 37 ( 3 ): 647 - 653 . (in Chinese)

张元鸣 , 蒋建波 , 陆佳炜 , 等 . 面向MapReduce的迭代式数据均衡分区策略 [J ] . 计算机学报 , 2019 , 42 ( 8 ): 1873 - 1885 .

ZHANG Y M , JIANG J B , LU J W , et al . An iterative data partitioning strategy for MapReduce [J ] . Chinese Journal of Computers , 2019 , 42 ( 8 ): 1873 - 1885 . (in Chinese)

HE Z Y , HUANG Q L , LI Z F , et al . Handling data skew for aggregation in spark SQL using task stealing [J ] . International Journal of Parallel Programming , 2020 , 48 ( 6 ): 941 - 956 .

SHEN Y J , XIONG J , JIANG D J . SrSpark: Skew-resilient spark based on adaptive parallel processing [C ] // 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS) . Piscataway : IEEE , 2020 : 466 - 475 .

YU J D , CHEN H P , HU F . SASM: Improving spark performance with Adaptive Skew Mitigation [C ] // 2015 IEEE International Conference on Progress in Informatics and Computing (PIC) . Piscataway : IEEE , 2015 : 102 - 107 .

HUANG Z C , WEI W G , XIE G Y . Load balancing mechanism based on linear regression partition prediction in spark [J ] . Journal of Physics: Conference Series , 2020 , 1575 ( 1 ): 012109 .

WANG S Z , JIA Z T , WANG W L . Research on optimization of data balancing partition algorithm based on spark platform [M ] // Lecture Notes in Computer Science . Cham : Springer International Publishing , 2021 : 3 - 13 .

SONG A B , PENG B W , QIU J Y , et al . BSDP: A novel balanced spark data partitioner [C ] // 2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS) . Piscataway : IEEE , 2021 : 556 - 566 .

GUO W X , HUANG C J , TIAN W H . Handling data skew at reduce stage in Spark by ReducePartition [J ] . Concurrency and Computation: Practice and Experience , 2020 , 32 ( 9 ): e5637 .

ZVARA Z , SZABÓ P G N , LÓRÁNT B B , et al . System-aware dynamic partitioning for batch and streaming workloads [C ] // Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing (UCC) . New York : ACM , 2021 : 1 - 10 .

阎逸飞 , 王智立 , 邱雪松 , 等 . Spark环境下基于数据倾斜模型的Shuffle分区优化方案 [J ] . 北京邮电大学学报 , 2020 , 43 ( 2 ): 116 - 121 .

YAN Y F , WANG Z L , QIU X S , et al . A shuffle partition optimization scheme based on data skew model in spark [J ] . Journal of Beijing University of Posts and Telecommunications , 2020 , 43 ( 2 ): 116 - 121 . (in Chinese)

TANG Z , LV W , LI K L , et al . An intermediate data partition algorithm for skew mitigation in Spark computing environment [J ] . IEEE Transactions on Cloud Computing , 2021 , 9 ( 2 ): 461 - 474 .

SHI X J , QIAN Y Q . An algorithm of data skew in spark based on partition [C ] // 2020 International Conference on Computers, Information Processing and Advanced Education (CIPAE) . Piscataway : IEEE , 2020 : 217 - 222 .

LI C L , CAI Q Q , LUO Y L . Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment [J ] . The Journal of Supercomputing , 2022 , 78 ( 3 ): 3561 - 3604 .

ZAHARIA M , CHOWDHURY M , DAS T , et al . Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing [C ] // Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI) . Berkeley : USENIX Association , 2012 : 15 - 28 .

卞琛 , 于炯 , 英昌甜 , 等 . 并行计算框架Spark的自适应缓存管理策略 [J ] . 电子学报 , 2017 , 45 ( 2 ): 278 - 284 .

BIAN C , YU J , YING C T , et al . Self-adaptive strategy for cache management in spark [J ] . Acta Electronica Sinica , 2017 , 45 ( 2 ): 278 - 284 . (in Chinese)

ILGEN B , KARAOGLAN B . Investigation of Zipf's ‘law-of-meaning'on Turkish corpora [C ] // 2007 22nd international symposium on computer and information sciences (ISCIS) . Piscataway : IEEE , 2007 : 1 - 6 .

LIN J . The curse of Zipf and limits to parallelization: A look at the stragglers problem in MapReduce [C ] // Proceedings of the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR) , Boston : CEUR-WS , 2009 : 57 - 62 .

AHMAD F M . Statistical universals of language: Mathematical chance vs. human choice [J ] . Technometrics , 2022 , 64 ( 3 ): 432 - 433 .

United States Department of Transportation . Bureau Of Transportation Statistics (BTS) [DS/OL ] . ( 2024-04-24 ) [ 2024-04-24 ] . https://www.transtats.bts.gov https://www.transtats.bts.gov .

智慧来 , 张丽 , 李金海 . 旁观者视角下粒的多层次描述 [J ] . 电子学报 , 2022 , 50 ( 11 ): 2568 - 2574 .

ZHI H L , ZHANG L , LI J H . Multi-level description of granules from an outsider's perspective [J ] . Acta Electronica Sinica , 2022 , 50 ( 11 ): 2568 - 2574 . (in Chinese)

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

自适应的Spark数据均衡分区方法

正则化超限学习机的最大分划广义交替方向乘子法

基于MapReduce的增广动态Skyline查询处理方法