Adaptive Data Balanced Partitioner in Spark

HE Yu-lin; WU Dong-tong; HUANG Zhe-xue

doi:10.12263/DZXB.20250348

您当前的位置：

首页 >

文章列表页 >

Adaptive Data Balanced Partitioner in Spark

PAPERS | 更新时间：2025-12-27

- Adaptive Data Balanced Partitioner in Spark
- ACTA ELECTRONICA SINICA Vol. 53, Issue 8, Pages: 2764-2778(2025)
- 作者机构：
  
  1.人工智能与数字经济广东省实验室（深圳），广东深圳 518107
  2.深圳大学计算机与软件学院，广东深圳 518060
- 作者简介：
- 基金信息：
  
  Natural Science Foundation of Guangdong Province(2023A1515011667);Major Science and Technology Special Project of Shenzhen(KJZD20230923114809020)
- DOI：10.12263/DZXB.20250348
  CLC： TP311;
- Received：30 April 2025，
  
  Accepted：18 August 2025，
  
  Published：25 August 2025
- 稿件说明：
移动端阅览
何玉林, 吴东彤, 黄哲学. 自适应的Spark数据均衡分区方法[J]. 电子学报, 2025, 53(08): 2764-2778.

HE Yu-lin, WU Dong-tong, HUANG Zhe-xue. Adaptive Data Balanced Partitioner in Spark[J]. Acta Electronica Sinica, 2025, 53(08): 2764-2778.
何玉林, 吴东彤, 黄哲学. 自适应的Spark数据均衡分区方法[J]. 电子学报, 2025, 53(08): 2764-2778. DOI：10.12263/DZXB.20250348

HE Yu-lin, WU Dong-tong, HUANG Zhe-xue. Adaptive Data Balanced Partitioner in Spark[J]. Acta Electronica Sinica, 2025, 53(08): 2764-2778. DOI：10.12263/DZXB.20250348

摘要

Spark作为通用的计算引擎，以其简单、快速、可扩展的优势，被广泛地应用于大数据的处理和分析中.然而，Spark默认采用哈希分区或范围分区对数据进行划分，导致其在处理键倾斜分布的数据时，常常出现各分区数据量严重不均衡的问题.诸多优化方法被提出，如迁移分区、贪心分区、反馈分区等，但往往存在数据传输量大、额外计算成本高、运行时间长等问题.为更好地缓解键倾斜分布问题带来的影响，本文提出了一种自适应的Spark数据均衡分区方法.该方法引入了奖惩思想对数据分区过程进行适当调控，同时对于数据量较大的键进行分割，使得各个分区的数据量相对均衡.该方法首先对数据采样并预估键权重.其次，按照键权重对样本数据降序排列，确保所有分区都有初始数据.再次，根据奖惩分配策略，自适应地更新各个分区的分配概率，并将待分配的键指向分配概率最高的分区.对于超过分区容量的键的数据，则分割为多个部分且指向不同分区.在所有样本数据分配完成后，获得自适应分区方案.在实际分区时，对于样本中出现的键对应的数据按照自适应分区方案进行分配；对于未出现的键对应的数据，则按照哈希方法进行分区.最后，通过实验验证，基于新方法设计的自适应均衡分区器（Adaptive Data Balanced Partitioner，ADBP）能够有效缓解键倾斜的负面影响.在真实数据集上，ADBP的WordCount程序总运行时间比自带分区器Hash、Range分别平均缩短了1.51%、29.90%，比现有基于学习自动机的自适应哈希分区器（Learning Automata Hash Partitioner，LAHP）、对倾斜的中间数据块进行拆分合并（Splitting and Combination algorithm for skew Intermediate Data block，SCID）算法、粗粒度放置和细粒度放置（Fined-Coarse Grained Intermediate Data Placement，FCGIDP）算法分别平均缩短了8.12%、21.64%、19.62%.

Abstract

As an open-source computing engine

due to its simplicity

speed and scalability

Spark is widely used in the field of big data processing and analysis. Spark defaults to using hash partitioning or range partitioning to partition data. It often results in severe imbalances in data volume between partitions when processing data with skewed key distributions. Many optimization methods have been proposed

such as migration partitioning

greedy partitioning

feedback partitioning

etc.

but often have problems such as large data transmission

high extra computing cost

and long running time. In order to better alleviate the impact of key skew distribution problem

this paper proposes an adaptive Spark data balanced partitioning method

which introduces the idea of reward and punishment to properly regulate the data partitioning process. At the same time

the key with big data volume is properly divided to make the data amount of each partition relatively balanced. After sampling the data and estimating the key weights

the sample data are sorted in descending order according to the key weights

so that all partitions have initial data. Then according to the reward and punishment allocation strategy

the allocation probability of each partition is adaptively updated and the keys to be allocated are directed to the partition with the highest probability. The adaptive data partitioning scheme was obtained after all sample data were allocated. In actual partitioning

the data of keys that appear in the sample are allocated according to the adaptive data partitioning scheme

while the data of keys that do not appear are partitioned according to the hash method. The experimental results show that the adaptive data balanced partitioner (ADBP) designed with the new data partitioning method can effectively alleviate the negative impact of key skew. On real data sets

the total running time of WordCount program of ADBP is averagely 1.51% and 29.90% shorter than Spark’s own partitioners

i.e.

HashPartitioner and RangePartitioner

and averagely 8.12%

21.64% and 19.62% shorter than the existing partitioners learning automata hash partitioner (LAHP)

splitting and combination algorithm for skew intermediate data block (SCID) and fined-coarse grained intermediate data placement (FCGIDP) respectively.

关键词

Keywords

references

DITTRICH J , QUIANÉ-RUIZ J A . Efficient big data processing in Hadoop MapReduce [J ] . Proceedings of the VLDB Endowment , 2012 , 5 ( 12 ): 2014 - 2015 .

ZAHARIA M , CHOWDHURY M , FRANKLIN M J , et al . Spark: Cluster computing with working sets [C ] // Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing (Hot Cloud) . Berkeley : USENIX Association , 2010 : 1 - 7 .

CARBONE P , KATSIFODIMOS A , EWEN S , et al . Apache Flink: Stream and batch processing in a single engine [J ] . The Bulletin of the Technical Committee on Data Engineering , 2015 , 38 ( 4 ): 28 - 38 .

HAN Z J , ZHANG Y J . Spark: A big data processing platform based on memory computing [C ] // Proceedings of the 7th International Symposium on Parallel Architectures, Algorithms and Programming . Piscataway : IEEE , 2016 : 172 - 176 .

VERMA A , MANSURI A H , JAIN N . Big data management processing with Hadoop MapReduce and spark technology: A comparison [C ] // 2016 Symposium on Colossal Data Analysis and Networking . Piscataway : IEEE , 2016 : 1 - 4 .

SHREE R , CHOUDHURY T , GUPTA S C , et al . KAFKA: The modern platform for data management and analysis in big data domain [C ] // Proceedings of the 2nd International Conference on Telecommunication and Networks . Piscataway : IEEE , 2018 : 1 - 5 .

ZHAI M Y , SONG A B , QIU J Y , et al . Query optimization approach with shuffle intermediate cache layer for spark SQL [C ] // 2019 IEEE 38th International Performance Computing and Communications Conference . Piscataway : IEEE , 2020 : 1 - 6 .

YENDURI L K . Performance evaluation of apache hadoop, spark, and flink for batch processing of big data: A comparative analysis [C ] // Proceedings of the 3rd International Conference on Electrical, Electronics, Information and Communication Technologies . Piscataway : IEEE , 2024 : 1 - 5 .

AHN H , KIM H , YOU W . Performance study of spark on YARN cluster using HiBench [C ] // 2018 IEEE International Conference on Consumer Electronics - Asia . Piscataway : IEEE , 2018 : 206 - 212 .

张占峰 , 王文礼 , 耿珊珊 , 等 . Spark数据倾斜问题研究 [J ] . 河北省科学院学报 , 2020 , 37 ( 1 ): 1 - 7 .

ZHANG Z F , WANG W L , GENG S S , et al . Research on data skew of spark [J ] . Journal of the Hebei Academy of Sciences , 2020 , 37 ( 1 ): 1 - 7 . (in Chinese)

HE Z Y , LI Z F , PENG X S , et al . D S2 : Handling data skew using data stealings over high-speed networks [C ] // 2021 IEEE 37th International Conference on Data Engineering . Piscataway : IEEE , 2021: 1865 - 1870 .

何玉林 , 吴东彤 , Fournier-Viger Philippe , 等 . 基于优先填补策略的Spark数据均衡分区方法 [J ] . 电子学报 , 2024 , 52 ( 10 ): 3322 - 3335 .

HE Y L , WU D T , FOURNIERVIGER P , et al . First filling strategy-based partitioning method to balance data in spark [J ] . Acta Electronica Sinica , 2024 , 52 ( 10 ): 3322 - 3335 . (in Chinese)

HE Z Y , HUANG Q L , LI Z F , et al . Handling data skew for aggregation in spark SQL using task stealing [J ] . International Journal of Parallel Programming , 2020 , 48 ( 6 ): 941 - 956 .

SHEN Y J , XIONG J , JIANG D J . SrSpark: Skew-resilient spark based on adaptive parallel processing [C ] // 2020 IEEE 26th International Conference on Parallel and Distributed Systems . Piscataway : IEEE , 2021 : 466 - 475 .

YU J D , CHEN H P , HU F . SASM: Improving spark performance with Adaptive Skew Mitigation [C ] // 2015 IEEE International Conference on Progress in Informatics and Computing . Piscataway : IEEE , 2016 : 102 - 107 .

HUANG Z C , WEI W G , XIE G Y . Load balancing mechanism based on linear regression partition prediction in spark [J ] . Journal of Physics: Conference Series , 2020 , 1575 ( 1 ): 012109 .

侯震梅 , 杨玉莹 . 分布式数据流数据倾斜均衡方法研究 [J ] . 长春大学学报(自然科学版) , 2020 , 30 ( 5 ): 11 - 20 .

HOU Z M , YANG Y Y . Research on data skew equalization method for distributed data streams [J ] . Journal of Changchun University , 2020 , 30 ( 5 ): 11 - 20 . (in Chinese)

周舜杰 . 分布式流连接系统负载均衡策略研究 [D ] . 武汉 : 华中科技大学 , 2019 .

ZHOU S J . Distributed Stream Join System Load Balance Strategy Studies [D ] . Wuhan : Huazhong University of Science and Technology , 2019 . (in Chinese)

卞琛 , 修位蓉 , 于炯 . 异构Spark集群数据倾斜修正调度策略 [J ] . 计算机工程与科学 , 2022 , 44 ( 4 ): 620 - 630 .

BIAN C , XIU W R , YU J . A data skew correction scheduling strategy of heterogeneous Spark cluster [J ] . Computer Engineering & Science , 2022 , 44 ( 4 ): 620 - 630 . (in Chinese)

TANG Z , ZHANG X S , LI K L , et al . An intermediate data placement algorithm for load balancing in Spark computing environment [J ] . Future Generation Computer Systems , 2018 , 78 : 287 - 301 .

WANG S Z , JIA Z T , WANG W L . Research on optimization of data balancing partition algorithm based on spark platform [C ] // Artificial Intelligence and Security . Cham : Springer , 2021 : 3 - 13 .

SONG A B , PENG B W , QIU J Y , et al . BSDP: A novel balanced spark data partitioner [C ] // 2021 IEEE 27th International Conference on Parallel and Distributed Systems . Piscataway : IEEE , 2022 : 556 - 566 .

GUO W X , HUANG C J , TIAN W H . Handling data skew at reduce stage in Spark by ReducePartition [J ] . Concurrency and Computation: Practice and Experience , 2020 , 32 ( 9 ): e5637 .

LI C L , ZHANG Y , LUO Y L . Intermediate data placement and cache replacement strategy under Spark platform [J ] . Journal of Parallel and Distributed Computing , 2022 , 163 : 114 - 135 .

杨迪 , 赵家伟 , 王鹏 , 等 . 面向负载均衡的动态均衡分区策略 [J ] . 计算机应用与软件 , 2024 , 41 ( 8 ): 46 - 52 .

YANG D , ZHAO J W , WANG P , et al . Dynamic balancing partition strategy for load balancing [J ] . Computer Applications and Software , 2024 , 41 ( 8 ): 46 - 52 . (in Chinese)

KUMAR R , AGRAWAL N , TAPASWI S . Improved load balancing and partitioning model for cloud networks [C ] // 2024 OITS International Conference on Information Technology . Piscataway : IEEE , 2024 : 803 - 808 .

SUN D W , ZHANG C L , GAO S , et al . An adaptive load balancing strategy for stateful join operator in skewed data stream environments [J ] . Future Generation Computer Systems , 2024 , 152 : 138 - 151 .

HE C L , HUANG Y , WANG C Y , et al . Dynamic data partitioning strategy based on heterogeneous flink cluster [C ] // Proceedings of the 5th International Conference on Artificial Intelligence and Big Data . Piscataway : IEEE , 2022 : 355 - 360 .

IRANDOOST M A , RAHMANI A M , SETAYESHI S . A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game [J ] . Information Sciences , 2019 , 501 : 662 - 679 .

刘寒梅 , 韩宏莹 . 基于反馈调度的MapReduce负载均衡分区算法研究 [J ] . 信息通信 , 2015 , 28 ( 10 ): 41 - 42 .

LIU H M , HAN H Y . Research on MapReduce load balancing partition algorithm based on feedback scheduling [J ] . Information & Communications , 2015 , 28 ( 10 ): 41 - 42 . (in Chinese)

YAN J T . Fuzzy-based balanced partitioning under capacity and size-tolerance constraints in distributed quantum circuits [J ] . IEEE Transactions on Quantum Engineering , 2023 , 4 : 5100115 .

WANG H Y , YE X D . Research on optimization strategy of data skew problem based on hive partitioning mechanism [C ] // Proceedings of the 5th International Conference on Decision Science & Management . Piscataway : IEEE , 2023 : 73 - 77 .

阎逸飞 , 王智立 , 邱雪松 , 等 . Spark环境下基于数据倾斜模型的Shuffle分区优化方案 [J ] . 北京邮电大学学报 , 2020 , 43 ( 2 ): 116 - 121 .

YAN Y F , WANG Z L , QIU X S , et al . A shuffle partition optimization scheme based on data skew model in spark [J ] . Journal of Beijing University of Posts and Telecommunications , 2020 , 43 ( 2 ): 116 - 121 . (in Chinese)

ZVARA Z , SZABÓ P G N , LÓRÁNT B B , et al . System-aware dynamic partitioning for batch and streaming workloads [C ] // Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing . New York : ACM , 2021 : 1 - 10 .

王磊 . 基于Spark异构集群的低成本任务调度策略和数据倾斜优化研究 [D ] . 重庆 : 重庆邮电大学 , 2022 .

WANG L . Research on Low Cost Task Scheduling Strategy and Data Skew Optimization Based on Spark Heterogeneous Cluster [D ] . Chongqing : Chongqing University of Posts and Telecommunications , 2022 . (in Chinese)

CHEN Q , YAO J Y , XIAO Z . LIBRA: Lightweight data skew mitigation in MapReduce [J ] . IEEE Transactions on Parallel and Distributed Systems , 2015 , 26 ( 9 ): 2520 - 2533 .

FU Z M , TANG Z , YANG L , et al . ImRP: A predictive partition method for data skew alleviation in spark streaming environment [J ] . Parallel Computing , 2020 , 100 : 102699 .

DASH M , NG W . Efficient reservoir sampling for transactional data streams [C ] // Proceedings of the 6th IEEE International Conference on Data Mining - Workshops . New York : ACM , 2006 : 662 - 666 .

LIN J . The curse of zipf and limits to parallelization: A look at the stragglers problem in MapReduce [C ] // Proceedings of the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval (ISDS-IR) . Boston : CEUR-WS , 2009 : 57 - 62 .

United States Department of Transportation . Bureau of transportation statistics (BTS) [DB/OL ] . ( 2024-04-24 ) [ 2024-04-24 ] . https://www.transtats.bts.gov https://www.transtats.bts.gov .

PONNUSAMY S , GUPTA P . Scalable data partitioning techniques for distributed data processing in cloud environments: A review [J ] . IEEE Access , 2024 , 12 : 26735 - 26746 .

赵二虎 , 吴济文 , 肖思莹 , 等 . 嵌入式异构智能计算系统并行多流水线设计 [J ] . 电子学报 , 2023 , 51 ( 11 ): 3354 - 3364 .

ZHAO E H , WU J W , XIAO S Y , et al . Parallel multi pipeline design of embedded heterogeneous AI computing systems [J ] . Acta Electronica Sinica , 2023 , 51 ( 11 ): 3354 - 3364 . (in Chinese)

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

First Filling Strategy-Based Partitioning Method to Balance Data in Spark

Self-Adaptive Strategy for Cache Management in Spark

DATC-MC: Distributed Approximate Triangle Counting Based on Markov Chain in Graph Streaming

Related Author

HUANG Zhe-xue

WU Dong-tong

HE Yu-lin

Philippe Fournier-Viger

XIU Wei-rong

YING Chang-tian

YU Jiong

BIAN Chen

Related Institution

Guangdong Laboratory of Artificial Intelligence and Digital Economy

School of Information Science and Engineering, Xinjiang University

School of Information and Engineering, Urumqi Vocational University

School of Information Science and Engineering Xinjiang University Urumqi Xinjiang China

School of Information and Engineering Urumqi Vocational University Urumqi Xinjiang China

⁰