PASSING: A Hybrid Parameter Synchronization Strategy in Distributed Machine Learning

YU Xiao-shan; GU Hua-xi; ZHOU Zhao-xing; WANG Jia-kun

doi:10.12263/DZXB.20250207

您当前的位置：

首页 >

文章列表页 >

PASSING: A Hybrid Parameter Synchronization Strategy in Distributed Machine Learning

PAPERS | 更新时间：2025-12-27

- PASSING: A Hybrid Parameter Synchronization Strategy in Distributed Machine Learning
- ACTA ELECTRONICA SINICA Vol. 53, Issue 8, Pages: 2636-2648(2025)
- 作者机构：
  
  西安电子科技大学通信工程学院，陕西西安 710000
- 作者简介：
- 基金信息：
- DOI：10.12263/DZXB.20250207
  CLC： TP393;
- Received：20 March 2025，
  
  Accepted：18 August 2025，
  
  Published：25 August 2025
- 稿件说明：
移动端阅览
余晓杉, 顾华玺, 周肇星, 等. PASSING：分布式机器学习的混合参数同步策略[J]. 电子学报, 2025, 53(08): 2636-2648.

YU Xiao-shan, GU Hua-xi, ZHOU Zhao-xing, et al. PASSING: A Hybrid Parameter Synchronization Strategy in Distributed Machine Learning[J]. Acta Electronica Sinica, 2025, 53(08): 2636-2648.
余晓杉, 顾华玺, 周肇星, 等. PASSING：分布式机器学习的混合参数同步策略[J]. 电子学报, 2025, 53(08): 2636-2648. DOI：10.12263/DZXB.20250207

YU Xiao-shan, GU Hua-xi, ZHOU Zhao-xing, et al. PASSING: A Hybrid Parameter Synchronization Strategy in Distributed Machine Learning[J]. Acta Electronica Sinica, 2025, 53(08): 2636-2648. DOI：10.12263/DZXB.20250207

摘要

随着机器学习模型的参数量与训练数据集爆炸式增长，单一计算节点已无法满足人工智能（Artificial Intelligence，AI）大模型的算力需求，分布式机器学习系统成为支持模型训练的主要平台，该系统通过数万设备的并行训练缩短机器学习的训练时间.其中数据并行是一种常用的分布式训练并行框架，该框架将训练数据划分至不同的计算节点，通过节点间周期性参数同步实现训练任务的协同，由于计算节点在每轮迭代前需要传输大量数据以完成参数同步，通信成为影响计算效率的关键因素.经典参数同步策略存在通信次数较多或接收端链路拥塞的问题，基于网内聚合的参数同步策略则存在交换机计算、存储能力有限、服务器输出端口拥塞的问题，对此本文提出一种混合参数同步策略PASSING（hybrid Parameter Synchronization Strategy with In-host and In-network Aggregation），该策略首先在服务器内或机架内预先进行模型参数的本地同步，随后利用可编程交换机完成全局的参数同步，这种方式既保证了机内小规模计算节点间的高效通信，也减轻了交换机侧的计算和通信负载.本文使用多GPU（Graphics Processing Unit）服务器和可编程交换机搭建了实验平台，并部署了所提出的混合同步策略，实验结果表明PASSING相较于传统的参数服务器算法最多提升了65.25%的训练性能，有效加速了分布式训练的速度.

Abstract

With the explosive growth in the number of parameters of machine learning models and the scale of training datasets

a single computing node can no longer meet the computational demands of large artificial intelligence (AI) models. Distributed machine learning systems have become the primary platform for supporting AI model training. The training time can be reduced by implementing parallel training across tens of thousands of computing nodes. In particular

data parallelism is a widely used parallel training framework in distributed training. It splits the training dataset across many computing nodes and then trains the model collaboratively through periodic parameter synchronization among those nodes. Since computing nodes need to transmit a large amount of data to complete the parameter synchronization before each round of iteration

communication becomes the key factor that affects computational efficiency. Traditional parameter synchronization strategies suffer from the problem of excessive communication rounds or congestion at the receiver’s link. In contrast

parameter synchronization strategies based on in-network aggregation face issues such as limited computing and storage capabilities of the switches

and congestion at server output ports. To this end

a hybrid parameter synchronization strategy termed PASSING (hybrid Parameter Synchronization Strategy with In-host and In-network Aggregation) is proposed. It implements a local pre-aggregation of the model parameters within the host prior to transferring the data to programmable switches. Subsequently

the local aggregation parameters are sent to the programmable switches to implement the global parameter synchronization. This approach not only ensures efficient communication between the small-scale computing nodes with the host but also reduces the computational and communication load on the switch side. We built a testbed using the multi-GPU （Graphics Processing Unit） servers and programmable switches and deployed PASSING in this testbed. The experimental results demonstrate that PASSING

when compared to traditional parameter synchronization strategies

enhances training performance by up to 65.25%

thus effectively accelerating the speed of distributed training.

关键词

Keywords

references

ZHOU H , HU C M , YUAN Y , et al . Large language model (LLM) for telecommunications: A comprehensive survey on principles, key techniques, and opportunities [J ] . IEEE Communications Surveys & Tutorials , 2025 , 27 ( 3 ): 1955 - 2005 .

ZHOU X C , LIU M Y , YURTSEVER E , et al . Vision language models in autonomous driving: A survey and outlook [J ] . IEEE Transactions on Intelligent Vehicles , 2024 , PP( 99 ): 1 - 20 .

ZHA D C , BHAT Z P , LAI K H , et al . Data-centric artificial intelligence: A survey [J ] . ACM Computing Surveys , 2025 , 57 ( 5 ): 1 - 42 .

QIU J N , LI L , SUN J K , et al . Large AI models in health informatics: Applications, challenges, and the future [J ] . IEEE Journal of Biomedical and Health Informatics , 2023 , 27 ( 12 ): 6074 - 6087 .

THAKUR A , BISWAS S K , MAJUMDAR S , et al . A comprehensive review on different CNN architectures [C ] // 2025 3rd International Conference on Intelligent Data Communication Technologies and Internet of Things . Piscataway : IEEE , 2025 : 1997 - 2004 .

ZHAO X , WANG L M , ZHANG Y F , et al . A review of convolutional neural networks in computer vision [J ] . Artificial Intelligence Review , 2024 , 57 ( 4 ): 99 .

ALZUBAIDI L , ZHANG J L , HUMAIDI A J , et al . Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions [J ] . Journal of Big Data , 2021 , 8 ( 1 ): 53 .

YADAV S P , ZAIDI S , MISHRA A , et al . Survey on machine learning in speech emotion recognition and vision systems using a recurrent neural network (RNN) [J ] . Archives of Computational Methods in Engineering , 2022 , 29 ( 3 ): 1753 - 1770 .

梁宏涛 , 刘硕 , 杜军威 , 等 . 深度学习应用于时序预测研究综述 [J ] . 计算机科学与探索 , 2023 , 17 ( 6 ): 1285 - 1300 .

LIANG H T , LIU S , DU J W , et al . Review of deep learning applied to time series prediction [J ] . Journal of Frontiers of Computer Science and Technology , 2023 , 17 ( 6 ): 1285 - 1300 . (in Chinese)

LAGHARI A ALI , SUN Y Q , ALHUSSEIN M , et al . Deep residual-dense network based on bidirectional recurrent neural network for atrial fibrillation detection [J ] . Scientific Reports , 2023 , 13 : 15109 .

RUSSAKOVSKY O , DENG J , SU H , et al . ImageNet large scale visual recognition challenge [J ] . International Journal of Computer Vision , 2015 , 115 ( 3 ): 211 - 252 .

TAN M X , LE Q V . EfficientNet: Rethinking model scaling for convolutional neural networks [EB/OL ] . ( 2020-09-11 )[ 2024-12-26 ] . https://arXiv.org/abs/1905.11946 https://arXiv.org/abs/1905.11946 .

BROWN T B , MANN B , RYDER N , et al . Language models are few-shot learners [C ] // Proceedings of the 34th International Conference on Neural Information Processing Systems . New York : ACM , 2020 : 1877 - 1901 .

OPENAI , ACHIAM J , ADLER S , et al . GPT-4 technical report [EB/OL ] . ( 2024-03-04 )[ 2024-12-16 ] . https://arXiv.org/abs/2303.08774 https://arXiv.org/abs/2303.08774 .

DEAN J , CORRADO G S , MONGA R , et al . Large scale distributed deep networks [C ] // Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1 . New York : ACM , 2012 : 1223 - 1231 .

SHOEYBI M , PATWARY M , PURI R , et al . Megatron-LM: Training multi-billion parameter language models using model parallelism [EB/OL ] . ( 2023-03-13 )[ 2024-12-16 ] . https://arXiv.org/abs/1909.08053 https://arXiv.org/abs/1909.08053 .

LI S G , LIU H X , BIAN Z D , et al . Colossal-AI: A unified deep learning system for large-scale parallel training [C ] // Proceedings of the 52nd International Conference on Parallel Processing . New York : ACM , 2023 : 766 - 775 .

SUN Z B , CAO H Q , WANG Y W , et al . AdaPipe: Optimizing pipeline parallelism with adaptive recomputation and partitioning [C ] // Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , Volume 3 . New York : ACM , 2024: 86 - 100 .

ZHANG Z , XIA Y Q , WANG H L , et al . MPMoE: Memory efficient MoE for pre-trained models with adaptive pipeline parallelism [J ] . IEEE Transactions on Parallel and Distributed Systems , 2024 , 35 ( 6 ): 998 - 1011 .

CHEN Z , DENG Y , WU Y , et al . Towards understanding the mixture-of-experts layer in deep learning [J ] . Advances in Neural Information Processing Systems , 2022 , 35 : 23049 - 23062 .

HWANG C , CUI W , XIONG Y F , et al . Tutel: Adaptive mixture-of-experts at scale [EB/OL ] . ( 2023-06-05 )[ 2024-12-26 ] . https://arXiv.org/abs/2206.03382 https://arXiv.org/abs/2206.03382 .

HUANG Y P , CHENG Y L , CHEN D H , et al . GPipe: Efficient training of giant neural networks using pipeline parallelism [C ] // Neural Information Processing Systems . New York : Curran Associates Inc. , 2018 : 103 - 112 .

NARAYANAN D , HARLAP A , PHANISHAYEE A , et al . PipeDream: Generalized pipeline parallelism for DNN training [C ] // Proceedings of the 27th ACM Symposium on Operating Systems Principles . New York : ACM , 2019 : 1 - 15 .

NARAYANAN D , PHANISHAYEE A , SHI K Y , et al . Memory-efficient pipeline-parallel DNN training [C ] // International Conference on Machine Learning . Vancouver : Proceedings of Machine Learning Research , 2020 : 7937 - 7947 .

WANG H , TIAN H , CHEN J R , et al . Towards domain-specific network transport for distributed DNN training [C ] // Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation . New York : ACM , 2024 : 1421 - 1443 .

AHMAD AWAN A , HAMIDOUCHE K , HASHMI J M , et al . S-caffe: Co-designing MPI runtimes and caffe for scalable deep learning on modern GPU clusters [J ] . ACM SIGPLAN Notices , 2017 , 52 ( 8 ): 193 - 205 .

DEVLIN J , CHANG M W , LEE K , et al . BERT: Pre-training of deep bidirectional transformers for language understanding [EB/ OL ] . ( 2019-05-24 )[ 2024-12-25 ] . https://arXiv.org/abs/1810.04805 https://arXiv.org/abs/1810.04805 .

TANG Z H , SHI S H , WANG W , et al . Communication-efficient distributed deep learning: A comprehensive survey [EB/OL ] . ( 2023-09-01 )[ 2024-12-26 ] . https://arXiv.org/abs/2003.06307 https://arXiv.org/abs/2003.06307 .

GIBIANSKY A . Bringing HPC techniques to deep learning [EB/OL ] . ( 2017-02-21 ) [ 2024-12-26 ] . http://research.baidu.com/bringing-hpc-techniques-deep-learning/ http://research.baidu.com/bringing-hpc-techniques-deep-learning/ .

JIA X Y , SONG S T , HE W , et al . Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes [EB/OL ] . ( 2018-07-30 )[ 2024-12-26 ] . https://arXiv.org/abs/1807.11205 https://arXiv.org/abs/1807.11205 .

TANAKA Y , KAGEYAMA Y . ImageNet ResNet-50 training in 224 seconds [EB/OL ] . ( 2019-03-05 ) [ 2024-12-26 ] . https://arxiv.org/abs/1811.05233v1 https://arxiv.org/abs/1811.05233v1 .

YING C , KUMAR S , CHEN D H , et al . Image classification at supercomputer scale [EB/OL ] . ( 2018-12-02 ) [ 2024-12-26 ] . https://arXiv.org/abs/1811.06992 https://arXiv.org/abs/1811.06992 .

CHO M , FINKLER U , SERRANO M , et al . BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy [J ] . IBM Journal of Research and Development , 2019 , 63 ( 6 ): 1 : 1 - 1 : 11 .

PEI J , AMER-YAHIA S , HUANG Y Z , et al . FlexPS: Flexible parallelism control in parameter server architecture [J ] . Proceedings of the VLDB Endowment , 2018 , 11 ( 5 ): 566 - 579 .

JIANG Y M , ZHU Y B , LAN C , et al . A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters [C ] // Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation . New York : ACM , 2020 : 463 - 479 .

刘忠沛 , 杨翔瑞 , 杨凌 , 等 . CAInNet: 面向AI加速的通算一体网内计算模型 [J ] . 计算机学报 , 2025 , 48 ( 1 ): 19 - 34 .

LIU Z P , YANG X R , YANG L , et al . CAInNet: In-network computing model for AI acceleration [J ] . Chinese Journal of Computers , 2025 , 48 ( 1 ): 19 - 34 . (in Chinese)

刘宏岩 , 张栋 , 吴春明 . 基于可编程交换机的网内灰色故障检测技术研究进展 [J ] . 电子学报 , 2024 , 52 ( 10 ): 3613 - 3622 .

LIU H Y , ZHANG D , WU C M . Empowering in-network gray failure detection with programmable switches [J ] . Acta Electronica Sinica , 2024 , 52 ( 10 ): 3613 - 3622 . (in Chinese)

SAPIO A , CANINI M , HO C Y , et al . Scaling distributed machine learning with in-network aggregation [EB/OL ] . ( 2020-09-30 )[ 2024-12-26 ] . https://arXiv.org/abs/1903.06701 https://arXiv.org/abs/1903.06701 .

LIU S , WANG Q L , ZHANG J Y , et al . In-network aggregation with transport transparency for distributed training [C ] // Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , Volume 3 . New York : ACM , 2023: 376 - 391 .

LAO C I , LE Y F , MAHAJAN K S , et al . ATP: In-network aggregation for multi-tenant learning [C ] // Symposium on Networked Systems Design and Implementation . Boston : USENIX Association , 2021 : 741 - 761

PAN H , CUI P L , LI Z Y , et al . Enabling fast and flexible distributed deep learning with programmable switches [EB/OL ] . ( 2022-08-10 )[ 2024-12-19 ] . https://arXiv.org/abs/2205.05243 https://arXiv.org/abs/2205.05243 .

FANG J , ZHAO G M , XU H L , et al . GRID: Gradient routing with in-network aggregation for distributed training [J ] . IEEE/ACM Transactions on Networking , 2023 , 31 ( 5 ): 2267 - 2280 .

DE SENSI D , DI GIROLAMO S , ASHKBOOS S , et al . Flare: Flexible in-network allreduce [C ] // International Conference for High Performance Computing, Networking, Storage and Analysis . Piscataway : IEEE , 2022 : 1 - 15 .

LIU L , ZHOU P , SUN G , et al . Topologies in distributed machine learning: Comprehensive survey, recommendations and future directions [J ] . Neurocomputing , 2024 , 567 : 127009 .

王帅 , 李丹 . 分布式机器学习系统网络性能优化研究进展 [J ] . 计算机学报 , 2022 , 45 ( 7 ): 1384 - 1411 .

WANG S , LI D . Research progress on network performance optimization of distributed machine learning system [J ] . Chinese Journal of Computers , 2022 , 45 ( 7 ): 1384 - 1411 . (in Chinese)

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

⁰