基于vGPU性能干扰感知的大模型推理负载资源高效配置方法

张虎; 孙明辉; 刘杨; 戴鸿君; 王继彬; 张有利

doi:10.12263/DZXB.20250468

您当前的位置：

首页 >

文章列表页 >

基于vGPU性能干扰感知的大模型推理负载资源高效配置方法

大模型与互联网 | 更新时间：2026-02-10

- 基于vGPU性能干扰感知的大模型推理负载资源高效配置方法
- Resource-Efficient Configuration Method for Large Model Inference Loads Based on vGPU Performance Interference Awareness
- 电子学报 2025年53卷第11期页码：3836-3851
- 作者机构：
  
  1.山东大学集成电路学院，山东济南 250101
  2.齐鲁工业大学（山东省科学院）山东省计算中心（国家超级计算济南中心）算力互联网与信息安全教育部重点实验室，山东济南 250013
  3.山东省算力互联网与服务计算重点实验室，山东省基础科学研究中心（计算机科学），山东济南 250103
  4.北京邮电大学，北京 100876
- 作者简介：
  
  [ "张虎男，1987年1月出生于山东省济南市.现为山东省计算中心（国家超级计算济南中心）副研究员.主要研究方向为大模型推理优化、算力互联网.E-mail: zhanghu@sdas.org" ]
  [ "孙明辉男，1997年6月出生于山东省泰安市.现为山东省计算中心（国家超级计算济南中心）工程师.主要研究方向为算力资源调度.E-mail: sunminghui9999@163.com" ]
  [ "刘杨男，1984年6月出生于黑龙江省哈尔滨市.现为北京邮电大学教授.主要研究方向为算力芯片与网络.中国电子学会会员编号：E190022238M.E-mail: liu.yang@bupt.edu.cn" ]
  [ "戴鸿君男，1981年5月出生于山东省泰安市.现为山东大学教授.主要研究方向为计算机体系结构.E-mail: dahogn@sdu.edu.cn" ]
  [ "王继彬男，1984年5月出生于山东省临沂市.现为山东省计算中心（国家超级计算济南中心）研究员.主要研究方向为虚拟资源调度.E-mail: wangjb@sdas.org" ]
  [ "张有利女，2000年4月出生于山东省菏泽市.现为齐鲁工业大学（山东省科学院）硕士研究生.主要研究方向为异构计算环境下的任务调度算法.E-mail: ylzhang0319@163.com" ]
- 基金信息：
  
  国家重点研发计划(2024YFB2906605);山东省重点研发计划(2024CXGC010113);齐鲁工业大学（山东省科学院）科教产融合试点工程重大创新类项目(2024ZDZX08)
- DOI：10.12263/DZXB.20250468
  中图分类号： TP391;
- 收稿：2025-06-01，
  
  录用：2025-11-14，
  
  纸质出版：2025-11-25
- 稿件说明：
移动端阅览
张虎, 孙明辉, 刘杨, 等. 基于vGPU性能干扰感知的大模型推理负载资源高效配置方法[J]. 电子学报, 2025, 53(11): 3836-3851.

ZHANG Hu, SUN Ming-hui, LIU Yang, et al. Resource-Efficient Configuration Method for Large Model Inference Loads Based on vGPU Performance Interference Awareness[J]. Acta Electronica Sinica, 2025, 53(11): 3836-3851.
张虎, 孙明辉, 刘杨, 等. 基于vGPU性能干扰感知的大模型推理负载资源高效配置方法[J]. 电子学报, 2025, 53(11): 3836-3851. DOI：10.12263/DZXB.20250468

ZHANG Hu, SUN Ming-hui, LIU Yang, et al. Resource-Efficient Configuration Method for Large Model Inference Loads Based on vGPU Performance Interference Awareness[J]. Acta Electronica Sinica, 2025, 53(11): 3836-3851. DOI：10.12263/DZXB.20250468

摘要

人工智能（Artificial Intelligence，AI）技术的快速演进推动了开源大模型在多元化场景中的规模化应用.然而，随着图形处理器（Graphics Processing Unit，GPU）单卡性能的提升，在支持中小规模大模型推理负载时，GPU资源易出现闲置现象，导致整体算力利用率不足.为提升数据中心GPU资源使用效率，业界普遍采用时空共享或虚拟GPU（Virtual GPU，vGPU）技术实现算力复用，其中vGPU凭借细粒度资源划分与安全隔离特性，已成为数据中心向多租户、多任务提供GPU资源服务的主流方案.然而，GPU资源共享技术不可避免地引入任务负载之间的性能干扰，尤其是大模型推理负载所需资源具有动态性和突发性.在不考虑性能干扰的情况下，会导致推理负载延迟显著增加，甚至引发服务质量目标（Service Level Objective，SLO）违约，影响大模型服务的稳定性与用户体验.针对这一关键挑战，本文提出了一种基于vGPU性能干扰感知的大模型推理负载资源高效配置方法.该方法首先通过大规模并发推理实验，构建了涵盖不同参数规模大模型、不同负载组合、不同负载强度下的多维性能表征数据集；在此基础上，建立了综合考虑推理模型特征、硬件支撑信息及系统监控指标的轻量化性能干扰预测模型，既保证了对关键性能指标的精准估计，也满足了资源配置决策的实时性需求.基于该预测模型，本文进一步设计了基于约束优化的经济型资源配置算法，以最小化GPU资源分配量为目标函数，以推理延迟不超过SLO阈值、吞吐量满足业务需求为约束条件，通过动态调整各负载的vGPU资源分配比例，实现了在满足推理负载质量约束的前提下GPU资源分配优化.实验部分构建了包含两类六种典型大模型的混合负载测试环境，并在NVIDIA A100和RTX6000硬件平台与HAMi vGPU方案上，与传统GPU配置策略进行了对比验证.实验结果表明，所提方法在严格满足SLO约束的前提下，相较主流方案可降低超过20%的GPU资源成本开销，验证了其在大模型推理场景下的有效性与经济性，为数据中心提升GPU资源利用效率、降低人工智能服务部署成本、促进开源大模型的规模化普及应用提供了重要技术支撑.

Abstract

The rapid evolution of artificial intelligence (AI) has propelled the large-scale application of open-source large language model across diverse scenarios. However

with the substantial performance boost of individual graphics processing unit (GPU)

resources often suffer from idling when serving inference workloads for small- to medium-sized LLM

leading to insufficient overall computing utilization. To enhance GPU efficiency in data centers

spatial-temporal sharing or virtual GPU (vGPU) technologies are widely adopted for resource multiplexing. Notably

vGPU has emerged as the mainstream solution for providing GPU services to multi-tenant and multi-task environments

owing to its fine-grained resource partitioning and robust security isolation. Nevertheless

GPU resource sharing inevitably introduces performance interference among workloads

particularly given the dynamic and bursty resource demands characteristic of LLM inference. Neglecting such interference can lead to a significant surge in inference latency and trigger service level objective (SLO) violations

thereby compromising the stability and user experience of LLM services. To address this critical challenge

this paper proposes an efficient resource provisioning method for LLM inference workloads based on vGPU performance interference awareness. First

we construct a multi-dimensional performance characterization dataset through large-scale concurrent inference experiments

covering various LLM parameter sizes

workload co-location combinations

and intensities. On this basis

a lightweight performance interference prediction model is established

incorporating model features

hardware specifications

and system monitoring metrics. This model ensures precise estimation of key performance indicators while meeting the real-time requirements of resource decision-making. Leveraging this prediction model

we further design a constraint-optimization-based economic resource allocation algorithm. With the objective of minimizing GPU resource consumption and constraints ensuring inference latency remains within SLO thresholds and throughput meets business demands

the algorithm optimizes GPU resource allocation by dynamically adjusting the vGPU partition ratios for each workload. We evaluate the proposed method in a mixed workload environment comprising two categories and six typical LLMs. The experiments are conducted on NVIDIA A100 and RTX6000 platforms utilizing the HAMi vGPU solution

benchmarking against traditional GPU provisioning strategies. Experimental results demonstrate that the proposed method reduces GPU resource overhead by over 20% compared to mainstream schemes while strictly adhering to SLO constraints. These findings validate the effectiveness and economic viability of the approach in LLM inference scenarios

providing significant technical support for data centers to enhance GPU utilization

reduce AI service deployment costs

and facilitate the large-scale adoption of open-source LLM.

关键词

Keywords

references

SANDERSON K . GPT-4 is here: What scientists think [J ] . Nature , 2023 , 615 ( 7954 ): 773 .

DENG Z H , MA W L , HAN Q L , et al . Exploring DeepSeek: A survey on advances, applications, challenges and future directions [J ] . IEEE/CAA Journal of Automatica Sinica , 2025 , 12 ( 5 ): 872 - 893 .

李晨 , 刘畅 , 葛一漩 , 等 . 多GPU系统非一致存储访问优化: 研究进展与展望 [J ] . 电子学报 , 2024 , 52 ( 5 ): 1783 - 1800 .

LI C , LIU C , GE Y X , et al . Non-uniform memory access optimization on multi-GPU systems: Research progress and prospect [J ] . Acta Electronica Sinica , 2024 , 52 ( 5 ): 1783 - 1800 . (in Chinese)

SONG Y X , MI Z Y , XIE H T , et al . PowerInfer: Fast large language model serving with a consumer-grade GPU [C ] // Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles . New York : ACM , 2024 : 590 - 606 .

BRIDGES R A , IMAM N , MINTZ T M . Understanding GPU power: A survey of profiling, modeling, and simulation methods [J ] . ACM Computing Surveys , 2016 , 49 ( 3 ): 1 - 27 .

DHAKAL A , KULKARNI S G , RAMAKRISHNAN K K . GSLICE: Controlled spatial sharing of GPUs for a scalable inference platform [C ] // Proceedings of the 11th ACM Symposium on Cloud Computing . New York : ACM , 2020 : 492 - 506 .

CHOI S , LEE S , KIM Y , et al . Serving heterogeneous machine learning models on Multi-GPU servers with Spatio-Temporal sharing [C ] // 2022 USENIX Annual Technical Conference (USENIX ATC 22) . California : USENIX Association , 2022 : 199 - 216 .

ZHANG B W , LI S X , LI Z Z . MIGER: Integrating multi-instance GPU and multi-process service for deep learning clusters [C ] // Proceedings of the 53rd International Conference on Parallel Processing . New York : ACM , 2024 : 504 - 513 .

HAMi . Project-HAMi: Heterogeneous AI computing virtualization middleware [EB/OL ] . [ 2024-09-29 ] . https://github.com/Project-HAMi/HAMi/tree/release-v2.4 https://github.com/Project-HAMi/HAMi/tree/release-v2.4 .

SHI L , CHEN H , SUN J H , et al . vCUDA: GPU-accelerated high-performance computing in virtual machines [J ] . IEEE Transactions on Computers , 2012 , 61 ( 6 ): 804 - 816 .

LIN Y S , LIN C Y , LEE C R , et al . qCUDA: GPGPU virtualization for high bandwidth efficiency [C ] // 2019 IEEE International Conference on Cloud Computing Technology and Science . Piscataway : IEEE , 2020 : 95 - 102 .

GU J F , WANG P X , DAVID NÚÑEZ ARAYA I , et al . HAS-GPU: Efficient hybrid auto-scaling with fine-grained GPU allocation for SLO-aware serverless inferences [C ] // Euro-Par 2025: Parallel Processing . New York : ACM , 2025 : 159 - 174 .

ZHANG S L , XU A , CHEN Q , et al . Efficient performance-aware GPU sharing with compatibility and isolation through kernel space interception [C ] // 2025 USENIX Annual Technical Conference (USENIX ATC 25) . California : USENIX Association , 2025 : 1003 - 1019 .

GOSWAMI A , YOUNG J , SCHWAN K , et al . GPUShare: Fair-sharing middleware for GPU clouds [C ] // 2016 IEEE International Parallel and Distributed Processing Symposium Workshops . Piscataway : IEEE , 2016 : 1769 - 1776 .

WEAVER A , KAVI K , MILOJICIC D , et al . Granularity-and interference-aware GPU sharing with MPS [C ] // Proceedings of the SC’24 Workshops of the International Conference on High Performance Computing , Network, Storage, and Analysis . New York : ACM , 2025 : 1630 - 1637 .

ZHANG Z K , ALLEN T , YAO F , et al . TunneLs for bootlegging: Fully reverse-engineering GPU TLBs for challenging isolation guarantees of NVIDIA MIG [C ] // Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security . New York : ACM , 2023 : 960 - 974 .

WANG S , CHEN S P , SHI Y M , et al . AdaGap: An adaptive gap-aware resource allocation strategy for GPU sharing in heterogeneous clusters [J ] . Future Generation Computer Systems , 2025 , 173 : 107883 .

XIAO W C , REN S R , LI Y , et al . AntMan: Dynamic scaling on GPU clusters for deep learning [C ] // 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) . California : USENIX Association , 2020 : 533 - 548 .

DUATO J , PEÑA A J , SILLA F , et al . rCUDA: Reducing the number of GPU-based accelerators in high performance clusters [C ] // 2010 International Conference on High Performance Computing & Simulation . Piscataway : IEEE , 2010 : 224 - 231 .

ZHAO C , GAO W , NIE F P , et al . Fair and cache blocking aware warp scheduling for concurrent kernel execution on GPU [J ] . Future Generation Computer Systems , 2020 , 112 : 1093 - 1105 .

AYUB M , HELMY T . Concurrent kernel execution and interference analysis on GPUs using deep learning approaches [J ] . Journal of King Saud University - Computer and Information Sciences , 2022 , 34 ( 10 ): 10193 - 10204 .

XU F , XU J N , CHEN J B , et al . iGniter: Interference-aware GPU resource provisioning for predictable DNN inference in the cloud [J ] . IEEE Transactions on Parallel and Distributed Systems , 2023 , 34 ( 3 ): 812 - 827 .

XU X , ZHANG N , CUI M , et al . Characterization and prediction of performance interference on mediated passthrough GPUs for interference-aware scheduler [C ] // Proceedings of the 11th USENIX Conference on Hot Topics in Cloud Computing . New York : ACM , 2019 : 14 .

WU B Y , ZHANG Z L , BAI Z H , et al . Transparent GPU sharing in container clouds for deep learning workloads [C ] // 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) . California : USENIX Association , 2023 : 69 - 85 .

KIM S , KIM Y . Co-scheML: Interference-aware container co-scheduling scheme using machine learning application profiles for GPU clusters [C ] // 2020 IEEE International Conference on Cluster Computing . Piscataway : IEEE , 2020 : 104 - 108 .

GENG X , ZHANG H T , ZHAO Z Y , et al . Interference-aware parallelization for deep learning workload in GPU cluster [J ] . Cluster Computing , 2020 , 23 ( 4 ): 2689 - 2702 .

ZHAO H Y , HAN Z H , YANG Z , et al . HiveD: Sharing a GPU cluster for deep learning with guarantees [C ] // 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) . California : USENIX Association , 2020 : 515 - 532 .

CHEN Q C , OH J , KIM S , et al . Design of an adaptive GPU sharing and scheduling scheme in container-based cluster [J ] . Cluster Computing , 2020 , 23 ( 3 ): 2179 - 2191 .

NA S , KIM J , LEE S , et al . Supporting secure multi-GPU computing with dynamic and batched metadata management [C ] // 2024 IEEE International Symposium on High-Performance Computer Architecture . Piscataway : IEEE , 2024 : 204 - 217 .

GAO W , OUYANG Z Y , SUN P , et al . IceFrog: A layer-elastic scheduling system for deep learning training in GPU clusters [J ] . IEEE Transactions on Parallel and Distributed Systems , 2025 , 36 ( 6 ): 1071 - 1086 .

TAO X R , PAN Q K , GAO L . An iterated greedy algorithm with reinforcement learning for distributed hybrid flowshop problems with job merging [J ] . IEEE Transactions on Evolutionary Computation , 2025 , 29 ( 3 ): 589 - 600 .

ZHANG W Q , LI C , GEN M , et al . A multiobjective memetic algorithm with particle swarm optimization and Q-learning-based local search for energy-efficient distributed heterogeneous hybrid flow-shop scheduling problem [J ] . Expert Systems with Applications , 2024 , 237 : 121570 .

KUMAR B A , JYOTHI B , SINGH A R , et al . Hybrid genetic algorithm-simulated annealing based electric vehicle charging station placement for optimizing distribution network resilience [J ] . Scientific Reports , 2024 , 14 : 7637 .

GONG C , ZHOU N R , XIA S H , et al . Quantum particle swarm optimization algorithm based on diversity migration strategy [J ] . Future Generation Computer Systems , 2024 , 157 : 445 - 458 .

LONDE M A , PESSOA L S , ANDRADE C E , et al . Biased random-key genetic algorithms: A review [J ] . European Journal of Operational Research , 2025 , 321 ( 1 ): 1 - 22 .

ROSTAMI S , BROUMANDNIA A , KHADEMZADEH A . An energy-efficient task scheduling method for heterogeneous cloud computing systems using capuchin search and inverted ant colony optimization algorithm [J ] . The Journal of Supercomputing , 2024 , 80 ( 6 ): 7812 - 7848 .

WANG Y M , HAO M , HE H , et al . DRLCAP: Runtime GPU frequency capping with deep reinforcement learning [J ] . IEEE Transactions on Sustainable Computing , 2024 , 9 ( 5 ): 712 - 726 .

LIU Z H , XU X , QIAO P , et al . Acceleration for deep reinforcement learning using parallel and distributed computing: A survey [J ] . ACM Computing Surveys , 2025 , 57 ( 4 ): 1 - 35 .

ZHANG Z , XU C , LIU K , et al . A resource optimization scheduling model and algorithm for heterogeneous computing clusters based on GNN and RL [J ] . The Journal of Supercomputing , 2024 , 80 ( 16 ): 24138 - 24172 .

LIU T F , CHEN Y R , LI D , et al . BGL: GPU-efficient GNN training by optimizing graph data I/O and preprocessing [EB/OL ] . ( 2021-12-16 )[ 2025-10-10 ] . https://arXiv.org/abs/2112.08541 https://arXiv.org/abs/2112.08541 .

MO Z Z , XU H L , LAU W C . Optimal resource efficiency with fairness in heterogeneous GPU clusters [C ] // Proceedings of the 25th International Middleware Conference . New York : ACM , 2024 : 36 - 48 .

WANG S , CHEN S P , SHI Y M . GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clusters [J ] . Future Generation Computer Systems , 2024 , 152 : 127 - 137 .

ARIMA E , KANG M , SABA I , et al . Optimizing hardware resource partitioning and job allocations on modern GPUs under power caps [C ] // Proceedings of the 51st International Conference on Parallel Processing . New York : ACM , 2023 : 1 - 10 .

STRATI F , MA X Z , KLIMOVIC A . Orion: Interference-aware, fine-grained GPU sharing for ML applications [C ] // Proceedings of the Nineteenth European Conference on Computer Systems . New York : ACM , 2024 : 1075 - 1092 .

LEE W , LEE J , SEO J , et al . InfiniGen: Efficient generative inference of large language models with dynamic KV cache management [EB/OL ] . ( 2024-06-28 )[ 2025-10-10 ] . https://arXiv.org/abs/2406.19707 https://arXiv.org/abs/2406.19707 .

ALI A , PINCIROLI R , YAN F , et al . BATCH: Machine learning inference serving on serverless platforms with adaptive batching [C ] // SC20: International Conference for High Performance Computing, Networking, Storage and Analysis . Piscataway : IEEE , 2021 : 1 - 15 .

MIAO X P , SHI C N , DUAN J F , et al . SpotServe: Serving generative large language models on preemptible instances [C ] // Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , Volume 2 . New York : ACM , 2024: 1112 - 1127 .

WALKOWIAK B , WALKOWIAK T . Assessing inference time in large language models [M ] // System Dependability-Theory and Applications . Cham : Springer , 2024 : 296 - 305 .

BARRETT C , CAO S Y , GONZALEZ J , et al . SGLang: Efficient execution of structured language model programs [EB/OL ] . ( 2024-06-06 )[ 2025-10-01 ] . https://arxiv.org/abs/2312.07104 https://arxiv.org/abs/2312.07104 .

LUO W L , FAN R B , LI Z Y , et al . Benchmarking and dissecting the NVIDIA hopper GPU architecture [C ] // 2024 IEEE International Parallel and Distributed Processing Symposium . Piscataway : IEEE , 2024 : 656 - 667 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

从博弈论视角解构去噪扩散概率模型的视觉概念生成机制

基于有向超图的工作流资源分配均衡优化方法

低温光致发光微计算机测试系统

Xe^1v离子激光感生荧光用于医用X射线ZnCdS:Ag荧光屏的性能研究

YGG:Cr晶体的光谱特性