Non-Uniform Memory Access Optimization on Multi-GPU Systems: Research Progress and Prospect

LI Chen; LIU Chang; GE Yi-xuan; GUO Yang

doi:10.12263/DZXB.20231198

您当前的位置：

首页 >

文章列表页 >

Non-Uniform Memory Access Optimization on Multi-GPU Systems: Research Progress and Prospect

SURVEYS AND REVIEWS | 更新时间：2025-12-11

- Non-Uniform Memory Access Optimization on Multi-GPU Systems: Research Progress and Prospect
- ACTA ELECTRONICA SINICA Vol. 52, Issue 5, Pages: 1783-1800(2024)
- 作者机构：
  
  国防科技大学计算机学院，湖南长沙 410073
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62202478);NUDT Innovation Science Foundation(23-ZZCX-JDZ-12)
- DOI：10.12263/DZXB.20231198
  CLC： TP393;
- Received：25 December 2023，
  
  Revised：2024-04-03，
  
  Published：25 May 2024
- 稿件说明：
移动端阅览
李晨, 刘畅, 葛一漩, 等. 多GPU系统非一致存储访问优化：研究进展与展望[J]. 电子学报, 2024, 52(05): 1783-1800.

LI Chen, LIU Chang, GE Yi-xuan, et al. Non-Uniform Memory Access Optimization on Multi-GPU Systems: Research Progress and Prospect[J]. Acta Electronica Sinica, 2024, 52(05): 1783-1800.
李晨, 刘畅, 葛一漩, 等. 多GPU系统非一致存储访问优化：研究进展与展望[J]. 电子学报, 2024, 52(05): 1783-1800. DOI：10.12263/DZXB.20231198

LI Chen, LIU Chang, GE Yi-xuan, et al. Non-Uniform Memory Access Optimization on Multi-GPU Systems: Research Progress and Prospect[J]. Acta Electronica Sinica, 2024, 52(05): 1783-1800. DOI：10.12263/DZXB.20231198

摘要

随着晶体管缩小速度的减缓，单GPU（Graphics Processing Units）的性能提升已经变得越来越具有挑战性，因此，多GPU系统成为了提高GPU系统性能的主要手段.然而，由于片外物理设计的制约，多GPU系统中处理器间的带宽不均衡导致了非一致存储访问（Non-Uniform Memory Access，NUMA）问题，严重影响多GPU系统的性能.为了减少非一致存储访问所导致的性能损失，本文首先分析了非一致存储访问出现的原因，并对现有的非一致存储访问解决方案进行了对比.针对不同维度的非一致存储访问，本文从减少远程访问流量和提升远程访问性能两个方向出发，对非一致存储访问的优化方案进行了总结.最后，结合这些方案的优缺点，提出了未来多GPU系统非一致存储访问优化的发展方向.

Abstract

Due to the slowdown of transistor scaling

it has become increasingly difficult to enhance the performance of a single GPU (Graphics Processing Units）. Therefore

multi-GPU systems have become the main means to improve the performance of GPU systems. However

due to the constraints of off-chip physical design

the bandwidth imbalance between processors in multi-GPU systems leads to non-uniform memory access (NUMA) problems

which seriously affects the performance of multi-GPU systems. In order to reduce the performance loss caused by non-uniform memory access

this paper first analyzes the causes of non-uniform memory access and compares existing solutions for non-uniform memory access. For non-uniform memory access with different dimensions

this paper summarizes optimization solutions for non-uniform memory access from two directions: reducing remote access traffic and improving remote access performance. Finally

combining the advantages and disadvantages of these solutions

this paper proposes the future development direction of non-uniform memory access optimization for multi-GPU systems.

关键词

Keywords

references

CHEN M , MAO S W , LIU Y H . Big data: A survey [J ] . Mobile Networks and Applications , 2014 , 19 ( 2 ): 171 - 209 .

CHEN G , LI G , PEI S , et al . High performance computing via a GPU [C ] // 2009 First International Conference on Information Science and Engineering . Piscataway : IEEE , 2009 : 238 - 241 .

RONG C M , LIU L , CHEN G L . Big data and smart computing: Methodology and practice [J ] . Concurrency and Computation: Practice & Experience , 2016 , 28 ( 11 ): 3077 - 3078 .

KARNAGEL T , BEN-NUN T , WERNER M , et al . Big data causing big (TLB) problems: Taming random memory accesses on the GPU [C ] // Proceedings of the 13th International Workshop on Data Management on New Hardware . New York : ACM , 2017 : 1 - 10 .

CHRISTOPHE E , MICHEL J , INGLADA J . Remote sensing processing: From multicore to GPU [J ] . IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , 2011 , 4 ( 3 ): 643 - 652 .

MIELIKAINEN J , HUANG B , WANG J , et al . Compute unified device architecture (CUDA)-based parallelization of WRF Kessler cloud microphysics scheme [J ] . Computers & Geosciences , 2013 , 52 : 292 - 299 .

KERR A , DAN C , RICHARDS M . QR decomposition on GPUs [C ] // Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2) . New York : ACM , 2009 : 71 - 78 .

STONE S S , HALDAR J P , TSAO S C , et al . Accelerating advanced MRI reconstructions on GPUs [C ] // Proceedings of the 5th Conference on Computing frontiers . New York : ACM , 2008 : 261 - 272 .

张舒 . 模式识别并行算法与GPU高速实现研究 [D ] . 成都 : 电子科技大学 , 2009 .

ZHANG S . Research on Parallel Algorithm of Pattern Recognition and High-speed Implementation of GPU [D ] . Chengdu : University of Electronic Science and Technology of China , 2009 . (in Chinese)

LAI L Y , TSAI K H , LI H W . GPGPU-based ATPG system: Myth or reality? [J ] . IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , 2020 , 39 ( 1 ): 239 - 247 .

MARSHALL S , VANHOY G , AKOGLU A , et al . GPGPU based parallel implementation of spectral correlation density function [J ] . Journal of Signal Processing Systems , 2020 , 92 ( 1 ): 71 - 93 .

BURTSCHER M , NASRE R , PINGALI K . A quantitative study of irregular programs on GPUs [C ] // 2012 IEEE International Symposium on Workload Characterization (IISWC) . Piscataway : IEEE , 2012 : 141 - 151 .

GARCIA V , DEBREUVE E , BARLAUD M . Fast k nearest neighbor search using GPU [C ] // 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2008 : 1 - 6 .

GARCIA V , DEBREUVE E , NIELSEN F , et al . K-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching [C ] // 2010 IEEE International Conference on Image Processing . Piscataway : IEEE , 2010 : 3757 - 3760 .

SHALF J . HPC interconnects at the end of Moore’s law [C ] // Optical Fiber Communication Conference (OFC) 2019 . Washington : OSA , 2019 : 1 - 9 .

KNOCHEL U . 3D integration: Opportunities, design challenges and approaches [C ] // 2012 IEEE 15th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS) . Piscataway : IEEE , 2012 : 4 .

ZHOU C , ZHANG T . High performance graph data imputation on multiple GPUs [J ] . Future Internet , 2021 , 13 ( 2 ): 36 .

WANG K B , DING X N , LEE R B , et al . GDM: Device memory management for gpgpu computing [C ] // The 2014 ACM International Conference on Measurement and Modeling of Computer Systems . New York : ACM , 2014 : 533 - 545 .

LASHGAR A , SALEHI E , BANIASADI A . A case study in reverse engineering GPGPUs: Outstanding memory handling resources [EB/OL ] . ( 2015-09-04 )[ 2023-06-13 ] . https://doi.org/10.1145/2927964.2927968 https://doi.org/10.1145/2927964.2927968 .

O'NEIL M A , BURTSCHER M . Microarchitectural performance characterization of irregular GPU kernels [C ] // 2014 IEEE International Symposium on Workload Characterization (IISWC) . Piscataway : IEEE , 2014 : 130 - 139 .

KIM J , KIM H , LEE J H , et al . Achieving a single compute device image in OpenCL for multiple GPUs [J ] . ACM SIGPLAN Notices , 2011 , 46 ( 8 ): 277 - 288 .

CABEZAS J , VILANOVA L , GELADO I , et al . Automatic parallelization of kernels in shared-memory multi-GPU nodes [C ] // Proceedings of the 29th ACM on International Conference on Supercomputing . New York : ACM , 2015 : 3 - 13 .

SAKAI R , INO F , HAGIHARA K . Towards automating multi-dimensional data decomposition for executing a single-GPU code on a multi-GPU system [C ] // 2016 Fourth International Symposium on Computing and Networking (CANDAR) . Piscataway : IEEE , 2016 : 408 - 414 .

AGARWAL N , NELLANS D , O’CONNOR M , et al . Unlocking bandwidth for GPUs in CC-NUMA systems [C ] // 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2015 : 354 - 365 .

ZHENG T H , NELLANS D , ZULFIQAR A , et al . Towards high performance paged memory for GPUs [C ] // 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2016 : 345 - 357 .

PICCOLI G , SANTOS H N , RODRIGUES R E , et al . Compiler support for selective page migration in NUMA architectures [C ] // Proceedings of the 23rd International Conference on Parallel Architectures and Compilation . New York : ACM , 2014 : 369 - 380 .

TIAN Y Y , PUTHOOR S , GREATHOUSE J L , et al . Adaptive GPU cache bypassing [C ] // Proceedings of the 8th Workshop on General Purpose Processing Using GPUs . New York : ACM , 2015 : 25 - 35 .

KOO G , OH Y , RO W W , et al . Access pattern-aware cache management for improving data utilization in GPU [C ] // Proceedings of the 44th Annual International Symposium on Computer Architecture . New York : ACM , 2017 : 307 - 319 .

DUONG N , ZHAO D L , KIM T , et al . Improving cache management policies using dynamic reuse distances [C ] // 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture . Piscataway : IEEE , 2012 : 389 - 400 .

IBRAHIM M A , KAYIRAN O , ECKERT Y , et al . Analyzing and leveraging shared L1 caches in GPUs [C ] // Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques . New York : ACM , 2020 : 161 - 173 .

RAMASHEKAR T , BONDHUGULA U . Automatic data allocation and buffer management for multi-GPU machines [J ] . ACM Transactions on Architecture and Code Optimization , 10 ( 4 ): 60 .

AGARWAL N , NELLANS D , STEPHENSON M , et al . Page placement strategies for GPUs within heterogeneous memory systems [J ] . ACM SIGPLAN Notices , 2015 , 50 ( 4 ): 607 - 618 .

KIM H , HADIDI R , NAI L F , et al . CODA: Enabling co-location of computation and data for multiple GPU systems [J ] . ACM Transactions on Architecture and Code Optimization , 15 ( 3 ): 32 .

BEN-NUN T , LEVY E , BARAK A , et al . Memory access patterns: The missing piece of the multi-GPU puzzle [C ] // Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis . New York : ACM , 2015 : 1 - 12 .

DASHTI M , FEDOROVA A , FUNSTON J , et al . Traffic management: A holistic approach to memory placement on NUMA systems [C ] // Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems . New York : ACM , 2013 : 381 - 394 .

BARUAH T , SUN Y F , DINCER A T , et al . Griffin: hardware-software support for efficient page migration in multi-GPU systems [C ] // 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2020 : 596 - 609 .

ARUNKUMAR A , BOLOTIN E , CHO B , et al . MCM-GPU: Multi-chip-module GPUs for continued performance scalability [C ] // Proceedings of the 44th Annual International Symposium on Computer Architecture . New York : ACM , 2017 : 320 - 332 .

YOUNG V , JALEEL A , BOLOTIN E , et al . Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systems [C ] // Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture . New York : ACM , 2018 : 339 - 351 .

KHAIRY M , NIKIFOROV V , NELLANS D , et al . Locality-centric data and threadblock management for massive GPUs [C ] // 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) . Piscataway : IEEE , 2020 : 1022 - 1036 .

LUK C K , HONG S , KIM H . Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping [C ] // Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture . New York : ACM , 2009 : 45 - 55 .

LEE Y W , Y J HEO , CHO C S , et al . Open-CL based multi GPU acceleration for video object detection [C ] // 2021 IEEE International Conference on Consumer Electronics (ICCE) . Piscataway : IEEE , 2021 : 1 - 3 .

VEGA A , BUYUKTOSUNOGLU A , BOSE P . Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems [C ] // Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques . Piscataway : IEEE , 2013 : 245 - 255 .

TANG X L , PATTNAIK A , JIANG H P , et al . Controlled kernel launch for dynamic parallelism in GPUs [C ] // 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2017 : 649 - 660 .

PARK J J K , PARK Y , MAHLKE S . Dynamic resource management for efficient utilization of multitasking GPUs [C ] // Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems . New York : ACM , 2017 : 1 - 5 .

LI A , SONG S L , LIU W F , et al . Locality-aware CTA clustering for modern GPUs [C ] // Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems . New York : ACM , 2017 : 297 - 311 .

PATTNAIK A , TANG X L , JOG A , et al . Scheduling techniques for GPU architectures with processing-in-memory capabilities [C ] // Proceedings of the 2016 International Conference on Parallel Architectures and Compilation . New York : ACM , 2016 : 31 - 44 .

KIM G , LEE M , JEONG J , et al . Multi-GPU system design with memory networks [C ] // Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture . New York : ACM , 2014 : 484 - 495 .

LEE J , SAMADI M , MAHLKE S . VAST: The illusion of a large memory space for GPUs [C ] // 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT) . Piscataway : IEEE , 2014 : 443 - 454 .

JI F , LIN H , MA X . RSVM: A region-based software virtual memory for GPU [C ] // Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques . Piscataway : IEEE , 2013 : 269 - 278 .

BHATOTIA P , RODRIGUES R , VERMA A . Shredder: GPU-accelerated incremental storage and computation [C ] // Proceedings of the 10th USENIX conference on File and Storage Technologies . New York : ACM , 2012 : 14 .

MOKHTARI R , STUMM M . BigKernel: High performance CPU-GPU communication pipelining for big data-style applications [C ] // 2014 IEEE 28th International Parallel and Distributed Processing Symposium . Piscataway : IEEE , 2014 : 819 - 828 .

SABNE A , SAKDHNAGOOL P , EIGENMANN R . Scaling large-data computations on multi-GPU accelerators [C ] // Proceedings of the 27th International ACM Conference on Supercomputing . New York : ACM , 2013 : 443 - 454 .

KANG S , FENDER A , EATON J , et al . Computing PageRank scores of web crawl data using DGX A100 clusters [C ] // 2020 IEEE High Performance Extreme Computing Conference (HPEC) . Piscataway : IEEE , 2020 : 1 - 4 .

JANG H , KIM J , GRATZ P , et al . Bandwidth-efficient on-chip interconnect designs for GPGPUs [C ] // Proceedings of the 52nd Annual Design Automation Conference . New York : ACM , 2015 : 1 - 6 .

WANG J F , WANG Q , JIANG L , et al . IBOM: An integrated and balanced on-chip memory for high performance GPGPUs [J ] . IEEE Transactions on Parallel and Distributed Systems , 2018 , 29 ( 3 ): 586 - 599 .

CHUN K C , KIM Y K , RYU Y , et al . A 16-GB 640-GB/s HBM2E DRAM with a data-bus window extension technique and a synergetic on-die ECC scheme [J ] . IEEE Journal of Solid-State Circuits , 2021 , 56 ( 1 ): 199 - 211 .

LI A , SONG S L , CHEN J Y , et al . Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect [J ] . IEEE Transactions on Parallel and Distributed Systems , 2020 , 31 ( 1 ): 94 - 110 .

SHARMA D DAS . PCI express 6.0 specification: A low-latency, high-bandwidth, high-reliability, and cost-effective interconnect with 64.0 GT/s PAM-4 signaling [J ] . IEEE Micro , 2021 , 41 ( 1 ): 23 - 29 .

LI A , SONG S L , CHEN J Y , et al . Tartan: Evaluating modern GPU interconnect via a multi-GPU benchmark suite [C ] // 2018 IEEE International Symposium on Workload Characterization (IISWC) . Piscataway : IEEE , 2018 : 191 - 202 .

HU Y C , LU L . Design of a simulation model for high performance LINPACK in hybrid CPU-GPU systems [J ] . The Journal of Supercomputing , 2021 , 77 ( 12 ): 13739 - 13756 .

NORMI A H , SUHAILA A H , NORMA A . Statistical filtering on 3D cloud data points on the CPU-GPU platform [J ] . Journal of Physics: Conference Series , 2021 , 1770 : 012006 .

SOUZA R , FERNANDES A , TEIXEIRA T S F X , et al . Online multimedia retrieval on CPU-GPU platforms with adaptive work partition [J ] . Journal of Parallel and Distributed Computing , 2021 , 148 : 31 - 45 .

SOURAVLAS S , SIFALERAS A , KATSAVOUNIS S . Hybrid CPU-GPU community detection in weighted networks [J ] . IEEE Access , 2020 , 8 : 57527 - 57551 .

YIN F , SHI F . Cluster optimization algorithm based on CPU and GPU hybrid architecture [J ] . Cluster Computing , 2022 , 25 ( 4 ): 2601 - 2611 .

AGARWAL N , NELLANS D , EBRAHIMI E , et al . Selective GPU caches to eliminate CPU-GPU HW cache coherence [C ] // 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2016 : 494 - 506 .

ZHANG S Q , QIN Z , YANG Y H , et al . Transparent partial page migration between CPU and GPU [J ] . Frontiers of Computer Science , 2020 , 14 ( 3 ): 13 .

SUN Y F , BARUAH T , MOJUMDER S A , et al . MGPUSim: Enabling multi-GPU performance modeling and optimization [C ] // Proceedings of the 46th International Symposium on Computer Architecture . New York : ACM , 2019 : 197 - 209 .

LI C , AUSAVARUNGNIRUN R , ROSSBACH C J , et al . A framework for memory oversubscription management in graphics processing units [C ] // Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems . New York : ACM , 2019 : 49 - 63 .

GANGULY D , ZHANG Z Y , YANG J , et al . Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory [C ] // Proceedings of the 46th International Symposium on Computer Architecture . New York : ACM , 2019 : 224 - 235 .

LINDHOLM E , NICKOLLS J , OBERMAN S , et al . NVIDIA tesla: A unified graphics and computing architecture [J ] . IEEE Micro , 2008 , 28 ( 2 ): 39 - 55 .

BORDAT A , DOBIAS P , KERNEC J L , et al . GPU based implementation for the pre-processing of radar-based human activity recognition [C ] // 25th Euromicro Conference on Digital System Design (DSD) . Piscataway : IEEE , 2022 : 593 - 598 .

田绪红 , 陈茂资 , 田金梅 . DirectX发展及相关GPU通用计算技术综述 [J ] . 计算机工程与设计 , 2009 , 30 ( 23 ): 5432 - 5436, 5559 .

TIAN X H , CHEN M Z , TIAN J M . Survey of development of DirectX and GPGPU [J ] . Computer Engineering and Design , 2009 , 30 ( 23 ): 5432 - 5436, 5559 . (in Chinese)

吴俊杰 . 层次存储的访问分析与优化方法研究: 重用性、相似性与亲和性 [D ] . 长沙 : 国防科学技术大学 , 2009 .

WU J J . Research on Analysis and Optimization of Data Access for Memory Hierarchy [D ] . Changsha : National University of Defense Technology , 2009 . (in Chinese)

BOLOSKY W , FITZGERALD R , SCOTT M . Simple but effective techniques for NUMA memory management [J ] . ACM SIGOPS Operating Systems Review , 1989 , 23 ( 5 ): 19 - 31 .

BLAGODUROV S , ZHURAVLEV S , FEDOROVA A , et al . A case for NUMA-aware contention management on multicore systems [C ] // Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques . New York : ACM , 2010 : 557 - 558 .

DAS R , AUSAVARUNGNIRUN R , MUTLU O , et al . Application-to-core mapping policies to reduce memory interference in multi-core systems [C ] // Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques . New York : ACM , 2012 : 455 - 456 .

FALSAFI B , WOOD D A . Reactive NUMA: A design for unifying S-COMA and CC-NUMA [C ] // Proceedings of the 24th Annual International Symposium on Computer Architecture . New York : ACM , 1997 : 229 - 240 .

HARDAVELLAS N , FERDMAN M , FALSAFI B , et al . Reactive NUCA [J ] . ACM SIGARCH Computer Architecture News , 2009 , 37 ( 3 ): 184 - 195 .

LI H , TANDRI S , STUMM M , et al . Locality and loop scheduling on NUMA multiprocessors [C ] // 1993 International Conference on Parallel Processing - ICPP’93 Vol2 . Piscataway : IEEE , 1993 : 140 - 147 .

SAULSBURY A , WILKINSON T , CARTER J , et al . An argument for simple COMA [J ] . Future Generation Computer Systems , 1995 , 11 ( 6 ): 553 - 566 .

TAM D , AZIMI R , STUMM M . Thread clustering [J ] . ACM SIGOPS Operating Systems Review , 2007 , 41 ( 3 ): 47 - 58 .

ROY P , SONG S L , KRISHNAMOORTHY S , et al . NUMA-caffe [J ] . ACM Transactions on Architecture and Code Optimization , 2018 , 15 ( 2 ): 1 - 26 .

ZHANG W , JIANG Z H , CHEN Z G , et al . NUMA-aware DGEMM based on 64-bit ARMv8 multicore processors architecture [J ] . Electronics , 2021 , 10 ( 16 ): 1984 .

VERGHESE B , DEVINE S , GUPTA A , et al . Operating system support for improving data locality on CC-NUMA compute servers [J ] . ACM SIGOPS Operating Systems Review , 1996 , 30 ( 5 ): 279 - 289 .

SONG W , JUNG H J , J H AHN , et al . Evaluation of performance unfairness in NUMA system architecture [J ] . IEEE Computer Architecture Letters , 2017 , 16 ( 1 ): 26 - 29 .

KAESTLE S , ACHERMANN R , ROSCOE T , et al . Shoal: Smart allocation and replication of memory for parallel programs [C ] // Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC’15) . New York : USENIX Association , 2015 : 263 - 276 .

BAPTISTE L , VIVIEN Q , ALEXANDRA F . Thread and memory placement on NUMA systems: Asymmetry matters [C ] // Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC’15) . New York : USENIX Association , 2015 : 277 - 289 .

LIU M , LI T . Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads [C ] // 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) . Piscataway : IEEE , 2014 : 325 - 336 .

MILIC U , VILLA O , BOLOTIN E , et al . Beyond the socket: NUMA-aware GPUs [C ] // Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture . New York : ACM , 2017 : 123 - 135 .

LAMAAZI H , MIZOUNI R , OTROK H , et al . Smart-3DM: Data-driven decision making using smart edge computing in hetero-crowdsensing environment [J ] . Future Generation Computer Systems , 2022 , 131 : 151 - 165 .

WANG G Y , WANG Y . 3DM: Domain-oriented data-driven data mining [J ] . Fundamenta Informaticae , 2009 , 90 ( 4 ): 395 - 426 .

TOPOL A W , LA TULIPE D C , SHI L , et al . Three-dimensional integrated circuits [J ] . IBM Journal of Research and Development , 2006 , 50 ( 4/5 ): 491 - 506 .

吴际 , 谢冬青 . 三维集成技术的现状和发展趋势 [J ] . 现代电子技术 , 2014 , 37 ( 6 ): 104 - 107 .

WU J , XIE D Q . Current status and trends of three-dimensional integrated technology [J ] . Modern Electronics Technique , 2014 , 37 ( 6 ): 104 - 107 . (in Chinese)

LOH G H , XIE Y , BLACK B . Processor design in 3D die-stacking technologies [J ] . IEEE Micro , 2007 , 27 ( 3 ): 31 - 48 .

STRENGERT M , MÜLLER C , DACHSBACHER C , et al . CUDASA: Compute unified device and systems architecture [C ] // Proceedings of the 8th Eurographics Conference on Parallel Graphics and Visualization . New York : ACM , 2008 : 49 - 56 .

ZHU Z C , XU S Z , TANG J , et al . GraphVite: A high-performance CPU-GPU hybrid system for node embedding [C ] // The World Wide Web Conference . New York : ACM , 2019 : 2494 - 2504 .

GONG L , ZHANG C , DUAN L , et al . Nonrigid image registration using spatially region-weighted correlation ratio and GPU-acceleration [J ] . IEEE Journal of Biomedical and Health Informatics , 2019 , 23 ( 2 ): 766 - 778 .

ZIABARI A K , SUN Y F , MA Y N , et al . UMH: A hardware-based unified memory hierarchy for systems with multiple discrete GPUs [J ] . ACM Transactions on Architecture and Code Optimization , 2016 , 13 ( 4 ): 35 .

SOUROURI M , GILLBERG T , BADEN S B , et al . Effective multi-GPU communication using multiple CUDA streams and threads [C ] // 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) . Piscataway : IEEE , 2014 : 981 - 986 .

MATAM K K , MOHAMMAD R A , MURALI A . Efficient automatic parallelization of a single GPU program for a multiple GPU system [J ] . Integration , 2019 , 66 : 35 - 43 .

MUTHUKRISHNAN H , LUSTIG D , VILLA O , et al . FinePack: Transparently improving the efficiency of fine-grained transfers in multi-GPU systems [C ] // 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2023 : 516 - 529 .

MUTHUKRISHNAN H , NELLANS D , LUSTIG D , et al . Efficient multi-GPU shared memory via automatic optimization of fine-grained transfers [C ] // 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) . Piscataway : IEEE , 2021 : 139 - 152 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

⁰