

浏览全部资源
扫码关注微信
国防科技大学计算机学院,湖南长沙 410073
Received:25 December 2023,
Revised:2024-04-03,
Published:25 May 2024
移动端阅览
李晨, 刘畅, 葛一漩, 等. 多GPU系统非一致存储访问优化:研究进展与展望[J]. 电子学报, 2024, 52(05): 1783-1800.
LI Chen, LIU Chang, GE Yi-xuan, et al. Non-Uniform Memory Access Optimization on Multi-GPU Systems: Research Progress and Prospect[J]. Acta Electronica Sinica, 2024, 52(05): 1783-1800.
李晨, 刘畅, 葛一漩, 等. 多GPU系统非一致存储访问优化:研究进展与展望[J]. 电子学报, 2024, 52(05): 1783-1800. DOI:10.12263/DZXB.20231198
LI Chen, LIU Chang, GE Yi-xuan, et al. Non-Uniform Memory Access Optimization on Multi-GPU Systems: Research Progress and Prospect[J]. Acta Electronica Sinica, 2024, 52(05): 1783-1800. DOI:10.12263/DZXB.20231198
随着晶体管缩小速度的减缓,单GPU(Graphics Processing Units)的性能提升已经变得越来越具有挑战性,因此,多GPU系统成为了提高GPU系统性能的主要手段.然而,由于片外物理设计的制约,多GPU系统中处理器间的带宽不均衡导致了非一致存储访问(Non-Uniform Memory Access,NUMA)问题,严重影响多GPU系统的性能.为了减少非一致存储访问所导致的性能损失,本文首先分析了非一致存储访问出现的原因,并对现有的非一致存储访问解决方案进行了对比.针对不同维度的非一致存储访问,本文从减少远程访问流量和提升远程访问性能两个方向出发,对非一致存储访问的优化方案进行了总结.最后,结合这些方案的优缺点,提出了未来多GPU系统非一致存储访问优化的发展方向.
Due to the slowdown of transistor scaling
it has become increasingly difficult to enhance the performance of a single GPU (Graphics Processing Units). Therefore
multi-GPU systems have become the main means to improve the performance of GPU systems. However
due to the constraints of off-chip physical design
the bandwidth imbalance between processors in multi-GPU systems leads to non-uniform memory access (NUMA) problems
which seriously affects the performance of multi-GPU systems. In order to reduce the performance loss caused by non-uniform memory access
this paper first analyzes the causes of non-uniform memory access and compares existing solutions for non-uniform memory access. For non-uniform memory access with different dimensions
this paper summarizes optimization solutions for non-uniform memory access from two directions: reducing remote access traffic and improving remote access performance. Finally
combining the advantages and disadvantages of these solutions
this paper proposes the future development direction of non-uniform memory access optimization for multi-GPU systems.
CHEN M , MAO S W , LIU Y H . Big data: A survey [J ] . Mobile Networks and Applications , 2014 , 19 ( 2 ): 171 - 209 .
CHEN G , LI G , PEI S , et al . High performance computing via a GPU [C ] // 2009 First International Conference on Information Science and Engineering . Piscataway : IEEE , 2009 : 238 - 241 .
RONG C M , LIU L , CHEN G L . Big data and smart computing: Methodology and practice [J ] . Concurrency and Computation: Practice & Experience , 2016 , 28 ( 11 ): 3077 - 3078 .
KARNAGEL T , BEN-NUN T , WERNER M , et al . Big data causing big (TLB) problems: Taming random memory accesses on the GPU [C ] // Proceedings of the 13th International Workshop on Data Management on New Hardware . New York : ACM , 2017 : 1 - 10 .
KARNAGEL T , BEN-NUN T , WERNER M , et al . Big data causing big (TLB) problems: Taming random memory accesses on the GPU [C ] // Proceedings of the 13th International Workshop on Data Management on New Hardware (DAMON’17) . New York : ACM , 2017 : 6 .
CHRISTOPHE E , MICHEL J , INGLADA J . Remote sensing processing: From multicore to GPU [J ] . IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , 2011 , 4 ( 3 ): 643 - 652 .
MIELIKAINEN J , HUANG B , WANG J , et al . Compute unified device architecture (CUDA)-based parallelization of WRF Kessler cloud microphysics scheme [J ] . Computers & Geosciences , 2013 , 52 : 292 - 299 .
KERR A , DAN C , RICHARDS M . QR decomposition on GPUs [C ] // Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2) . New York : ACM , 2009 : 71 - 78 .
STONE S S , HALDAR J P , TSAO S C , et al . Accelerating advanced MRI reconstructions on GPUs [C ] // Proceedings of the 5th Conference on Computing frontiers . New York : ACM , 2008 : 261 - 272 .
张舒 . 模式识别并行算法与GPU高速实现研究 [D ] . 成都 : 电子科技大学 , 2009 .
ZHANG S . Research on Parallel Algorithm of Pattern Recognition and High-speed Implementation of GPU [D ] . Chengdu : University of Electronic Science and Technology of China , 2009 . (in Chinese)
LAI L Y , TSAI K H , LI H W . GPGPU-based ATPG system: Myth or reality? [J ] . IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , 2020 , 39 ( 1 ): 239 - 247 .
MARSHALL S , VANHOY G , AKOGLU A , et al . GPGPU based parallel implementation of spectral correlation density function [J ] . Journal of Signal Processing Systems , 2020 , 92 ( 1 ): 71 - 93 .
BURTSCHER M , NASRE R , PINGALI K . A quantitative study of irregular programs on GPUs [C ] // 2012 IEEE International Symposium on Workload Characterization (IISWC) . Piscataway : IEEE , 2012 : 141 - 151 .
GARCIA V , DEBREUVE E , BARLAUD M . Fast k nearest neighbor search using GPU [C ] // 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2008 : 1 - 6 .
GARCIA V , DEBREUVE E , NIELSEN F , et al . K-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching [C ] // 2010 IEEE International Conference on Image Processing . Piscataway : IEEE , 2010 : 3757 - 3760 .
SHALF J . HPC interconnects at the end of Moore’s law [C ] // Optical Fiber Communication Conference (OFC) 2019 . Washington : OSA , 2019 : 1 - 9 .
KNOCHEL U . 3D integration: Opportunities, design challenges and approaches [C ] // 2012 IEEE 15th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS) . Piscataway : IEEE , 2012 : 4 .
ZHOU C , ZHANG T . High performance graph data imputation on multiple GPUs [J ] . Future Internet , 2021 , 13 ( 2 ): 36 .
WANG K B , DING X N , LEE R B , et al . GDM: Device memory management for gpgpu computing [C ] // The 2014 ACM International Conference on Measurement and Modeling of Computer Systems . New York : ACM , 2014 : 533 - 545 .
LASHGAR A , SALEHI E , BANIASADI A . A case study in reverse engineering GPGPUs: Outstanding memory handling resources [EB/OL ] . ( 2015-09-04 )[ 2023-06-13 ] . https://doi.org/10.1145/2927964.2927968 https://doi.org/10.1145/2927964.2927968 .
O'NEIL M A , BURTSCHER M . Microarchitectural performance characterization of irregular GPU kernels [C ] // 2014 IEEE International Symposium on Workload Characterization (IISWC) . Piscataway : IEEE , 2014 : 130 - 139 .
KIM J , KIM H , LEE J H , et al . Achieving a single compute device image in OpenCL for multiple GPUs [J ] . ACM SIGPLAN Notices , 2011 , 46 ( 8 ): 277 - 288 .
CABEZAS J , VILANOVA L , GELADO I , et al . Automatic parallelization of kernels in shared-memory multi-GPU nodes [C ] // Proceedings of the 29th ACM on International Conference on Supercomputing . New York : ACM , 2015 : 3 - 13 .
SAKAI R , INO F , HAGIHARA K . Towards automating multi-dimensional data decomposition for executing a single-GPU code on a multi-GPU system [C ] // 2016 Fourth International Symposium on Computing and Networking (CANDAR) . Piscataway : IEEE , 2016 : 408 - 414 .
AGARWAL N , NELLANS D , O’CONNOR M , et al . Unlocking bandwidth for GPUs in CC-NUMA systems [C ] // 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2015 : 354 - 365 .
ZHENG T H , NELLANS D , ZULFIQAR A , et al . Towards high performance paged memory for GPUs [C ] // 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2016 : 345 - 357 .
PICCOLI G , SANTOS H N , RODRIGUES R E , et al . Compiler support for selective page migration in NUMA architectures [C ] // Proceedings of the 23rd International Conference on Parallel Architectures and Compilation . New York : ACM , 2014 : 369 - 380 .
TIAN Y Y , PUTHOOR S , GREATHOUSE J L , et al . Adaptive GPU cache bypassing [C ] // Proceedings of the 8th Workshop on General Purpose Processing Using GPUs . New York : ACM , 2015 : 25 - 35 .
KOO G , OH Y , RO W W , et al . Access pattern-aware cache management for improving data utilization in GPU [C ] // Proceedings of the 44th Annual International Symposium on Computer Architecture . New York : ACM , 2017 : 307 - 319 .
DUONG N , ZHAO D L , KIM T , et al . Improving cache management policies using dynamic reuse distances [C ] // 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture . Piscataway : IEEE , 2012 : 389 - 400 .
IBRAHIM M A , KAYIRAN O , ECKERT Y , et al . Analyzing and leveraging shared L1 caches in GPUs [C ] // Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques . New York : ACM , 2020 : 161 - 173 .
RAMASHEKAR T , BONDHUGULA U . Automatic data allocation and buffer management for multi-GPU machines [J ] . ACM Transactions on Architecture and Code Optimization , 10 ( 4 ): 60 .
AGARWAL N , NELLANS D , STEPHENSON M , et al . Page placement strategies for GPUs within heterogeneous memory systems [J ] . ACM SIGPLAN Notices , 2015 , 50 ( 4 ): 607 - 618 .
KIM H , HADIDI R , NAI L F , et al . CODA: Enabling co-location of computation and data for multiple GPU systems [J ] . ACM Transactions on Architecture and Code Optimization , 15 ( 3 ): 32 .
BEN-NUN T , LEVY E , BARAK A , et al . Memory access patterns: The missing piece of the multi-GPU puzzle [C ] // Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis . New York : ACM , 2015 : 1 - 12 .
DASHTI M , FEDOROVA A , FUNSTON J , et al . Traffic management: A holistic approach to memory placement on NUMA systems [C ] // Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems . New York : ACM , 2013 : 381 - 394 .
BARUAH T , SUN Y F , DINCER A T , et al . Griffin: hardware-software support for efficient page migration in multi-GPU systems [C ] // 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2020 : 596 - 609 .
ARUNKUMAR A , BOLOTIN E , CHO B , et al . MCM-GPU: Multi-chip-module GPUs for continued performance scalability [C ] // Proceedings of the 44th Annual International Symposium on Computer Architecture . New York : ACM , 2017 : 320 - 332 .
YOUNG V , JALEEL A , BOLOTIN E , et al . Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systems [C ] // Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture . New York : ACM , 2018 : 339 - 351 .
KHAIRY M , NIKIFOROV V , NELLANS D , et al . Locality-centric data and threadblock management for massive GPUs [C ] // 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) . Piscataway : IEEE , 2020 : 1022 - 1036 .
LUK C K , HONG S , KIM H . Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping [C ] // Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture . New York : ACM , 2009 : 45 - 55 .
LEE Y W , Y J HEO , CHO C S , et al . Open-CL based multi GPU acceleration for video object detection [C ] // 2021 IEEE International Conference on Consumer Electronics (ICCE) . Piscataway : IEEE , 2021 : 1 - 3 .
VEGA A , BUYUKTOSUNOGLU A , BOSE P . Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems [C ] // Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques . Piscataway : IEEE , 2013 : 245 - 255 .
TANG X L , PATTNAIK A , JIANG H P , et al . Controlled kernel launch for dynamic parallelism in GPUs [C ] // 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2017 : 649 - 660 .
PARK J J K , PARK Y , MAHLKE S . Dynamic resource management for efficient utilization of multitasking GPUs [C ] // Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems . New York : ACM , 2017 : 1 - 5 .
LI A , SONG S L , LIU W F , et al . Locality-aware CTA clustering for modern GPUs [C ] // Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems . New York : ACM , 2017 : 297 - 311 .
PATTNAIK A , TANG X L , JOG A , et al . Scheduling techniques for GPU architectures with processing-in-memory capabilities [C ] // Proceedings of the 2016 International Conference on Parallel Architectures and Compilation . New York : ACM , 2016 : 31 - 44 .
KIM G , LEE M , JEONG J , et al . Multi-GPU system design with memory networks [C ] // Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture . New York : ACM , 2014 : 484 - 495 .
LEE J , SAMADI M , MAHLKE S . VAST: The illusion of a large memory space for GPUs [C ] // 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT) . Piscataway : IEEE , 2014 : 443 - 454 .
JI F , LIN H , MA X . RSVM: A region-based software virtual memory for GPU [C ] // Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques . Piscataway : IEEE , 2013 : 269 - 278 .
BHATOTIA P , RODRIGUES R , VERMA A . Shredder: GPU-accelerated incremental storage and computation [C ] // Proceedings of the 10th USENIX conference on File and Storage Technologies . New York : ACM , 2012 : 14 .
MOKHTARI R , STUMM M . BigKernel: High performance CPU-GPU communication pipelining for big data-style applications [C ] // 2014 IEEE 28th International Parallel and Distributed Processing Symposium . Piscataway : IEEE , 2014 : 819 - 828 .
SABNE A , SAKDHNAGOOL P , EIGENMANN R . Scaling large-data computations on multi-GPU accelerators [C ] // Proceedings of the 27th International ACM Conference on Supercomputing . New York : ACM , 2013 : 443 - 454 .
KANG S , FENDER A , EATON J , et al . Computing PageRank scores of web crawl data using DGX A100 clusters [C ] // 2020 IEEE High Performance Extreme Computing Conference (HPEC) . Piscataway : IEEE , 2020 : 1 - 4 .
JANG H , KIM J , GRATZ P , et al . Bandwidth-efficient on-chip interconnect designs for GPGPUs [C ] // Proceedings of the 52nd Annual Design Automation Conference . New York : ACM , 2015 : 1 - 6 .
WANG J F , WANG Q , JIANG L , et al . IBOM: An integrated and balanced on-chip memory for high performance GPGPUs [J ] . IEEE Transactions on Parallel and Distributed Systems , 2018 , 29 ( 3 ): 586 - 599 .
CHUN K C , KIM Y K , RYU Y , et al . A 16-GB 640-GB/s HBM2E DRAM with a data-bus window extension technique and a synergetic on-die ECC scheme [J ] . IEEE Journal of Solid-State Circuits , 2021 , 56 ( 1 ): 199 - 211 .
LI A , SONG S L , CHEN J Y , et al . Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect [J ] . IEEE Transactions on Parallel and Distributed Systems , 2020 , 31 ( 1 ): 94 - 110 .
SHARMA D DAS . PCI express 6.0 specification: A low-latency, high-bandwidth, high-reliability, and cost-effective interconnect with 64.0 GT/s PAM-4 signaling [J ] . IEEE Micro , 2021 , 41 ( 1 ): 23 - 29 .
LI A , SONG S L , CHEN J Y , et al . Tartan: Evaluating modern GPU interconnect via a multi-GPU benchmark suite [C ] // 2018 IEEE International Symposium on Workload Characterization (IISWC) . Piscataway : IEEE , 2018 : 191 - 202 .
HU Y C , LU L . Design of a simulation model for high performance LINPACK in hybrid CPU-GPU systems [J ] . The Journal of Supercomputing , 2021 , 77 ( 12 ): 13739 - 13756 .
NORMI A H , SUHAILA A H , NORMA A . Statistical filtering on 3D cloud data points on the CPU-GPU platform [J ] . Journal of Physics: Conference Series , 2021 , 1770 : 012006 .
SOUZA R , FERNANDES A , TEIXEIRA T S F X , et al . Online multimedia retrieval on CPU-GPU platforms with adaptive work partition [J ] . Journal of Parallel and Distributed Computing , 2021 , 148 : 31 - 45 .
SOURAVLAS S , SIFALERAS A , KATSAVOUNIS S . Hybrid CPU-GPU community detection in weighted networks [J ] . IEEE Access , 2020 , 8 : 57527 - 57551 .
YIN F , SHI F . Cluster optimization algorithm based on CPU and GPU hybrid architecture [J ] . Cluster Computing , 2022 , 25 ( 4 ): 2601 - 2611 .
AGARWAL N , NELLANS D , EBRAHIMI E , et al . Selective GPU caches to eliminate CPU-GPU HW cache coherence [C ] // 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2016 : 494 - 506 .
ZHANG S Q , QIN Z , YANG Y H , et al . Transparent partial page migration between CPU and GPU [J ] . Frontiers of Computer Science , 2020 , 14 ( 3 ): 13 .
SUN Y F , BARUAH T , MOJUMDER S A , et al . MGPUSim: Enabling multi-GPU performance modeling and optimization [C ] // Proceedings of the 46th International Symposium on Computer Architecture . New York : ACM , 2019 : 197 - 209 .
LI C , AUSAVARUNGNIRUN R , ROSSBACH C J , et al . A framework for memory oversubscription management in graphics processing units [C ] // Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems . New York : ACM , 2019 : 49 - 63 .
GANGULY D , ZHANG Z Y , YANG J , et al . Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory [C ] // Proceedings of the 46th International Symposium on Computer Architecture . New York : ACM , 2019 : 224 - 235 .
LINDHOLM E , NICKOLLS J , OBERMAN S , et al . NVIDIA tesla: A unified graphics and computing architecture [J ] . IEEE Micro , 2008 , 28 ( 2 ): 39 - 55 .
BORDAT A , DOBIAS P , KERNEC J L , et al . GPU based implementation for the pre-processing of radar-based human activity recognition [C ] // 25th Euromicro Conference on Digital System Design (DSD) . Piscataway : IEEE , 2022 : 593 - 598 .
田绪红 , 陈茂资 , 田金梅 . DirectX发展及相关GPU通用计算技术综述 [J ] . 计算机工程与设计 , 2009 , 30 ( 23 ): 5432 - 5436, 5559 .
TIAN X H , CHEN M Z , TIAN J M . Survey of development of DirectX and GPGPU [J ] . Computer Engineering and Design , 2009 , 30 ( 23 ): 5432 - 5436, 5559 . (in Chinese)
吴俊杰 . 层次存储的访问分析与优化方法研究: 重用性、相似性与亲和性 [D ] . 长沙 : 国防科学技术大学 , 2009 .
WU J J . Research on Analysis and Optimization of Data Access for Memory Hierarchy [D ] . Changsha : National University of Defense Technology , 2009 . (in Chinese)
BOLOSKY W , FITZGERALD R , SCOTT M . Simple but effective techniques for NUMA memory management [J ] . ACM SIGOPS Operating Systems Review , 1989 , 23 ( 5 ): 19 - 31 .
BLAGODUROV S , ZHURAVLEV S , FEDOROVA A , et al . A case for NUMA-aware contention management on multicore systems [C ] // Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques . New York : ACM , 2010 : 557 - 558 .
DAS R , AUSAVARUNGNIRUN R , MUTLU O , et al . Application-to-core mapping policies to reduce memory interference in multi-core systems [C ] // Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques . New York : ACM , 2012 : 455 - 456 .
FALSAFI B , WOOD D A . Reactive NUMA: A design for unifying S-COMA and CC-NUMA [C ] // Proceedings of the 24th Annual International Symposium on Computer Architecture . New York : ACM , 1997 : 229 - 240 .
HARDAVELLAS N , FERDMAN M , FALSAFI B , et al . Reactive NUCA [J ] . ACM SIGARCH Computer Architecture News , 2009 , 37 ( 3 ): 184 - 195 .
LI H , TANDRI S , STUMM M , et al . Locality and loop scheduling on NUMA multiprocessors [C ] // 1993 International Conference on Parallel Processing - ICPP’93 Vol2 . Piscataway : IEEE , 1993 : 140 - 147 .
SAULSBURY A , WILKINSON T , CARTER J , et al . An argument for simple COMA [J ] . Future Generation Computer Systems , 1995 , 11 ( 6 ): 553 - 566 .
TAM D , AZIMI R , STUMM M . Thread clustering [J ] . ACM SIGOPS Operating Systems Review , 2007 , 41 ( 3 ): 47 - 58 .
ROY P , SONG S L , KRISHNAMOORTHY S , et al . NUMA-caffe [J ] . ACM Transactions on Architecture and Code Optimization , 2018 , 15 ( 2 ): 1 - 26 .
ZHANG W , JIANG Z H , CHEN Z G , et al . NUMA-aware DGEMM based on 64-bit ARMv8 multicore processors architecture [J ] . Electronics , 2021 , 10 ( 16 ): 1984 .
VERGHESE B , DEVINE S , GUPTA A , et al . Operating system support for improving data locality on CC-NUMA compute servers [J ] . ACM SIGOPS Operating Systems Review , 1996 , 30 ( 5 ): 279 - 289 .
SONG W , JUNG H J , J H AHN , et al . Evaluation of performance unfairness in NUMA system architecture [J ] . IEEE Computer Architecture Letters , 2017 , 16 ( 1 ): 26 - 29 .
KAESTLE S , ACHERMANN R , ROSCOE T , et al . Shoal: Smart allocation and replication of memory for parallel programs [C ] // Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC’15) . New York : USENIX Association , 2015 : 263 - 276 .
BAPTISTE L , VIVIEN Q , ALEXANDRA F . Thread and memory placement on NUMA systems: Asymmetry matters [C ] // Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC’15) . New York : USENIX Association , 2015 : 277 - 289 .
LIU M , LI T . Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads [C ] // 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) . Piscataway : IEEE , 2014 : 325 - 336 .
MILIC U , VILLA O , BOLOTIN E , et al . Beyond the socket: NUMA-aware GPUs [C ] // Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture . New York : ACM , 2017 : 123 - 135 .
LAMAAZI H , MIZOUNI R , OTROK H , et al . Smart-3DM: Data-driven decision making using smart edge computing in hetero-crowdsensing environment [J ] . Future Generation Computer Systems , 2022 , 131 : 151 - 165 .
WANG G Y , WANG Y . 3DM: Domain-oriented data-driven data mining [J ] . Fundamenta Informaticae , 2009 , 90 ( 4 ): 395 - 426 .
TOPOL A W , LA TULIPE D C , SHI L , et al . Three-dimensional integrated circuits [J ] . IBM Journal of Research and Development , 2006 , 50 ( 4/5 ): 491 - 506 .
吴际 , 谢冬青 . 三维集成技术的现状和发展趋势 [J ] . 现代电子技术 , 2014 , 37 ( 6 ): 104 - 107 .
WU J , XIE D Q . Current status and trends of three-dimensional integrated technology [J ] . Modern Electronics Technique , 2014 , 37 ( 6 ): 104 - 107 . (in Chinese)
LOH G H , XIE Y , BLACK B . Processor design in 3D die-stacking technologies [J ] . IEEE Micro , 2007 , 27 ( 3 ): 31 - 48 .
STRENGERT M , MÜLLER C , DACHSBACHER C , et al . CUDASA: Compute unified device and systems architecture [C ] // Proceedings of the 8th Eurographics Conference on Parallel Graphics and Visualization . New York : ACM , 2008 : 49 - 56 .
ZHU Z C , XU S Z , TANG J , et al . GraphVite: A high-performance CPU-GPU hybrid system for node embedding [C ] // The World Wide Web Conference . New York : ACM , 2019 : 2494 - 2504 .
GONG L , ZHANG C , DUAN L , et al . Nonrigid image registration using spatially region-weighted correlation ratio and GPU-acceleration [J ] . IEEE Journal of Biomedical and Health Informatics , 2019 , 23 ( 2 ): 766 - 778 .
ZIABARI A K , SUN Y F , MA Y N , et al . UMH: A hardware-based unified memory hierarchy for systems with multiple discrete GPUs [J ] . ACM Transactions on Architecture and Code Optimization , 2016 , 13 ( 4 ): 35 .
SOUROURI M , GILLBERG T , BADEN S B , et al . Effective multi-GPU communication using multiple CUDA streams and threads [C ] // 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) . Piscataway : IEEE , 2014 : 981 - 986 .
MATAM K K , MOHAMMAD R A , MURALI A . Efficient automatic parallelization of a single GPU program for a multiple GPU system [J ] . Integration , 2019 , 66 : 35 - 43 .
MUTHUKRISHNAN H , LUSTIG D , VILLA O , et al . FinePack: Transparently improving the efficiency of fine-grained transfers in multi-GPU systems [C ] // 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2023 : 516 - 529 .
MUTHUKRISHNAN H , NELLANS D , LUSTIG D , et al . Efficient multi-GPU shared memory via automatic optimization of fine-grained transfers [C ] // 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) . Piscataway : IEEE , 2021 : 139 - 152 .
0
Views
13
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621