[1] 巨涛,朱正东,董小社.异构众核系统及其编程模型与性能优化技术研究综述[J].电子学报,2015,43(1):111-119. JU Tao,ZHU Zheng-dong,DONG Xiao-she.The feature,programming model and performance optimization strategy of heterogeneous many-core system:a review[J].Acta Electronica Sinica,2015,43(1):111-119.(in Chinese)
[2] NVIDIA Corporation.CUDA C programming guide[OL].https://docs.nvidia.com/cuda/archive/8.0/,2018.
[3] 张珩,张立波,武延军.基于Multi-GPU平台的大规模图数据处理[J].计算机研究与发展,2018,55(2):273-288. ZHANG Heng,ZHANG Li-bo,WU Yan-jun.Large-scale graph processing on multi-GPU platforms[J].Journal of Computer Research and Development,2018,55(2):273-288.(in Chinese)
[4] ZHU Z,XU S,TANG J,et al.GraphVite:A high-performanceCPU-GPU hybrid system for node embedding[A].Proceedings of the The World Wide Web Conference[C].New York:ACM,2019.2494-2504.
[5] GONG L,ZHANG C,DUAN L,et al.Nonrigid image registration using spatially region-weighted correlation ratio and GPU-acceleration[J].IEEE Journal of Biomedical and Health Informatics,2018,23(2):766-778.
[6] CHAPUIS G,EIDENBENZ S,SANTHI N.GPU performance prediction through parallel discrete event simulation and common sense[A].Proceedings of the 9th EAI International Conference on Performance Evaluation Methodologies and Tools[C].Brussels:ICST (Institute for Computer Sciences,Social-Informatics and Telecommunications Engineering),2016.204-211.
[7] 冯晓,戴紫彬,蔡路亭,等.基于Amdahl定律扩展的多核处理器性能模型研究[J].电子学报,2017,45(6):1424-1430. FENG Xiao,DAI Zi-bin,CAI Lu-ting,et al.Performance model for multicore processor based on extended Amdahl’s law[J].Acta Electronica Sinica,2017,45(6):1424-1430.(in Chinese)
[8] 郑祯,翟季冬,李焱,等.基于CUPTI接口的典型GPU程序负载特征分析[J].计算机研究与发展,2016,53(6):1249-1262. ZHENG Zhen,ZHAI Ji-dong,LI yan,et al.Workload analysis for typical GPU programs using CUPTI interface[J].Journal of Computer Research and Development,2016,53(6):1249-1262.(in Chinese)
[9] PARK I K,SINGHAL N,LEE M H,et al.Design and performance evaluation of image processing algorithms on GPUs[J].IEEE Transactions on Parallel and Distributed Systems,2010,22(1):91-104.
[10] CUI Z,LIANG Y,RUPNOW K,et al.An accurate GPU performance model for effective control flow divergence optimization[A].Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium[C].Piscataway:IEEE,2012.83-94.
[11] ZHOU K,TAN G,ZHANG X,et al.A performance analysis framework for exploiting GPU microarchitectural capability[A].Proceedings of the International Conference on Supercomputing[C].New York:ACM,2017.15.
[12] BALDINI I,FINK S J,ALTMAN E.Predicting GPU performance from CPU runs using machine learning[A].Proceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing[C].Piscataway:IEEE,2014.254-261.
[13] ARDALANI N,LESTOURGEON C,SANKARALINGAM K,et al.Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance[A].Proceedings of the 48th International Symposium on Microarchitecture[C].New York:ACM,2015.725-737.
[14] O’NEAL K,BRISK P,ABOUSAMRA A,et al.GPU performance estimation using software rasterization and machine learning[J].ACM Transactions on Embedded Computing Systems (TECS),2017,16(5s):148.
[15] LOUBOUTIN M,LANGE M,HERRMANN F J,et al.Performance prediction of finite-difference solvers for different computer architectures[J].Computers & Geosciences,2017,105:148-157.
[16] LYM S,LEE D,O’CONNOR M,et al.DeLTA:GPU performance model for deep learning applications with in-depth memory system traffic analysis[A].Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)[C].Piscataway:IEEE,2019.293-303.
[17] GU J,LIU H,ZHOU Y,et al.DeepProf:performance analysis for deep learning applications via mining GPU execution patterns[J].arXiv preprint,2017,arXiv:1707.03750.
[18] O’NEAL K,BRISK P.Predictive modeling for CPU,GPU,and FPGA performance and power consumption:a survey[A].Proceedings of the 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)[C].Piscataway:IEEE,2018.763-768.
[19] KONSTANTINIDIS E,COTRONIS Y.A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling[J].Journal of Parallel and Distributed Computing,2017,107:37-56.
[20] 黄品丰,赵荣彩,姚远,等.面向异构多核处理器的并行代价模型[J].计算机应用,2013,33(6):1544-1547. HUANG Pin-feng,ZHAO Rong-cai,YAO Yuan,et al.Parallel cost model for heterogeneous multi-core processors[J].Journal of Computer Applications,2013,33(6):1544-1547.(in Chinese)
[21] NVIDIA Corporation.CUDA occupancy calculato[OL].http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls,2018.
[22] LUEBKE D.CUDA并行编程入门[OL].https://cn.udacity.com/course/intro-to-parallel-programming-cs344,2018.
[23] NVIDIA Corporation.NVIDIA visual profiler[OL].https://developer.nvidia.com/nvidia-visual-profiler,2018. |