

浏览全部资源
扫码关注微信
1.合肥工业大学计算机与信息学院,安徽合肥 230009
2.合肥工业大学微电子学院,安徽合肥 230009
Received:21 January 2022,
Revised:2022-07-15,
Published:25 March 2024
移动端阅览
欧阳一鸣,王奇,汤飞扬,等.MRNDA:一种基于资源受限片上网络的深度神经网络加速器组播机制研究[J].电子学报,2024,52(03):872-884.
OUYANG Yi-ming, WANG Qi, TANG Fei-yang, et al.MRNDA: A Multicast Mechanism for Resource-Constrained Noc-Based Deep Neural Network Accelerators[J].Acta Electronica Sinica, 2024, 52(03): 872-884.
欧阳一鸣,王奇,汤飞扬,等.MRNDA:一种基于资源受限片上网络的深度神经网络加速器组播机制研究[J].电子学报,2024,52(03):872-884. DOI:10.12263/DZXB.20220106
OUYANG Yi-ming, WANG Qi, TANG Fei-yang, et al.MRNDA: A Multicast Mechanism for Resource-Constrained Noc-Based Deep Neural Network Accelerators[J].Acta Electronica Sinica, 2024, 52(03): 872-884. DOI:10.12263/DZXB.20220106
片上网络(Network-on-Chip,NoC)在多处理器系统中得到了广泛的应用.近年来,有研究提出了基于NoC的深度神经网络(Deep Neural Network,DNN)加速器.基于NoC的DNN加速器设计利用NoC连接神经元计算设备,能够极大地减少加速器对片外存储的访问从而减少加速器的分类延迟和功耗.但是,若采用传统的单播NoC,大量的一对多数据包会极大的提高加速器的通信延迟.并且,目前的深度神经网络规模往往非常庞大,而NoC的核心数量是有限的.因此,文中提出了一种针对资源受限的NoC的组播方案.该方案利用有限数量的处理单元(Processor Element,PE)来计算大型的DNN,并且利用特殊的树形组播加速网络来减少加速器的通信延迟.仿真结果表明,和基准情况相比,本文提出的组播机制使加速器的分类延迟最高降低了86.7%,通信延迟最高降低了88.8%,而它的路由器面积和功耗仅占基准路由器的9.5%和10.3%.
Network-on-Chip (NoC) devices have been widely used in multiprocessor systems. In recent years
NoC-based deep neural network (DNN) accelerators have been proposed to connect neural computing devices using NoCs. Such designs dramatically reduce off-chip memory accesses of these platforms thus reduce the accelerators’ classification latency and power consumption. However
the large number of one-to-many packet transfers significantly increase the communication latency with traditional unicast channels. We proposed a multicast mechanism for resource-constrained noc-based deep neural network accelerators (MRNDA) to compute large DNN models by using limited number of processor elements (PEs). This paper proposes a tree-based multicast acceleration network to decrease the communication latency of DNN accelerators. Simulation results show that
compared with the baseline method
the multicast mechanism proposed in this paper reduces the classification latency of the accelerator by up to 86.7% and the communication latency by up to 88.8%
while its router’s area and power only account for 9.5% and 10.3% of the baseline routers.
GOOSSENS K , DIELISSEN J , RADULESCU A . AEthereal network on chip: Concepts, architectures, and implementations [J ] . IEEE Design & Test of Computers , 2005 , 22 ( 5 ): 414 - 421 .
WANG L , JIN Y , KIM H , et al . Recursive partitioning multicast: A bandwidth-efficient routing for networks-on-chip [C ] // 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip . New York : ACM , 2009 : 64 - 73 .
PEH L S , DALLY W J . A delay model and speculative architecture for pipelined routers [C ] // Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture . New York : ACM , 2001 : 255 - 266 .
KUMAR A , PEH L S , KUNDU P , et al . Express virtual channels: Towards the ideal interconnection fabric [J ] . ACM Sigarch Computer Architecture News , 2007 , 35 ( 2 ): 150 - 161 .
MATSUTANI H , KOIBUCHI M , AMANO H , et al . Prediction router: Yet another low latency on-chip router architecture [C ] // 2009 IEEE 15th International Symposium on High Performance Computer Architecture . Piscataway : IEEE , 2009 : 367 - 378 .
DEB S , GANGULY A , PANDE P P , et al . Wireless NoC as interconnection backbone for multicore chips: Promises and challenges [J ] . IEEE Journal on Emerging and Selected Topics in Circuits and Systems , 2012 , 2 ( 2 ): 228 - 239 .
OUYANG Y , YANG J , XING K , et al . An improved communication scheme for non-HOL-blocking wireless NoC [J ] . Integration , 2018 , 60 : 240 - 247 .
OUYANG Y , LI Z , LI J , et al . CPCA: An efficient wireless routing algorithm in WiNoC for cross path congestion awareness [J ] . Integration , 2019 , 69 : 75 - 84 .
OUYANG Y , WANG Q , HU L , et al . DVFS based error avoidance strategy in wireless network-on-chip [J ] . Journal of Electronic Testing , 2019 , 35 ( 6 ): 767 - 777 .
CHEN K C , EBRAHIMI M , WANG T Y , et al . NoC-based DNN accelerator: A future design paradigm [C ] // Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip . New York : ACM , 2019 : 1 - 8 .
DALLY W J , TOWLES B P . Principles and Practices of Interconnection Networks [M ] . Amsterdam : Morgan Kaufmann Publishers , 2004 .
PAINKRAS E , PLANA L A , GARSIDE J , et al . SpiNNaker: A 1-W 18-core system-on-chip for massively-parallel neural network simulation [J ] . IEEE Journal of Solid-State Circuits , 2013 , 48 ( 8 ): 1943 - 1953 .
CARRILLO S , HARKIN J , MCDAID L J , et al . Scalable hierarchical network-on-chip architecture for spiking neural network hardware implementations [J ] . IEEE Transactions on Parallel and Distributed Systems , 2012 , 24 ( 12 ): 2451 - 2461 .
CHEN Y H , YANG T J , EMER J , et al . Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices [J ] . IEEE Journal on Emerging and Selected Topics in Circuits and Systems , 2019 , 9 ( 2 ): 292 - 308 .
LIU X , WEN W , QIAN X , et al . Neu-NoC: A high-efficient interconnection network for accelerated neuromorphic systems [C ] // 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC) . Piscataway : IEEE , 2018 : 141 - 146 .
CHEN K C , WANG T Y . NN-noxim: High-level cycle-accurate NoC-based neural networks simulator [C ] // 2018 11th International Workshop on Network on Chip Architectures (NoCArc) . Piscataway : IEEE , 2018 : 1 - 5 .
CHEN K C J , WANG T Y G , YANG Y C A . Cycle-accurate noc-based convolutional neural network simulator [C ] // Proceedings of the International Conference on Omni-Layer Intelligent Systems . New York : ACM , 2019 : 199 - 204 .
CHEN K C J , EBRAHIMI M , WANG T Y , et al . A NoC-based simulator for design and evaluation of deep neural networks [J ] . Microprocessors and Microsystems , 2020 , 77 : 103145 .
XIAO S , GUO Y , LIAO W , et al . Neuronlink: An efficient chip-to-chip interconnect for large-scale neural network accelerators [J ] . IEEE Transactions on Very Large Scale Integration (VLSI) Systems , 2020 , 28 ( 9 ): 1966 - 1978 .
SHEN X W , YE X C , TAN X , et al . An efficient network-on-chip router for dataflow architecture [J ] . Journal of Computer Science and Technology , 2017 , 32 ( 1 ): 11 - 25 .
HAN S , LEE J , CHOI K . Tree-mesh heterogeneous topology for low-latency noc [C ] // Proceedings of the 2014 International Workshop on Network on Chip Architectures . New York : ACM , 2014 : 19 - 24 .
DAS R , NARAYANASAMY S , SATPATHY S K , et al . Catnap: Energy proportional multiple network-on-chip [J ] . ACM SIGARCH Computer Architecture News , 2013 , 41 ( 3 ): 320 - 331 .
LIU S , CHEN T , LING L , et al . IMR: High-performance low-cost multi-ring NoCs [J ] . IEEE Transactions on Parallel & Distributed Systems , 2016 , 27 ( 6 ): 1700 - 1712 .
SPEIER T , WOLFORD B , DILEEP B . Qualcomm centriq 2400 processor [C ] // Hot Chips: A Symposium on High Performance Chips . New York : ACM , 2017 : 1 - 17 .
JEFFERS J , REINDERS J , SODANI A . Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition [M ] . Amsterdam : Morgan Kaufmann Publishers Inc , 2016 .
LIAN X , LIU Z , SONG Z , et al . High-performance FPGA-based CNN accelerator with block-floating-point arithmetic [J ] . IEEE Transactions on Very Large Scale Integration (VLSI) Systems , 2019 , 27 ( 8 ): 1874 - 1885 .
FARABET C , POULET C , LECUN Y . An fpga-based stream processor for embedded real-time vision with convolutional networks [C ] // 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops . Piscataway : IEEE , 2009 : 878 - 885 .
GUPTA S , AGRAWAL A , GOPALAKRISHNAN K , et al . Deep learning with limited numerical precision [C ] // International Conference on Machine Learning . New York : ACM , 2015 : 1737 - 1746 .
MOONS B , VERHELST M . A 0.3-2.6 TOPS/W precision-scalable processor for real-time large-scale ConvNets [C ] // 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits) . Piscataway : IEEE , 2016 : 1 - 2 .
KRIZHEVSKY A , SUTSKEVER I , HINTON G E . Imagenet classification with deep convolutional neural networks [J ] . Advances in Neural Information Processing Systems , 2012 , 25 : 1097 - 1105 .
SIMONYAN K , ZISSERMAN A . Very deep convolutional networks for large-scale image recognition [EB/OL ] . ( 2014 )[2022 ] . https : arxiv . org/abs/ 1709 . 06158 v 1 .
CATANIA V , MINEO A , MONTELEONE S , et al . Cycle-accurate network on chip simulation with noxim [J ] . ACM Transactions on Modeling and Computer Simulation (TOMACS) , 2016 , 27 ( 1 ): 1 - 25 .
KUMAR D R , NAJJAR W A , SRIMANI P K . A new adaptive hardware tree-based multicast routing in k-ary n-cubes [J ] . IEEE Transactions on Computers , 2001 , 50 ( 7 ): 647 - 659 .
HU W , LU Z , JANTSCH A , et al . Power-efficient tree-based multicast support for networks-on-chip [C ] // 16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011) . New York : ACM , 2011 : 363 - 368 .
LIN X , MCKINLEY P K , NI L M . Deadlock-free multicast wormhole routing in 2-D mesh multicomputers [J ] . IEEE Transactions on Parallel and Distributed Systems , 1994 , 5 ( 8 ): 793 - 804 .
EBRAHIMI M , DANESHTALAB M , LILJEBERG P , et al . HAMUM-A novel routing protocol for unicast and multicast traffic in MPSoCs [C ] // 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing . Piscataway : IEEE , 2010 : 525 - 532 .
JERGER N E , L-S PEH , LIPASTI M . Virtual circuit tree multicasting: A case for on-chip hardware multicast support [C ] // 2008 International Symposium on Computer Architecture . Piscataway : IEEE , 2008 : 229 - 240 .
OUYANG Y , TANG F , HU C , et al . MMNNN: A tree-based multicast mechanism for NoC-based deep neural network accelerators [J ] . Microprocessors and Microsystems , 2021 , 85 ( 5 ): 104242 .
KONSTANTINOU D , NICOPOULOS C , LEE J , et al . SmartFork: Partitioned multicast allocation and switching in network-on-chip routers [C ] // 2020 IEEE International Symposium on Circuits and Systems (ISCAS) . Piscataway : IEEE , 2020 .
LECUN Y , BOTTOU L , BENGIO Y , et al . Gradient-based learning applied to document recognition [J ] . Proceedings of the IEEE , 1998 , 86 ( 11 ): 2278 - 2324 .
ASCIA G , CATANIA V , MONTELEONE S , et al . Analyzing networks-on-chip based deep neural networks [C ] // Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip . New York : ACM , 2019 : 1 - 2 .
LEE J , KIM C , KANG S , et al . UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision [J ] . IEEE Journal of Solid-State Circuits , 2018 , 54 ( 1 ): 173 - 185 .
AGYEMAN M O , VIEN Q T , AHMADINIA A , et al . A resilient 2-d waveguide communication fabric for hybrid wired-wireless noc design [J ] . IEEE Transactions on Parallel and Distributed Systems , 28 ( 2 ): 359 - 373 .
0
Views
12
下载量
1
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621