国防科学技术大学计算机学院,湖南,长沙,410073
网络出版:2016-02-25,
纸质出版:2016
移动端阅览
陈海燕, 杨超, 刘胜, 等. 一种高效的面向基2 FFT算法的SIMD并行存储结构[J]. 电子学报, 2016,44(2):241-246.
CHEN Hai-yan, YANG Chao, LIU Sheng, et al. An Efficient SIMD Parallel Memory Structure for Radix-2 FFT Computation[J]. Acta Electronica Sinica, 2016, 44(2): 241-246.
陈海燕, 杨超, 刘胜, 等. 一种高效的面向基2 FFT算法的SIMD并行存储结构[J]. 电子学报, 2016,44(2):241-246. DOI: 10.3969/j.issn.0372-2112.2016.02.001.
CHEN Hai-yan, YANG Chao, LIU Sheng, et al. An Efficient SIMD Parallel Memory Structure for Radix-2 FFT Computation[J]. Acta Electronica Sinica, 2016, 44(2): 241-246. DOI: 10.3969/j.issn.0372-2112.2016.02.001.
随着SIMD(Single Instruction Multiple Data stream)结构DSP(Digital Signal Processor)片上集成了越来越多的处理单元
并行访存的灵活性及带宽效率对实际运算性能的影响越来越大.本文详细分析了一般SIMD结构DSP中基2 FFT(Fast Fourier Transform)并行算法面临的访存问题
采用简单的部分地址异或逻辑完成SIMD并行访存地址转换
实现了FFT运算的无冲突SIMD并行访存;提出了几种带特殊混洗模式的向量访存指令
可完全消除SIMD结构下基2 FFT运算时需要的额外混洗指令操作.最后将其应用于某16路SIMD数字信号处理器YHFT-Matrix2中向量存储器VM的优化设计.测试结果表明
采用该SIMD并行存储结构优化的VM以增加18%的硬件开销实现了FFT运算全流水无冲突并行访存和100%并行访存带宽利用率;相比优化前的设计
不同点数FFT运算可获得1.32~2.66的加速比.
As more and more execution units are integrated in the digital signal processor(DSP) with single instruction multiple data stream(SIMD) extension
the flexibility and bandwidth efficiency of parallel memory access have significant effects on its whole practical performance.Based on detailed analysis of the memory access problems for radix-2 fast Fourier transform(FFT) algorithm in general SIMD DSP
this paper used parts of the address bit XOR logic to realize memory access address translation
and achieved conflict-free parallel SIMD memory accesses for FFT computation.Then several memory access instructions with special shuffle modes were brought forward
which could completely eliminate extra shuffling instruction operations of radix-2 FFT algorithm in the SIMD architecture.Finally
the vector memory(VM) in 16-way SIMD DSP YHFT-Matrix2 was optimized by above methods.The test results show that the optimized VM can realize fully pipelined conflict-free memory accesses and 100% parallel memory access bandwidth utilization with increase of 18% area overheads.Compared with the design before optimization
the performance of different points radix-2 FFT can achieve speedup ranging from 1.32 to 2.66.
0
浏览量
1278
下载量
2
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621