PAPERS
BIE Meng-ni, LI Wei, FU Qiu-xing, CHEN Tao, DU Yi-ran, NAN Long-mei
During the rapid evolution of post-quantum cryptography, considering the needs for flexibility and efficiency, we proposed a parallel reconfigurable sampling accelerator for various lattice-based post-quantum cryptographic algorithms. We analyzed seven sampling processes involved in lattice-based post-quantum cryptography and proposed seven efficient parallel implementation models for these samplings, respectively, based on mathematical derivations. Then we extracted four common operational logics from these models. Using these four common operational logics as the core, we introduced data rearrangement to limit the effective bit width of operation data, which improved the acceptance rate of rejection sampling and eliminates the complex modular reduction operations in finite field operations. Then we proposed a high energy-efficient reconfigurable parallel sampling algorithm. To enhance the hardware implementation efficiency of the sampling algorithm, we adopted the butterfly transform network to complete the parallel splitting, merging, and lookup of data with any effective bit width within a single clock cycle, efficiently realizing the parallelization of the algorithm’s pre- and post-processing, and constructed a parameterized parallel reconfigurable sampling accelerator architecture model. Aiming for high energy efficiency, combined with logic synthesis experimental results, we determined the optimal parallel degree parameters of the architecture model and proposed a parallel reconfigurable sampling accelerator with a data bandwidth of 1 024 bits. Experimental results showed that, using a 40 nm CMOS process library, and performing post-simulation under the ss, 125 ℃ process corner conditions, the circuit's highest operating frequency can reach 667 MHz, with an average power consumption of 0.54W. Completing a 256-point uniform sampling requires 6 ns, completing a 256-point rejection sampling with a rejection value less than 216 on average only takes 22.5 ns, completing a 256-point binary sampling within 8 bits requires 18 ns, completing a 509-point simple ternary sampling requires 36 ns, completing a 701-point non-negative correlated ternary sampling requires 124.5 ns, completing a 509-point fixed-weight ternary sampling requires 11.18 , and completing a discrete Gaussian sampling in the Falcon algorithm once requires 3 ns. Compared with existing research, the sampler proposed in we reduce the energy consumption value for a uniform-rejection sampling by about 30.23%, and the energy consumption value for a binary sampling by about 31.6%.