HUANG Guang-yuan, HUANG Rong, ZHOU Shu-bo, JIANG Xue-qin
Online available: 2025-03-04
The attention mechanism and its variants have been widely applied in the field of image inpainting. They divide corrupted images into complete and missing regions, and capture long-range contextual information only within the complete regions to fill in the missing regions. As the area of missing regions increases, the features of complete regions decrease, which limits the performance of the attention mechanisms and leads to suboptimal inpainting results. In order to extend the context range of the attention mechanism, we employ a vector-quantized codebook to learn visual atoms. These visual atoms, which describe the structural and textural of image patches, constitute external features for image inpainting and thus compensate for the internal features of the image. On this basis, we propose a dual-stream attention image inpainting method based on interacting and fusing internal-external features. Based on internal and external information sources, we design an internal mask attention module and an internal-external cross attention module. These two attention modules form a dual-stream attention to facilitate interaction between internal features and between internal and external features, thereby generating internal- and external- source inpainting features. The internal mask attention shields the interference of missing region features with a mask. It captures contextual information exclusively within the complete regions, thereby generating internal-source inpainting features. The internal-external cross attention interacts with internal and external features by calculating the similarity relationship between internal features and external features composed of visual atoms, thereby generating external-source inpainting features. In addition, we design a controllable feature fusion module that generates spatial weight maps based on the correlation between internal- and external- source inpainting features. These spatial weight maps fuse internal and external features by element-wise weighting of internal- and external- source inpainting features. Extensive experimental results on Places2, FFHQ and Paris StreetView datasets demonstrate that the proposed method achieves average improvements of 3.45%, 1.34%, 13.91%, 13.64%, and 16.92% for PSNR, SSIM, L1, LPIPS, and FID metrics respectively, compared with the state-of-the-art methods. Visualization experimental results demonstrate that both internal features and external features composed of visual atoms are beneficial for repairing corrupted images.