To better reveal the latent semantic correlation between image and text
this paper proposes a cross media retrieval method by fusing two level similarity
which constructs two subnets to deal with global features and local features respectively so as to obtain better semantic matching between image and text. The image representation is divided into whole image and some image regions
and the text representation is also divided into whole sentence and some words. A two level alignment method is designed to match the global and local representation of image and text
and the two similarities are fused to learn the complete cross-media representation. The experimental results on MSCOCO and Flickr30K datasets show that the proposed method can make the semantic matching of image and text more accurate
and is superior to many state-of-the-art cross-media retrieval methods.