Abstract
Due to the high similarity of fine-grained image subclasses, small inter-class changes and large intra-class changes are caused, which leads to the difficulty of fine-grained image classification task. However, existing convolutional neural networks have been unable to effectively solve this problem. Aiming at the above-mentioned fine-grained image classification problem, this paper proposes a multi-scale and multi-level ViT model. First, through data augmentation techniques, the accuracy of fine-grained image classification can be effectively improved. Secondly, the small-scale input and large-scale input of the model make the input image have more feature ex-pressions. The subsequent multi-layeredness effectively utilizes the results of the previous layer of ViT, so that the data of the previous layer can be more effectively used in the next layer of ViT. Finally, cross-attention allows the results of two scale inputs to be fused in a reasonable way. The proposed model is competitive with current mainstream state-of-the-art methods on multiple datasets.