Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Optimizing Mobile Vision Transformers for Land Cover Classification

^* ,

^* ,

Version 1 : Received: 2 October 2023 / Approved: 3 October 2023 / Online: 3 October 2023 (10:39:42 CEST)

A peer-reviewed article of this Preprint also exists.

Rozario, P., Gadgil, R., Gomes, R., Lee, J., Keller, P., Sipos, G., ... & Rudolph, J. (2023). Optimizing Mobile Vision Transformers for Land Cover Classification. Rozario, P., Gadgil, R., Gomes, R., Lee, J., Keller, P., Sipos, G., ... & Rudolph, J. (2023). Optimizing Mobile Vision Transformers for Land Cover Classification.

Abstract

Image classification in Remote Sensing and Geographic Information Systems (GIS) containing various land-cover classes is essential for efficient and sustainable land-use estimation, and other tasks like object detection, localization and segmentation. Deep Learning (DL) techniques have shown a tremendous potential in the GIS domain. While Convolutional Neural Networks (CNN) have dominated most of the image analysis domain, a new architecture called transformers have proved to be a unifying solution for several AI-based processing pipelines. Vision Transformers (ViT), a variant of transformers can have comparable and in some cases better accuracy than a CNN. However, they suffer from a significant drawback associated with an excessive use of training parameters. In this research we explore several modifications in the vision transformer architectures, especially MobileViT that can be optimized while boosting accuracy. To verify our proposed approach these new architectures are trained on four land-cover datasets AID, EuroSAT, UC-Merced, and WHU-RS19. Experiments reveal that combination of lightweight convolutional layers including ShuffleNet along with depthwise separable convolutions and average pooling can reduce the trainable parameters by 17.85% and yet achieve higher accuracy than the base MobileViT. It is also observed that utilizing a combination of convolution layers along with multi-headed self attention layers in MobileViT variants provide better performance in capturing local and global features unlike the standalone ViT architecture that utilizes almost 95% more parameters than the proposed MobileViT variant.

Keywords

vision transformers; Mobile ViT; ShuffleNet; CNN; Land cover classification

Subject

Environmental and Earth Sciences, Remote Sensing

Copyright: This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download PDF