remote sensing; scene classification; transductive inference; vision-language models; zero-shot; Software; Signal Processing; Electrical and Electronic Engineering
Abstract :
[en] Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pretraining. However, their conventional usage in zero-shot scene classification methods still involves dividing large images into patches and making independent predictions, i.e., inductive inference, thereby limiting their effectiveness by ignoring valuable contextual information. Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships from the image encoder to enhance zero-shot capabilities through transductive inference, all without the need for supervision and at a minor computational cost. Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements over inductive zero-shot classification. Our source code is publicly available on Github: https://github.com/elkhouryk/RS-TransCLIP.
Disciplines :
Computer science
Author, co-author :
El Khoury, Karim; UCLouvain, Belgium
Zanella, Maxime ; Université de Mons - UMONS > Faculté Polytechnique > Service Informatique, Logiciel et Intelligence artificielle ; UCLouvain, Belgium
Gérin, Benoît; UCLouvain, Belgium
Godelaine, Tiffanie; UCLouvain, Belgium
Macq, Benoît; UCLouvain, Belgium
Mahmoudi, Saïd ; Université de Mons - UMONS > Faculté Polytechnique > Service Informatique, Logiciel et Intelligence artificielle
De Vleeschouwer, Christophe; UCLouvain, Belgium
Ayed, Ismail Ben; ÉTS Montreal, Canada
Language :
English
Title :
Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification
Publication date :
01 January 2025
Event name :
ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Event place :
Hyderabad, Ind
Event date :
06-04-2025 => 11-04-2025
Audience :
International
Main work title :
2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings
Editor :
Rao, Bhaskar D
Publisher :
Institute of Electrical and Electronics Engineers Inc.
F114 - Informatique, Logiciel et Intelligence artificielle
Research institute :
R300 - Institut de Recherche en Technologies de l'Information et Sciences de l'Informatique
Funders :
IEEE IEEE Signal Processing Society
Funding text :
M.Z. and B.G. are funded by the Walloon region under grant No. 2010235 (ARIAC by DIGITALWALLONIA4.AI). T.G. is funded by MedReSyst part of the Walloon Region and EU-Wallonie 2021-2027 program.
H. Chen, C. Lan, et al., “Land-cover change detection using paired openstreetmap data and optical high-resolution imagery via object-guided transformer, ” arXiv preprint arXiv:2310.02674, 2023.
Q. Yuan, H. Shen, et al., “Deep learning in environmental remote sensing: Achievements and challenges, ” Remote Sens. Environ., vol. 241, p. 111716, 2020.
W. H. Maes and K. Steppe, “Perspectives for remote sensing with unmanned aerial vehicles in precision agriculture, ” Trends Plant Sci., vol. 24, no. 2, pp. 152-164, 2019.
S. K. Phang, T. H. A. Chiang, et al., “From satellite to uavbased remote sensing: A review on precision agriculture, ” IEEE Access, 2023.
H. Xia, J. Wu, et al., “A deep learning application for building damage assessment using ultra-high-resolution remote sensing imagery in turkey earthquake, ” Int. J. Disaster Risk Sci., vol. 14, no. 6, pp. 947-962, 2023.
K. El Khoury, T. Godelaine, et al., “Streamlined hybrid annotation framework using scalable codestream for bandwidth-restricted uav object detection, ” arXiv preprint arXiv:2402.04673, 2024.
M. B. Sariyildiz, J. Perez, et al., “Learning visual representations with caption annotations, ” in ECCV, pp. 153-170, Springer, 2020.
A. Joulin, L. Van Der Maaten, et al., “Learning visual features from large weakly supervised data, ” in ECCV, pp. 67-84, Springer, 2016.
T. Abdullah, Y. Bazi, et al., “Textrs: Deep bidirectional triplet network for matching text to remote sensing images, ” Remote Sens., vol. 12, no. 3, p. 405, 2020.
M. M. A. Rahhal, Y. Bazi, et al., “Deep unsupervised embedding for remote sensing image retrieval using textual cues, ” Appl. Sci., vol. 10, no. 24, p. 8931, 2020.
F. Liu, D. Chen, et al., “Remoteclip: A vision language foundation model for remote sensing, ” IEEE Trans. Geosci. Remote Sens., 2024.
A. Radford, J. W. Kim, et al., “Learning transferable visual models from natural language supervision, ” in Proc. 38th Int. Conf. Mach. Learn., vol. 139 of Proc. Mach. Learn. Res., pp. 8748-8763, PMLR, 2021.
Z. Zhang, T. Zhao, et al., “Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing, ” arXiv preprint arXiv:2306.11300, 2024.
C. Pang, J. Wu, et al., “Towards helpful and honest remote sensing large vision language model, ” arXiv preprint arXiv:2403.20213, 2024.
Z. Wang, R. Prabha, et al., “Skyscript: A large and semantically diverse vision-language dataset for remote sensing, ” in Proc. AAAI Conf. Artif. Intell., vol. 38, pp. 5805-5813, 2024.
D. Muhtar, Z. Li, et al., “Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model, ” arXiv preprint arXiv:2402.02544, 2024.
X. Li, C. Wen, et al., “Vision-language models in remote sensing: Current progress and future trends, ” IEEE Geosci. Remote Sens. Mag., vol. 12, no. 2, pp. 32-66, 2024.
V. Vapnik, “An overview of statistical learning theory, ” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988-999, 1999.
T. Joachims, “Transductive inference for text classification using support vector machines, ” in ICML, vol. 99, pp. 200-209, 1999.
S. Martin, Y. Huang, et al., “Transductive zero-shot and few-shot clip, ” in CVPR, pp. 28816-28826, 2024.
M. Zanella, B. Gérin, et al., “Boosting vision-language models with transduction, ” arXiv preprint arXiv:2406.01837, 2024.
S. Zhang, Y. Xu, et al., “Biomedclip: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs, ” arXiv preprint arXiv:2303.00915, 2024.
S. Eslami, G. de Melo, et al., “Does clip benefit visual question answering in the medical domain as much as it does in the general domain?, ” 2021.
J. Luo, Z. Pang, et al., “Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding, ” arXiv preprint arXiv:2406.10100, 2024.
W. Zhang, M. Cai, et al., “Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain, ” arXiv preprint arXiv:2401.16822, 2024.
U. Mall, C. P. Phoo, et al., “Remote sensing vision-language foundation models without annotations via ground remote alignment, ” in ICLR, 2024.
Y. Hu, J. Yuan, et al., “Rsgpt: A remote sensing vision language model and benchmark, ” arXiv preprint arXiv:2307.15266, 2023.
Y. Bazi, L. Bashmal, et al., “Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery, ” Remote Sens., vol. 16, no. 9, 2024.
G. S. Dhillon, P. Chaudhari, et al., “A baseline for few-shot image classification, ” in ICLR, 2019.
M. Boudiaf, I. Ziko, et al., “Information maximization for few-shot learning, ” Adv. Neural Inf. Process. Syst., vol. 33, pp. 2445-2457, 2020.
J. Liu, L. Song, et al., “Prototype rectification for few-shot learning, ” in ECCV, pp. 741-756, Springer, 2020.
I. Ziko, J. Dolz, et al., “Laplacian regularized few-shot learning, ” in ICML, PMLR, 2020.
M. Zanella, F. Shakeri, et al., “Boosting vision-language models for histopathology classification: Predict all at once, ” in International Workshop on Foundation Models for General Medical AI, pp. 153-162, Springer, 2024.
G.-S. Xia, J. Hu, et al., “Aid: A benchmark data set for performance evaluation of aerial scene classification, ” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 7, pp. 3965-3981, 2017.
P. Helber, B. Bischke, et al., “Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification, ” in IGARSS, pp. 204-207, 2018.
X. Qi, P. Zhu, et al., “Mlrsnet: A multi-label high spatial resolution remote sensing dataset for semantic scene understanding, ” ISPRS J. Photogramm. Remote Sens., vol. 169, pp. 337-350, 2020.
Q. Wang, S. Liu, et al., “Scene classification with recurrent attention of vhr remote sensing images, ” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 2, pp. 1155-1167, 2019.
W. Zhou, S. Newsam, et al., “Patternnet: A benchmark dataset for performance evaluation of remote sensing image retrieval, ” ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 197-209, 2018.
G. Cheng, J. Han, et al., “Remote sensing image scene classification: Benchmark and state of the art, ” Proc. IEEE, vol. 105, no. 10, pp. 1865-1883, 2017.
L. Zhao, P. Tang, et al., “Feature significance-based multibagof-visual-words model for remote sensing image scene classification, ” J. Appl. Remote Sens., vol. 10, 2016.
H. Li, X. Dou, et al., “Rsi-cb: A large-scale remote sensing image classification benchmark using crowdsourced data, ” Sensors, vol. 20, no. 6, 2020.
M. Zanella and I. Ben Ayed, “On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?, ” in CVPR, pp. 23783-23793, June 2024.