Towards masked autoencoding pre-training for wide area motion imagery

Steve Goley; Rohan Pradhan; Austin Welch

doi:10.1117/12.2665871

15 June 2023 Towards masked autoencoding pre-training for wide area motion imagery

Steve Goley, Rohan Pradhan, Austin Welch

Proceedings Volume 12525, Geospatial Informatics XIII ; 1252508 (2023) https://doi.org/10.1117/12.2665871
Event: SPIE Defense + Commercial Sensing, 2023, Orlando, Florida, United States

Abstract

Transformer models are demonstrating remarkable and emergent capabilities in the natural language processing domain. These models are bounded only by the availability of large training datasets. These datasets can be tractably obtained since natural language models are pre-trained using self-supervision in the form of token masking. Papers like He et al. and Cao et al. have recently shown the power of this token masking technique by utilizing masked autoencoders as scalable vision learners in combination with a self-supervised pre-training technique for vision transformer models. Feichtenhofer et al. extended these techniques to video, proving that masked autoencoders are scalable spatiotemporal learners as well. To our best knowledge, these techniques have only been experimented on ground-level, object-centric style imagery and video. Extending these techniques to remote or overhead imagery presents two significant problems. First, the size of objects of interest are small compared to the typical mask patch size. Second, the frames are not object centered. In this study, we explore if modern self-supervised pre-training techniques like masked auto encoding extend well to overhead wide area motion imagery (WAMI) data. We argue that modern pre-training techniques like MAE are well suited to WAMI data given the typical object size in this domain as well as the ability to leverage strong global spatial contextual information. To this end, we conduct a comprehensive exploration of different patch sizes and masking ratios on the popular WAMI dataset, WPAFB 2009. We find that domain-specific adjustments to these pre-training techniques result in downstream performance improvements on computer vision tasks including object detection.

Conference Presentation

Citation Download Citation

Steve Goley, Rohan Pradhan, and Austin Welch "Towards masked autoencoding pre-training for wide area motion imagery", Proc. SPIE 12525, Geospatial Informatics XIII , 1252508 (15 June 2023); https://doi.org/10.1117/12.2665871

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available