Transformer models are demonstrating remarkable and emergent capabilities in the natural language processing domain. These models are bounded only by the availability of large training datasets. These datasets can be tractably obtained since natural language models are pre-trained using self-supervision in the form of token masking. Papers like He et al. and Cao et al. have recently shown the power of this token masking technique by utilizing masked autoencoders as scalable vision learners in combination with a self-supervised pre-training technique for vision transformer models. Feichtenhofer et al. extended these techniques to video, proving that masked autoencoders are scalable spatiotemporal learners as well. To our best knowledge, these techniques have only been experimented on ground-level, object-centric style imagery and video. Extending these techniques to remote or overhead imagery presents two significant problems. First, the size of objects of interest are small compared to the typical mask patch size. Second, the frames are not object centered. In this study, we explore if modern self-supervised pre-training techniques like masked auto encoding extend well to overhead wide area motion imagery (WAMI) data. We argue that modern pre-training techniques like MAE are well suited to WAMI data given the typical object size in this domain as well as the ability to leverage strong global spatial contextual information. To this end, we conduct a comprehensive exploration of different patch sizes and masking ratios on the popular WAMI dataset, WPAFB 2009. We find that domain-specific adjustments to these pre-training techniques result in downstream performance improvements on computer vision tasks including object detection.
|