Aerial video recognition is challenging due to various factors. Prior work on action recognition imposes constraints in terms of unavailability of object detection bounding box ground-truth inhibiting the application of localization models and computational constraints preventing the usage of expensive space-time self-attention. Optical flow and pretrained models for detecting human actor performing action do not work too well due to domain gap issues. Our contributions1, 2 are as follows: 1. We present a frequency-domain space-time attention method that encapsulates long-range space-time dependencies by emulating the weighted outer product in the frequency domain. 2. We present a frequency-based object background disentanglement method to inherently separate out the moving human actor from the background. 3. We present a mathematical model for static salient regions and an identity loss function to learn disentangled features in a differentiable manner.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.