Proceedings Article | 19 May 2016
KEYWORDS: Video surveillance, Video, Video processing, Data modeling, Visual process modeling, Multimedia, Video surveillance, Video, Video processing, Multimedia, Feature extraction, RGB color model, Statistical modeling, Visualization
Rapid event detection faces an emergent need to process large videos collections; whether surveillance videos or unconstrained web videos, the ability to automatically recognize high-level, complex events is a challenging task. Motivated by pre-existing methods being complex, computationally demanding, and often non-replicable, we designed a simple system that is quick, effective and carries minimal overhead in terms of memory and storage. Our system is clearly described, modular in nature, replicable on any Desktop, and demonstrated with extensive experiments, backed by insightful analysis on different Convolutional Neural Networks (CNNs), as stand-alone and fused with others. With a large corpus of unconstrained, real-world video data, we examine the usefulness of different CNN models as features extractors for modeling high-level events, i.e., pre-trained CNNs that differ in architectures, training data, and number of outputs. For each CNN, we use 1-fps from all training exemplar to train one-vs-rest SVMs for each event. To represent videos, frame-level features were fused using a variety of techniques. The best being to max-pool between predetermined shot boundaries, then average-pool to form the final video-level descriptor. Through extensive analysis, several insights were found on using pre-trained CNNs as off-the-shelf feature extractors for the task of event detection. Fusing SVMs of different CNNs revealed some interesting facts, finding some combinations to be complimentary. It was concluded that no single CNN works best for all events, as some events are more object-driven while others are more scene-based. Our top performance resulted from learning event-dependent weights for different CNNs.