Learning bag of spatio-temporal features for human interaction recognition

Khadidja Nour el houda Slimani; Yannick Benezeth; Feryel Souami

doi:10.1117/12.2559268

31 January 2020 Learning bag of spatio-temporal features for human interaction recognition

Khadidja Nour el houda Slimani, Yannick Benezeth, Feryel Souami

Proceedings Volume 11433, Twelfth International Conference on Machine Vision (ICMV 2019); 1143302 (2020) https://doi.org/10.1117/12.2559268
Event: Twelfth International Conference on Machine Vision, 2019, Amsterdam, Netherlands

Abstract

Bag of Visual Words Model (BoVW) has achieved impressive performance on human activity recognition. However, it is extremely difficult to capture high-level semantic meanings behind video features with this method as the spatiotemporal distribution of visual words is ignored, preventing localization of the interactions within a video. In this paper, we propose a supervised learning framework that automatically recognizes high-level human interaction based on a bag of spatiotemporal visual features. At first, a representative baseline keyframe that captures the major body parts of the interacting persons is selected and the bounding boxes containing persons are extracted to parse the poses of all persons in the interaction. Based on this keyframe, features are detected by combining edge features and Maximally Stable Extremal Regions (MSER) features for each interacting person and backward-forward tracked over the entire video sequence. Based on feature tracks, 3D XYT spatiotemporal volumes are generated for each interacting target. Then, the K-means algorithm is used to build a codebook of visual features to represent a given interaction. The interaction is then represented by the sum of the frequency occurrence of visual words between persons. Extensive experimental evaluations on the UT-interaction dataset demonstrate the strength of our method to recognize the ongoing interactions from videos with a simple implementation.

Citation Download Citation

Khadidja Nour el houda Slimani, Yannick Benezeth, and Feryel Souami "Learning bag of spatio-temporal features for human interaction recognition", Proc. SPIE 11433, Twelfth International Conference on Machine Vision (ICMV 2019), 1143302 (31 January 2020); https://doi.org/10.1117/12.2559268

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available