2013 IEEE International Conference on Computer Vision Workshops
Full text: Download
Detection and recognition of collective human activities are important modules of any system devoted to high level social behavior analysis. In this paper, we present a novel semantic-based spatio-temporal descriptor which can cope with several interacting people at different scales and multiple activities in a video. Our descriptor is suitable for modelling the human motion interaction in crowded environments - the scenario most difficult to analyse because of occlusions. In particular, we extend the Pose let detector approach by defining a descriptor based on Pose let activation patterns over time, named TPOS. We will show that this descriptor can effectively tackle complex real scenarios allowing to detect humans in the scene, to localize (in space-time) human activities, and perform collective group activity recognition in a joint manner, reaching state of the art results.