Abstract:
Human action recognition (HAR) has emerged as a core research domain
for video understanding and analysis, thus attracting many researchers.
Although signi cant results have been achieved in simple scenarios, HAR
is still a challenging task due to issues associated with view independence,
occlusion and inter-class variation observed in realistic scenarios. In previous
research e orts, the classical Bag of Words (BoW) approach, along
with its variations, has been widely used. In this dissertation, we propose
a novel feature representation approach for action representation in
complex and realistic scenarios. We also present an approach to handle
the inter and intraclass variation challenge present in human action
recognition.
The primary focus of this research is to enhance the existing strengths of
the BoW approach like view independence, scale invariance and occlusion
handling. The proposed Bag of Expressions (BoE) includes an independent
pair of neighbors for building expressions; therefore it is tolerant to
occlusion and capable of handling view independence up to some extent
in realistic scenarios. We apply a class-speci c visual words extraction
approach for establishing a relationship between these extracted visual
words in both space and time dimensions.
To improve classical BoW, we propose a Dynamic Spatio-Temporal Bag
of Expressions (D-STBoE) model for human action recognition without
compromising the strengths of the classical bag of visual words approach.
Expressions are formed based on the density of a spatiotemporal cube
of a visual word. To handle inter-class variation, we use class-speci c
visual word representation for visual expressions generation. The formation
of visual expressions is based on the density of spatiotemporal cube
built around each visual word, as constructing neighborhoods with a xed
number of neighbors would include non-relevant information hence making
a visual expression less discriminative in scenarios with occlusion and
changing viewpoints. Thus, the proposed approach makes our model more
robust to occlusion and changing viewpoint challenges present in realistic
scenarios. Comprehensive experiments on publicly available datasets show
that the proposed approach outperforms existing state-of-the-art human
action recognition approaches.