Human Action Recognition and Localization in Videos

Ullah, Javid

DSpace Home
→
Engineering and Technology
→
Thesis
→
View Item

dc.contributor.author	Ullah, Javid
dc.date.accessioned	2019-09-27T07:19:35Z
dc.date.accessioned	2020-04-11T15:38:55Z
dc.date.available	2020-04-11T15:38:55Z
dc.date.issued	2019
dc.identifier.govdoc	17923
dc.identifier.uri	http://142.54.178.187:9060/xmlui/handle/123456789/5217
dc.description.abstract	Human action localization and recognition in videos is one of the most studied and active research area in computer vision. In this thesis we elaborate two main questions; First when and where is the action performed in the video and what type of action is performed. When and where localize the action spatially in a time series visual data, while what type of action determine the action category/class. The output of action localization is a sub-volume consists of the action of interest. Action localization is more challenging as compared to the action classi cation, because it is the process of extracting a speci c part of the spatio-temporal volume in a visual data. We address the problem of automatic extraction of foreground objects in videos and then determine the category of action performed in the localized region. Action localization and recognition deal with understanding when, where and what happens in a video sequence. In the last decade, some of the proposed methods addressed the problem of simultaneous recognition and localization of actions. Action recognition addresses the question What type of action is performed in the video? , while action localization aims to answer the question Where in the video? . These methods are termed action detection or action localization and recognition . The human action recognition and localization is greatly motivated by the wide range of applications in various elds of computer vision like human perceptual segmentation, tracking human in a video sequence, recovering the body structure, medical diagnosis, monitoring the human activities in security-sensitive areas like airports, buildings (universities, hospitals, schools), border crossings and elderly daily activity recognition (related to elderly health issues). It is one of the hardest problems due to enormous variations in visual data, appearance of actors, motion patterns, changes in camera viewpoints, illumination variations, moving and cluttered backgrounds, occlusions of actors, intra-and inter-class variations, noise, moving cameras and the availability of extensive amount of visual data. Local features based action recognition methods have been extensively studied in the last two decades. These systems have numerous limitations and far enough from the real time scenario. Every phase of the system has its own importance for the next phase, such as the success and accuracy of local feature based methods depend on the accurate encoding of visual data i.e. feature extraction method, dimensionality reduction of the extracted features and compact representation, localizing the action and training a learning model (classi er) for the classi cation of action sequences (Main parts of the system should be: (1) Feature extraction, (2) Feature representation, (3) localization of the region of interest, and (4) classi cation of the action video). First of all we study, investigate, evaluate and compare the well known state-of-the-art and prominent approaches proposed for action recognition and localization. The methods proposed for action localization are too complex and computationally expensive. We have proposed a novel saliency map computation based on local and global features to ll the gap between the two types of features and hence provide promising results very e ciently for salient object detection. Then the motion features are fused intelligently with the detected salient object to extract the moving object in each frame of the sequence. The object proposal algorithms normally use computationally expensive segmentation methodologies to extract di erent non-overlapping objects/regions in a frame. Our proposed methods exploit very limited spatio-temporal neighborhood to extract a compact action region based on the compensated motion information. Finally, classi er is trained on the local features to recognize/label the action sequence. We have evaluated two types of learning models, extreme learning machine (ELM) and deep neural networks (DNNs). ELM is fast, while the computationally intensive classi ers such as Deep Neural Networks (DNNs) produce comparatively better action recognition accuracy. The experimental evaluation reveals that our local features based human action recognition and localization system improves the existing systems in many aspects such as computational complexity and performance. Finally it is concluded that the proposed algorithms obtain better or very similar action recognition and localization performance/accuracy as compared to the state-of-the-art approaches on realistic, unconstrained and challenging human action recognition and localization datasets such as KTH, MSR-II, JHMDB21 and UCF Sports. Besides, to evaluate the e ectiveness of localization of proposed algorithms a number of segmentation data sets have been used such as MOViCs, I2R, SegTrack v 1 & 2, ObMiC and Wall owers. Though the approaches proposed in the thesis obtain promising and impressive results as compared to the prominent state-of-the-art methods, further research and investigations are required to get enhanced or comparable results on more challenging realistic videos encountered in practical life. The future directions are discussed in conclusions and future work section of the thesis.	en_US
dc.description.sponsorship	Higher Education Commission, Pakistan	en_US
dc.language.iso	en_US	en_US
dc.publisher	National University of Computer and Emerging Sciences, Islamabad	en_US
dc.subject	Computer Science	en_US
dc.title	Human Action Recognition and Localization in Videos	en_US
dc.type	Thesis	en_US