SPEECH SEGREGATION FROM MULTI-STREAM AUDITORY ENVIRONMENTS

Khan, Muhammad Jamil

DSpace Home
→
Natural Sciences
→
Thesis
→
View Item

dc.contributor.author	Khan, Muhammad Jamil
dc.date.accessioned	2018-02-16T04:35:06Z
dc.date.accessioned	2020-04-09T16:51:01Z
dc.date.available	2020-04-09T16:51:01Z
dc.date.issued	2016
dc.identifier.uri	http://142.54.178.187:9060/xmlui/handle/123456789/3171
dc.description.abstract	In audio source separation, cocktail party problem is a typical example that segregates a particular signal while filtering out the others from audio mixture. This problem has been investigated for decades. In our daily life, a plethora of sources add their acoustic pattern to the environment. In this scenario, segregating human speech acoustics from the audio mixture is challenging. This challenge becomes harder in case of monaural (single microphone) setup that essentially eliminates the spatiality of the target source. Human listeners possess an incredible capability to segregate a specific sound source from a complex mixture, even single ear is enough for complex auditory scene. This process of separating and target specific sound source is called auditory scene analysis (ASA). Recently, ASA has received profound interest from many audio researchers. However, Emulating the same functionality on computer is imperative and challenging. Number of applications need an effective system in place that has near human-like ability to segregate auditory signals. There are many challenges for existing Computer Auditory Scene Analysis (CASA) systems yet to be handled in case of monaural speech segregation. This research work proposes a systematic in-depth effort in evolving a CASA framework for monaural speech segregation. In the first stage, peripheral analysis is done to model ASA inspired time-frequency representation called cochleagram. In the second stage system extracts ASA cues are extracted such as fundamental frequency (F0), spectral peaks, onsets and offsets. The Cochleagram is further mapped to eight discrete clusters. Theses clusters will become the foundation to produce morphed cochleagram versions that are further processed one by one with rough estimated fundamental frequency (F0) and spectral peaks to iteratively stabilize and improve pitch estimation. The system classifies speech and non-speech interference based on improved pitch estimation, spectral peaks and underlying ASA features ix e.g. harmonicity, onset/offset, and generates an ideal binary mask for target speech segregation. Finally, the target speech source is resynthesized using masked time-frequency units of Cochleagram. Systematic evaluation shows that the proposed system for voiced speech segregation produces better results and it can able to identify majority of time frequency (T-F) units from cochleagram for target speech separation. The proposed system produces significantly better results when compared with existing standardized voiced speech segregation techniques.	en_US
dc.description.sponsorship	Higher Education Commission, Pakistan	en_US
dc.language.iso	en	en_US
dc.publisher	University of Engineering and Technology, Taxila, Pakistan	en_US
dc.subject	Applied Sciences	en_US
dc.title	SPEECH SEGREGATION FROM MULTI-STREAM AUDITORY ENVIRONMENTS	en_US
dc.type	Thesis	en_US