PASTIC Dspace Repository

SPEECH SEGREGATION FROM MULTI-STREAM AUDITORY ENVIRONMENTS

Show simple item record

dc.contributor.author Khan, Muhammad Jamil
dc.date.accessioned 2018-02-16T04:35:06Z
dc.date.accessioned 2020-04-09T16:51:01Z
dc.date.available 2020-04-09T16:51:01Z
dc.date.issued 2016
dc.identifier.uri http://142.54.178.187:9060/xmlui/handle/123456789/3171
dc.description.abstract In audio source separation, cocktail party problem is a typical example that segregates a particular signal while filtering out the others from audio mixture. This problem has been investigated for decades. In our daily life, a plethora of sources add their acoustic pattern to the environment. In this scenario, segregating human speech acoustics from the audio mixture is challenging. This challenge becomes harder in case of monaural (single microphone) setup that essentially eliminates the spatiality of the target source. Human listeners possess an incredible capability to segregate a specific sound source from a complex mixture, even single ear is enough for complex auditory scene. This process of separating and target specific sound source is called auditory scene analysis (ASA). Recently, ASA has received profound interest from many audio researchers. However, Emulating the same functionality on computer is imperative and challenging. Number of applications need an effective system in place that has near human-like ability to segregate auditory signals. There are many challenges for existing Computer Auditory Scene Analysis (CASA) systems yet to be handled in case of monaural speech segregation. This research work proposes a systematic in-depth effort in evolving a CASA framework for monaural speech segregation. In the first stage, peripheral analysis is done to model ASA inspired time-frequency representation called cochleagram. In the second stage system extracts ASA cues are extracted such as fundamental frequency (F0), spectral peaks, onsets and offsets. The Cochleagram is further mapped to eight discrete clusters. Theses clusters will become the foundation to produce morphed cochleagram versions that are further processed one by one with rough estimated fundamental frequency (F0) and spectral peaks to iteratively stabilize and improve pitch estimation. The system classifies speech and non-speech interference based on improved pitch estimation, spectral peaks and underlying ASA features ix e.g. harmonicity, onset/offset, and generates an ideal binary mask for target speech segregation. Finally, the target speech source is resynthesized using masked time-frequency units of Cochleagram. Systematic evaluation shows that the proposed system for voiced speech segregation produces better results and it can able to identify majority of time frequency (T-F) units from cochleagram for target speech separation. The proposed system produces significantly better results when compared with existing standardized voiced speech segregation techniques. en_US
dc.description.sponsorship Higher Education Commission, Pakistan en_US
dc.language.iso en en_US
dc.publisher University of Engineering and Technology, Taxila, Pakistan en_US
dc.subject Applied Sciences en_US
dc.title SPEECH SEGREGATION FROM MULTI-STREAM AUDITORY ENVIRONMENTS en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account