Automatic Segmentation of Speech

Baig, Mirza Muhammad Ali

DSpace Home
→
Engineering and Technology
→
Thesis
→
View Item

Automatic Segmentation of Speech

Baig, Mirza Muhammad Ali

URI: http://142.54.178.187:9060/xmlui/handle/123456789/5242

Date: 2018

Abstract:

Time-aligned and labeled speech at sub-word level is required to develop spoken language technology components. Determining time boundaries of sub word units of speech and labelling those, is the speech segmentation problem. Manual human-labeling is considered to be the most accurate, which however requires significant amount of time when large amount of speech has to be dealt. The evidences which humans employ are based on knowledge of acoustic-phonetics and at very basic level works on spectrograms based techniques. Based on a hypothesis that computers can also segment speech automatically if evidence which human experts utilizes are used, leads us towards time effective automatic speech segmentation. In this thesis unsupervised automatic time-alignment of speech at sub-word level is carried out based on the pieces of information which spectrograms carry. The speech spectrogram engineered in this thesis does not possess information of vocal excitations and capture dynamics of vocal tract only. The novel feature is found suitable for segmentation problem and utilizes both forward and inverse characteristics of vocal tract (FICV). Additionally to evaluate the suitability of a feature extraction technique for speech segmentation task, a framework has also been developed. In the thesis, speech segmentation is carried out on indigenously developed Classical Arabic (CA) dataset and therefore becomes first scheme of its kind for CA which is an under resourced language in speech technology. The performance of FICV based speech segmentation scheme is compared and shown to be significantly better than standard unsupervised and supervised techniques both in terms of error-rates and alignment accuracies. Reduction of 12.29% in error rates is achieved with FICV based feature when compared with standard unsupervised technique. Carrying out supervised segmentation requires a basic sub-word level recognizer, which labels and aligns speech. In this connection a Hidden Markov Model (HMM) based speech recognizer is trained. The acoustic modeling is carried using a discriminative technique which shows better recognition accuracies of up to 4% than the non-discriminative technique. Thesis also verifies that using manually-labeled data for training acoustic models can further improve recognition accuracies by 3-4%. In this regard, thesis carries details of experimental steps which can also serve as guideline for developing an automatic speech recognizer for CA.

Show full item record