Abstract:
Time-aligned and labeled speech at sub-word level is required to develop spoken language
technology components. Determining time boundaries of sub word units of speech and labelling
those, is the speech segmentation problem. Manual human-labeling is considered to be the most
accurate, which however requires significant amount of time when large amount of speech has to
be dealt. The evidences which humans employ are based on knowledge of acoustic-phonetics
and at very basic level works on spectrograms based techniques. Based on a hypothesis that
computers can also segment speech automatically if evidence which human experts utilizes are
used, leads us towards time effective automatic speech segmentation.
In this thesis unsupervised automatic time-alignment of speech at sub-word level is carried out
based on the pieces of information which spectrograms carry. The speech spectrogram
engineered in this thesis does not possess information of vocal excitations and capture dynamics
of vocal tract only. The novel feature is found suitable for segmentation problem and utilizes
both forward and inverse characteristics of vocal tract (FICV). Additionally to evaluate the
suitability of a feature extraction technique for speech segmentation task, a framework has also
been developed.
In the thesis, speech segmentation is carried out on indigenously developed Classical Arabic
(CA) dataset and therefore becomes first scheme of its kind for CA which is an under resourced
language in speech technology. The performance of FICV based speech segmentation scheme is
compared and shown to be significantly better than standard unsupervised and supervised
techniques both in terms of error-rates and alignment accuracies. Reduction of 12.29% in error
rates is achieved with FICV based feature when compared with standard unsupervised technique. Carrying out supervised segmentation requires a basic sub-word level recognizer, which labels
and aligns speech. In this connection a Hidden Markov Model (HMM) based speech recognizer
is trained. The acoustic modeling is carried using a discriminative technique which shows better
recognition accuracies of up to 4% than the non-discriminative technique. Thesis also verifies
that using manually-labeled data for training acoustic models can further improve recognition
accuracies by 3-4%. In this regard, thesis carries details of experimental steps which can also
serve as guideline for developing an automatic speech recognizer for CA.