Abstract:
Biological sequence comparison is fundamental in extracting information
that is valuable in applications such as protein structure prediction, predicting
structural similarity, phylogenic analysis, homology detection,
function prediction and discovering evolutionary relationship. Besides
biologists, numerous researchers like mathematicians, statistician and
even computer scientists attracted largely towards sequence analysis because
of its involvement in various important applications. Protein classi
cation is one of the major areas of research in recent years. Despite
technological advances, classifying proteins accurately is still a big challenge.
In this work, we rst introduce an ant-inspired data mining approach
for protein classi cation problem to investigate the e ectiveness of rulesbased
approach. Supervised classi cation mechanism along with data
mining concepts establishes compact and e cient rules classifying proteins
into its correct family.
Towards biological sequence analysis, we propose ASIF, a novel algorithm
that consists of an alignment algorithm ASIFALIGN and a mathematical
model (dASIF ) quantifying the sequence alignment. The proposed
approach is based on intra-residue-distance and a plausible (unbiased)
penalty factor. A standard dataset of DNA sequences are tested
that produces reliable and robust sequence dissimilarities/similarities.
Moreover, the proposed approach is used to construct a phylogenetic
tree. Phylogenetic trees constructed by our approach outperform other
methods.
In addition, the proposed approach is applied to protein secondary structure
classi cation problem. A dataset of twelve secondary structures are
used to validate the distance matrix for classi cation purpose generated
by the new alignment algorithm and a mathematical model. Results
produced by the new scoring model are very much encouraging which
shows reliability of our approach.
Our approach not only provides a solid ground for its applications but
also performs the fundamental job of dissimilarities/similarities calculation
at a reasonable computational complexity. Results reveal the signi
cance of our approach and provide a basis of the proposed model
to be adopted for other biological applications such as protein function
prediction, homology detection and protein fold recognition problem.
I would like to dedicate this thesis to My Father (A strong and gentle
soul who taught me to trust in ALLAH, believe in hard work and rest
assure for the best of the results), My Mother (late)(For being my rst
mentor and a true guide in shape of her beautiful memories and love),
My Brothers, Sisters and Family (For supporting and encouraging
throughout my studies and research).