Abstract:
Biological sequences consist of A C G and T in a DNA structure and contain vital
information of living organisms. This information is used in many applications such as drug
design, microarray analysis and phylogenetic trees. Advances in computing technologies,
specifically Next Generation Sequencing technologies have increased genomic data at a rapid
rate. The increase in genomic data presents significant research challenges in bioinformatics,
such as sequence alignment, short read error correction, phylogenetic inference etc. Various
tools and algorithms have been proposed for phylogenetic inference. Early algorithms used
sequential programs to solve the problem of phylogenetic inference. Improvements were gained
in terms of tree accuracy and execution time, however; the programs were still slow, and
improvements were needed to infer correct phylogeny in short times. This challenge introduced
parallel and distributed processing to the field of bioinformatics. Many tools and programs have
been developed based on parallel and distributed computing.
This thesis presents algorithmic solutions for phylogenetic inference. Solutions include
‘PhyloDoop’ and ‘SeqCompress’ algorithms. PhyloDoop algorithm is used for inference of
phylogenetic trees. The algorithm is based on Maximum Likelihood method, implemented on
Hadoop Map/Reduce framework. PhyloDoop is based on clusters i.e. divides the input
alignment to clusters, builds trees for each cluster, merges and optimizes all sub-trees and the
final tree is also optimized. PhyloDoop is compared to well-known algorithms both on real and
simulated datasets. Experiments on real datasets were performed to test likelihood values,
execution time, and speedup in distributed environment. The results show better accuracy as
compared to other algorithms on most of the datasets. Execution time is also short on most
datasets. The proposed algorithm yields better speed up on large datasets. Simulated datasets
were used to measure topological accuracy. PhyloDoop is topologically accurate on most
datasets with short execution time in comparison to other algorithms. SeqCompress is used to
compress DNA sequences in order to reduce memory requirements and execution time.
Impressive results are shown in comparison to other algorithms. These results show a gap for
efficient usage of compression techniques to infer correct phylogeny with low memory
requirements as well as execution time.