Hadoop-Based Algorithm for Inference of Phylogenetic Trees Using Maximum Likelihood Method

Sardaraz, Muhammad

DSpace Home
→
Engineering and Technology
→
Thesis
→
View Item

Hadoop-Based Algorithm for Inference of Phylogenetic Trees Using Maximum Likelihood Method

Sardaraz, Muhammad

URI: http://142.54.178.187:9060/xmlui/handle/123456789/4883

Date: 2016

Abstract:

Biological sequences consist of A C G and T in a DNA structure and contain vital information of living organisms. This information is used in many applications such as drug design, microarray analysis and phylogenetic trees. Advances in computing technologies, specifically Next Generation Sequencing technologies have increased genomic data at a rapid rate. The increase in genomic data presents significant research challenges in bioinformatics, such as sequence alignment, short read error correction, phylogenetic inference etc. Various tools and algorithms have been proposed for phylogenetic inference. Early algorithms used sequential programs to solve the problem of phylogenetic inference. Improvements were gained in terms of tree accuracy and execution time, however; the programs were still slow, and improvements were needed to infer correct phylogeny in short times. This challenge introduced parallel and distributed processing to the field of bioinformatics. Many tools and programs have been developed based on parallel and distributed computing. This thesis presents algorithmic solutions for phylogenetic inference. Solutions include ‘PhyloDoop’ and ‘SeqCompress’ algorithms. PhyloDoop algorithm is used for inference of phylogenetic trees. The algorithm is based on Maximum Likelihood method, implemented on Hadoop Map/Reduce framework. PhyloDoop is based on clusters i.e. divides the input alignment to clusters, builds trees for each cluster, merges and optimizes all sub-trees and the final tree is also optimized. PhyloDoop is compared to well-known algorithms both on real and simulated datasets. Experiments on real datasets were performed to test likelihood values, execution time, and speedup in distributed environment. The results show better accuracy as compared to other algorithms on most of the datasets. Execution time is also short on most datasets. The proposed algorithm yields better speed up on large datasets. Simulated datasets were used to measure topological accuracy. PhyloDoop is topologically accurate on most datasets with short execution time in comparison to other algorithms. SeqCompress is used to compress DNA sequences in order to reduce memory requirements and execution time. Impressive results are shown in comparison to other algorithms. These results show a gap for efficient usage of compression techniques to infer correct phylogeny with low memory requirements as well as execution time.

Show full item record