Abstract:
Biological sequences consist of A C G and T in a DNA structure and contain vital
information of living organisms. The development of computing technologies, especially NGS
technologies have increased genomic data at a rapid rate. The increase in genomic data presents
significant research challenges in bioinformatics, such as sequence alignment, short-reads error
correction, phylogenetic inference, etc. Next-generation high-throughput sequencing
technologies have opened new and thought-provoking research opportunities. In particular,
Next-generation sequencers produce a massive amount of short-reads data in a single run.
However, these large amounts of short-reads data produced are highly susceptible to errors, as
compared to shotgun sequencing. Therefore, there is a peremptory demand to design fast and
more accurate statistical and computational tools to analyze these data. This research presents a
novel and robust algorithm called HaShRECA for genome sequence short reads error correction.
The developed algorithm is based on a probabilistic model that analyzes the potential errors in
reads and utilizes the Hadoop-MapReduce framework to speed up the computation processes.
Experimental results show that HaShRECA is more accurate, as well as time and space efficient
as compared to previous algorithms.