Abstract:
Implementing fault tolerant scheduling in computational grid is a challenging task.
Proactive and reactive fault tolerant scheduling techniques are commonly used in
grids. Proactive approaches focus on the issues due to which faults are generated.
Reactive approaches are activated after identification of failures. Different from exist
ing fault tolerant techniques, we present a novel, hybrid, dynamic, and adaptive fault
tolerant technique that effectively uses proactive and reactive approaches. Proactive
fault tolerant orchestrator uses proactive approach, where resources are filtered on the
basis of vicinity, availability and reliability. Existing fault tolerance techniques do not
distinguish resources during selection, but the proposed algorithm prefers to employ
local resources that results in low communication costs and less tendency towards
failures. In order to find high availability of resources, a newly identified parameter
that uses availability time is incorporated in the model for finding highly available
resources using mean time between availability and mean time between unavailability.
Reliability of nodes is an indispensable consideration and proposed system computes
the reliability of nodes using factors like success or failure ratio of jobs and types of
encountered failures. Proposed model also employs an optimal resource identification
algorithm that helps in selection of optimal resources during execution of the jobs. List
of reliable and optimal grid nodes identified using proactive fault tolerant orchestrator
is passed to reactive fault tolerant orchestrator. Failure detector and predictor are the
two components that work under reactive fault tolerant orchestrator and caters for
network, prediction and temperature based hardware failures. For detection of errors
in an efficient and timely manner push and pull models are also applied. Hardware
failures are predicted on the basis of device temperature and are carefully used for con
trolling the checkpoint intensity. Reduction in number of checkpoints based on device
temperature provide several performance benefits in terms of communication cost and
reduced execution times. Performance of proposed model is validated using GridSim
toolkit. Compared to contemporary techniques, experimental results exhibit efficiency
and effectiveness of the proposed model with respect to several performance metrics
like execution time, throughput, waiting and turnaround time, number of checkpoints
and energy consumption.