Fault Tolerant Message Passing
Principle investigator: Nehal
Desai CCS-1
Co-investigators: Dean Prichard
CCS-1
�������������������� �������� Andrey
Mirtchovski CCS-1
Abstract
Recent trends in the use of
clusters for large scale distributed scientific calculations have moved away
from moderate size clusters of large SMPs, toward very large clusters of
smaller SMPs. With larger numbers of nodes and more complex network topologies,
the issue of reliability becomes paramount. The exponential dependence of the
reliability on the size of the clusters leads to a rapid decrease in the mean
time between failures of the cluster, will render application level reliability
schemes such as checkpoint/restart ineffective.� In this study, we propose to implement a message passing system which
utilizes transparent process mirroring to build a fault tolerant scientific
computing system. Our goal is to increase the usefulness of clusters for
scientific computing by greatly increasing their reliability. This work will
leverage existing work in our group on building message passing systems for a
number of operating systems.
Scientific and Technical Impact of the Proposed Work
This proposal
will address the issue of increasing the reliability of commodity clusters for
scientific computing.� With the
increasing popularity of large scale Beowulf type clusters for scientific
computing come new challenges.� In
addition to the standard challenges of a well performing and scalable cluster
is the issue of reliability.�
Distributed scientific applications are unique among computer
applications because they are extremely intolerant of system failures.� That is, they require all the resources of a
computer (e.g. CPU, network, storage, etc.) at all times.� Therefore, in order for a scientific
calculation to complete properly, system reliability over long periods of time
is a must.� Before we begin any
discussion of reliability, we need to define a few terms:
�
Fault tolerance
is the ability of parallel or distributed systems to recover from component
failures without loss of performance.
�
High availability
(HA) systems provide users with almost uninterrupted services and provide
reduced service during component failure.�
The goal of high availability systems is minimal service interruption,
not fault tolerance.
�
Continuously available
(CA) systems are capable of providing users with uninterrupted services and are
only minimally affected by component or system failures.
In the context
of scientific applications, reliability is equivalent to the level of service provided
by a CA system.
Currently, most
scientific applications rely on checkpointing to provide reliability, but as
the size of clusters grows the effectiveness of the checkpoint-restart paradigm
will diminish.� Because for large
clusters, the mean time between failures (MTBF) of any component may approach
the time required to write/save the data file required to restart the
calculation.� The situation is easily
illustrated by a simple calculation.�
Consider the case of a computer node with a 10000 hour (~1.15 years)
mean time between failures.� With 10000
nodes the probability of all the nodes being up for 1 hour is only 0.5, e.g.
the mean time to failure for the full system is 1 hour.� Functionally, the probability of failure
increases exponentially with the number of nodes.�
The MTBF is the
most generous metric that can be used to estimate system reliability.� In reality, experiences with large-scale
clusters such as ASCI Blue Mountain and ASCI Q show that even a temporary
interruption of any hardware or software component of the cluster environment
will likely cause a scientific application to terminate.� One of the ironies of large-scale clusters
is that as the computing power of the cluster increases the ability to run
large scale distributed scientific codes to completion actually decreases.
For upcoming
tera-scale clusters, the need to incorporate reliability into the design of a
cluster becomes imperative.� For current
large-scale scientific computing clusters, such as the ASCI Q machine, the
estimated MTBF for the full system is on the order of a few hours, compared to
a roughly a day for previous systems such as ASCI Blue Mountain/Pacific.� If current trends continue, the next
generation of large-scale clusters will have MTBF on the order of a few minutes.� For such clusters the checkpoint-restart
methods in current use will be of little help.�
Machines of this scale will require new ways of dealing with this
inherent un-reliability.� This study
proposes to develop techniques to increase the reliability and therefore the
usability of large-scale clusters.�
Background
Increasing the
reliability of computer systems is not a new issue and has been addressed in a
number of different contexts (telecommunications, storage, financial systems,
air traffic control, etc.) using a plethora of different techniques [1,2,6].
Before we
address techniques for increased reliability, we must presume a fault
model.� Faults can be thought as
occurring at various levels in a cluster.�
In this study we assume faults occur at the operational or functional
level.�� Operationally, there are two
sources of unreliability in a computer system: the nodes (CPU) and the
network.� Node faults occur for a number
of reasons including hardware and software failures, and manifest themselves in
a variety of ways from producing incorrect results to total failure.� The other source of unreliability is network
faults, which result from a loss of networking hardware or software.� Network fault tolerance, given its current
commercial importance, has been studied extensively resulting in a number of
distinct methodologies [3].� In this
study, we build and extend, previous work in message passing by implementing a
fault-tolerant routing network.�
In addition to
the source of the faults, fault models make assumptions about the way in which
faults happen or are distributed.� In
the current study, we assume a random fault model, in which components fail
independently.�� This assumption greatly
circumscribes the research, but allows for the introduction of various analytic
modeling techniques.� In the later
stages of this project we hope to remove this assumption, and incorporate more
complex models into the fault tolerance framework.
One of the most
promising methods for dealing with node faults is process mirroring.� In process mirroring, processes used in the
calculation are replicated or mirrored.�
The original and replicated processes are managed as a single unit in
order to maintain identical computation and process state information.� Periodically, information is exchanged
between the replicated processes, allowing a comparison of the calculation and
the detection of faulted nodes.� The
heuristics and algorithms used to determine which information is exchanged and
how to arbitrate the different failure scenarios is key to the effectiveness
process mirroring.� Process mirroring
has been available for commercial purposes such as database and enterprise
resource planning for many years.�
However, to date no one has effectively applied this technique to
distributed scientific computing.
Process units
are of two types: simple and replicated.�
Simple process units are single non-replicated processes.� Replicated units are multiple processes that
are addressed as a single unit. In figure 1, processes 0, 1 and 3 are examples
of simple processes while process 2 is an example of a replicated process.
Figure
1: Example of simple (left) and replicated (right) process units.
To understand
the impact that process mirroring has on fault tolerance we need to examine
some basic probability theory.� The
probability of a single node failure as a function of time is given by the
exponential distribution.
�������� p(X>t) = 1 - ebt
where b
can be calculated from the MTBF� as:
b
= -log(1/2)/MTBF
Considering node
failures to be independent events, the probability of having any single node
fail is:
q(t)
= (1 - p(t))n
where n is the
number of nodes.
With process
mirroring the probability of any replicated process unit failing (both
processes in a pair failing) is given by:
q(t)
= (1 - p(t)2)n
where n is the
number of replicated process units.
Figure 2 shows
the example of MTBF = 10000 hrs, and n = 10000 process units.� The left hand panel shows the probability of
success as a function of time without process pairs, while the right hand panel
shows the same results with process pairs.�
It is clear from the figure that replicating processes leads to a huge
improvement in reliability.
Figure 2: Probability of job success as a function
of job run time, left panel is without process replication, right panel is with
process replication.
Research objectives and goals
The goal of this
study is to increase the usefulness of scientific computing clusters by
increasing their reliability.� Because
of the exponential dependence of the probability of failure on the number of
nodes, increasing system reliability is the key to using commodity clusters to
run tera-scale physics simulations.�
This study will focus on using replicated processes and their extensions
to improve overall machine reliability.�
Increasing the reliability of clusters means users can spend more time
doing science rather than marshaling the checkpoint-restart process.
We propose to
implement a clustering infrastructure, which utilizes process mirroring and
network resiliency to build a fault tolerant continuously available computer
cluster.� This infrastructure will
consist of several components including:
�
The messaging layer which controls the
flow of data between processes.� Since
message passing is the primary mechanism to exchange information in
distributed/parallel scientific computing, much of the reliability will be
implemented at this level.�
�
Autonomous network and node monitoring in
order to determine current hardware and software status.
The messaging
passing component will leverage our previous work in this area [3,5].� Much of the original messaging passing work
in our group was driven by a re-examination of current models of scientific
computation and questions as to whether those models will allow scientist to
fully exploit the ever increasing power of commodity components.� One of the results of these explorations was
the adoption of the Plan 9 operating system [7] as a prototyping/research
environment.� The Plan 9 operating
system is unique and we believe holds great potential for cluster
research.� Plan 9 is appealing for
number of reasons:
�
Plan 9 is a distributed operating system
with a global file system and novel security model.
�
In Plan 9 all system devices are
represented as files. Thus both local and remote resources can be easily
controlled and accessed.� This allows
for the rapid prototyping of new ideas.
�
Support for commodity hardware and
standard network protocols.
�
Operating system support for checkpointing
and process migration.
�
Similar syntax to system V UNIX, allowing
easy porting of applications between plan 9 and traditional Unix systems.
R & D
approach and expected results.��
Phase 1
This phase of
the study will involve adding transparent support for process pairs to our
research messaging system.�� From a
research perspective, the objective is not simply to implement process
mirroring, but to develop to a general framework by:
1. Categorizing
the different failure scenarios.
2. Determining
which scenarios are most likely and should be addressed.
3. Determining
the minimal (and sufficient) set of information that needs to be exchanged
between mirrored processes to allow for detection of various failure scenarios.
4. Constructing
algorithms/heuristics that use this information to avoid the faulty or dead
nodes.
In
addition there are a number of practical considerations, the framework (and its
accompanying infrastructure) must address:�
the availability of resources and the operational needs of users.� In some modes of operation users require
only a high degree of certainty about the accuracy of the calculation and in
others modes they require a high degrees of certainty of job completion.
In
phase 1, all calculation� processes are
mirrored.� Adding complete process
mirroring to an existing distributed calculation results in an increase in the
number of resources required to run a simulation.��� These additional resource requirements must be weighed against
the need to successfully run a simulation (e.g. for mission critical
applications like those in the ASCI program, the choice is obvious).� Figure 3 shows the additional nodes and
network connections needed to implement process replication for a simple grid
communication pattern.
This
increase in resource usage has a large impact on the probability of a distributed
calculation running to completion.� For
example in the case described above, the addition of process replication
increases the probability of the job continuing to run after 1 hour from 0.5 to
greater than 0.99.
Figure 3: An example of the extra network
connections and mirror nodes required for process replication.
In addition, to the initial fault
tolerant infrastructure we will develop a test suite of representative
application which can be used to demonstrate the effectiveness of the process
replication and the fault tolerant message routing schemes.
Phase 2
While process
replication greatly increases the fault-tolerance of a distributed computing
system, the overhead of mirroring every process maybe too high for some
applications.� In this phase of the
study we will consider replication of only some of the processes.
If f is the
fraction of nodes that have process pairs then the probability of success is
given by:
q(t)
= (1-p(t)2)nf
* (1-p(t))n(1-f)
Figure 4 shows
the probability of success as a function of the fraction of replicated pairs
for run times of 1, 2, 4 and 8 hours.�
One can see that the longer the run time, the more important it becomes
to replicate all processes because of the high probability of failure for the
un-replicated nodes.
Figure
4: Probability of job success as a function of the fraction of nodes that are
replicated. Solid line is for a run time of 1 hour, dotted is for 2 hours,
dashed is for 4 hours, and dot-dashed is for 8 hour run time.�
We also propose to use information obtained from system monitoring
[4] and the history of the node behavior to estimate which nodes are more
likely to fail in order to determine which processes will be mirrored.� As with Phase 1 we will demonstrate the
software we develop using a test suite of applications on clusters in the ACL
and document the results in a series of reports.
References:
[1] R. Birman,
Spinglass: Adaptive Probabilistic Tools for Advanced Networks http://www.cs.cornell.edu/Info/Projects/Spinglass/
[2] J. Bruck R.
Cypher and C.-T. Ho., Fault-Tolerant Meshes of Hypercubes with Minimal Numbers
of Spares. IEEE Transactions on Computers, 43, 9, 1089-1104. 1991.
[3] R. L.
Graham, M W. Sulalski, L. D. Risinger, D. J. Daniel, N. N. Desai, M. Nochumsen,
and L.-L. Chen, Performance analysis of the LA-MPI communication library. Los
Alamos National Laboratory Technical Report LA-UR-02-1004.
[4]
Ron Minnich and Karen Reid, Supermon: High performance monitoring for Linux
clusters.
The
Fifth Annual Linux Showcase and Conference, Oakland, CA.November 5-10, 2001.
[5] A. A.
Mirtchovski, D. Prichard, and N. N. Desai, Using the Plan9 Operating System in
Cluster Computing For Scientific Applications. Los Alamos National Laboratory
Technical Report. Submitted to Cluster 2002.
[6] P. T.
Murray, R. A. Fleming, P. D. Harry, and P. A. Vickers, Somersault: Enabling
Fault-Tolerant Distributed Software Systems. Hewlett-Packard� Labs Technical Report. HP-98-81, 1998.
[7] R. Pike, D.
Presotto, S. Dorward, B. Flandrena, K. Thompson, H. Trickey, and P.
Winterbottom,� Plan9 from Bell Labs.
Computing Systems, 8(3):221-254, 1995.
Curriculum Vitae
Nehal Desai
Education:
Ph.D��� Mechanical Engineering, North Carolina State University.� Work on multiphasic fluid mechanics.
Relevant Work Experience:
����������� Los Alamos National Lab (Oct 2000-Present)
Technical Staff Member, CCS-1.� Member of Cluster Research Team and� LA-MPI (Los Alamos Message Passing Interface) teams.
����������� IBM (June 1999-Oct. 2000)
Research Programmer.� Advanced Middleware Group.� Worked on several IBM products including:� Personal Communications Manager (PCOMM) and IBM Websphere
�����������
SGI (April 1998-June 1999)
����������� System Engineer for the ASCI BlueMountain System at Los Alamos
����������� Lab.� Part of the BlueMountain integration team.