B.Robic, J.Silc, R.Trobec.
Reliability and throughput improvement in massively parallel systems.
Proc. 6th Int'l Conf. Parallel Computing ParCo'97,
Bonn, Germany, Sep. 16-19, 1997.
This paper presents a technique for efficient use of massively parallel
systems. We assume that systems are homogeneous with large number of simple
computing nodes interconnected into regular pattern. Due to large number of
nodes, we assume that some of nodes may be faulty,
usually forming fault-clusters. The technique is based on:
(i) efficient heuristics for program mapping and resource allocation
for improving system utilization and throughput, and
(ii) run-time local diagnosis for locating fault-clusters.
In particular, we combine our recent original solutions to the following
three problems concerning massively parallel systems, that is, the problem of
local run-time diagnosis [3], fault-tolerant mapping of arbitrary
programs [1], and run-time resource allocation [2].
Such a combined technique offers both low expected execution time of parallel
programs and high system reliability.
Throughput and reliability are improved due to the following results of
various optimizations performed during mapping, allocation, and diagnosis:
[1] B.Robic, J.Silc. Fault tolerant mapping onto VLSI/WSI processor arrays. In Proc. 20th EUROMICRO Conf., pages 697--703, September 1994.
[2] J.Silc, B.Robic. Dynamic program allocation on the mesh-connected parallel architecture. In E.H.D'Hollander et al. (eds.), Parallel Computing: State-of-the-Art and Perspectives, pages 701--704, Elsevier Science Publishing, 1996.
[3] R.Trobec, I.Jerebic. Local diagnosis in massively parallel systems. Parallel Comput., 23(6):721--731, June 1997.