B.Robic, J.Silc, R.Trobec.
Reliability and throughput improvement in massively parallel systems.
Proc. 6th Int'l Conf. Parallel Computing ParCo'97, Bonn, Germany, Sep. 16-19, 1997.

This paper presents a technique for efficient use of massively parallel systems. We assume that systems are homogeneous with large number of simple computing nodes interconnected into regular pattern. Due to large number of nodes, we assume that some of nodes may be faulty, usually forming fault-clusters. The technique is based on: (i) efficient heuristics for program mapping and resource allocation for improving system utilization and throughput, and (ii) run-time local diagnosis for locating fault-clusters.
In particular, we combine our recent original solutions to the following three problems concerning massively parallel systems, that is, the problem of local run-time diagnosis [3], fault-tolerant mapping of arbitrary programs [1], and run-time resource allocation [2].
Such a combined technique offers both low expected execution time of parallel programs and high system reliability. Throughput and reliability are improved due to the following results of various optimizations performed during mapping, allocation, and diagnosis:

minimized resource requirements of programs in terms of computing nodes and links,
reduced program waiting and execution times,
reduced system fragmentation,
dynamic reconfiguration after fault detection,
the parallel diagnostic procedure can be carried out in idle parts of the system concurrently with application programs,
no hardware redundancy for diagnosis,
non-permanent faults can be detected.

[1] B.Robic, J.Silc. Fault tolerant mapping onto VLSI/WSI processor arrays. In Proc. 20th EUROMICRO Conf., pages 697--703, September 1994.

[2] J.Silc, B.Robic. Dynamic program allocation on the mesh-connected parallel architecture. In E.H.D'Hollander et al. (eds.), Parallel Computing: State-of-the-Art and Perspectives, pages 701--704, Elsevier Science Publishing, 1996.

[3] R.Trobec, I.Jerebic. Local diagnosis in massively parallel systems. Parallel Comput., 23(6):721--731, June 1997.