Failure Recovery

LINQ to HPC is designed to run on clusters of commodity hardware, so you should anticipate that one or more of the DSC nodes may fail during execution, with or without warning. The LINQ to HPC graph manager monitors the status of all executing vertices, in part to watch for transient or permanent failures. A vertex must report back to the LINQ to HPC graph manager within a set timeout value. If it fails to report within that period, the LINQ to HPC graph manager assumes that the vertex has failed or the computer has crashed, and initiates recovery procedures.

LINQ to HPC vertices are assumed to be deterministic, and the LINQ to HPC graph is acyclic, so failure recovery is a relatively straightforward matter. When a vertex fails, the LINQ to HPC graph manager reexecutes the failed vertex, perhaps on a different computer. If failure was due to a read error, the job manager also marks the upstream vertex as "Failed," and reexecutes that process as well. The system will retry a vertex six times before causing the job to fail.