Skip to main content
CS Colloquium | March 16, 2022

System Resilience: Amplify Failures, Detect, or Both?

Ganesh Gopalakrishnan (ACM Distinguished Speaker)
Professor University of Utah

Stevenson Hall 1300
12:00 PM - 12:50 PM

As we cram billion of transistors into a chip, and build computers with thousands of such chips, the probability of system state bits transiently getting corrupted due to system noise and high energy particle strikes goes up. Such "soft errors" factors are exacerbated by manufacturing variability that is higher in smaller lithographies.

Many types of software-based error detectors have been proposed to detect these soft errors and trigger recomputation from state checkpoints.  Unfortunately, most of these detection schemes introduce unacceptable computational overheads and also have unacceptably high false positive rates.

In one line of work, we have ameliorated this situation by focusing on applications such as stencils.  In this domain, we guarantee near 100% detection based on rigorous floating-point error analysis based on affine arithmetic.  We also reduce overheads by covering multiple steps of the stencil application per detector deployment.

The main take-away message is that system resilience solutions developed with attention to higher accuracy and lower overheads may prove to be the inevitable safety net based on which designers attempt to reduce energy consumption in this period of ending Moore's law.