Skip to main content
CS Colloquium | September 13, 2022

Beyond Restart: Checkpointing for the Exascale Era

Rebecca Hartman-Baker
User Engagement Group Leader National Energy Research Scientific Computing Center

Stevenson Hall 1300
12:00 PM - 12:50 PM

The National Energy Research Scientific Computing Center (NERSC) is the mission high-performance computing center for the United States Department of Energy Office of Science (DOE-SC). NERSC operates cutting-edge, large-scale supercomputers for users performing simulations and data analysis in science areas of interest to DOE-SC. Experimental data analysis, in which data from large scientific instruments such as telescopes or light sources is computationally analyzed, is a growing portion of the compute workload at NERSC. Many of these scientific instruments increasingly require real-time or fast computing turnaround. NERSC's challenge is to support this urgent workload within its existing operations. Checkpoint/restart (C/R) can enable this new workload pattern and address other operational challenges, such as resiliency. In this presentation, we describe our efforts to create a robust offering of checkpointing approaches and to simplify and incentivize their use, in order to optimize the user experience, minimize wasted cycles, and maximize compute-system utilization.