Skip to main content
CS Colloquium | March 30, 2017

Orchestrating Chaos: Applying Database Research In The Wild

Peter Alvaro, University of California, Santa Cruz

Stevenson Hall 1300
12:00 PM - 12:50 PM

Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, an increasing number of large-scale sites regularly run failure drills in which faults are deliberately injected in production or staging systems. While fault injection infrastructures are becoming relatively mature, existing approaches either explore the combinatorial space of potential failures randomly or exploit the “hunches” of domain experts to guide the search. Random strategies waste resources testing “uninteresting” faults, while programmer-guided approaches are only as good as the intuition of a programmer and only scale with human effort. The subject of this talk covers intuition, experience and research directions related to lineage-driven fault injection (LDFI), a novel approach to automating failure testing. LDFI utilizes existing tracing or logging infrastructures to work backwards from good outcomes, identifying redundant computations that allow it to aggressively prune the space of faults that must be explored via fault injection. I will describe LDFI’s theoretical roots in the database research notion of provenance, present early results from the field, and present opportunities for near- and long-term future research.