Metastable Failures in Distributed Systems

# · ✸ 50 · 💬 11 · 2 years ago · charap.co · mjb · 📷
Distributed systems often fail spectacularly and unpredictably. The culprit behind metastable failures is a sustaining effect that prevents the system from leaving a bad/failed state even after the initial trigger is removed. To understand more about the importance of triggers and sustaining effects, we need to look at how most distributed systems are deployed. These systems operate in a metastable state that we call a metastable vulnerable state, as the system is vulnerable to a failure if enough of a trigger disturbs its stability. With metastable failures affecting real systems, we need to have more understanding of the problem and processes involved to develop better coping and prevention strategies. Improvements in our ability to predict and avoid metastable failures will also translate directly to efficiency gains because it will let us operate systems closer to their natural performance limits. As the industry makes bigger and bigger systems and pushes them to work as cheaply as possible, we need to develop a proper understanding of how these critical systems fail at scale so we can continue improving the reliability.
Metastable Failures in Distributed Systems



Send Feedback | WebAssembly Version (beta)