July 9, 2012
Stress Failures Versus Decay Failures
Two distinct categories of system failure require different responses - stress failures from acute overload and decay failures from gradual erosion.
6 min read
Two Ways Things Break
Systems fail in two fundamentally different ways, and mixing them up produces wrong diagnoses and wrong remedies.
A stress failure occurs when a system experiences more load than it was designed to handle. The bridge bears too much weight. The server receives more requests than it can process. The person takes on more commitments than they can fulfill. The failure is caused by the excess load, and the fix involves either reducing the load or increasing the capacity.
A decay failure occurs when a system that was adequate degrades over time until it can no longer perform its function. The bridge rusts until a normal load causes failure. The server's hardware degrades until routine requests fail. The person's skills atrophy until they can no longer perform work that was previously routine. The failure is caused by the degradation, and the fix involves either reversing the decay or replacing the degraded components.
These failures look similar at the moment of collapse - the bridge falls, the server crashes, the person fails to deliver - but they have different histories and require different responses.
Diagnosing the Type
The diagnostic question is: was this system performing adequately before the failure? If yes, something changed to cause the failure. If the change was an increase in load, you are looking at a stress failure. If no clear load increase occurred, you are probably looking at a decay failure.
In organizational contexts, this distinction matters enormously. A team that was meeting its commitments and then missed a deadline when workload spiked has a stress failure. The response is to address the workload: reduce scope, add resources, extend timeline.
A team that has been gradually missing small commitments, producing lower-quality work, and struggling with coordination over an extended period has a decay failure. Adding more resources in response to a decay failure is often ineffective or counterproductive - it treats the symptom (insufficient capacity) while the decay continues unaddressed.
Decay Is Invisible Until Sudden
Decay failures are particularly insidious because they are invisible until they become sudden. The process of decay is slow and gradual. Each day's degradation is too small to notice or attribute to a specific cause. The accumulated degradation over months or years produces a dramatic failure event that looks sudden but was actually long in development.
This mismatch between the gradual cause and the sudden effect produces systematic misdiagnosis. When something fails dramatically, we look for dramatic causes. We search for the specific event that caused the failure. We find one - or invent one - and design responses around it. But if the actual cause was accumulated decay, addressing the apparent event cause does not address the underlying erosion.
The post-mortem process is particularly prone to this error. Post-mortems are good at identifying the immediate precipitating cause of a failure. They are poor at tracing accumulated decay that made the system vulnerable to failure long before the precipitating cause appeared.
Maintenance as Decay Prevention
The practical implication of decay failure analysis is that maintenance - active, ongoing effort to prevent degradation - is not optional. It is a necessary part of any system's operating costs.
Maintenance is systematically undervalued in most contexts. It is invisible when done well (the thing that didn't fail because it was maintained). It competes with new development for resources. It requires attention to be paid to things that are not currently urgent.
The organizations and individuals who manage decay failures well typically have explicit maintenance practices: skills that are kept current rather than allowed to atrophy, processes that are regularly reviewed and updated, systems that are actively monitored rather than trusted to continue working.
The organizations and individuals who do not have maintenance practices tend to experience the distinctive pattern of decay failure: everything seems fine until suddenly it is not, and the failure is much worse than it would have been if the decay had been caught and addressed earlier.
Resilience Design
The ultimate lesson is about resilience design. A resilient system is not one that never fails. It is one where failures are caught early when they are small (before stress failures become catastrophic and before decay progresses to critical failure), and where the system can be repaired and restored when failures do occur.
This requires both stress monitoring (knowing when load is approaching design limits) and decay monitoring (tracking degradation over time, not just current performance). The latter is harder and more often neglected, which is why decay failures surprise us more often than stress failures.
A good maintenance philosophy understands that decay is the default state of complex systems. Preventing it requires active, ongoing attention - not just response when something visibly breaks.