Why Do Internet Services Fail, and What Can Be Done About It?

The number and popularity of large-scale Internet services such as Google, MSN, and Yahoo! have grown significantly in recent years. Such services are poised to increase further in importance as they become the repository for data in ubiquitous computing systems and the platform upon which new global-scale services and applications are built. These services' large scale and need for 24x7 operation have led their designers to incorporate a number of techniques for achieving high availability. Nonetheless, failures still occur.  Although the architects and operators of these services might see such problems as failures on their part, these system failures provide important lessons for the systems community about why large-scale systems fail, and what techniques could prevent failures. In an attempt to answer the question "Why do Internet services fail, and what can be done about it?" we have studied over a hundred post-mortem reports of user-visible failures from three large-scale Internet services. In this paper we
  • identify which service components are most failure-prone and have the highest Time to Repair (TTR), so that service operators and researchers can know what areas most need improvement;
  • discuss in detail several instructive failure case studies;
  • examine the applicability of a number of failure mitigation techniques to the actual failures we studied; and
  • highlight the need for improved operator tools and systems, collection of industry-wide failure data, and creation of service-level benchmarks.
Read the paper here.

Ron - the problem management guy