Skip to main content
Why Do Internet Services Fail, and What Can Be Done About It?
The number and popularity of large-scale Internet
services such as Google, MSN, and Yahoo! have grown significantly in
recent years. Such services are poised to increase further in importance
as they become the repository for data in ubiquitous computing systems
and the platform upon which new global-scale services and applications
are built. These services' large scale and need for 24x7 operation have
led their designers to incorporate a number of techniques for achieving
high availability. Nonetheless, failures still occur. Although the architects and operators of
these services might see such problems as failures on their part, these
system failures provide important lessons for the systems community
about why large-scale systems fail, and what techniques could prevent
failures. In an attempt to answer the question "Why do Internet services
fail, and what can be done about it?" we have studied over a hundred
post-mortem reports of user-visible failures from three large-scale
Internet services. In this paper we
-
identify which service components are most
failure-prone and have the highest Time to Repair (TTR), so that service
operators and researchers can know what areas most need improvement;
-
discuss in detail several instructive failure case studies;
-
examine the applicability of a number of failure mitigation techniques to the actual failures we studied; and
-
highlight the need for improved operator
tools and systems, collection of industry-wide failure data, and
creation of service-level benchmarks.
Read the paper
here.
Comments
Post a comment