Skip to main content

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems

Large, production quality distributed systems still fail periodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures.  We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. RunningAspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.

View the presentation here or read the full paper here.

Ron - the problem management guy

Comments

Popular posts from this blog

LDWin: Link Discovery for Windows

LDWin supports the following methods of link discovery: CDP - Cisco Discovery Protocol LLDP - Link Layer Discovery Protocol Download LDWin from here.

easywall - Web interface for easy use of the IPTables firewall on Linux systems written in Python3.

Firewalls are becoming increasingly important in today’s world. Hackers and automated scripts are constantly trying to invade your system and use it for Bitcoin mining, botnets or other things. To prevent these attacks, you can use a firewall on your system. IPTables is the strongest firewall in Linux because it can filter packets in the kernel before they reach the application. Using IPTables is not very easy for Linux beginners. We have created easywall - the simple IPTables web interface . The focus of the software is on easy installation and use. Access this neat software over on github: easywall

Latest: updatethreatblock.sh

#!/bin/bash # # usage updatethreatblock.sh <configuration file> # eg: updatethreatblock.sh /etc/ipset-threatblock/ipset-threatblock.conf # function exists() { command -v "$1" >/dev/null 2>&1 ; } if [[ -z "$1" ]]; then   echo "Error: please specify a configuration file, e.g. $0 /etc/ipset-threatblock/ipset-threatblock.conf"   exit 1 fi # shellcheck source=ipset-threatblock.conf if ! source "$1"; then   echo "Error: can't load configuration file $1"   exit 1 fi if ! exists curl && exists egrep && exists grep && exists ipset && exists iptables && exists sed && exists sort && exists wc ; then   echo >&2 "Error: searching PATH fails to find executables among: curl egrep grep ipset iptables sed sort wc"   exit 1 fi DO_OPTIMIZE_CIDR=no if exists iprange && [[ ${OPTIMIZE_CIDR:-yes} != no ]]; then   DO_OPTIMIZE_CIDR=yes fi if [[ ! -d $(dirname &q