Skip to main content

How SD-WAN improves Mean Time To Repair: WHILE outage { CASE detect(); diagnose(); resolve(); i++ }

 

The important consideration when working on improving Mean Time to Repair (MTTR) is to understand the time in between. It is not about an outage occurring at a specific time and the link coming back online at another time. The above is what is meant by MTTR but to have a meaningful conversation about it more information is required. In the context of software defined wide area networks (SD-WAN), a comparison needs to be made between how a SD-WAN deployment would have improved MTTRs above that of a legacy wide area network (WAN) installation.

Based on risk mitigations and industry norms, ISPs often contract SLA's based on these MTTRs. A poorly managed MTTR can result in heavy penalties or having to incur additional costs by correcting excessive times using more resources (either headcount or automated service tools) which might not be optimal. Another negative consequence would be customer churn.

Incident life cycle

To understand the times involved in MTTR we need to fully understand all the steps that happen from outage to repair, which in ITIL terms is often referred to as the incident life cycle. Here are the steps at a high level:

  • Outage occurs;
  • The outage is detected either by human notification or automated systems such as Network Management Systems;
  • A process of diagnosis occurs whereby resources determine the outage causation and repair process. During this step, a number of tools can potentially assist. Causation can be immediate (visual), intermediate (underlying) or root (underpinning);
  • Typically when the underlying causation is determined a repair can be initiated.
  • If appropriate a workaround might be available to temporary return the link/connectivity to service as a short term alternative while normal operations are completed at a later stage;
  • The link is ready for repair when diagnosis is complete, the repair process determined and any logistics such as delivery of spare parts/components completed;
  • The components that have caused the outage are then repaired and this includes restoring the required configuration for normal operations; and
  • The link starts operating normally again when traffic starts flowing again over the link in a manner similar to before the outage.
Programmatically this would be:
WHILE outage {
step; i++
}

The SD-WAN architecture inherently improves the MTTR in a number of ways. The connectivity is controlled and managed from concentrators located in data centres. Thus unlike a legacy distributed wide area network, any link outage is immediately detected by the concentrators without the requirement of a remote polling system.

Configuration

The setup and configuration of a SD-WAN is simplistic at an administrative level. There are no realms of text to copy and past via telnet/ssh sessions. The diagnosis is immediately partitioned between the lower transport protocol levels versus the high connectivity protocol levels. SD-WAN makes this diagnosis immediately apparent and there is not extended finger pointing between layer 2 or 3 which so often befalls legacy wide area network deployments.

Logistics

Logistics and spare parts is common across SD-WAN and legacy wide area network deployments and is not necessarily better optimised in either scenario. However, since SD-WAN hardware is more likely to be built using white box instead of proprietary hardware there is a potential improvement in overall parts availability. Another benefit of SD-WAN is that the diagnosis and management ability of the product set is more update which will result in a greater success rate of first resolutions with rolling wheels. One of the biggest curses of current legacy WAN installations is the disproportionate number of second visits required by rolling wheels due to component mismatches. Some of these installations have been in the field for years and the new stock often does not inter-operate with what is in the field.

Automation

The restore of the link is extremely optimized and automated within SD-WAN. This is as a result of the simplistic provisioning mechanism used to initially deploy SDWAN and leveraged to restore service. It automatically connects to the concentrator, downloads the configuration and service is restored. In a legacy environment there is a often a process required of laptops using specialised cables, remote session consoles over 3G such as Teamviewer, and the cursed cut and paste required with legacy consoles. The skill level for remote hands in SD-WAN is thus less and therefore more readily available.

SDWAN links are often deployed whereby multiple paths and mediums are utilised. Given this inherent ability, a workaround is more readily available in SD-WAN deployments than with legacy WAN installations. In my situations, SD-WAN protects the overall availability as when more than one last mile is in place, it is unlikely that they are all suffering for outages simultaneously!

At a basic and practical level SD-WAN improves MTTR. Any contributions and comments welcomed.

Ronald works connecting Internet inhabiting things at Fusion Broadband.

Ronald Bartels on twitter


Comments

Popular posts from this blog

easywall - Web interface for easy use of the IPTables firewall on Linux systems written in Python3.

Firewalls are becoming increasingly important in today’s world. Hackers and automated scripts are constantly trying to invade your system and use it for Bitcoin mining, botnets or other things. To prevent these attacks, you can use a firewall on your system. IPTables is the strongest firewall in Linux because it can filter packets in the kernel before they reach the application. Using IPTables is not very easy for Linux beginners. We have created easywall - the simple IPTables web interface . The focus of the software is on easy installation and use. Access this neat software over on github: easywall

No Scrubs: The Architecture That Made Unmetered Mitigation Possible

When building a DDoS mitigation service it’s incredibly tempting to think that the solution is scrubbing centers or scrubbing servers. I, too, thought that was a good idea in the beginning, but experience has shown that there are serious pitfalls to this approach. Read the post of at Cloudflare's blog: N o Scrubs: The Architecture That Made Unmetered Mitigation Possible

Should You Buy A UniFi Dream Machine, USG, USG Pro, or Dream Machine Pro?