Skip to main content

How SD-WAN improves Mean Time To Repair: WHILE outage { CASE detect(); diagnose(); resolve(); i++ }


The important consideration when working on improving Mean Time to Repair (MTTR) is to understand the time in between. It is not about an outage occurring at a specific time and the link coming back online at another time. The above is what is meant by MTTR but to have a meaningful conversation about it more information is required. In the context of software defined wide area networks (SD-WAN), a comparison needs to be made between how a SD-WAN deployment would have improved MTTRs above that of a legacy wide area network (WAN) installation.

Based on risk mitigations and industry norms, ISPs often contract SLA's based on these MTTRs. A poorly managed MTTR can result in heavy penalties or having to incur additional costs by correcting excessive times using more resources (either headcount or automated service tools) which might not be optimal. Another negative consequence would be customer churn.

Incident life cycle

To understand the times involved in MTTR we need to fully understand all the steps that happen from outage to repair, which in ITIL terms is often referred to as the incident life cycle. Here are the steps at a high level:

  • Outage occurs;
  • The outage is detected either by human notification or automated systems such as Network Management Systems;
  • A process of diagnosis occurs whereby resources determine the outage causation and repair process. During this step, a number of tools can potentially assist. Causation can be immediate (visual), intermediate (underlying) or root (underpinning);
  • Typically when the underlying causation is determined a repair can be initiated.
  • If appropriate a workaround might be available to temporary return the link/connectivity to service as a short term alternative while normal operations are completed at a later stage;
  • The link is ready for repair when diagnosis is complete, the repair process determined and any logistics such as delivery of spare parts/components completed;
  • The components that have caused the outage are then repaired and this includes restoring the required configuration for normal operations; and
  • The link starts operating normally again when traffic starts flowing again over the link in a manner similar to before the outage.
Programmatically this would be:
WHILE outage {
step; i++

The SD-WAN architecture inherently improves the MTTR in a number of ways. The connectivity is controlled and managed from concentrators located in data centres. Thus unlike a legacy distributed wide area network, any link outage is immediately detected by the concentrators without the requirement of a remote polling system.


The setup and configuration of a SD-WAN is simplistic at an administrative level. There are no realms of text to copy and past via telnet/ssh sessions. The diagnosis is immediately partitioned between the lower transport protocol levels versus the high connectivity protocol levels. SD-WAN makes this diagnosis immediately apparent and there is not extended finger pointing between layer 2 or 3 which so often befalls legacy wide area network deployments.


Logistics and spare parts is common across SD-WAN and legacy wide area network deployments and is not necessarily better optimised in either scenario. However, since SD-WAN hardware is more likely to be built using white box instead of proprietary hardware there is a potential improvement in overall parts availability. Another benefit of SD-WAN is that the diagnosis and management ability of the product set is more update which will result in a greater success rate of first resolutions with rolling wheels. One of the biggest curses of current legacy WAN installations is the disproportionate number of second visits required by rolling wheels due to component mismatches. Some of these installations have been in the field for years and the new stock often does not inter-operate with what is in the field.


The restore of the link is extremely optimized and automated within SD-WAN. This is as a result of the simplistic provisioning mechanism used to initially deploy SDWAN and leveraged to restore service. It automatically connects to the concentrator, downloads the configuration and service is restored. In a legacy environment there is a often a process required of laptops using specialised cables, remote session consoles over 3G such as Teamviewer, and the cursed cut and paste required with legacy consoles. The skill level for remote hands in SD-WAN is thus less and therefore more readily available.

SDWAN links are often deployed whereby multiple paths and mediums are utilised. Given this inherent ability, a workaround is more readily available in SD-WAN deployments than with legacy WAN installations. In my situations, SD-WAN protects the overall availability as when more than one last mile is in place, it is unlikely that they are all suffering for outages simultaneously!

At a basic and practical level SD-WAN improves MTTR. Any contributions and comments welcomed.

Ronald works connecting Internet inhabiting things at Fusion Broadband.

Ronald Bartels on twitter


Popular posts from this blog

Why Madge Networks, the token-ring company, went titsup

There I was shooting the breeze with an old mate. The conversation turned to why Madge Networks which I wrote about here went titsup. My analysis is that Madge Networks had a solution and decided to go out and find a problem. They deferred to more incorrect strategic technology choices. The truth of the matter is that when something goes titsup, its not because of one reason only, but a myriad of them all contributing to the negative consequence. There are the immediate or visual ones, which are underpinned by intermediate ones and finally after digging right down, there are the root causes. There is never a singular root cause for anything but I'll present my opinion and encourage everyone else to chip in. All of them together are more likely the reason the company went titsup. As far as technology brainfarts go there is no better example than Kodak . They invented the digital camera that killed them. However, they were so focused on milking people in their leg

Flawed "ITIL aligned"​ Incident Management

Many "ITIL aligned" service desk tools have flawed incident management. The reason is that incidents are logged with a time association and some related fields to type in some gobbledygook. The expanded incident life cycle is not enforced and as a result trending and problem management is not possible. Here is a fictitious log of an incident at PFS, a financial services company, which uses CGTSD, an “ITIL-aligned” service desk tool. Here is the log of an incident record from this system: Monday, 12 August: 09:03am (Bob, the service desk guy): Alice (customer in retail banking) phoned in. Logged an issue. Unable to assist over the phone (there goes our FCR), will escalate to second line. 09:04am (Bob, the service desk guy): Escalate the incident to Charles in second line support. 09:05am (Charles, technical support): Open incident. 09:05am (Charles, technical support): Delayed incident by 1 day. Tuesday, 13 August: 10:11am (Charles, technical support): Phoned Alice.

The best social media requires no batteries

  Today it is all about social media such as whatsapp, facebook, twitter or even LinkedIn. However, the best social media is Craic. No, it is not to be confused with substance abuse. Let me explain. Often when people meet around a braai , dinner table, or share either a pot of beer, bottle of wine, a cup of tea or a mug of coffee a conversation is likely to happen. This conversation is invariably about things and is referred to as Craic. And it is best reinforced with a good bottle of whisky (typically an Irish one, which would be known as a whiskey). Now talking about why some people call it whisky, and other whiskey is good Craic. Craic is often a discussion about things that spark a debate or lead to an extensive and prolonged engagement. Things in our world are objects that exist or have existed for a long time period. We typically assume that things in our modern world have only been around a short time period but invariably many