Skip to main content

Prestik, Scotch tape and barbed wire data centres

 


No alt text provided for this image

Many times I have encountered Prestik, Scotch tape and barbed wire data centres. The reason that they are called this is because it seems these are the only tools the techies have to keep things up and running. And you know it is one, because you can visually recognize the dirt, spaghetti and graffiti around the place. The management denies any request for anything else but then still expects a quality service. The reason seems to be to satisfy budget constraints and operational service delivery in the data centre seems to be low on the list of priorities. The result is obvious, major incidents that will result in large productivity and even financial damage to the companies involved.

The symptoms of a Prestik, Scotch tape and barbed wire data centre are:

  • Limited documentation and non-existent standard operation procedures;
  • No visible emergency procedures;
  • Poor labeling;
  • Continuous change; and
  • Headless chickens.

This is often made worse by the division in the data centre of the responsibilities between the service delivery and operational departments. Invariably the two protagonists end up pointing fingers at each other.

Water leaks are no uncommon in data centres. Most often it is due to air handling units icing or piping perishing or failing. A relocated kitchen with a burst geyser, a ablution facility in an inappropriate location or a burst mains a large distance away but the flood directs itself into the data centre as it is located in the basement. Workable and tested flood detection is a must, and emergency plans must be in place when it happens. If you don't have a sump pump in your storeroom, you definitely won't have time to order one off ebay and have it delivered before you are floating out the data centre like Noah using one of your server cabinets as an ark.

Power outages and rolling black outs are also a common occurrence. It is naive to assume that it will work without testing and even more naive to conduct a test of a few minutes when the reality will be an outage of a few hours. The reason a full test of from four to eight hours is required can be described by this example. Modern air handling units use variable speed which also makes them more efficient. The controllers in these units have inputs from thermostats and air flow which allows the to operate at optimal levels. Air handling units are not connected to UPS but directly to utility power which is backed up by generators. If the air handling units were connected to UPS, they would invariably drain the reserve power too rapidly. In a power failure the air handling units will power cycle as there is a latency before the generators start and can sustain a load. However, due to the nature of the controllers it is required that the air handling unit's controllers which manages the sensors is on UPS. Secondly, these controllers need to be programmed to time delay the air handling unit start when power load is available. This is because if all air handling units start at the same time the load will be too high, and the start will fail. A different time delayed start for each unit will allow each unit to power spike at different intervals preventing the high load all at once. (Much the same way as LM pilot Mattingly was required to achieve in the movie Apollo 13). A modern data centre will starting heating up withing 5 minutes and critical shutdown will be required within 45 minutes. A 5 minute power test will never highlight the kinks in the system, thus a full test is always required.

IT guys like to view network status using management tools and web based tools. Facility guys walk around with clipboards and eyeball the equipment. Both methods have merit but only doing one has none. It always fascinates me why consultants install UPS's at a million bucks and overlook the web or Ethernet module for the unit. I have seen a data centre fail as no-one knew the generator was not charging the UPS and no web module was installed.

When a major incident happens and large amounts of moola are flushed down the toilet, then it is too late to say that the data centre should have been built with more than Prestik, Scotch tape and barbed wire.

No alt text provided for this image
Ronald Bartels works at Fusion Broadband and is driving SD-WAN adoption in South Africa.

This article was originally published over at LinkedIn: Prestik, Scotch tape and barbed wire data centres

Comments

Popular posts from this blog

Why Madge Networks, the token-ring company, went titsup

There I was shooting the breeze with an old mate. The conversation turned to why Madge Networks which I wrote about here went titsup. My analysis is that Madge Networks had a solution and decided to go out and find a problem. They deferred to more incorrect strategic technology choices. The truth of the matter is that when something goes titsup, its not because of one reason only, but a myriad of them all contributing to the negative consequence. There are the immediate or visual ones, which are underpinned by intermediate ones and finally after digging right down, there are the root causes. There is never a singular root cause for anything but I'll present my opinion and encourage everyone else to chip in. All of them together are more likely the reason the company went titsup. As far as technology brainfarts go there is no better example than Kodak . They invented the digital camera that killed them. However, they were so focused on milking people in their leg

Flawed "ITIL aligned"​ Incident Management

Many "ITIL aligned" service desk tools have flawed incident management. The reason is that incidents are logged with a time association and some related fields to type in some gobbledygook. The expanded incident life cycle is not enforced and as a result trending and problem management is not possible. Here is a fictitious log of an incident at PFS, a financial services company, which uses CGTSD, an “ITIL-aligned” service desk tool. Here is the log of an incident record from this system: Monday, 12 August: 09:03am (Bob, the service desk guy): Alice (customer in retail banking) phoned in. Logged an issue. Unable to assist over the phone (there goes our FCR), will escalate to second line. 09:04am (Bob, the service desk guy): Escalate the incident to Charles in second line support. 09:05am (Charles, technical support): Open incident. 09:05am (Charles, technical support): Delayed incident by 1 day. Tuesday, 13 August: 10:11am (Charles, technical support): Phoned Alice.

Updated: Articles by Ron Bartels published on iot for all

  These are articles that I published during the course of the past year on one of the popular international Internet of Things publishing sites, iot for all .  These are articles that I published during the course of the past year on one of the popular international Internet of Things publishing sites, iot for all . Improving Data Center Reliability With IoT Reliability and availability are essential to data centers. IoT can enable better issue tracking and data collection, leading to greater stability. Doing the Work Right in Data Centers With Checklists Data centers are complex. Modern economies rely upon their continuous operation. IoT solutions paired with this data center checklist can help! IoT Optimi