Skip to main content

A checklist for troubleshooting network problems (22 things to catch)

 

  1. Assumptions! What is really wrong? Is it the network that is being blamed for something else? Fully describe and detail the issue. The mere act of writing it down, often clarifies matters.
  2. Kick the tyres and do a visual inspection. With Smartphones being readily available, take pictures. I once went to a factory where there was a problem. Upon inspection, the network equipment was covered in pigeon pooh! The chassis had rusted and the PCB boards were being affected by the stuff. No wonder there was a problem. In another example, which involved radio links. It is difficult with radio links to remotely troubleshoot alignment errors. (I can recall when a heavy storm blew some radio links out of alignment. Until we climbed onto the roof we never realised how strong the wind really was that day!)
  3. Cabling. Is the cable actually plugged in? Is it plugged into the correct location. Wear and tear on cabling can also not be discounted. As a minimum invest in a decent cable tester. Check for power cable runs that are in parallel to network cables. Check for dust on fibre optic connectors. (Start off my reseating all the patch cables.)
  4. Check the auto negotiation settings. Many problems are as of a result of switch or host setting misconfiguration. Tip: Auto is best! Surprisingly, this is the the biggest current problem in networks. If you have a decent network management tool you can detect these mismatches and even discovery protocols like LLDP and CDP provide visibility.
  5. Are there packets being dropped? The next biggest problem is often the misinterpretation of bandwidth being used. As an example a 2 Mb/s WAN link will be unable to sustain a load of more than 2 Mb/s. Simple??? Often a techie will look at an hourly usage graph and say that because the graph show an peak of now more than 1 Mb/s there is no problem. WRONG. Data is burtsy which means that a load for a few seconds greater than the available bandwidth will drop packets and result in applications are impacted. The term used to describe the dropping of packets above the available load is called policing. This is mitigated by using shaping. It is crucial to understand that shaping needs to be implemented at a level below the policing rate or else there is no benefit as packets will still be dropped.
  6. Check the network drivers. Most of the network drivers that are pre-released with the operating systems are not optimal! Visit the NIC (Network Interface Card) manufacturer web site and update.
  7. Walk through the configuration. Are the IP addresses correct? Are the subnets correct? Is the right VLAN being used? Is the gateway correct?
  8. Changes. Compare and determine differences. Firewall rule changes are often candidate changes for review. (And don't discount desktop firewalls!) When reviewing changes consider: What: conditions, activity, equipment; When: schedule, occurrence, status; Where: local, environment; How: practice, actions, procedures; Who: personnel, supervision. Review the network documentation. Is what is written there reflected in reality?
  9. Power! Often network equipment does not start up correctly after a power outage or is adversely affected by brown outs.It might be prudent to restart the equipment to ensure a clean start up.
  10. Refer to those Release Notes. Somewhere in the world someone has had the same problem as you. Download and read the latest release notes for your network equipment.
  11. Black holes. It is amazing how common black holes really are in networks and it is usually down to incorrect MTU settings. I can recall a mad day of scrambling around attempting to troubleshoot network connectivity issues when finally narrowing it down to a WAN compression device that was messing with the MTU. Be sure to check all the appliances and netwok devices along the communications path and check the MTU. As more tunelled networks are deployed this issue will occur more often. (I was recently phoned by a pal who had the issue on one of his customer's networks between some old 3Com kit and a Cisco WAN. Everyone has gone down the wrong path in trying to troubleshoot the issue before I suggested he check the MTU. Voila! Problem solved).
  12. Sniff. Wireshark's powerful features make it the tool of choice for network troubleshooting. Load the software and capture a copy of the packets involved in the problem. This forms the basis of any extended analysis.
  13. Are the router tables correct? "show ip route"
  14. Is the bandwidth being saturated? FTP and email are bandwidth killers and the usual suspects.
  15. Spanning tree. Spanning tree must be setup in a deterministic fashion and not in a default manner. And hubs in a switched network or disasters in waiting. Also make sure a techie hasn't left a span port enabled and then reallocated it later.
  16. QoS settings. Have the correct bandwidth allocations been made and are they correct end to end?
  17. Hacking and pseudo hacking. You have hackers and then those that pretent to be hackers. Those vulnerability scans often cause more trouble than what they are trying to prevent. Death by shooting squadron at dawn is the only punishment for Infosec folk doing vulnerability scans across a WAN link especially during the day.
  18. Service provider finger pointing. Never trust a carrier or service provider when their lips are moving. I should know...
  19. Name resolution. Is name resolution working correctly?
  20. Complexity. Often network engineers try to show the worth of their big pay packages by designing complex networks. The true worth of a good design is if it is normalized and taken down to its most simple form. A simple network is less likely to go titsup.
  21. Broadcasts. In many cases too many nodes are installed into a single VLAN or broadcast domain. Has the LAN being correctly structured and designed?
  22. Pre-empt the issue. Fundamentally, this requires a good network configuration management tool and continuous reviews. Is this being done pro actively?

Ronald Bartels provides SDWAN solutions via Fusion Broadband.

This article was originally published over at LinkedIn: A checklist for troubleshooting network problems (22 things to catch)

 

Comments

Popular posts from this blog

Why Madge Networks, the token-ring company, went titsup

There I was shooting the breeze with an old mate. The conversation turned to why Madge Networks which I wrote about here went titsup. My analysis is that Madge Networks had a solution and decided to go out and find a problem. They deferred to more incorrect strategic technology choices. The truth of the matter is that when something goes titsup, its not because of one reason only, but a myriad of them all contributing to the negative consequence. There are the immediate or visual ones, which are underpinned by intermediate ones and finally after digging right down, there are the root causes. There is never a singular root cause for anything but I'll present my opinion and encourage everyone else to chip in. All of them together are more likely the reason the company went titsup. As far as technology brainfarts go there is no better example than Kodak . They invented the digital camera that killed them. However, they were so focused on milking people in their leg

Flawed "ITIL aligned"​ Incident Management

Many "ITIL aligned" service desk tools have flawed incident management. The reason is that incidents are logged with a time association and some related fields to type in some gobbledygook. The expanded incident life cycle is not enforced and as a result trending and problem management is not possible. Here is a fictitious log of an incident at PFS, a financial services company, which uses CGTSD, an “ITIL-aligned” service desk tool. Here is the log of an incident record from this system: Monday, 12 August: 09:03am (Bob, the service desk guy): Alice (customer in retail banking) phoned in. Logged an issue. Unable to assist over the phone (there goes our FCR), will escalate to second line. 09:04am (Bob, the service desk guy): Escalate the incident to Charles in second line support. 09:05am (Charles, technical support): Open incident. 09:05am (Charles, technical support): Delayed incident by 1 day. Tuesday, 13 August: 10:11am (Charles, technical support): Phoned Alice.