Troubleshooting steps


Introduction

Troubleshooting network issues is one of the common skills of every network engineer.  And usually, we don’t think about it. We don’t study and train this skill especially. I tell about troubleshooting as a formal process. We just get experience from our daily routine or follow company workflow. I will try to formalize some basic notions. Hope it will be helpful. 

Of course, it depends on the situation and business constraints but when we try to resolve some issue we should follow the next steps:

Preparing -> Information-gathering -> Isolating -> Resolving -> Escalating

Let's look at every step.

Preparing

Every network has infrastructure tools (monitoring, inventory, etc), but we should continuously improve and keep up to date them. Try to develop and integrate a new one. This stack of tools is our source of truth. If we have it, we can easily fetch a full amount of information before, during, and after problems. It’s an enormous topic but without these tools, we can’t successfully troubleshoot our network.

Mandatory tools:

  • Syslog (at least simple Syslog server. And good to have e.g. Elastic stack)
  • Alarm management system (e.g. Zabbix)
  • Statistics collector (e.g. Zabbix, Cacti etc)
  • Inventory tool (e.g. Netbox)
  • AAA system (e.g. TACACS)
  • CLI log collector (many terminal tools have this feature)
  • NTP
  • Topology tools and maps
  • Configuration collector
  • Demarcation and mirror points (It’s very important to have the possibility mirror real traffic or measure some traffic characteristics)
Another useful tool is the local knowledge base. Some possible parts of knowledge base are:
  • network design documentation (HLD, LLD, drafts)
  • hardware and software documentation from vendors
  • software release notes
  • previous issues descriptions database
  • contacts of vendor's support
Information-gathering

What information should we fetch from our tools when some issue is occurring? 

  • logs
  • alarms
  • statistics
  • special tech files from network nodes (this point depends on the vendor equipment)
  • different show commands and debug (of course we can’t collect all outputs. This point starts when we isolate a problem. It will be discussed in the next topics)
  • traffic dumps (if it’s possible and applicable)

The other mandatory thing for our investigation is answers for the next questions:

  • When did the problem start?
  • What is the service impact?
  • What are the symptoms?
  • Were there any operations being performed around the time the problem started?
  • And other related questions.

I’m sure that if we have all information it will be half problem resolved. 

Isolating

The next step is trying to isolate our problem. As I can imagine we have two types of isolation or two steps of isolation:

  • isolation by network segment. Sometimes we don’t know what segment of our network experiences problems at the first moment of the issue. And we should find this place.
  • isolation by protocol stack. The main technique is Bottom UP - from Physical layer to Application level

It depends on the situation but usually, this step is simple and we particularly know what node or network segment has a problem (e.g. from a monitoring system). But sometimes it’s not obvious. And we should investigate different types of information which we collected from the second step. We should pay attention to alarms, spikes on statistics, unusual events in logs, last configuration changes,  etc. If we have enough information we will do half of the job. Nothing special - just preparing :)

Ok, we've found some network devices or a couple of devices with problems, but we don't understand what is the root cause? What’s the next step? 

Isolation by protocol stack

Every protocol and stack level contains enormous types of problems. But I'm not going deep dive into every protocol or level. It's impossible. I will give a list of the most common and frequent problems with my comments. It's just examples. Real-life problems are complex and often include interaction between different protocols, hardware, and software issues.

L1/L2

Problem

Possible causes

Port down, FCS errors (and any other errors)

  • cables connected to wrong ports
  • problems with media(SFP, fiber)
  • HW problem

L2 MTU issue

  • configuration
  • 3-rd party devices in the middle 

Unstable MAC or ARP table,

excessive broadcast traffic

  • configuration
  • L2 loop (loop-free protocol config or state)
  • faulty client’s device

L2 protocols control plane issues

(e.g. STP, LACP, LLDP)

  • configuration
  • L2 filters
  • 3-rd party devices in the middle 


L3 

Problem

Possible causes

Network is unreachable

( It’s the base problem all the time :))

  • It’s not a route in the RT
  • incorrect next-hop resolution
  • routing loop
  • L3 filters

Routing protocols adjacency fail

  • MTU mismatch
  • interface type mismatch
  • authentication
  • router-id isn’t unique
  • area/level/capability mismatch

Suboptimal routing or routing loops

  • incorrect routing policy
  • manual protocol preference configuration
  • incorrect static routes


Resolving

Ok, we’ve found the root cause of our problem. What’s the next step? Resolving. Maybe it's an obvious step, but I will give some advice and comments about it or provide some plan.

Resolving pipeline:

  • Test your action in the lab environment if it’s possible
  • Resolve problem (make one change at a time!)
  • Post verify your changes (check the same sources as for the second step: statistics, log, alarms, CLI outputs, etc)
  • Document your changes
  • In the end, try to improve company workflow. Think about how to prevent the problem in the future.

This short plan helps to carefully resolve the problem and decrease the probability of the same problem in the future. And again - everything depends on the situation and problem severity.

Escalating

It's a normal situation when you can't resolve some issue by yourself or your team/level. And for this purpose, every company should have an escalation plan. If you decided to escalate the problem to the next level you should provide a full amount of information:

  • short description of the issue
  • first look information (again our sources - logs, alarms, statistics, etc)
  • changes (if they were)
  • your point of view and ideas
  • ready remote control (if applicable)

Conclusion

Network engineers have deals with network issues almost every day. Some problems are simple but take time away. Other issues are very complex and require a huge amount of effort.  How can we mitigate this situation? How can we reduce the count of issues and time for investigation?

We should continuously improve all aspects of network maintenance:

  • infrastructure tools (preparing step)
  • company workflow (information-gathering and escalating steps)
  • experience and theory knowledge of engineers (isolating and resolving steps)


Comments