Troubleshooting steps
Introduction
Troubleshooting network issues is one of the common skills of every network engineer. And usually, we don’t think about it. We don’t study and train this skill especially. I tell about troubleshooting as a formal process. We just get experience from our daily routine or follow company workflow. I will try to formalize some basic notions. Hope it will be helpful.
Of course, it depends on the situation and business constraints but when we try to resolve some issue we should follow the next steps:
Preparing -> Information-gathering -> Isolating -> Resolving -> Escalating
Let's look at every step.
Preparing
Every network has infrastructure tools (monitoring, inventory, etc), but we should continuously improve and keep up to date them. Try to develop and integrate a new one. This stack of tools is our source of truth. If we have it, we can easily fetch a full amount of information before, during, and after problems. It’s an enormous topic but without these tools, we can’t successfully troubleshoot our network.
Mandatory tools:
- Syslog (at least simple Syslog server. And good to have e.g. Elastic stack)
- Alarm management system (e.g. Zabbix)
- Statistics collector (e.g. Zabbix, Cacti etc)
- Inventory tool (e.g. Netbox)
- AAA system (e.g. TACACS)
- CLI log collector (many terminal tools have this feature)
- NTP
- Topology tools and maps
- Configuration collector
- Demarcation and mirror points (It’s very important to have the possibility mirror real traffic or measure some traffic characteristics)
- network design documentation (HLD, LLD, drafts)
- hardware and software documentation from vendors
- software release notes
- previous issues descriptions database
- contacts of vendor's support
What information should we fetch from our tools when some issue is occurring?
- logs
- alarms
- statistics
- special tech files from network nodes (this point depends on the vendor equipment)
- different show commands and debug (of course we can’t collect all outputs. This point starts when we isolate a problem. It will be discussed in the next topics)
- traffic dumps (if it’s possible and applicable)
The other mandatory thing for our investigation is answers for the next questions:
- When did the problem start?
- What is the service impact?
- What are the symptoms?
- Were there any operations being performed around the time the problem started?
- And other related questions.
I’m sure that if we have all information it will be half problem resolved.
Isolating
The next step is trying to isolate our problem. As I can imagine we have two types of isolation or two steps of isolation:
- isolation by network segment. Sometimes we don’t know what segment of our network experiences problems at the first moment of the issue. And we should find this place.
- isolation by protocol stack. The main technique is Bottom UP - from Physical layer to Application level
It depends on the situation but usually, this step is simple and we particularly know what node or network segment has a problem (e.g. from a monitoring system). But sometimes it’s not obvious. And we should investigate different types of information which we collected from the second step. We should pay attention to alarms, spikes on statistics, unusual events in logs, last configuration changes, etc. If we have enough information we will do half of the job. Nothing special - just preparing :)
Ok, we've found some network devices or a couple of devices with problems, but we don't understand what is the root cause? What’s the next step?
Isolation by protocol stack
Every protocol and stack level contains enormous types of problems. But I'm not going deep dive into every protocol or level. It's impossible. I will give a list of the most common and frequent problems with my comments. It's just examples. Real-life problems are complex and often include interaction between different protocols, hardware, and software issues.
L1/L2
L3
Resolving
Ok, we’ve found the root cause of our problem. What’s the next step? Resolving. Maybe it's an obvious step, but I will give some advice and comments about it or provide some plan.
Resolving pipeline:
- Test your action in the lab environment if it’s possible
- Resolve problem (make one change at a time!)
- Post verify your changes (check the same sources as for the second step: statistics, log, alarms, CLI outputs, etc)
- Document your changes
- In the end, try to improve company workflow. Think about how to prevent the problem in the future.
This short plan helps to carefully resolve the problem and decrease the probability of the same problem in the future. And again - everything depends on the situation and problem severity.
Escalating
It's a normal situation when you can't resolve some issue by yourself or your team/level. And for this purpose, every company should have an escalation plan. If you decided to escalate the problem to the next level you should provide a full amount of information:
- short description of the issue
- first look information (again our sources - logs, alarms, statistics, etc)
- changes (if they were)
- your point of view and ideas
- ready remote control (if applicable)
Conclusion
Network engineers have deals with network issues almost every day. Some problems are simple but take time away. Other issues are very complex and require a huge amount of effort. How can we mitigate this situation? How can we reduce the count of issues and time for investigation?
We should continuously improve all aspects of network maintenance:
- infrastructure tools (preparing step)
- company workflow (information-gathering and escalating steps)
- experience and theory knowledge of engineers (isolating and resolving steps)
Comments
Post a Comment