Junos Monitoring and Troubleshooting
Junos Monitoring and Troubleshooting
Junos Monitoring and Troubleshooting
DAY ONE:
JUNOS MONITORING AND TROUBLESHOOTING
This Day One book advocates a process for monitoring and troubleshooting your network. The goal is to give you an idea of what to look for before ever typing a show command, so by books end, you should know not only what to look for, but where to look.
Day One: Junos Monitoring and Troubleshooting shows you how to identify the root causes
of a variety of problems and advocates a common approach to isolate the problems with
a best practice set of questions and tests. Moreover, it includes the instrumentation to
assist in root cause identification and the configuration know-how to solve both common and severe problems before they ever begin.
This Day One book for configuring SRX Series with J-Web makes configuring, troubleshooting,
andmaintainingthe SRXSeries devicesa breeze for any user who is new to the wonderful world
of Junos,or who just likes to use its GUI interfacerather thanthe CLI.
Alpana Nangpal,Security Engineer,Bravo Health
ITS DAY ONE AND YOU HAVE A JOB TO DO, SO LEARN HOW TO:
n Anticipate the causes and locations of network problems before ever logging in
to a device.
n Develop a standard monitoring and troubleshooting template providing technicians
and monitoring systems with all they need to operate your network.
n Utilize the OSI model for quick and effective troubleshooting across different
protocols and technologies.
n Use the power of Junos to monitor device and network health and reduce network
downtime.
n Develop your own test for checking the suitability of a network fix.
Juniper Networks Day One books provide just the information you need to know on day one. Thats
because they are written by subject matter experts who specialize in getting networks up and
running. Visit www.juniper.net/dayone to peruse the complete library.
Published by Juniper Networks Books
ISBN 978-1-936779-04-8
51800
9 781936 779048
7100 1241
Junos Fundamentals
ii
A solid understanding of the topology, traffic flows, and protocols
used on your network.
A familiarity with the Junos CLI.
Experience with network monitoring protocols such as syslog and
SNMP.
An understanding of Network Management Systems, what they
do, and how they do it.
Awareness of the OSI model and how it applies to network
protocols and elements.
A nticipate the causes and locations of network problems before
ever logging in to a device.
Develop a standard monitoring and troubleshooting template
Utilize the OSI model for quick and effective troubleshooting
across different protocols and technologies.
Use the power of Junos to monitor device and network health and
reduce network downtime.
Develop your own test for checking the suitability of a network
fix.
NOTE Wed like to hear your comments and critiques. Please send us your
iii
iv
About Junos
Junos is a reliable, high-performance network operating system for
routing, switching, and security. It reduces the time necessary to deploy
new services and decreases network operation costs by up to 41%.
Junos offers secure programming interfaces and the Junos SDK for
developing applications that can unlock more value from the network.
One operating system: Reduces time and effort to plan, deploy, and
operate network infrastructure.
www.oreilly.com.
Chapter 1
Root Cause Identification
The Fix Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
This Books Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Problem Scope
Single User
Single Destination
Source User
All
Constant
or
Sporadic
Destination
Constant
Service Provider
Some Protocols
Consistant
Sporadic
Constant
Circuit or Route
Oscillation
Some Protocols
Inconsistant
Sporadic
Firewall
Traffic Type
Problem Consistency
Troubleshooting Focus
Problem Scope
Single User
All Destinations
Source User
Constant
All
Sporadic
Constant
Some Protocols
Consistant
Sporadic
Constant
Some Protocols
Inconsistant
Sporadic
Firewall
Traffic Type
Problem Consistency
Troubleshooting Focus
Problem Scope
All Users
Single Destination
Destination
Constant
All
Sporadic
Service Provider
Constant
Some Protocols
Consistant
Sporadic
Constant
Some Protocols
Inconsistant
Sporadic
Firewall
Traffic Type
Problem Consistency
Troubleshooting Focus
10
Problem Scope
All Users
All Destinations
Aggregation Point
Problem
Constant
All
Sporadic
Service Provider
Constant
Some Protocols
Consistant
Circuit Down
Sporadic
Constant
Firewall
Some Protocols
Inconsistant
Sporadic
Network-wide
Outage
Traffic Type
Problem Consistency
Troubleshooting Focus
TIP These figures help you to identify the scope and possible cause of a
network outage. By using this information in conjunction with the Fix
Test that follows, you should be able to more quickly isolate problems
and restore service to your customers.
The scope of an outage can mean many things to many people. Some
people may declare its an apocalypse if their primary account is occasionally slow to download a single website, while others might raise an
alarm only when their entire network is down. What you should look for
with this question is an objective view of the problem, absent emotion.
This is the most important initial aspect of an outage to understand.
How Many Distinct Source Networks are Affected?
What Destinations are Involved?
You should then be able to determine if the problem is at the source, the
destination, or is something larger. If it is a single user (or net-work)
reporting problems to everything, you should focus your efforts on
the network elements close to the source. If many people are reporting
problems to a single destination (or network), the problem is likely close
to the destination, or is potentially the result of a problem at a network
interconnect such as a peering point. If you cant seem to isolate the
problem to either the sources or destination, the event is probably
network-wide and its probably time to hit the emergency button.
Who Reported the Problem First?
11
12
Answering this question can help you understand at which OSI model
network layer the problem is happening. Total loss of connectivity
usually indicates the problems are at Layer 3 or perhaps a circuit is
down. Layer 2 problems are generally protocol agnostic, but rarely
cause a complete outage. Upper layer (Layers 4, 5, 6, and 7) problems
are often caused by firewall issues.
The answer to this question should allow you to identify not only the
area where you should focus your effort, but also the device type. Layer
2 problems typically mean you should focus on Ethernet switches or
Layer 2 errors on the routers and end-host ports. If its a Layer 3
problem, you need to check the routers and the IP stack on the endhosts.
Is the Problem Constant or Sporadic?
Sales
Payroll
Engineering
HR
Sales
Support
Sales
Engineering
4
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
34
33
35
36
38
37
39
40
42
41
44
43
EX3200 8PoE
46
45
47
10
11
12
13
14
15
16
17
18
19
20
21
22
24
23
26
25
28
27
30
29
32
31
34
33
36
35
38
37
40
39
42
41
44
43
46
45
47
EX3200 8PoE
CONFIG
USB
CONSOLE
CF
PORT 0
PORT 1
POWER ON
CONFIG
USB
CONSOLE
100m
L2 VPN
GigE
FastE
PORT 1
ALARM
POWER ON
CONFIG
USB
CONSOLE
CF
PORT 0
CONFIG
30
29
31
32
33
34
35
36
37
38
39
40
42
41
43
44
45
EX3200 8PoE
46
47
ACTIVITY
PORT 1
ACTIVITY
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
28
27
30
29
32
31
34
33
NE TWOR K S
LINKSPEED
ACTIVITY
USB
CONSOLE
CF
PORT 0
J4300
ON
LINE
38
37
PORT 0
LINKSPEED
40
39
42
41
44
43
46
45
PORT 0
47
PORT 1
PORT 1
ACTIVITY
LINKSPEED LINKSPEED
ACTIVITY
ACTIVITY
LINKSPEED
ACTIVITY
PORT 0
LINKSPEED
PORT 0
PORT 1
ACTIVITY
LINKSPEED LINKSPEED
PORT 1
ACTIVITY
ACTIVITY
LINKSPEED
ACTIVITY
SLOT 1/0
EJECT
COMPACT FLASH
REMOVE
ALARM
POWER ON
J4300
PORT 1
36
35
EX3200
Juniper
SLOT 5/0
LINKSPEED
10/100
ETHERNET
PORT 1
PORT 0
ACTIVITY
LINKSPEED
SLOT 4/0
28
27
PORT 0
SLOT 2/0
26
25
10/100
ETHERNET
24
LINKSPEED
CONFIG
USB
CONSOLE
CF
PORT 0
REMOVE
ON
LINE
EJECT
COMPACT FLASH
J4300
PORT 1
TX
TX
TX
RX
Boston, MA
RX
RX
STATUS
1
PC CARD
RESET
Internet
(Provider 2)
M7i
LINE
RX ACTI V ITY
TX
TX
RX
RX
0
CONSOLE
AUX/MODEM
MGMT
RE 0
HDD MASTER
FAIL ONLINE
OFFLINE
RE 1
M10i
S R X240
MPIM-
S R X240
4
ALARM STATUS MPIM-1
MPIM-3
POWER
MPIM-4 RESET
CONFIG CONSOLE/AUX
HA
MPIM-2
MPIM-
LINK
0/0
0/1
0/2
0/3
0/4
0/5
0/6
0/7
0/8
10/100/1000
0/9
0/10
0/11
0/12
0/13
0/14
0/15
USB
MPIM-3
POWER
MPIM-4 RESET
CONFIG CONSOLE/AUX
HA
MPIM-2
LINE
STATUS
STATUS
STATUS
TX
RX
TX
TX
RX
RX
STATUS
SRX240
LINE
LINE
RX ACTI V ITY
LINE
STATUS
STATUS
RX ACTI V ITY
M7i
RX ACTI V ITY
LINE
LINE
0/0
LINK
0/1
0/2
0/3
0/4
0/5
0/6
0/7
10/100/1000
0/8
0/9
0/10
0/11
0/12
0/13
0/14
0/15
USB
SRX240
TX
RX
TX
TX
RX
RX
0/13
0/14
0/15
USB
MPIM-3
POWER
MPIM-4 RESET
CONFIG CONSOLE/AUX
HA
MPIM-2
0/0
LINK
0/1
0/2
0/3
0/4
0/5
0/6
0/7
0/8
10/100/1000
0/9
0/10
0/11
0/12
0/13
0/14
0/15
USB
SRX240
M7i
M7i
Juniper
MASTER
ONLINE
NO
ONLINE
FAIL
OK
OK
FAIL
NC
NO
NC
FAIL
FABRIC
ONLY
SCB
RE-S-1300
1/0
LINK
FABRIC
ONLY
OK/FAIL
1/0
FABRIC
ACTIVE
LINK
TUNNEL
FABRIC
ONLY
OK/FAIL
FABRIC
ACTIVE
RE-S-1300
TUNNEL
FABRIC
ACTIVE
OK/FAIL
FABRIC
ONLY
OK/FAIL
0
SCB
SCB
EX3200 8PoE
2
10
12
11
14
13
16
15
18
17
20
19
22
21
24
23
26
25
28
27
30
29
32
31
34
33
NC
NO
NC
NO
MASTER
ONLINE
ACO/LT
OFFLINE
NE T WOR K S
3/0 3/5
1/0 1/5
OK
FAIL
2
DPC 40xGE
NC
NO
NC
NO
RE 1
OK
MX 240
PEM
FAN
RE 0
FAIL
0
SCB
FAIL
OK
OK
FAIL
FAIL
MX 240
RE-S-1300
0
SCB
RE-S-1300
MX240
MX240
MX240
DC
1
OK
FAIL
MX240
MX240
MX240
OK
2/0 2/5
OK/FAIL
SCB
SCB
FABRIC
ACTIVE
SCB
FAIL
0
1
OK
NO
MX 240
3/0 3/5
OK/FAIL
3/0 3/5
1/0 1/5
PEM
FAN
RE 1
OK
FAIL
DPC 40xGE
DPC 40xGE
ACO/LT
OFFLINE
RE 0
OK
0/0 0/5
MASTER
NE T WOR K S
MX 240
1/0
NC
FABRIC
ONLY
NO
OK/FAIL
FABRIC
ACTIVE
NC
FABRIC
ONLY
OK/FAIL
FABRIC
ACTIVE
FAIL
1/0 1/5
OK
FAIL
2/0 2/5
OK
2/0 2/5
0/0 0/5
FAIL
0/0 0/5
OK/FAIL
PEM
FAN
RE 1
OK
RE 1
FAIL
Juniper
ACO/LT
OFFLINE
FAIL
PEM
FAN
RE 0
OK
ONLINE
RE 0
Juniper
ACO/LT
OFFLINE
NE T WOR K S
2
MASTER
3/0 3/5
0/12
1/0 1/5
0/11
LINK
0/10
TUNNEL
0/9
1/0
0/8
2/0 2/5
10/100/1000
OK/FAIL
0/7
0/0 0/5
0/6
FABRIC
ONLY
0/5
OK/FAIL
0/4
FABRIC
ACTIVE
0/3
FABRIC
ONLY
0/2
OK/FAIL
0/1
LINK
0/0
TUNNEL
MPIM-
LINK
OK
36
35
38
37
40
39
42
41
44
43
46
45
47
EX3200 8PoE
0
10
12
11
14
13
16
15
18
17
20
19
22
21
24
23
26
25
28
27
30
29
32
31
34
33
36
35
38
37
40
39
42
41
44
43
46
45
EX3200 8PoE
10
12
11
14
13
16
15
18
17
20
19
22
21
24
23
26
25
28
27
30
29
32
31
34
33
36
35
38
37
40
39
42
41
44
43
46
45
47
47
10
EX3200 8PoE
0
10
12
11
14
13
16
15
18
17
20
19
22
21
24
23
26
25
28
27
30
29
32
31
34
33
36
35
38
37
40
39
42
41
44
43
46
45
47
EX3200 8PoE
12
11
14
13
16
15
18
17
20
19
22
21
24
23
26
25
28
27
30
29
32
31
34
33
36
35
38
37
40
39
42
41
44
43
46
45
47
EX3200 8PoE
0
10
12
11
14
13
16
15
18
17
20
19
22
21
24
23
26
25
28
27
30
29
32
31
34
33
36
35
38
37
40
39
42
41
44
43
46
45
EX3200 8PoE
10
12
11
14
13
16
15
18
17
20
19
22
21
24
23
26
25
28
27
30
29
32
31
34
33
36
35
38
37
40
39
42
41
44
43
46
45
47
47
10
EX3200 8PoE
12
11
14
13
16
15
18
17
20
19
22
21
24
23
26
25
28
27
30
29
32
31
34
33
36
35
38
37
40
39
42
41
44
43
46
45
47
EX3200 8PoE
0
10
12
11
14
13
16
15
18
17
20
19
22
21
24
23
26
25
28
27
30
29
32
31
34
33
36
35
38
37
40
39
42
41
44
43
46
45
47
10
EX3200 8PoE
12
11
14
13
16
15
18
17
20
19
22
21
24
23
26
25
28
27
30
29
32
31
34
33
36
35
38
37
40
39
42
41
44
43
46
45
47
MX240
MASTER
ONLINE
Juniper
ACO/LT
PEM
FAN
NC
NO
NC
NO
OK
RE 0
FAIL
MX 240
OK
OK/FAIL
3/0 3/5
2/0 2/5
1/0 1/5
DPC 40xGE
NC
NO
NC
NO
RE 1
FAIL
OK
PEM
FAN
OK
FAIL
OK
FAIL
FAIL
MX 240
3/0 3/5
ACO/LT
OFFLINE
2/0 2/5
FAIL
ONLINE
1/0 1/5
OK
MASTER
NE T WOR K S
RE 1
0/0 0/5
OFFLINE
0/0 0/5
OK/FAIL
23
STATUS
RX
TX
RX
STATUS
STATUS
TX
RX
MPIM-2
Juniper
FABRIC
ONLY
OK/FAIL
FABRIC
ACTIVE
FABRIC
ONLY
OK/FAIL
RE-S-1300
FABRIC
ACTIVE
FABRIC
ONLY
OK/FAIL
LINK
TUNNEL
FABRIC
ACTIVE
RE-S-1300
SCB
LINK
0
SCB
SCB
TUNNEL
SCB
1/0
FABRIC
ONLY
FABRIC
ACTIVE
OK/FAIL
1/0
22
21
PIC
STATUS
RE 1
RX ACTI V ITY
MPIM-3
MPIM-4 RESET
CONFIG CONSOLE/AUX
HA
NE T WOR K S
FAIL
TX
STATUS
LINE
LINE
RX ACTI V ITY
RX ACTI V ITY
LINE
RX ACTI V ITY
TX
RX
POWER
DPC 40xGE
20
S R X240
19
STATUS
LINE
TX
RX
STATUS
RX ACTI V ITY
PORT 1
PORT 0
STATUS
LINE
RX ACTI V ITY
SRX240
OK
18
STATUS
APP
2
RE-400
1/
STATUS
RX
TX
RX
TX
STATUS
STATUS
PORT 3
RX
LINK
LINE
RX ACTI V ITY
PORT 1
PORT 0
STATUS
RX
TX
RX
TX
STATUS
STATUS
PORT 3
RX
LINK
PORT 2
RX
LINK
PORT 1
RX
LINK
PORT 0
RX
LINK
3
0/
RE 0
OFFLINE
M10i
ETHERNET 1000 BASE LX/SX/LH
FAIL
17
Adaptive Services
1/
MGMT
AUX/MODEM
FAIL ONLINE
1/
MPIM-
M7i
PIC
0
CONSOLE
HDD MASTER
ETHERNET 100BASE-TX
STATUS
APP
1
PC CARD
RESET
PORT 0
RX
LINK
0/
Adaptive Services
2
RE-400
M7i
S R X240
16
LINE
DS3
PORT 2
RX
LINK
1/
LINE
PORT 1
RX
LINK
ETHERNET 100BASE-TX
STATUS
TX
TX
RX
RX
TX
RX
RX ACTI V ITY
STATUS
STATUS
STATUS
DS3
OK
15
STATUS
OC12
100m
IP VPN
RX ACTI V ITY
LINE
RX ACTI V ITY
LINE
LINE
RX ACTI V ITY
RX ACTI V ITY
DPC 40xGE
14
13
POWER ON
COMPACT FLASH
DS3
M10i
3
0/
12
11
100m
IPSEC VPN
0/
RE 0
10
ACTIVITY
STATUS
ALARM
EJECT
ON
LINE
REMOVE
M10i
ETHERNET 1000 BASE LX/SX/LH
M7i
M7i
PORT 1
LINKSPEED
J4300
PORT 1
100m
GRE
DS3
Juniper
ACTIVITY
IP VPN
Cloud
Chicago, IL
NE T WOR K S
PORT 1
PORT 0
SLOT 1/0
SLOT 3/0
SLOT 1/0
COMPACT FLASH
Internet
(Provider 1)
MX240
PORT 0
LINKSPEED
SLOT 3/0
J4300
SLOT 6/0
NE TWOR K S
10/100
ETHERNET
Juniper
ACTIVITY
10/100
ETHERNET
ACTIVITY
LINKSPEED
SLOT 4/0
PORT 1
ACTIVITY
SLOT 5/0
PORT 1
T1
PORT 0
ACTIVITY
LINKSPEED LINKSPEED
10/100
ETHERNET
PORT 0
LINKSPEED
SLOT 2/0
ACTIVITY
SLOT 3/0
LINKSPEED
SLOT 6/0
PORT 1
ACTIVITY
EJECT
ON
LINE
CF
REMOVE
J4300
PORT 0
PORT 0
LINKSPEED
10/100
ETHERNET
J4300
SLOT 4/0
ALARM
COMPACT FLASH
Juniper
NE TWOR K S
SLOT 1/0
EJECT
ON
LINE
REMOVE
ACTIVITY
SLOT 4/0
LINKSPEED
SLOT 2/0
PORT 1
ACTIVITY
SLOT 5/0
PORT 0
LINKSPEED
10/100
ETHERNET
ACTIVITY
10/100
ETHERNET
PORT RX
1
LINKSPEED
SLOT 6/0
10/100
ETHERNET
STATUS
ACTIVITY
SLOT 00
10/100
ETHERNET
J4300
PORT 0TX
LINKSPEED
SLOT 3/0
SLOT 4/0
J4300
SLOT 2/0
ALARM
POWER ON
COMPACT FLASH
NE TWOR K S
ACTIVITY
SLOT 2/0
LINKSPEED
SLOT 5/0
PORT 1
ACTIVITY
Juniper
SLOT 6/0
SLOT 5/0
EJECT
ON
LINE
REMOVE
PORT 0
LINKSPEED
10/100
DS3
ETHERNET
ACTIVITY
FABRIC
ACTIVE
PORT 1
LINKSPEED
SLOT 00
10/100
ETHERNET
CF
PORT 0
ACTIVITY
SLOT 00
10/100
ETHERNET
USB
EX3200
PORT 1
LINKSPEED
SLOT 1/0
SLOT 4/0
SLOT 2/0
CONSOLE
PORT 0
10/100
ETHERNET
NE TWOR K S
J4300
SLOT 1/0
CONFIG
SLOT 00
10/100
ETHERNET
ALARM
POWER ON
J4300
EX3200
Juniper
ACTIVITY
SLOT 3/0
LINKSPEED
10/100
ETHERNET
PORT 1
ACTIVITY
SLOT 3/0
PORT 0
LINKSPEED
SLOT 00
10/100
ETHERNET
RX
SLOT 6/0
DS3
SLOT 5/0
10/100
ETHERNET
STATUS
SLOT 6/0
10/100
ETHERNET
TX
J4300
10/100
ETHERNET
EX3200
NE TWOR K S
SLOT 00
10/100
ETHERNET
EX3200 8PoE
0
Juniper
EX3200 8PoE
0
EX3200
TOR
EX4200 48PoE
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
POD Switch #1
EX3200
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
POD Switch #3
EX3200
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
POD Switch #4
EX3200
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
EX3200 8PoE
0
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
EX3200 8PoE
EX3200 8PoE
0
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
POD Switch #1
EX3200
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
EX3200 8PoE
0
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
47
47
47
POD Switch #2
EX3200
EX3200 8PoE
0
EX3200 8PoE
0
EX3200 8PoE
0
EX3200 8PoE
0
EX3200 8PoE
0
EX3200
TOR
0
POD Switch #2
EX3200
POD Switch #3
EX3200
POD Switch #4
EX3200
POD Switch #3
EX3200
POD Switch #4
EX3200
EX4200 48PoE
47
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Sales
HR
Support
Engineering
Sales
HR
Support
Engineering
Sales
HR
Support
Engineering
Sales
HR
Support
Engineering
Physical Design
The physical design depicted in this topology represents a fairly simple
enterprise network supporting two main sites (Chicago and Boston),
two remote sites (Peoria, IL and Reykjavik, Iceland), and a small
datacenter in the Chicago site. The main goals of this physical design
were to provide all possible redundancy, while allowing for scaling.
13
14
The core of the network features two M10i routers, four M7i routers
as well as four MX240 Ethernet aggregation routers. The M10i
routers provide primary connectivity between the two main sites and
terminate WAN connections to the remote offices over a variety of
technologies. A single M7i in each site provides connectivity to the
Internet, which will later be protected by a pair of SRX firewalls (right
now the SRX are not doing any filtering) and the other provides
redundant connectivity to the opposite main site. Finally, the MX240
routers aggregate the closet EX-Series Ethernet switches in each main
site and serve as a Layer 3 boundary between the datacenter and the
rest of the network. In the remote sites, J Series routers serve as the
gateway routers and an aggregation point for the EX Series switches.
Additionally, an acquired company has been connected to the core M7i
routers in a method similar to that used for the remote offices. This site
has a slightly different architecture. This design provides both redundancy (at the chassis level) and allows for scaling as the modular design
and chassis selections allow for increased bandwidth and additional
edge aggregation devices without the need for expensive hardware
replacement within the core.
Logical Design
Like the physical design, the logical design was built to provide
redundancy while allowing for fast convergence and an easy path for
future deployments such as MPLS, multicast, and IPv6.
Satellite Sites
You may have also noticed that the remote offices are connected not
only by traditional WAN circuit technologies, but also by logical
connections providing pseudo-wires using IPSec and Layer 2 VPN
technologies. From the perspective of the rest of the network, these
connections are the same as any other physical media, but since these
are logical connections, there is an impact on monitoring and troubleshooting.
IGP
The main IGP in this design will be OSPF. OSPF runs in a single area
(area 0) on the MX and M series routers. The J Series routers in the
remote offices also run OSPF, but the two OSPF domains are separate
and the acquired site has historically run RIP. OSPF is used because of
its relative ease of configuration, its convergence characteristics, its
support of MPLS and its familiarity for our operations teams. IS-IS
would work equally well. The decision to choose OSPF or IS-IS often
comes down to comfort level and experience with the protocol. When
all other factors are equal, familiarity is a perfectly valid basis for
choosing a protocol.
BGP
BGP supports various functions in this network. As the network is
multi-homed to the Internet, external BGP (eBGP) is run with the
service providers. For this case, AS-path prepending and local-preference are used to influence routing decisions such that the Chicago
Internet connection is preferred over the Boston connection. Internally, all M, MX, and J Series routers run internal BGP (iBGP) in a
full-mesh. The remote offices and the acquired company redistribute
their local IGP routes into BGP and BGP into their local IGPs. eBGP is
also used with a third service provider which provides an MPLS IP
VPN service for redundant connectivity to the Iceland site.
Summary
The goal of creating a monitoring and troubleshooting process is to
give you an idea of what to look for before ever typing a show command. It should give you a head start not only in where to look, but in
what to look for. It also allows you to preemptively contact additional
personnel and, if necessary, event management groups to begin any
triage and notification protocols. The additional personnel should
speed resolution of the problem.
TIP Front-line support can better handle issues when they are aware of
them before the phone rings, and their feedback to operations and
engineering can assist in isolating and resolving the problem.
15
16
Chapter 2
Putting the Fix Test to Work
Traffic Engineering and Overutilization (abbr.). . . . . . . 18
The Fix Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
18
Once you have identified the root cause of a problem (Chapter 1), the
next step is to resolve it. Two types of fixes are discussed in this
chapter: short-term fixes and long-term fixes.
These are intentionally generic terms, but it will be demonstrated that
short-term fixes, and yes, at times, even hacks are acceptable resolutions as long as they meet our books key Fix Test criteria:
n The fix does not cause other problems.
n The fix survives a reboot.
n The fix is well communicated.
n The fix is operationally understandable.
n And, the fix is replaced with a long-term fix in a reasonable
amount of time.
Assuming these requirements are met, a short-term fix is completely
acceptable. The main goal of any fix (short-term or long-term) should
always be the quick restoration of services.
Looking at the bolded output above, you can see that the outbound
connection to the provider is near line rate. The CLI output is a
snapshot, so running this command several times is recommended to
get a better understanding of the true traffic situation. And if you use
your network management system to look at Boston, you would see
something akin to Figure 2-2.
19
20
Using the Junos CLI, the show interfaces command will display the
input and output rates for the Boston connection. While a network
management system is the right tool to actively monitor the network
and alert on errors, nothing can replace the CLI for immediate, specific
information gathering. The combination of these two tool-sets
provides for the quickest isolation and remediation of network issues.
ps@boston-edge-1> show interfaces so-3/3/2
Physical interface: so-3/3/2, Enabled, Physical link is Up
Interface index: 167, SNMP ifIndex: 151
Description: Connection to isp-2
Link-level type: PPP, MTU: 4474, Clocking: Internal, SONET mode, Speed: OC12,
Loopback: None, FCS: 16, Payload scrambler: Enabled
Device flags : Present Running
Interface flags: Point-To-Point SNMP-Traps Internal: 0x4000
Link flags
: Keepalives
Keepalive settings: Interval 10 seconds, Up-count 1, Down-count 3
Keepalive: Input: 0 (never), Output: 0 (never)
LCP state: Down
NCP state: inet: Not-configured, inet6: Not-configured, iso: Not-configured, mpls:
Not-configured
CHAP state: Closed
PAP state: Closed
CoS queues
: 8 supported, 4 maximum usable queues
Last flapped : 2009-10-14 07:03:58 PDT (2d 03:37 ago)
Input rate
: 1207 bps (6 pps)
Output rate
: 2943 bps (17 pps)
SONET alarms : None
SONET defects : None
Looking at the same bit rates inbound and outbound (bolded above)
on the Boston provider circuit, you see that it is nearly empty.
You could (and should) request an upgrade of your peering capacity
with your primary provider, but that can take weeks. You need a
short-term solution to this overutilization problem. Since there is
another egress point in Boston to our secondary service-provider, you
could change your routing policy, forcing some outbound traffic to use
the backup link, alleviating the over-utilization of the Chicago circuit.
Use the show bgp summary command on your Boston edge router to see
that you are not currently selecting any routes from your Boston
service-provider.
ps@boston-edge-1> show bgp summary
Groups: 2 Peers: 3 Down peers: 0
Table
Tot Paths Act Paths Suppressed
History Damp State
Pending
inet.0
13986
6993
0
0
0
0
Peer
AS
InPkt
OutPkt
OutQ Flaps Last Up/Dwn State|#Active/
Received/Damped...
10.25.30.1
0/0/0
10.25.30.3
0/0/0
18.32.16.102
0/0/0
10
7013
7112
13:07 6993/6993/0
10
28
7111
12:41 0/0/0
107
7006
6277
8:10 0/6993/0
The output of show bgp summary shows that AS 107 is sending 6993
routes, but none of these routes are active on boston-edge-1. This is
shown in the field that is currently displaying 0/6993/0. The first
value is active routes, the second is received routes and the last shows
the number of dampened routes. To review the policy applied to this
BGP session, use the show configuration protocols bgp group
[group-name] command:
ps@boston-edge-1> show configuration protocols bgp group ebgp-as-107
type external;
import as-107-in-secondary;
export aggregate-out;
peer-as 107;
neighbor 18.32.16.102;
ps@boston-edge-1> show configuration policy-options policy-statement as-107-insecondary
term localpref-50 {
then {
local-preference 50;
}
}
When you review your peer configuration and applied import policy,
you can see why youre having problems. The local-preference on all
routes from our Boston service provider is set to 50, causing the
Chicago service provider to act as the preferred egress point for all
traffic destined for the Internet. Local-preference is the most significant
criteria in the BGP route selection process. The higher the value, the
more preferred the route is. The default value is 100, so a setting of 50
would make any route learned from this BGP peer less preferred than
the same route learned from the primary provider, whose routes are
getting the default local-preference.
Lets confirm. Using the show
tion for Boston:
21
22
This output confirms our hypothesis. The asterisk in the output indicates the selected route, and you can see that the selected route is the
route learned through our Chicago peering point, which can be quickly
identified because the first hop in the AS-Path is 44, the service-provider
in Chicago. Its selected because of the value configured for the localpreference. To allow some traffic to prefer the Boston egress point, you
need to update your policy to match on some routes and set them to a
higher local-preference than Chicago. There are no per-prefix traffic
statistics, so you should modify your policy, check your interface
statistics, and then repeat the cycle and tweak it until you are happy
with the traffic levels.
Aim for reducing the egress utilization in Chicago to 70%, which
should mean a drop of 300 to 400 megabits/second. An easy way to
begin is to configure a local-preference for the Boston service provider
for routes that originate from their AS or from their customers.
First, you know your provider sets a BGP community of 107:100 on all
customer routes as they document this in their peering policies, which
are posted on their website, a common method for network operators
to distribute this information; so use this information to develop the
policy. Next, lets add a new term which matches on this community,
and set the local-preference to 120. That should force the Boston
peering point to become preferred for those routes. Lets use
8.32.80.0/23 as an example to see if your change had the desired effect:
ps@boston-edge-1> show route 8.32.80.0/23 detail
inet.0: 7013 destinations, 14004 routes (7013 active, 0 holddown, 0 hidden)
8.32.80.0/23 (2 entries, 1 announced)
*BGP
Preference: 170/-101
Next hop type: Indirect
Next-hop reference count: 20970
Source: 10.25.30.1
Next hop type: Router, Next hop index: 488
Next hop: 192.168.14.1 via ge-0/01.0, selected
Protocol next hop: 10.25.30.1
Indirect next hop: 8e04000 131070
State: <Active Int Ext>
Local AS:
10 Peer AS:
10
Age: 1 Metric2: 1
Task: BGP_10.10.25.30.1+179
Announcement bits (2): 0-KRT 4-Resolve tree 1
AS path: 44 107 I
Communities: 107:100
Localpref: 100
BGP
Before the change, we prefer the Chicago exit point and we see that this is
due to local-preference.
To make the change, create a community named as-107-customers,
which includes 107:100, and use that community in a newly inserted
term. Our final configuration appears as follows:
ps@boston-edge-1> show configuration policy-options
policy-statement as-107-in-secondary {
term localpref-50 {
then {
local-preference 50;
}
}
term localpref-120 {
from community as-107-customers;
then {
local-preference 120;
}
}
}
community as-107-customers members 107:100;
To confirm the change has had the desired effect, once again issue a show
route command on the example prefix, 8.32.80.0/23:
ps@boston-edge-1> show route 8.32.80.0/23 detail
inet.0: 7013 destinations, 13008 routes (7013 active, 0 holddown, 0 hidden)
8.32.80.0/23 (1 entry, 1 announced)
*BGP
Preference: 170/-121
Next hop type: Router, Next hop index: 486
Next-hop reference count: 8991
Source: 18.32.16.102
Next hop: 18.32.16.102 via so-3/3/2.0, selected
State: <Active Ext>
23
24
Local AS:
10 Peer AS: 107
Age: 35:27
Task: BGP_107.18.32.16.102+59002
Announcement bits (3): 0-KRT 3-BGP RT Background 4-Resolve tree 1
AS path: 107 I
Communities: 107:100
Localpref: 120
Router ID: 172.17.0.3
You can also issue the show bgp summary command to verify that the
Boston router is now selecting AS 107 for some prefixes. Before, this
command showed 0/6993/0, but now you see that 1256 routes are
active from this peer (and 1256 less routes are active from the Chicago
peer).
ps@boston-edge-1> show bgp summary
Groups: 2 Peers: 3 Down peers: 0
Table
Tot Paths Act Paths Suppressed
History Damp State
Pending
inet.0
13986
6993
0
0
0
0
Peer
AS
InPkt
OutPkt
OutQ Flaps Last Up/Dwn State|#Active/
Received/Damped...
10.25.30.1
10
7013
7112
0
0
13:07 5737/6993/0
0/0/0
10.25.30.3
10
28
7111
0
0
12:41 0/0/0
0/0/0
18.32.16.102
107
7006
6277
0
0
8:10 1256/6993/0
0/0/0
BGP
Localpref: 100
Router ID: 10.25.30.1
Preference: 170/-51
Next hop type: Router, Next hop index: 486
Next-hop reference count: 8991
Source: 18.32.16.102
Next hop: 18.32.16.102 via so-3/3/2.0, selected
State: <Ext>
Inactive reason: Local Preference
Local AS:
10 Peer AS: 107
Age: 1:20:22
Task: BGP_107.18.32.16.102+59002
AS path: 107 44 I
Localpref: 50
Router ID: 172.17.0.3
If you open a second CLI session to the router, you could use the monitor
interface so-3/2/2 command to monitor real-time traffic on bostonedge-1:
25
26
Time: 16:11:23
Delay: 0/0/17
These commands show that the Boston edge router is now sending
approximately 300 megabits over its SONET link to the service
provider. Running a similar command on the Chicago edge router,
you can see the drop in traffic. Lets check on the Gigabit Ethernet
again one last time.
ps@chicago-edge-1> show interfaces ge-0/0/9
Physical interface: ge-0/0/9, Enabled, Physical link is Up
Interface index: 137, SNMP ifIndex: 118
Description: Connection to isp-1
Link-level type: Ethernet, MTU: 1514, Speed: 1000mbps, MAC-REWRITE Error: None,
Loopback: Disabled, Source filtering: Disabled, Flow control: Enabled,
Auto-negotiation: Enabled, Remote fault: Online
Device flags : Present Running
Interface flags: SNMP-Traps Internal: 0x4000
Link flags
: None
CoS queues
: 8 supported, 8 maximum usable queues
Current address: 00:19:e2:25:b0:09, Hardware address: 00:19:e2:25:b0:09
Last flapped : 2009-10-13 13:51:19 PDT (2d 20:44 ago)
Input rate
: 3172844311 bps (10682977 pps)
Output rate
: 6739379368 bps (23159379 pps Active alarms : None
Active defects : None
The bolded line shows that the output traffic levels on the Chicago
provider connection have dropped to ~670 megabits, which meets the
goal of reducing outbound traffic to 70% utilization.
None that are apparent. Using your NMS system, you should monitor
the Boston circuit daily to ensure it does not get over-utilized at peak
traffic times.
Does the Fix Survive a Reboot?
Yes.
Is the Fix Well Communicated?
27
28
Summary
While your specific traffic engineering problems and issues will always
be different than this chapters example, the purpose was to illustrate a
network outage and show how to apply The Fix Test to it. Traffic
engineering is always a good example to showcase because when it
goes sour everyone knows. A simple set of rules will help in your
approach to and effectiveness with troubleshooting.
Listen to your network users, but factor in their emotions. Minor or
localized network problems may appear worse to some users, depending on the impact.
Apply a set formula for examining a problem. A consistent approach
yields consistent results. An example of such a formula is:
What is the scope of the problem?
How many distinct source networks are affected?
What destinations are involved?
Who reported the problem first?
What type(s) of traffic is affected?
Is the problem constant or sporadic?
Test your theories and always confirm from another source. A combination of instrumentation and practical tests should prove that your fix
worked.
Short-term fixes lead to long-term resolutions. Your primary objective
is to restore service in an operationally supportable way and often this
involves short-term fixes.
Allow yourself to improve your formula as you go. If you consistently
use an evolving formula, your results should always improve.
Chapter 3
CLI Instrumentation
Environmental Commands . . . . . . . . . . . . . . . . . . . . . . 30
Chassis Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Request Support Information. . . . . . . . . . . . . . . . . . . . 37
Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
30
Environmental Commands
Most of the environmental instrumentation commands can be found at
the chassis hierarchy level:
ps@dunkel-re0> show chassis ?
Possible completions:
alarms
Show alarm status
craft-interface
Show craft interface status
environment
Show component status and temperature, cooling system speeds
ethernet-switch
Show Ethernet switch information
fabric
Show internal fabric management state
firmware
Show firmware and operating system version for components
fpc
Show Flexible PIC Concentrator status
hardware
Show installed hardware components
location
Show physical location of chassis
mac-addresses
Show media access control addresses
pic
Show Physical Interface Card state, type, and uptime
routing-engine
Show Routing Engine status
sibs
Show Switch Interface Board status
synchronization
Show clock synchronization information
temperature-thresholds Show chassis temperature threshold settings
A quick way to assess the status of the chassis is to issue the show
chassis alarms command:
ps@dunkel-re0> show chassis alarms
1 alarms currently active
Alarm time
Class Description
2010-01-19 11:47:35 PST Major PEM 3 Not OK
Here our chassis seems be having a problem with power entry module
(PEM) number 3. This usually indicates a power source problem.
Either the cable is unplugged or there is a problem with the circuit
breaker. To better understand the problem, issue the show chassis
environment pem command:
31
32
Spinning
Spinning
Spinning
Spinning
Spinning
Spinning
Spinning
Spinning
Spinning
Spinning
Spinning
Spinning
Spinning
Spinning
Spinning
at
at
at
at
at
at
at
at
at
at
at
at
at
at
at
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
speed
speed
speed
speed
speed
speed
speed
speed
speed
speed
speed
speed
speed
speed
speed
You can see how the show chassis environment command provides
information on the status of the power entry modules, temperatures,
and fan operation. Temperature alarms are also displayed here (and
can also be displayed with the show chassis alarms command) and are
often launched by site problems such as cooling system failures,
incorrect rack and system layouts, or fan failures (which would also be
shown in the output of the show chassis environment and show
chassis alarms commands).
TIP Fan failures and PEM failures that are not caused by bad cables or
Chassis Commands
The other main chassis level concerns include the status of the routingengine(s), FPCs (Flexible PIC Concentrator), and PICs (Physical
Inter-face Card). The show chassis routing-engine, show chassis
fpc, and show chassis fpc pic-status commands can display this
information:
Heres a sample of the output from the show chassis routing-engine
command when issued on a router with a single routing engine:
ps@doppelbock> show chassis routing-engine
Routing Engine status:
Temperature
32 degrees C / 89 degrees F
CPU temperature
32 degrees C / 89 degrees F
DRAM
Memory utilization
CPU utilization:
User
Background
Kernel
Interrupt
Idle
Model
Serial ID
Start time
Uptime
Load averages:
512 MB
36 percent
2 percent
0 percent
4 percent
0 percent
94 percent
RE-2.0
c40000078cf97701
2010-01-12 05:56:58 EST
7 days, 21 hours, 4 seconds
1 minute 5 minute 15 minute
0.08
0.02
0.01
33
34
Backup
Backup
43 degrees C / 109 degrees F
47 degrees C / 116 degrees F
3584 MB
8 percent
0 percent
0 percent
0 percent
0 percent
100 percent
RE-A-2000
1000699981
2010-01-19 11:28:08 PST
23 hours, 41 minutes, 28 seconds
The output shown here for a dual routing engine equipped router is
what you would expect when it is operating normally. If the Current
state for the routing engine is displayed as present (as shown below),
you might need to investigate it further:
ps@dunkel-re0> show chassis routing-engine
Routing Engine status:
Slot 0:
Current state
Master
Election priority
Master
Temperature
45 degrees C / 113 degrees F
CPU temperature
52 degrees C / 125 degrees F
DRAM
3584 MB
Memory utilization
9 percent
CPU utilization:
User
0 percent
Background
0 percent
Kernel
3 percent
Interrupt
1 percent
Idle
97 percent
Model
RE-A-2000
Serial ID
1000702757
Start time
2010-01-19 11:42:50 PST
Uptime
1 day, 39 minutes, 22 seconds
Load averages:
1 minute 5 minute 15 minute
0.01
0.01
0.02
Routing Engine status:
Slot 1:
Current state
Present
35
36
chassis pic fpc-slot [slotnumber] pic-slot [pic-slot number] offline command. Check that
the PIC is offline by issuing the show chassis pic fpc-slot [slot-number] pic-slot [pic-slot number] command. And then bring the PIC
back online by issuing the request chassis pic fpc-slot [slot-number] pic-slot [pic-slot number] online command:
ps@dunkel-re0> request chassis pic fpc-slot 4 pic-slot 0 offline
fpc 4 pic 0 offline initiated, use show chassis fpc pic-status 4 to verify
ps@dunkel-re0> show chassis pic fpc-slot 4 pic-slot 0
FPC slot 4, PIC slot 0 information:
State
Offline
ps@dunkel-re0> request chassis pic fpc-slot 4 pic-slot 0 online
fpc 4 pic 0 online initiated, use show chassis fpc pic-status 4 to verify
ps@dunkel-re0> show chassis pic fpc-slot 4 pic-slot 0
FPC slot 4, PIC slot 0 information:
Type
4x CHDS3 IQ
State
Online
PIC version
2.7
Uptime
1 second
If the problem persists, you might try reseating the FPC or PIC as
discussed in the routing engine troubleshooting section. If that does not
resolve the problem, open a case with JTAC.
If you need to open a case for a routing engine, FPC, or PIC problem, be
sure to include the output from the request support information
command and a copy of the messages and chassisd log files (stored in
the /var/log directory) in the new JTAC case.
Issue the request support information command and redirect the output
to a file:
ps@dunkel-re0> request support information | save rsi-dunkel-01202010.log
Wrote 7679 lines of output to rsi-dunkel-01202010.log
37
38
00:00
Log Files
It is also quite useful for JTAC to have a copy of the messages and
chassisd log files, which exist in the /var/log directory on the router.
While an SFTP session is open with the router, you can copy those files
as well:
sftp> cd /var/log
sftp> get messages
Fetching /var/log/messages to messages
/var/log/messages
sftp> get chassisd
Fetching /var/log/chassisd to chassisd
/var/log/chassisd
100%
32KB 31.7KB/s
100% 1967KB
1.9MB/s
00:00
00:01
You now have all of the information you need to open a JTAC case.
Junipers support teams can help you with the remaining troubleshooting and, if necessary, create an order for replacement hardware.
Summary
Network management suites provide an excellent method for actively
monitoring a network, polling for specific values, and to some degree
isolating problems, but nothing can replace the ability to efficiently
navigate the CLI. Most real time troubleshooting, diagnosis, and
resolution steps involve using the CLI, which makes a solid understanding of the CLI all the more important.
There are entire books about the Junos CLI and its potential to
monitor and troubleshoot specific issues. This chapter introduced a
few key commands and the rest of this booklet will explore the CLIs
ability to examine a devices inner workings.
MORE? Need more on Junos? See the other booklets in this Day One series,
Chapter 4
System Monitoring and
Troubleshooting
Syslog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
SNMP Polling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
SNMP Traps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
40
Before this booklet gets you looking at connectivity and protocols, you
must first ensure your system is functioning properly. This chapter
walks you through some of the basic syslog, SNMP polling, SNMP
traps, and CLI instrumentation so you can assess the operation of your
system. This chapter also acts as the basis for later monitoring and
troubleshooting discussions.
Syslog
Administrators should collect system logs, change-logs, and interactive-commands because it allows them to use such syslog files to
troubleshoot and monitor. At the same time, they help to corroborate
network events with configuration changes. A syslog file can also
provide the syslog server with enough information to appropriately
alarm, notify, and identify the root cause of problems where and when
possible.
Juniper offers both local logging, which writes messages to files in the /
var/log directory located either on the hard disk drive or the compact
flash disk, depending on the platform, and remote logging, where
syslog messages are sent to a remote syslog server.
BEST PRACTICE To limit the amount of writing done to the local hard drive or compact
Syslog messages are tagged with a severity syslog . The severity hierarchy, from least important to most important, is as follows:
Debug
Info
Notice
Warning
Error
Critical
Alert
Emergency
NOTE Junos supports all syslog severities but debug.
41
42
The syslog server now receives notice level logs, change-logs, and
interactive commands, and they are sent with a facility override of
local3.
The syslog server should be receiving all the syslog information you
could need to troubleshoot most issues, including Layer 1 problems.
But remember our commitment to confirm all troubleshooting steps
with a second network opinion? Lets run a Fix Test by triggering a link
transition and then monitor our syslog servers to see if they receive
notice.
Disabling the SONET link on so-3/3/2:
ps@dunkel-re0> configure
Entering configuration mode
[edit]
ps@dunkel-re0# set interfaces so-3/3/2 disable
[edit]
ps@dunkel-re0# commit
commit complete
Resulting syslog message:
Jan 19 15:48:41 172.19.110.171 mib2d[4549]: SNMP_TRAP_LINK_DOWN: ifIndex 151,
ifAdminStatus down(2), ifOperStatus down(2), ifName so-3/3/2
SNMP Polling
Now that syslog is configured, lets move on to the simple network
management protocol (SNMP). SNMP can be used either to passively
respond to queries or to actively send SNMP messages, called traps.
NOTE SNMP is disabled by default on Juniper Networks routers.
First lets configure the router to allow SNMP polling from a certain
address, and an SNMP community of tr4pp15t (an SNMP community
is a simple password to authenticate the entity making the query):
ps@dunkel-re0> configure
Entering configuration mode
[edit]
ps@dunkel-re0# set snmp community tr4pp15t authorization read-only
[edit]
ps@dunkel-re0# set snmp community tr4pp15t clients 172.19.110.10/32
[edit]
ps@dunkel-re0# show snmp
43
44
community tr4pp15t {
authorization read-only;
clients {
172.19.110.10/32;
}
}
[edit]
ps@dunkel-re0# show | compare
[edit snmp]
+ community tr4pp15t {
+
authorization read-only;
+
clients {
+
172.19.110.10/32;
+
}
+ }
[edit]
ps@dunkel-re0# commit
commit complete
The query timed-out, which means our router is only listening to SNMP
queries from 172.19.110.10.
There are numerous SNMP objects that can be queried for information. Many of these objects are dependent on what services, protocols,
and configurations are enabled.
BEST PRACTICE Juniper recommends that you configure the following base objects to
Hardware Inventory
To poll the device for its hardware inventory, you can use the snmpwalk
utility, which walks the SNMP tree starting at the value provided:
The hardware inventory OID is .1.3.6.1.4.1.2636.3.1.13.1.5
nms-1> snmpwalk -v2c -c tr4pp15t dunkel .1.3.6.1.4.1.2636.3.1.13.1.5
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.1.1.0.0 = STRING: midplane
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.2.1.0.0 = STRING: PEM 0
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.2.2.0.0 = STRING: PEM 1
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.2.3.0.0 = STRING: PEM 2
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.2.4.0.0 = STRING: PEM 3
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.1.1.0 = STRING: Top Left Front fan
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.1.2.0 = STRING: Top Right Rear fan
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.1.3.0 = STRING: Top Right Front fan
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.1.4.0 = STRING: Top Left Rear fan
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.2.1.0 = STRING: Bottom Left Front fan
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.2.2.0 = STRING: Bottom Right Rear fan
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.2.3.0 = STRING: Bottom Right Front fan
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.2.4.0 = STRING: Bottom Left Rear fan
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.3.1.0 = STRING: Rear Fan 1 (TOP)
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.3.2.0 = STRING: Rear Fan 2
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.3.3.0 = STRING: Rear Fan 3
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.3.4.0 = STRING: Rear Fan 4
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.3.5.0 = STRING: Rear Fan 5
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.3.6.0 = STRING: Rear Fan 6
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.4.3.7.0 = STRING: Rear Fan 7 (Bottom)
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.7.1.0.0 = STRING: FPC: M320 E2-FPC Type 3 @
0/*/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.7.2.0.0 = STRING: FPC: M320 E2-FPC Type 3 @
1/*/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.7.3.0.0 = STRING: FPC: M320 E2-FPC Type 2 @
45
46
2/*/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.7.4.0.0 = STRING: FPC: M320 E2-FPC Type 1 @
3/*/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.7.5.0.0 = STRING: FPC: M320 E2-FPC Type 1 @
4/*/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.1.1.0 = STRING: PIC: 10x 1GE(LAN), 1000
BASE @ 0/0/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.1.2.0 = STRING: PIC: 10x 1GE(LAN), 1000
BASE @ 0/1/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.2.1.0 = STRING: PIC: 4x OC-48 SONET @
1/0/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.2.2.0 = STRING: PIC: 8x 1GE(TYPE3), IQ2 @
1/1/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.3.1.0 = STRING: PIC: 4x OC-12 SONET, SMIR @
2/0/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.3.2.0 = STRING: PIC: 2x OC-12 ATM-II IQ, MM
@ 2/1/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.4.1.0 = STRING: PIC: 1x OC-12 SONET, SMIR @
3/0/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.4.2.0 = STRING: PIC: 1x OC-12 ATM-II IQ, MM
@ 3/1/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.4.3.0 = STRING: PIC: 4x OC-3 SONET, SMIR @
3/2/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.4.4.0 = STRING: PIC: 4x OC-3 SONET, MM @
3/3/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.5.1.0 = STRING: PIC: 4x CHDS3 IQ @ 4/0/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.5.3.0 = STRING: PIC: 1x CHOC12 IQ SONET,
SMIR @ 4/2/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.8.5.4.0 = STRING: PIC: 4x OC-3 SONET, SMIR @
4/3/*
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.9.1.0.0 = STRING: Routing Engine 0
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.9.2.0.0 = STRING: Routing Engine 1
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.10.1.1.0 = STRING: FPM GBUS
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.10.1.2.0 = STRING: FPM Display
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.12.1.0.0 = STRING: CB 0
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.12.2.0.0 = STRING: CB 1
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.15.1.0.0 = STRING: SIB 0
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.15.2.0.0 = STRING: SIB 1
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.15.3.0.0 = STRING: SIB 2
SNMPv2-SMI::enterprises.2636.3.1.13.1.5.15.4.0.0 = STRING: SIB 3
Examples:
nms-1> snmpwalk -v2c -c tr4pp15t dunkel .1.3.6.1.4.1.2636.3.1.13.1.8.9.1.0.0
SNMPv2-SMI::enterprises.2636.3.1.13.1.8.9.1.0.0 = Gauge32: 3
nms-1> snmpwalk -v2c -c tr4pp15t dunkel .1.3.6.1.4.1.2636.3.1.13.1.8.9.2.0.0
SNMPv2-SMI::enterprises.2636.3.1.13.1.8.9.2.0.0 = Gauge32: 0
47
48
DRAM
Memory utilization
CPU utilization:
User
Background
Kernel
Interrupt
Idle
Model
Serial ID
Start time
Uptime
3584 MB
8 percent
0 percent
0 percent
0 percent
0 percent
100 percent
RE-A-2000
1000699981
2010-01-19 11:28:08 PST
22 hours, 4 minutes, 52 seconds
As shown, the idle values match the utilization values. Note that the
utilization value and idle value should correlate, adding up to 100%:
Routing-engine 0 - 3% + 97% = 100%
Routing-engine 1 - 0% + 100% = 100%
Examples:
nms1> snmpwalk -v2c -c tr4pp15t dunkel .1.3.6.1.4.1.2636.3.1.13.1.11.9.1.0.0
SNMPv2-SMI::enterprises.2636.3.1.13.1.11.9.1.0.0 = Gauge32: 9
nms-1> snmpwalk -v2c -c tr4pp15t dunkel .1.3.6.1.4.1.2636.3.1.13.1.11.9.2.0.0
SNMPv2-SMI::enterprises.2636.3.1.13.1.11.9.2.0.0 = Gauge32: 8
This indicates that routing-engine 0 is running at 9% memory utilization while routing-engine 1 is running at 8% memory utilization.
Again, we always confirm, so lets run the show chassis routing-engine command on the router:
ps@dunkel-re0> show chassis routing-engine
Routing Engine status:
Slot 0:
Current state
Master
Election priority
Master
Temperature
45 degrees C / 113 degrees F
CPU temperature
52 degrees C / 125 degrees F
DRAM
3584 MB
Memory utilization
CPU utilization:
User
Background
Kernel
Interrupt
Idle
Model
Serial ID
Start time
Uptime
Load averages:
Routing Engine status:
Slot 1:
Current state
Election priority
Temperature
CPU temperature
DRAM
Memory utilization
CPU utilization:
User
Background
Kernel
Interrupt
Idle
Model
Serial ID
Start time
Uptime
9 percent
0 percent
0 percent
3 percent
0 percent
97 percent
RE-A-2000
1000702757
2010-01-19 11:42:50 PST
21 hours, 50 minutes, 15 seconds
1 minute 5 minute 15 minute
0.00
0.03
0.02
Backup
Backup
43 degrees C / 109 degrees F
47 degrees C / 116 degrees F
3584 MB
8 percent
0 percent
0 percent
0 percent
0 percent
100 percent
RE-A-2000
1000699981
2010-01-19 11:28:08 PST
22 hours, 4 minutes, 52 seconds
49
50
Red alarm
Normal Bad fan
75
65
110
110
110
110
90
80
90
80
80
70
80
70
80
70
80
70
80
72
80
72
80
72
80
72
80
72
80
chassis routing-
User
Background
Kernel
Interrupt
Idle
Model
Serial ID
Start time
Uptime
Load averages:
Routing Engine status:
Slot 1:
Current state
Election priority
Temperature
CPU temperature
DRAM
Memory utilization
CPU utilization:
User
Background
Kernel
Interrupt
Idle
Model
Serial ID
Start time
Uptime
0 percent
0 percent
2 percent
0 percent
97 percent
RE-A-2000
1000702757
2010-01-19
22 hours, 2
1 minute
0.00
11:42:50 PST
minutes, 35 seconds
5 minute 15 minute
0.00
0.00
Backup
Backup
43 degrees C / 109 degrees F
47 degrees C / 116 degrees F
3584 MB
8 percent
0 percent
0 percent
0 percent
0 percent
100 percent
RE-A-2000
1000699981
2010-01-19 11:28:08 PST
22 hours, 17 minutes, 14 seconds
The temperature values returned from the SNMP query shown here
matches our CLI output.
SNMP traps
In addition to polling, SNMP can also actively send traps under certain
circumstances. This is extremely important to any NMS system, as it
allows the network devices to proactively tell a monitoring system that
there is an issue or that an event has occurred. Properly configured
SNMP trapping, in conjunction with syslog messages, should provide
an NMS system with all that it needs to effectively monitor a network.
An SNMP trap configuration is similar to the SNMP polling configuration implemented in the previous step, except that an additional SNMP
community must be configured and an SNMP server must also be
51
52
Okay, lets configure traps for the authentication, chassis, configuration, link, routing, sonet-alarms, and startup categories:
[edit]
ps@dunkel-re0#
[edit]
ps@dunkel-re0#
[edit]
ps@dunkel-re0#
[edit]
ps@dunkel-re0#
[edit]
ps@dunkel-re0#
[edit]
ps@dunkel-re0#
[edit]
ps@dunkel-re0#
Issue the show snmp command (in configuration mode) to display the
SNMP configuration with the added changes:
[edit]
ps@dunkel-re0# show snmp
community tr4pp15t {
authorization read-only;
clients {
172.19.110.10/32;
}
}
trap-group tr4pp15t {
categories {
authentication;
chassis;
link;
routing;
startup;
configuration;
sonet-alarms;
}
targets {
172.19.110.10;
}
}
53
54
Summary
Half of running a well operated network is implementing a network
management system and providing that NMS system with the logs and
traps it needs to effectively identify and report problems. This is the
proactive portion of monitoring and troubleshooting and is an invaluable asset for any well-run network. The complement to an NMS, as
you will see in the following chapters, is active troubleshooting (and
monitoring) through the Junos CLI.
Chapter 5
Layer 1 and Layer 2 Monitoring
and Troubleshooting
Important Interface Commands . . . . . . . . . . . . . . . . . 56
Layer 1 Monitoring and Troubleshooting . . . . . . . . . . 56
SONET Alarms and Defects. . . . . . . . . . . . . . . . . . . . . . . 61
Layer 2 Monitoring and Troubleshooting. . . . . . . . . . 65
Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
56
57
58
SONET PHY:
Seconds
Count State
PLL Lock
0
0 OK
PHY Light
0
0 OK
SONET section:
BIP-B1
0
0
SEF
0
0 OK
LOS
0
0 OK
LOF
0
0 OK
ES-S
0
SES-S
0
SEFS-S
0
SONET line:
BIP-B2
0
0
REI-L
0
0
RDI-L
0
0 OK
AIS-L
0
0 OK
BERR-SF
0
0 OK
BERR-SD
0
0 OK
ES-L
0
SES-L
0
UAS-L
0
ES-LFE
0
SES-LFE
0
UAS-LFE
0
SONET path:
BIP-B3
0
0
REI-P
0
0
LOP-P
0
0 OK
AIS-P
0
0 OK
RDI-P
0
0 OK
UNEQ-P
0
0 OK
PLM-P
0
0 OK
ES-P
0
SES-P
0
UAS-P
0
ES-PFE
0
SES-PFE
0
UAS-PFE
0
Received SONET overhead:
F1
: 0x00, J0
: 0x00, K1
: 0x00, K2
: 0x00
S1
: 0x00, C2
: 0xcf, C2(cmp) : 0xcf, F2
: 0x00
Z3
: 0x00, Z4
: 0x00, S1(cmp) : 0x00
Transmitted SONET overhead:
F1
: 0x00, J0
: 0x01, K1
: 0x00, K2
: 0x00
S1
: 0x00, C2
: 0xcf, F2
: 0x00, Z3
: 0x00
Z4
: 0x00
Received path trace: pilsener-re0 so-2/2/0
70 69 6c 73 65 6e 65 72 2d 72 65 30 20 73 6f 2d pilsener-re0 so32 2f 32 2f 30 00 00 00 00 00 00 00 00 00 00 00 2/2/0...........
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 0d 00 ................
dunkel-re0 so-2/
0/0.............
................
................
Logical interface so-2/0/0.0 (Index 66) (SNMP ifIndex 179) (Generation 134)
Flags: Point-To-Point SNMP-Traps 0x4000 Encapsulation: PPP
Protocol inet, MTU: 4470, Generation: 141, Route table: 0
Addresses, Flags: Is-Preferred Is-Primary
Destination: 18.32.74.0/30, Local: 18.32.74.1, Broadcast: 18.32.74.3,
Generation: 144
Lets examine this output in depth, layer by layer, so youll know what
to look for in your own show interface extensive or monitor interface command.
Major elements of monitoring and troubleshooting Layer 1 and Layer
2 technologies include an understanding of how the protocol monitors,
how errors can impact the operation of the connection, and how those
errors are displayed to the operator.
SONET Mode
Juniper Networks SONET PICs can operate in either SONET or SDH
mode.
SONET (Synchronous Optical Networking) and SDH (Synchronous
Digital Hierarchy) were developed as part of Telcordia Technologies
Generic Requirements document GR-253-CORE. They detail two
different methods for multiplexing bit streams over a fiber optic cable.
SONET and SDH are very similar protocols. SDH was developed later
and can be viewed as a super-set of SONET, but because of the global
ubiquity of SDH and limited use of SONET within North America,
SDH is considered the standard.
59
60
Payload scrambling, bolded in the show command output, is a common culprit in a malfunctioning a SONET connection. Like many
SONET parameters, payload-scrambling must agree between the two
ends of a circuit, and a conflict between the two sides causes SONET
errors.
Certain transport standards, such as the ITU-T GR-253 standard,
require a certain density of ones (as opposed to zeroes) in the digital
stream. The main purpose of this requirement is for timing recovery.
The percent of ones needed for a T1, for example, is 12.5 percent, or
one bit in every byte.
To meet this requirement, payload-scrambling is used as an encoding
algorithm to ensure that this requirement is met and is enabled by
default on the SONET interfaces of Juniper Networks routers. Troubleshooting a payload-scrambling mismatch can be tricky. A payloadscrambling mismatch does not cause SONET layer issues (such as
defects) on either side of the connection, but the side with payloadscrambling enabled logs input errors which a Juniper Networks router
presents as input giants. The side of the connection without payloadscrambling configured will display framing errors.
BEST PRACTICE As with FCS, it is advisable to simply check that both sides are config-
Many different SONET errors can trigger input errors and while their
causes vary, they are indicative of a problem and their cause should be
examined and corrected. Framing errors, runts, and giants are typically
due to misconfiguration. Framing errors can be caused by an FCS
mismatch (16 on one side and 32 on the other, for example) or payload-scrambling mismatches.
Runts occur when a received packet is smaller than the minimum
frame size (64 bytes on SONET). These are the exception as they are
often caused by cabling or connection problems.
Giants are the result of a received packet being larger than the maximum frame size (16KB). These can be caused by payload-scrambling
mismatches as described above, but can also be caused by other
conditions. If input giants continue and configuration mismatches have
been ruled out, the provider of the circuit should be contacted to assist
in resolving the problem.
61
62
Using Figure 5-1 as a reference, the following sections list the most
common SONET errors, their possible causes, and some recommendations on where to start troubleshooting.
Loss of Signal (LOS)
The RDI is the complement to the AIS and is sent upstream when an
error is detected. Like AIS, there are path and line versions of this
signal.
n
Remote defect indication line (RDI-L): Sent upstream to a peer
LTE when an AIS-L or low-level defects are detected.
n
Remote defect indication path (RDI-P): Sent upstream to a peer
PTE when a defect in the signal, typically an AIS-P, is detected.
Remote Error Indication (REI)
Bit error rate alarms are declared when the number of BIP-B2 errors
hits a certain threshold. These error counters are shown in the output
of the show interface extensive command as previously shown.
Depending on the threshold, there are two types of BER alarms. In
both cases, the interface is taken down.
Bit error rate-signal degrade (BERR-SD) is declared when a bit error
rate of 10^-6 is reached.
63
64
Bit error rate-signal failure (BERR-SF) is declared when a bit error rate
of 10^-3 is reached.
Bit errors can be caused by any of the following:
n
Degrading optical fiber
n
Optical transmitter or receiver problems
n
Dirty fiber-optic connector
n
Clocking issues
n
Too much attenuation in the optical signal
n
BIP-B1 and BIP-B3 are not used in the BER alarm calculations
Payload Label Mismatch (PLM)
The PLL alarm occurs when the PLL cannot lock on to a timing device,
and indicates a possible hardware or network timing problem. The
65
66
The PPP related information is bolded above and shows the important
aspects of its operation. LCP (Link Control Protocol) and NCP
(Network Control Protocol) are protocols within the PPP suite that
manage the parameters and operation of the link and network functions of PPP. LCP is responsible for negotiating link parameters (packet
transmission size) and validates that the link is acceptable (based on a
set of criteria).
PPP cannot continue unless LCP has moved to an open state. After LCP
negotiates, NCP utilizes Layer 3 protocol modules to negotiate parameters for a multitude of Layer 3 protocols. In this way, PPP does not
need to be modified to support new Layer 3 protocols. The NCP
module that concerns us in this case is IPCP, the Internet Protocol
Control Protocol. IPCP negotiates IP layer parameters such as the
compression mechanism, IP addressing (for example, in dynamic
subscriber environments) and other values. IPCP must also be open for
a given Layer 3 protocol before any packet of that type can be transmitted.
NOTE Other examples of NCP modules are IPv6CP for IP version 6, IPxCP
67
68
The ping attempt and interface output show that LCP is down, which
explains why traffic is not passing over this link. The configuration
shows that CHAP authentication is used, so this is a good place to start
troubleshooting.
ps@dunkel-re0> show configuration interfaces so-1/2/3
description Connection to maibock;
unit 0 {
ppp-options {
chap {
default-chap-secret $9$PfF/9A0OIcF3hreKx7ik.539BIcM87cyvL; ## SECRET-DATA
}
}
family inet {
address 10.33.18.6/30;
}
}
The fastest way to confirm that the problem is not CHAP related is to
reenter the password (ppp-password in our case) on both sides of
the connection. However, to show the debugging capabilities of Junos,
traceoptions are used to diagnose the problem. Junos traceoptions
serve to provide debugging information for a given protocol or
function. Generally, protocol traceoptions are configured at the
hierarchy level corresponding to that protocol. In the case of PPP,
traceoptions are configured at the [edit protocols ppp] hierarchy. There
are two main configuration parameters for traceoptions, the file, to
which the system logs the debugging messages, and flags, which specify
the types of information to be logged. For this example, the file is
called ppp-log.txt. The following shows the configuration options
for PPP traceoptions:
[edit]
ps@dunkel-re0# set protocols ppp traceoptions flag ?
Possible completions:
access
Trace access code
address-pool
Trace address pool code
all
Trace all areas of code
auth
Trace authentication code
chap
Trace CHAP code
ci
Trace ci code
config
Trace configuration code
ifdb
Trace interface database code
lcp
Trace LCP state machine code
memory
Trace memory management code
message
Trace message processing code
mlppp
Trace MLPPP code
ncp
Trace NCP state machine code
pap
Trace PAP code
ppp
Trace PPP protocol processing code
radius
Trace RADIUS processing code
redundancy
Trace redundancy code
rtsock
Trace routing socket code
session
Trace session management code
signal
Trace signal handling code
timer
Trace timer code
ui
Trace user interface code
There are many traceoptions flags for PPP, but if you suspect CHAP to
be the problem, thats the best place to start. PPP traceoptions also
offer a level parameter, which allows you to limit the output to a
certain severity. For this example, you use a level of all to start with,
leaving the traceoptions configuration as follows:
ps@dunkel-re0> show configuration protocols ppp
traceoptions {
file ppp-log.txt;
level all;
flag chap;
}
69
70
***
so-1/2/3.0: CHAP - Stopping protocol timer
so-1/2/3.0: CHAP - Starting authentication
so-1/2/3.0: CHAP - End authen(0x8231004): FAILURE
The important line in the log is the failure error. This shows that
authentication is failing and that the password should be reset on both
sides.
Junos also has several CLI commands to monitor the status of PPP. The
most useful of commands are show ppp interface [interface name]
extensive and show ppp summary. Below are example outputs of these
commands in the current down state:
ps@dunkel-re0> show ppp interface so-1/2/3 extensive
Sessions for interface so-1/2/3
Session so-1/2/3.0, Type: PPP, Phase: Establish
LCP
State: Creq-sent
Last started: 2010-04-12 16:16:16 PDT
Last completed: 2010-04-12 16:16:14 PDT
Negotiated options:
Authentication protocol: CHAP, Authentication algorithm: MD5,
Magic number: 2543706641, MRU: 4470
Authentication: CHAP
State: Closed
Last started: 2010-04-12 16:16:14 PDT
Last completed: 2010-04-12 16:13:26 PDT
IPCP
State: Closed
Last started: 2010-04-12 16:13:26 PDT
Last completed: 2010-04-12 16:13:26 PDT
Negotiated options:
Local address: 10.33.18.6, Remote address: 10.33.18.4, Primary DNS: 0.0.0.0,
Secondary DNS: 0.0.0.0
ps@dunkel-re0> show ppp summary
Interface
Session type Session phase
so-1/2/3.0
PPP
Authenticate
Session flags
12
12
12
12
16:13:26
16:13:26
16:13:26
16:13:26
so-1/2/3.0:
so-1/2/3.0:
so-1/2/3.0:
so-1/2/3.0:
CHAP
CHAP
CHAP
CHAP
Note the SUCCESS message. Following this, you can once again ping
over the circuit, and the CLI instrumentation validates that PPP is now
functioning properly.
ps@dunkel-re0> show ppp interface so-1/2/3 extensive
Sessions for interface so-1/2/3
Session so-1/2/3.0, Type: PPP, Phase: Network
LCP
State: Opened
Last started: 2010-04-12 16:19:41 PDT
Last completed: 2010-04-12 16:19:41 PDT
Negotiated options:
Authentication protocol: CHAP, Authentication algorithm: MD5,
Magic number: 2544040945, MRU: 4470
Authentication: CHAP
State: Success
Last started: 2010-04-12 16:19:41 PDT
Last completed: 2010-04-12 16:19:41 PDT
IPCP
State: Opened
Last started: 2010-04-12 16:19:41 PDT
Last completed: 2010-04-12 16:19:41 PDT
Negotiated options:
Local address: 10.33.18.6, Remote address: 10.33.18.5, Primary DNS: 0.0.0.0,
Secondary DNS: 0.0.0.0
ps@dunkel-re0> show ppp summary
Interface
Session type Session phase
so-1/2/3.0
PPP
Network
Session flags
71
72
BEST PRACTICE Note that the states for LCP and IPCP are now Opened and show
ppp summary shows the session phase as Network. Before you end
your monitoring session, you should remove the traceoptions configuration used for PPP debugging. While traceoptions do not impact the
router, it does create unnecessary traceoptions configuration and files.
To do so, issue the delete
configuration mode:
command in
ps@dunkel-re0> configure
Entering configuration mode
[edit]
ps@dunkel-re0# delete protocols ppp traceoptions
[edit]
ps@dunkel-re0# commit and-quit
re0:
configuration check succeeds
re1:
commit complete
re0:
commit complete
Exiting configuration mode
You can stop monitoring of the ppp-log.txt file using the monitor stop
command. Note that monitoring is bound to your terminal and stops
automatically when you log out:
ps@dunkel-re0> monitor stop
Summary
This chapter used SONET and PPP to summarize how to monitor and
troubleshoot Layer 1 and 2 alarms and errors. SONET was used
because it is the most complex Layer 1 technology commonly used on
Juniper Networks devices, and PPP was used because of its popularity
and interoperability. However, the same approach can be used for any
other Layer 1 or 2 technology, including Ethernet.
That approach, be it for SONET or Ethernet or any other Layer 1 or 2
technology, is:
n
Use the instrumentation and information afforded by the protocol.
n
This information is contained within the output of the show
interface extensive command.
n
While output from show interface extensive provides a snapshot; the monitor interface command provides information in
real time, updating its display of an interfaces most important
statistics, warnings, errors, and attributes.
The following is an example of what the monitor command can
display on a UNIX terminal:
dunkel-re0
Seconds: 36
Time: 10:32:10
Delay: 0/0/19
73
74
BEST PRACTICE Ensure that Layer 1 settings such as FCS, payload-scrambling (both
SONET and SDH), speed, flow-control, and duplex-mode (all Ethernet) and Layer 3 settings such as authentication agree. Also, understand the role your device plays in the network (this is more important
in SONET networks than Ethernet networks).
Chapter 6
Layer 3 Monitoring
RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
IS-IS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
OSPF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
BGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
76
So far, this Day One booklet has discussed the monitoring and troubleshooting of Layer 1 and Layer 2 protocols. Layer 1 and Layer 2 issues
are typically isolated to contained systems whose configuration is only
significant locally.
For Layer 3 monitoring and troubleshooting, it is important to understand that an IP network is an interconnected, mutually dependent
system in which outages and changes can have a global impact.
This chapter discusses how to effectively monitor a Layer 3 IP network. Much like Layer 1 and Layer 2 monitoring, most operators rely
on syslog, SNMP, and monitoring systems to identify problems. Layer
3 network monitoring is much more complicated than it is at Layer 1
and Layer 2. This is because there is more data to process and so there
are more possible points of failure. The interconnected nature of an IP
network also makes it more difficult to quickly identify root-causes.
On a Layer 3 IP network, monitoring includes:
n
Logging configurations associated with commonly used protocols
to ensure that the NMS systems are receiving the appropriate logs.
n
Logs and SNMP traps and some of the important, protocol related
SNMP objects.
n
Useful operational mode commands for commonly deployed IP
protocols, including RIP, IS-IS, OSPF, and BGP.
n
The routing protocol process (RPD), which, among other things,
is responsible for logging and SNMP traps for route protocolrelated events.
NOTE Each protocol has its own set of logs and traps that are created for
events specific to the protocol. Some of these logs and traps are
automatic and some require configuration. For example, BGP will not
log neighbor state changes without configuring the log-updown statement.
RIP
Since RIP is a distance-vector protocol, it does not form adjacencies
like a link-state protocol such as OSPF or IS-IS. RIP advertisements are
simply multicast on the network and any interested router can process
the updates. Because of this passive learning, there is no neighbor state,
and as such no RIP related logs when RIP routers on a network
segment come up or go down.
Destination
Send Receive In
Address
Mode Mode
Met
-------------- ------- --224.0.0.9
mcast both
1
For this example, the router is configured to send and receive RIP
updates on interface so-2/0/0, which can be confirmed in the configuration (dont forget to always get a second opinion):
ps@dunkel-re0> show configuration protocols rip
group core-interfaces {
neighbor so-2/0/0.0;
}
The configuration tells us that the router is doing what it was asked to
do. It would be useful to see if the router is sending and receiving RIP
updates to get a better understanding of its operation, and it just so
happens that there is a command for that, show rip statistics:
ps@dunkel-re0> show rip statistics
RIPv2 info: port 520; holddown 120s.
rts learned rts held down rqsts dropped resps dropped
0
0
0
0
so-2/0/0.0: 1 routes learned; 0 routes advertised; timeout 180s; update interval 30s
Counter
Total Last 5 min Last minute
----------------- ----------- ----------Updates Sent
0
0
0
Triggered Updates Sent
0
0
0
Responses Sent
0
0
0
Bad Messages
0
0
0
RIPv1 Updates Received
0
0
0
RIPv1 Bad Route Entries
0
0
0
RIPv1 Updates Ignored
0
0
0
RIPv2 Updates Received
7
7
3
RIPv2 Bad Route Entries
0
0
0
RIPv2 Updates Ignored
0
0
0
Authentication Failures
0
0
0
RIP Requests Received
0
0
0
RIP Requests Ignored
0
0
0
77
78
IS-IS
Since IS-IS is a link-state protocol, it has a richer set of monitoring
tools. This section discusses the logging, SNMP traps, and instrumentation tools that can be used on the router to monitor its IS-IS operation.
Logging
IS-IS logs information based on the finite state machine of the adjacency process. These logs are important to monitoring the network for
IP problems because your IGP is either directly related to network
reachability (if all of your routes are kept within your IGP), or indirectly related to connectivity by providing next-hop reachability for
BGP and as the underlying protocol for MPLS.
Additionally, IGP down events may (and likely do) trigger reconvergence within your network, which may lead to the overutilization of
other links or a suboptimal route. In any event, it is important to know
When the adjacency first comes up, the router logs an ADJUP syslog
message:
Jan 22 15:15:50 172.19.110.171 rpd[4550]: RPD_ISIS_ADJUP: IS-IS new L2 adjacency to
pilsener-re0 on so-2/0/0.0
If the router on the other side of the connection suffers a soft failure in
which the routing-engine goes down but the link stays up (causing a
loss of keepalives) the following message is logged:
Jan 22 15:31:25 172.19.110.171 rpd[4550]: RPD_ISIS_ADJDOWN: IS-IS lost L2 adjacency to
pilsener-re0 on so-2/0/0.0, reason: Aged Out
And finally, in the event that there is a level mismatch in the configuration, you see the following message:
Jan 22 15:32:33 172.19.110.171 rpd[4550]: RPD_ISIS_ADJDOWN: IS-IS lost L2 adjacency to
pilsener-re0 on so-2/0/0.0, reason: Level Mismatch
SNMP Traps
Junos can also be configured to send SNMP traps when an IS-IS state
change occurs. By default, Junos does not send a trap when this
happens, so you need to configure an event-option to ensure that traps
are sent. Event-options allow you to specify an action to take when a
particular local log event occurs, as shown in the following eventoptions configuration:
ps@dunkel-re0> show configuration event-options
policy isisNbrStateChange {
events [ rpd_isis_adjdown rpd_isis_adjup ];
then {
raise-trap;
}
}
79
80
Level mismatch:
dunkel-fxp0.pslab.juniper.net [UDP: [172.19.110.171]:62313]: Trap , DISMAN-EVENTMIB::sysUpTimeInstance = Timeticks: (27438490) 3 days, 4:13:04.90,
SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-SMI::enterprises.2636.4.12.0.1,
SNMPv2-SMI::enterprises.2636.3.35.1.1.1.2.7 = STRING: RPD_ISIS_ADJDOWN,
SNMPv2-SMI::enterprises.2636.3.35.1.1.1.3.7 = Hex-STRING: 07 DA 01 17 00 00 1C
00 2B 00 00 , SNMPv2-SMI::enterprises.2636.3.35.1.1.1.4.7 = INTEGER: 6,
SNMPv2-SMI::enterprises.2636.3.35.1.1.1.5.7 = INTEGER: 4,
SNMPv2-SMI::enterprises.2636.3.35.1.1.1.6.7 = Gauge32: 4550,
SNMPv2-SMI::enterprises.2636.3.35.1.1.1.7.7 = STRING: rpd,
SNMPv2-SMI::enterprises.2636.3.35.1.1.1.8.7 = STRING: dunkel-re0,
SNMPv2-SMI::enterprises.2636.3.35.1.1.1.9.7 = STRING: RPD_ISIS_ADJDOWN:
IS-IS lost L2 adjacency to pilsener-re0 on so-2/0/0.0, reason: Level Mismatch,
SNMPv2-MIB::snmpTrapEnterprise.0 = OID: SNMPv2-SMI::enterprises.2636.1.1.1.2.9
There are many reasons why an adjacency may not form but some of
the common causes include:
n
Interface not configured for family ISO
n
Missing or duplicate NSAP address (in Junos, the NSAP address is
configured as the family iso address on the lo0.0 interface)
81
82
n
Level mismatch
n
Area mismatch
n
Authentication type or key failure
In each of these cases, confirm that the configuration matches on both
sides of the link.
Now that there is an adjacency, there should be some useful information in our IS-IS database. IS-IS maintains one database for Level 1
information and another for Level 2 information. As Level 1 has been
disabled, you should only see the Level 2 database. View the IS-IS
database by using the detail output option of the show isis database
command:
ps@dunkel-re0> show isis database detail
IS-IS level 1 link-state database:
IS-IS level 2 link-state database:
pilsener-re0.00-00 Sequence: 0x7, Checksum: 0xd28e, Lifetime: 1155 secs
IS neighbor: dunkel-re0.00
Metric:
10
IP prefix: 10.200.7.2/32
Metric:
0 Internal Up
IP prefix: 18.32.74.0/30
Metric:
10 Internal Up
dunkel-re0.00-00 Sequence: 0x1e5, Checksum: 0x75f2, Lifetime: 1163 secs
IS neighbor: pilsener-re0.00
Metric:
10
IP prefix: 10.200.7.1/32
Metric:
0 Internal Up
IP prefix: 18.32.74.0/30
Metric:
10 Internal Up
It might also be useful to monitor IS-IS for its overall operation. The
show isis overview and show isis statistics commands are useful
for IS-IS protocol monitoring. Much of the show isis overview output
is based on the configuration:
ps@dunkel-re0> show isis overview
Instance: master
Router ID: 10.200.7.1
Adjacency holddown: enabled
Maximum Areas: 3
LSP life time: 1200
Attached bit evaluation: enabled
SPF delay: 200 msec, SPF holddown: 5000 msec, SPF rapid runs: 3
IPv4 is enabled, IPv6 is enabled
Traffic engineering: enabled
Restart: Enabled
Restart duration: 210 sec
Helper mode: Enabled
Level 1
Internal route preference: 15
External route preference: 160
Wide metrics are enabled, Narrow metrics are enabled
Level 2
Internal route preference: 18
External route preference: 165
Wide metrics are enabled, Narrow metrics are enabled
This output shows the IP router-id, which can be configured under the
[edit routing-options] hierarchy level and defaults to the IP address
of the lo0.0 interface or, if a lo0.0 interface does not exist, the lowest
numbered IP interface.
Since the lo0.0 interface has been configured with IP address
10.200.7.1, the router has selected this address as our router ID. You
also see some default values for the LSP lifetime, attached bit evaluation, SPF options, MPLS traffic-engineering support, graceful-restart
parameters, as well as level specific configurations including route
preferences and metric styles. These are all configurable parameters.
ALERT! Understand what the default values should be in your network. Unless
83
84
Drops
0
0
0
0
0
0
Sent
13
7
90
1
0
111
Rexmit
0
0
0
0
0
This output shows that there have been 3 SPF runs, with the most
recent due to a new adjacency. As SPF is not constantly running, it
appears the IS-IS network is stable.
OSPF
From an operation perspective, OSPF and IS-IS are very similar. Both
form relationships with neighbors using a series of hello messages and a
finite state machine (FSM). Both create and synchronize link-state
databases and both derive routing-information from these databases.
Logging
Just as logging is important to IS-IS it is also important when running
OSPF. The following shows some of the OSPF logging available.
Neighbor up:
Jan 28 09:52:34 172.19.110.171 rpd[4550]: RPD_OSPF_NBRUP: OSPF neighbor 18.32.74.2
(so-2/0/0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor
detected this router)
Jan 28 09:52:34 172.19.110.171 rpd[4550]: RPD_OSPF_NBRUP: OSPF neighbor 18.32.74.2
(so-2/0/0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD
exchange of slave completed)
Link-down:
Jan 28 09:54:38 172.19.110.171 rpd[4550]: RPD_OSPF_NBRDOWN: OSPF neighbor 18.32.74.2
(so-2/0/0.0) state changed from Full to Down due to KillNbr (event reason: interface
went down)
SNMP Traps
Unlike IS-IS, Junos by default sends SNMP traps when there is an
adjacency change for OSPF. As with IS-IS, lets show these different
traps.
Neighbor up:
dunkel-re0.juniper.net [UDP: [172.19.110.171]:62313]: Trap , DISMAN-EVENTMIB::sysUpTimeInstance = Timeticks: (77178300) 8 days, 22:23:03.00,
SNMPv2-MIB::snmpTrapOID.0 = OID:
SNMPv2-SMI::mib-2.14.16.2.2,
SNMPv2-SMI::mib-2.14.1.1.0 = IpAddress: 10.200.7.1,
85
86
SNMPv2-SMI::mib-2.14.10.1.1.18.32.74.2.0
SNMPv2-SMI::mib-2.14.10.1.2.18.32.74.2.0
SNMPv2-SMI::mib-2.14.10.1.3.18.32.74.2.0
SNMPv2-SMI::mib-2.14.10.1.6.18.32.74.2.0
SNMPv2-MIB::snmpTrapEnterprise.0 = OID:
SNMPv2-SMI::enterprises.2636.1.1.1.2.9
=
=
=
=
IpAddress: 18.32.74.2,
INTEGER: 0,
IpAddress: 10.200.7.2,
INTEGER: 8,
SNMP traps are admittedly cryptic (and there are open source tools
such as snmptt to help decrypt them). Nonetheless, the information
you need is in there. In the above trap, the important portion of the
trap is in bold. The object 2.14.10.1.6.[neighbor address].0 is in the
ospfNbrState MIB. Each potential value means:
1 - down
2 - attempt
3 - init
4 - twoWay
5 - exchangeStart
6 - exchange
7 - loading
8 - full
Based on the information from this SNMP trap, the OSPF neighbor is
in state 8 (full).
Link-down:
dunkel-re0.juniper.net [UDP: [172.19.110.171]:62313]: Trap , DISMAN-EVENTMIB::sysUpTimeInstance = Timeticks: (77303649) 8 days, 22:43:56.49,
SNMPv2-MIB::snmpTrapOID.0 = OID:
SNMPv2-SMI::mib-2.14.16.2.2,
SNMPv2-SMI::mib-2.14.1.1.0 = IpAddress: 10.200.7.1,
SNMPv2-SMI::mib-2.14.10.1.1.18.32.74.2.0 = IpAddress: 18.32.74.2,
SNMPv2-SMI::mib-2.14.10.1.2.18.32.74.2.0 = INTEGER: 0,
SNMPv2-SMI::mib-2.14.10.1.3.18.32.74.2.0 = IpAddress: 10.200.7.2,
SNMPv2-SMI::mib-2.14.10.1.6.18.32.74.2.0 = INTEGER: 1,
SNMPv2-MIB::snmpTrapEnterprise.0 = OID:
SNMPv2-SMI::enterprises.2636.1.1.1.2.9
For this example, the trap indicates that neighbor 18.32.74.2 transitioned to state 1 (down). Received at the same time was a link-down
trap that helped us to understand the reason for the OSPF down trap:
10.200.7.1: Link Down Trap (0) Uptime: 8 days, 22:43:56.49, IF-MIB::ifIndex.179 =
INTEGER: 179, IF-MIB::ifAdminStatus.179 = INTEGER: up(1), IF-MIB::ifOperStatus.179 =
INTEGER: down(2), IF-MIB::ifName.179 = STRING: so-2/0/0.0
SNMP Polling
All OSPF object identifiers (OIDs) fall under the OSPF management
information base (MIB), which has an OID of 1.3.6.1.2.1.14. While
there are many OSPF objects, the most commonly monitored MIB is
the OSPF neighbor state (ospfNbrState). This OID uses the format
1.3.6.1.2.1.14.10.1.6.[neighbor address]. Note that this is the neighbor address and not the neighbors router-id, as you could have
multiple OSPF adjacencies to the same neighbor.
nms-1> snmpwalk -v2c -c tr4pp15t cartman 1.3.6.1.2.1.14.10.1.6
SNMPv2-SMI::mib-2.14.10.1.6.18.32.74.2.0 = INTEGER: 8
The value of this object follows the same mapping as discussed previously with 8 being mapped as full. So the adjacency with the neighbor at 18.32.74.2 is in a full state.
State
Full
ID
10.200.7.2
Pri Dead
128
39
State
Init
ID
10.200.7.2
Pri Dead
128
34
87
88
State
2Way
ID
10.200.7.2
Pri Dead
128
37
State
Full
ID
10.200.7.2
Pri Dead
128
35
For all of these cases, confirm that the configuration matches on both
sides on the link.
Now that there is an OSPF neighborship in full state, there is some
useful information in the OSPF database. Like the IS-IS protocols
database separation (based on levels), OSPF creates separate databases
for each area the router is connected to. For this example, the router is
only connected to area 0, so theres only information in the area 0 database as the following shows:
Here, routers Dunkel and Pilsener are both advertising type 1 (router)
LSAs, which include information on the connections of each router.
Since both routers only have 1 OSPF connection (to each other), their
router LSAs look similar. Both include their own router-ids as well as
the connection between them. With information in the databases, the
routers can derive routing information. Dunkel has learned 10.200.7.2
(Pilseners loopback) through OSPF and has installed a route over the
OSPF connection to that destination.
You can identify installed routes by those denoted with a plus sign or
an asterisk, as displayed in the output of the show route protocol
ospf command:
ps@dunkel-re0> show route protocol ospf
inet.0: 14 destinations, 17 routes (14 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
10.200.7.2/32
18.32.74.0/30
224.0.0.5/32
You can also see that there is an OSPF route to 18.32.74.0/30, which is
the network connecting Dunkel to Pilsener. The reason this route is not
installed is that Dunkel is a directly connected (read: better) route to
this network. Lets show its route:
89
90
*[Direct/0] 00:15:06
> via so-2/0/0.0
[OSPF/10] 00:15:05, metric 1
> via so-2/0/0.0
ospf
Sent
7
3
0
4
2
Total
Received
5
2
0
3
3
Last 5 seconds
Sent
Received
0
0
0
0
0
0
0
0
0
0
retransmitted
:
flooded
:
flooded high-prio :
retransmitted
:
transmitted to nbr :
requested
:
acknowledged
:
0, last 5 seconds :
2, last 5 seconds :
0, last 5 seconds :
2, last 5 seconds :
0, last 5 seconds :
0, last 5 seconds :
2, last 5 seconds :
0
0
0
0
0
0
0
0
0
0
0
Receive errors:
None
BGP
BGP behaves like a distance-vector protocol, but it also forms neighbor
relationships with peers. There is a wealth of syslogging, SNMP
polling and trapping, and instrumentation options to monitor the state
and operation of BGP.
Lets first review some of the syslog messages associated with BGP.
Logging
BGP syslog messages most commonly pertain to the state of a BGP
session. On Juniper Networks routers, you must configure BGP to log
session state changes by issuing the set protocols bgp log-updown
command. This command can also be used at the BGP group or
neighbor level, but is most useful at the protocol level.
Here, a simple internal BGP session which establishes peering between
Dunkel and Pilsener is shown. Also shown are the resulting syslog
messages of the session going up and going down:
91
92
Neighbor up:
Jan 29 11:47:44 172.19.110.171 rpd[4550]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer
10.200.7.2 (Internal AS 1) changed state from OpenConfirm to Established (event
RecvKeepAlive)
Neighbor down due to peer not being configured on the remote side:
Jan 29 11:50:38 172.19.110.171 rpd[4550]: bgp_read_v4_message: NOTIFICATION received
from 10.200.7.2 (Internal AS 1): code 6 (Cease) subcode 3 (Peer Unconfigured)
Jan 29 11:50:38 172.19.110.171 rpd[4550]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer
10.200.7.2 (Internal AS 1) changed state from Established to Idle (event RecvNotify)
SNMP Traps
BGP creates SNMP traps for neighbor state changes and these are
contained in the mib-2.15.3.1.2.[neighbor address] object. The state
is communicated using an integer with the following mappings:
1 - idle
2 - connect
3 - active
4 - opensent
5 - openconfirm
6 - established
The following shows some examples of BGP traps based on neighbor
state change, focusing on our BGP peer at 10.200.7.2.
Neighbor up:
cartman-fxp0.pslab.juniper.net [UDP: [172.19.110.171]:62313]: Trap , DISMANEVENT-MIB::sysUpTimeInstance = Timeticks: (87163316) 10 days, 2:07:13.16,
SNMPv2-MIB::snmpTrapOID.0 = OID:
SNMPv2-SMI::mib-2.15.7.1,
SNMPv2-SMI::mib-2.15.3.1.14.10.200.7.2 = Hex-STRING: 04 00 ,
SNMPv2-SMI::mib-2.15.3.1.2.10.200.7.2 = INTEGER: 6,
SNMPv2-MIB::snmpTrapEnterprise.0 = OID:
SNMPv2-SMI::enterprises.2636.1.1.1.2.9
Neighbor down:
cartman-re0.juniper.net [UDP: [172.19.110.171]:62313]: Trap , DISMAN-EVENTMIB::sysUpTimeInstance = Timeticks: (87321153) 10 days, 2:33:31.53,
SNMPv2-MIB::snmpTrapOID.0 = OID:
SNMPv2-SMI::mib-2.15.7.2,
SNMPv2-SMI::mib-2.15.3.1.14.10.200.7.2 = Hex-STRING: 06 07 ,
SNMPv2-SMI::mib-2.15.3.1.2.10.200.7.2 = INTEGER: 1,
SNMPv2-MIB::snmpTrapEnterprise.0 = OID:
SNMPv2-SMI::enterprises.2636.1.1.1.2.9
To gain useful information on each peer, you must first get the peers
SNMP index. The router creates a BGP peer index table, and this
93
94
Through SNMP, you can see that seven total routes are received, one of
which is accepted and six of which have been rejected, from the peer
with index 0 (10.200.7.2). Use the show bgp summary command to
confirm:
ps@dunkel-re0> show bgp summary
Groups: 1 Peers: 1 Down peers: 0
Table
Tot Paths Act Paths Suppressed
History Damp State
Pending
inet.0
7
1
0
0
0
0
Peer
AS
InPkt
OutPkt
OutQ Flaps Last Up/Dwn State|#Active/
Received/Damped...
10.200.7.2
1
130
123
0
11
53:55 1/7/0
0/0/0
Here, seven routes are learned from 10.200.7.2 and one has been
accepted. If the session was not established, its state would be displayed in place of the number of prefixes active/received/dampened,
as shown with our session to 10.200.7.3. Also shown is that both
peers are in AS 1 and that the session to 10.200.7.2 has been up for
1:33:38 and the session to 10.200.7.3 has been down for nearly two
minutes. Understanding the length of time a session has been up or
down can help correlate the state change to network events or
configurations.
The show bgp neighbor command provides additional information
for each BGP peer. A particular peer can be specified with this command (for example, show bgp neighbor 10.200.7.2) to display
information specific to that peer, omitting this option displays
information for all peers:
ps@dunkel-re0> show bgp neighbor 10.200.7.2
Peer: 10.200.7.2+53667 AS 1
Local: 10.200.7.1+179 AS 1
Type: Internal
State: Established
Flags: <Sync>
Last State: OpenConfirm Last Event: RecvKeepAlive
Last Error: Cease
Options: <Preference LocalAddress LogUpDown PeerAS Refresh>
Local Address: 10.200.7.1 Holdtime: 90 Preference: 170
Number of flaps: 11
Last flap event: RecvNotify
Error: Hold Timer Expired Error Sent: 1 Recv: 0
Error: Cease Sent: 1 Recv: 10
Peer ID: 10.200.7.2
Local ID: 10.200.7.1
Active Holdtime: 90
Keepalive Interval: 30
Peer index: 0
BFD: disabled, down
95
96
Octets 5730
Octets 5308
10.200.7.2/32
This output is consistent with the show bgp neighbor command in that
five routes are being advertised to our peer at 10.200.7.2. It also shows
that the routes have a local preference of 100 (the default) and that the
AS path is I, meaning Internal, which means that the local AS is
originating the route.
The show route receive-protocol bgp [neighbor] command displays
the routes being received from our neighbor. You would expect to see
seven routes based on the previous output of show bgp neighbor:
97
98
0 hidden)
AS path
I
I
I
I
I
I
I
This makes sense and aligns not only with expectations based on the
output from show bgp neighbor, but also based on the output from
show route protocol bgp: received seven routes and using one (the
route to 112.74.9.0/30).
Summary
Each protocol has its own set of logs and traps that are created for
events specific to the protocol. Some of these logs and traps require
configuration.
Whenever you monitor at Layer 3, it is extremely important to
remember that an IP network is an interconnected system and that it is
dependent on lower layers in the OSI model. Collecting system logs
and SNMP traps is instrumental in allowing both technicians and
NMS systems to quickly identify the root cause of problems.
Problems experienced at Layer 3 can easily be caused by problems at
Layers 1 and 2. This reinforces the message that developing a troubleshooting strategy which begins by narrowing down the likely causes
before you even issue a single command, following the OSI model, and
effectively utilizing protocol instrumentation is key to quickly isolating
and resolving network issues.
Chapter 7
Layer 3 Troubleshooting
Outage Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Troubleshooting Packet Loss . . . . . . . . . . . . . . . . . . . 100
Troubleshooting Routing Loops. . . . . . . . . . . . . . . . . 104
Troubleshooting Circuit Overutilization. . . . . . . . . . 106
Troubleshooting Route Oscillation . . . . . . . . . . . . . . 106
Troubleshooting Latency . . . . . . . . . . . . . . . . . . . . . . . 110
Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
This chapter on Layer 3 troubleshooting attempts to convey a methodology and philosophy for troubleshooting an IP network, and thus
returns to guidelines presented in Chapter 1, if only to establish a
process that you can apply over and over again.
It is nearly impossible to describe every IP problem you may encounter,
so instead, this chapter presents effective ways of approaching network
problems that can lead to quick problem isolation and resolution.
Outage Types
Lets begin by categorizing the different types of IP outages:
n
Packet Loss: Packet loss is the failure to deliver a packet to a
destination and can be caused by many different problems. The
most common culprits are a down circuit, lack of (correct) routing
information, routing loops, and over-utilized circuits (with or
without class-of-service). Security devices also frequently cause
(intentional) dropped packets, but firewalls are outside the scope
of this book.
n
Latency: Latency is a delay in the delivery of packets, which can be
caused by suboptimal routing or class-of-service queuing. Jitter, a
related problem, is a variance in latency and can be problematic in
voice and video over IP environments. Jitter is often caused by
class-of-service (or lack thereof).
NOTE One important point is that nothing is more important when trouble-
Circuit Outages
A complete outage is sometimes the simplest type of outage to resolve.
When troubleshooting a complete outage, isolation is the most
important aspect, since the resolution likely involves a configuration
change, hardware restart or swap, or a call to the Telco vendor.
The most useful tool for troubleshooting this type of outage is traceroute. Traceroute sends a series of packets towards a destination with
an incrementing time-to-live (TTL) value starting at 1. When a router
receives an IP packet, it is required to decrement the TTL. When the
TTL reaches zero, the router must send an ICMP time-exceeded
message. This ICMP message provides the sending router with the IP
address of that particular hop.
Since this process is repeated by every hop in the path, the sending
router learns the IP address of every hop in the path. This information
can be invaluable to an operator attempting to identify the root cause
of an outage.
NOTE A common misconception is that the last responding hop in a tracer-
oute is the cause of the problem. If it responds, it means that your host
has reachability to that router and that router has reachability back to
your host.
The problem generally lies between the last responsive hop and the
first non-responsive hop or on the first non-responsive hop. The main
point is that using this single command, you can immediately discover
where you need to focus your effort.
Consider the network shown in Figure 7-1:
Default route
NETWORKS
Default route
Juniper
NETWORKS
J2300
ALARM
POWER ON
CONFIG
CONSOLE
USB
J2300
ALARM
PORT 0
PORT 1
STATUS
PORT 2
PORT 3
STATUS
dunkel
POWER ON
CONFIG
CONSOLE
USB
10/100
ETHERNET
Juniper
10/100
ETHERNET
Corporate LAN
PORT 0
pilsener
PORT 1
STATUS
PORT 2
PORT 3
STATUS
Internet
101
This network is comprised of two routers within the site, one (dunkel,
our old friend) aggregating our corporate LAN and one (pilsener, a
favorite beverage) connecting to the service provider. Pilsener has a
static default route with a next-hop to the service provider and is distributing this route into our sites IGP, OSPF. This provides dunkel with
a default route with pilsener as the next-hop. The servers and hosts on
the LAN have static default routes with dunkel as the next-hop.
Lets begin by viewing a traceroute to a destination on the Internet
when all systems are working as expected:
server% traceroute 4.2.2.1
traceroute to 4.2.2.1 (4.2.2.1), 30 hops max, 40 byte packets
1 18.32.75.1 (18.32.75.1) 2.617 ms 1.690 ms 2.851 ms Dunkel
2 18.32.74.6 (18.32.74.6) 3.386 ms 3.370 ms 5.570 ms Pilsener
3 4.10.33.2 (4.10.33.2) 13.513 ms 3.905 ms 5.060 ms Service provider hop 1
4 4.1.18.21 (4.1.18.21) 3.778 ms 5.237 ms 5.413 ms Service provider hop 2
5 4.2.2.1 (4.2.2.1) 10.876 ms 12.568 ms 5.991 ms Destination
The most common point of failure in this network is the link to the
service provider as shown in Figure 7-2. Lets simulate this outage and
repeat our traceroute.
Default route
Juniper
NETWORKS
Default route
Internet
Juniper
NETWORKS
J2300
ALARM
POWER ON
CONFIG
CONSOLE
USB
J2300
ALARM
PORT 0
PORT 1
STATUS
PORT 2
PORT 3
STATUS
dunkel
POWER ON
CONFIG
CONSOLE
USB
10/100
ETHERNET
Corporate LAN
10/100
ETHERNET
102
PORT 0
PORT 1
STATUS
PORT 2
PORT 3
STATUS
pilsener
The scenario shown in Figure 7-2 would yield the traceroute that
follows:
ps@dunkel> show route 4.2.2.1
ps@dunkel>
Because pilseners static default route to the service provider disappears when the link goes down, it no longer distributes this route into
OSPF, which means dunkel no longer has a route to 4.2.2.1. So dunkel
responds with a destination host unreachable error message, which
is indicated by the !H characters in the final line of our traceroute:
server% traceroute 4.2.2.1
traceroute to 4.2.2.1 (4.2.2.1), 30 hops max, 40 byte packets
1 18.32.75.1 (18.32.75.1) 1.983 ms 2.440 ms 2.414 ms
2 18.32.75.1 (18.32.75.1) 2.883 ms !H 4.136 ms !H 3.799 ms !H
Default route
Corporate LAN
Juniper
NETWORKS
Default route
Internet
Juniper
NETWORKS
J2300
ALARM
POWER ON
CONFIG
CONSOLE
USB
J2300
ALARM
PORT 0
PORT 1
STATUS
PORT 2
PORT 3
POWER ON
STATUS
CONFIG
dunkel
CONSOLE
USB
10/100
ETHERNET
Juniper
10/100
ETHERNET
Corporate LAN
PORT 0
PORT 1
STATUS
PORT 2
Default route
STATUS
Default route
Internet
NETWORKS
ALARM
POWER ON
CONFIG
CONSOLE
USB
J2300
ALARM
PORT 0
dunkel
PORT 1
STATUS
PORT 2
PORT 3
STATUS
POWER ON
CONFIG
CONSOLE
USB
10/100
ETHERNET
Juniper
J2300
10/100
ETHERNET
NETWORKS
PORT 3
pilsener
PORT 0
PORT 1
STATUS
PORT 2
PORT 3
STATUS
pilsener
Figures 7-3 & 7-4 Ethernet Link and Outage to the Service Provider
The following shows how the traceroute might appear given this type
of failure:
server % traceroute 4.2.2.1
traceroute to 4.2.2.1 (4.2.2.1), 30 hops max, 40 byte packets
1 18.32.75.1 (18.32.75.1) 2.891 ms 0.594 ms 1.595 ms
2 18.32.74.6 (18.32.74.6) 2.425 ms 2.544 ms 2.642 ms
3 * * *
The trace has made it to Pilsener. This is because our service provider
Ethernet connection terminates on a switch (which is up) rather than
the service provider router (which is not), the connection itself stays
up. This means Pilseners static default route remains valid and it
continues to distribute the default route into OSPF:
103
104
*[Static/5] 00:00:33
> to 192.168.14.10 via fe-0/0/1.0
Since the traceroute is starring out between Pilsener and the nexthop (the service provider), Pilsener is a good place to begin our investigation. The next steps would be to log into Pilsener to issue the show
route command shown above and attempt to ping the remote side of
the connection (192.168.14.10). And, when that fails, its time to
contact the service provider.
This failure may be harder for an NMS system to catch because there is
no link-down event. The only way a monitoring system could catch
this type of failure is if it were also performing some type of probing
for performance management and connectivity assurance. Many NMS
systems have the ability to ping monitor different destinations for this
purpose, such as Nagios and WhatsUp Gold.
ps@router-3> ping 192.168.14.10
PING 192.168.14.10 (192.168.14.10): 56 data bytes
^C
--- 192.168.14.10 ping statistics --5 packets transmitted, 0 packets received, 100% packet loss
Corporate LAN
Juniper
J2300
10/100
ETHERNET
NETWORKS
ALARM
POWER ON
CONFIG
Corporate LAN
CONSOLE
USB
PORT 0
PORT 1
STATUS
PORT 2
PORT 3
STATUS
dunkel
18.32.76/24
Defa
ult ro
ute
Default route
NETWORKS
fault
De
NETWORKS
J2300
ALARM
POWER ON
CONFIG
Corporate LAN
18.32.77/24
CONSOLE
USB
10/100
ETHERNET
Juniper
PORT 0
PORT 1
STATUS
PORT 2
PORT 3
route
J2300
ALARM
POWER ON
CONFIG
CONSOLE
USB
10/100
ETHERNET
Juniper
PORT 0
PORT 1
STATUS
PORT 2
PORT 3
Internet
STATUS
pilsener
STATUS
altbier
One outage that can lead to a routing loop in this scenario is if Altbiers
link to the 18.32.76.0/24 network fails. This causes a routing loop
because Pilsener has no knowledge of the outage on Altbier and
continues to use the static route for the 18.32.76.0/23 network. When
the packet reaches Altbier, it does not have a route to the
18.32.76.0/24 network because its interface to that network is down.
The next best route it has is the default route towards Pilsener, which
then sends the packet right back to Altbier because of its static route.
server% traceroute 18.32.76.7
traceroute to 18.32.76.7 (18.32.76.7), 30 hops max, 40 byte packets
1 18.32.75.1 (18.32.75.1) 6.156 ms 2.181 ms 1.534 ms dunkel
2 18.32.74.6 (18.32.74.6) 9.631 ms 10.610 ms 3.273 ms pilsener
3 18.32.74.62 (18.32.74.62) 3.315 ms 3.728 ms 6.280 ms altbier
4 18.32.74.61 (18.32.74.61) 4.833 ms 8.704 ms 6.481 ms pilsener
5 18.32.74.62 (18.32.74.62) 7.148 ms 7.928 ms 3.979 ms altbier
6 18.32.74.61 (18.32.74.61) 3.779 ms 4.372 ms 3.427 ms pilsener
7 18.32.74.62 (18.32.74.62) 4.701 ms 4.005 ms 9.300 ms altbier
8 18.32.74.61 (18.32.74.61) 7.323 ms 7.616 ms 2.357 ms pilsener
9 18.32.74.62 (18.32.74.62) 3.373 ms 3.322 ms 10.979 ms altbier
10 18.32.74.61 (18.32.74.61) 3.315 ms 10.498 ms 4.453 ms pilsener
105
NETWORKS
J2300
ALARM
POWER ON
CONFIG
Corporate LAN
CONSOLE
USB
10/100
ETHERNET
Juniper
PORT 0
PORT 1
STATUS
PORT 2
PORT 3
STATUS
Defa
dunkel
18.32.76/24
ult ro
ute
External BGP
NETWORKS
ute
J2300
ALARM
POWER ON
CONFIG
lt ro
u
Defa
NETWORKS
J2300
ALARM
POWER ON
CONFIG
Corporate LAN
18.32.77/24
CONSOLE
USB
10/100
ETHERNET
Juniper
PORT 0
PORT 1
STATUS
PORT 2
PORT 3
STATUS
altbier
CONSOLE
USB
10/100
ETHERNET
Juniper
PORT 0
pilsener
PORT 1
STATUS
PORT 2
PORT 3
STATUS
Internet
107
Over this session between Pilsener and our service provider, lets
announce an aggregate route for the sites network (18.32.72.0/21)
and receive the full Internet routing table, as such:
ps@pilsener-re0> show bgp summary
Groups: 1 Peers: 1 Down peers: 0
Table
Tot Paths Act Paths Suppressed
inet.0
10007
10007
0
Peer
AS
InPkt
OutPkt
Received/Accepted/Damped...
4.10.33.2
3356
10019
29
10007/10007/10007/0 0/0/0/0
11:14
109
110
Troubleshooting Latency
Latency is the amount of time it takes for a packet to get from a sender
to the receiver. The root cause and isolation of a latency problem can
be hard to identify. This is because latency can be inconsistent, can be
limited to certain types of traffic, and may not be easily reproducible.
An understanding of the topology of your network, the protocols used,
the current state of your network, and any features enabled (such as
class-of-service) can help you to resolve a latency problem (and loss, as
discussed later).
Latency problems tend to cause the most trouble for real-time traffic
applications, especially voice and video. Users that are reporting a
problem may not even know that latency is the cause. The problem
report may simply state that there are gaps in calls or artifacts in video.
With normal traffic, such as HTTP, a latent packet isnt a big deal as
the receiver simply waits until it receives the next packet and this is
usually imperceptible to the user. However, a packet that is too latent
in a voice or video stream means that packet is lost as far as the receiver
is concerned.
The first step in troubleshooting a latency problem is to identify
whether or not the traffic is taking the optimal path. This obviously
requires an in-depth knowledge of the network as well as any current
outages that would cause sub-optimal routing.
For the example network, traceroute shows that the traffic is taking the
optimal path through our network, Dunkel, Pilsener, and then our
service provider.
The next step then is to try to identify the network hop that is inducing
the latency. The following traceroute shows that the hop between
routers 2 and 3 is causing significant latency:
server% traceroute 4.2.2.1
traceroute to 4.2.2.1 (4.2.2.1), 30 hops max, 40 byte packets
1 18.32.75.1 (18.32.75.1) 4.435 ms 3.117 ms 3.413 ms Dunkel
2 18.32.74.6 (18.32.74.6) 4.935 ms 12.434 ms 2.826 ms Pilsner
3 4.10.33.2 (4.10.33.2) 13.513 ms 3.905 ms 5.060 ms Service provider hop 1
4 4.1.18.21 (4.1.18.21) 3.778 ms 5.237 ms 5.413 ms Service provider hop 2
5 4.2.2.1 (4.2.2.1) 128.269 ms 137.346 ms 133.977 ms
Now you know where to further investigate. At this point, the problem
could either be ingress or egress queuing on router 2 or router 3. So the
first thing needed is to check the link from router 2 to router 3, as it
appears that there is significant packet queuing.
There should be high utilization on this link, since without high-utilization, packet queuing could not be happening, so check the interface on
router 2 handling traffic to router 3 by issuing a show interfaces
command:
ps@pilsener> show interfaces fe-0/0/1
Physical interface: fe-0/0/1, Enabled, Physical link is Up
Interface index: 128, SNMP ifIndex: 61
Link-level type: Ethernet, MTU: 1518, Speed: 100mbps, Loopback: Disabled,
Source filtering: Disabled, Flow control: Enabled
Device flags : Present Running
Interface flags: SNMP-Traps Internal: 0x4000
CoS queues
: 4 supported, 4 maximum usable queues
Current address: 00:90:69:6b:14:00, Hardware address: 00:90:69:6b:14:00
Last flapped : 2010-01-12 06:00:14 EST (2w5d 22:24 ago)
Input rate
: 71441794 bps (46877 pps)
Output rate
: 96542771 bps (63598 pps)
Active alarms : None
Active defects : None
Logical interface fe-0/0/1.0 (Index 67) (SNMP ifIndex 56)
Description: Connection to service provider 1
Flags: SNMP-Traps 0x4000 VLAN-Tag [ 0x8100.5 ] Encapsulation: ENET2
Input packets : 0
Output packets: 0
Protocol inet, MTU: 1500
Flags: None
Addresses, Flags: Is-Preferred Is-Primary
Destination: 4.10.33.0/30, Local: 4.10.33.1/30, Broadcast: 4.10.33.2
Logical interface fe-0/0/1.32767 (Index 68) (SNMP ifIndex 58)
Flags: SNMP-Traps 0x4000 VLAN-Tag [ 0x0000.0 ] Encapsulation: ENET2
Input packets : 0
Output packets: 0
It appears that this link is probably overutilized. Verify this by checking the queue statistics. By default, Juniper Networks routers enable
two queues. One is a network-control queue, which services the
routing protocol and other control plane traffic. It is allocated 5% of
the bandwidth and 5% of the buffer space. The other queue is a
best-effort queue for all other traffic, which uses the remaining 95% of
the bandwidth and 95% of the buffer space.
111
112
Summary
While Layer 3 monitoring and troubleshooting is fundamentally
different from Layer 1 and Layer 2, the same methodology and
approach can be used to isolate and resolve Layer 3 problems. Using a
consistent, logical approach and applying the Fix Test described in this
book should allow you to quickly diagnose and implement supportable short-term fixes.
Whenever troubleshooting at Layer 3, remember that it is an interconnected system and that some routers in your network act based the
actions of another router. Route-tagging and class-of-service markings
are great examples of this. Nothing can replace experience with your
network, the protocols used on it, and the way in which it operates
under normal conditions. However, managing your approach to
operating your network in conjunction with a sound understanding of
the way in which protocols function and how Junos features and
instrumentation can assist you make the task significantly easier.
113
114