Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

SAA Troubleshooting

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Troubleshooting Network Degradation

with SAA
Rujipars Aungkhanawin
Alcatel-Lucent
March 25, 2013

Benefit of SAA
Early Notification of Degradation
Raise alarm when network degradation occurs before
Control-Plane protocol detection (BFD, routing
protocols, link/hardware status change)

Not much benefit gain from SAA in totally outage


Other alarms already notify these problem i.e.
changing in routing protocols neighbor status,
port/interface status

SAA Configuration Review

GUI Configuration in 5620-SAM


Configuration Generated on NE
Script Files
CRON Configuration

Naming Convention
SAA Name Example
ICMP Ping Test Name
I_BASD03_01/1/10-BPBI13_02/2/20
I_TASD03_01/1/10-TPBI13_02/2/20
ETH-CFM Test Name
E_BASD03_01/1/10-BPBI13_02/2/20
E_TASD03_01/1/10-TPBI13_02/2/20
I_TASD03_01/L002-TPBI13_02/L001
I_TASD03_01/L001-TPNCA1_02/1/12
Note: Don't use LAG for TUC testing
01/1/10 = Card/MDA/Port number
L001 = Lag1
I = Icmp-ping
E = Eth-CFM
B = BFKT
T = TUC
ASD03 = Node name(Tested node)

Node Name & CLLI


CSN & WiFi Gateway
PNCABKCA17W
Use one
Word

Use Four
Words

PNCA7

PTN, RCU, RN

MSC
Replace Code
CWT = MTG
HAM = PBI
SRW = BRK

ICMP Ping SAA Test Configuration

SAA Configuration in 5620-SAM


Under menu Tools/Service Test Manager (STM)

ICMP Ping Properties General

- Name and Descriptions are same

System address of node generating the test


(Not the IP address for ICMP ping)
Tar get IP address for ICMP Ping

Egress interface for ICMP Ping

ICMP Ping Properties Parameters


1000 Ping packets per test
Interval has no meaning for rapid ping
Time-out to determine that there is no response
for a ping packet

Rapid Ping, generate 100 packets per second

QoS marking for generated ping packet


as Network Control, in-profile

On-demand execution

ICMP Ping Properties Result Configuration

Will consider test fail if number of time-out


packet exceeed 10 packet in a test
Generate SNMP trap only when test fail,
not every time the test is done.

ICMP Ping Properties Threshold


Alarm generated when number of packet loss reach a defined threshold
Alarm cleared when number of packet loss goes below the theshold

ICMP Ping Properties Threshold


Threshold crossing alarm will be
generated when there are 10
or more packet loss in 1 test

Time since last threshold crossing occurs


Number of packet loss when
last threshold crossing occur

SAA Test Configuration Created on NE


*A:HAMMBK0900M>config>saa# info
---------------------------------------------test "I_THAM09_01/1/01-THAM05_02/1/12" owner "sas:0:1:1:e"
description "I_THAM09_01/1/01-THAM05_02/1/12"
type
icmp-ping 10.197.254.44 rapid size 1500 source 10.197.170.18 next-hop 10.197.170.17 count 1000
exit
trap-gen
test-fail-enable
exit
loss-event rising-threshold 10
Nex-hop parameter added by CLI configuration after
no shutdown
configuration created by Service Test Manater GUI
exit
test "I_THAM09_02/1/01-THAM05_03/2/05" owner "sas:0:2:2:e"
description "I_THAM09_02/1/01-THAM05_03/2/05"
type
icmp-ping 10.197.254.44 rapid size 1500 source 10.197.170.238 next-hop 10.197.170.237 count 1000
exit
trap-gen
test-fail-enable
exit
loss-event rising-threshold 10
no shutdown
exit
----------------------------------------------

CRON Configuration (Test Scheduling)


script define script files to be scheduled by CRON (via script binding)
action define binding of script and result location and will be
referred to by CRON
*A:HAMMBK0900M>config>cron# info
---------------------------------------------script "SAA-Icmp-ping"
location "cf3:/cron-script/SAA-Icmp-ping.txt"
no shutdown
exit
script "Deleted-SAA-result"
location "cf3:/cron-script/Deleted-SAA-result.txt"
no shutdown
exit
action "SAA-Icmp-ping"
results "cf3:/cron-result/SAA-Icmp-ping"
script "SAA-Icmp-ping"
no shutdown
exit
action "Deleted-SAA-result"
results "cf3:/cron-result/Deleted-SAA-result"
script "Deleted-SAA-result"
no shutdown
exit

CRON Configuration (Test Scheduling)


schedule define when to execute the action
schedule "SAA-Icmp-ping"
description "SAA-Icmp-ping"
action "SAA-Icmp-ping"
type calendar
day-of-month all
hour all
Run SAA test every the minute of 1 and 30, every hour, everyday.
minute 1 30
Its resulting log will be cleared by Delete-SAA-result script.
month all
weekday all
no shutdown
exit
schedule "Deleted-SAA-result"
description "Deleted-SAA-result"
action "Deleted-SAA-result"
type calendar
day-of-month all
hour 4
Run script to clear log resulted by SAA test schedule at 4:45 am everyday.
minute 45
Its result, in turn, deleted by SAA-Icmp-ping script.
month all
weekday all
no shutdown
exit
----------------------------------------------

Script Files
There are 2 script files used, SAA-Icmp-ping.txt and Deleted-SAA-result.txt
*A:HAMMBK0900M>file cf3:\cron-script\ # dir
Volume in drive cf3 on slot A has no label.
Volume in drive cf3 on slot A is formatted as FAT32.
Directory of cf3:\cron-script\
02/20/2013 10:12a
<DIR>
.
03/26/2013 11:40a
<DIR>
..
02/18/2013 02:51p
38 Deleted-SAA-result.txt
02/20/2013 10:12a
195 SAA-Icmp-ping.txt
2 File(s)
233 bytes.
2 Dir(s)
1601929216 bytes free.
*A:HAMMBK0900M>file cf3:\cron-script\ # type SAA-Icmp-ping.txt
File: SAA-Icmp-ping.txt
------------------------------------------------------------------------------exit all
oam saa I_THAM09_01/1/01-THAM05_02/1/12 owner sas:0:1:1:e start
oam saa I_THAM09_02/1/01-THAM05_03/2/05 owner sas:0:2:2:e start
Run SAA test one by one
Delete result file generated by the
Deleted-SAA-result script
file delete cf3:/cron-result/Deleted*.* force
===============================================================================
*A:HAMMBK0900M>file cf3:\cron-script\ # type Deleted-SAA-result.txt
File: Deleted-SAA-result.txt
Delete result log generated by SAA Test
------------------------------------------------------------------------------ It self also create log, deleted by
file delete cron-result\SAA*.* force
SAA-Icmp-ping script a bove.
===============================================================================

SAA Working Status on NE


*A:HAMMBK0900M# show saa "I_THAM09_0
"I_THAM09_01/1/01-THAM05_02/1/12" "I_THAM09_02/1/01-THAM05_03/2/05"
*A:HAMMBK0900M# show saa "I_THAM09_01/1/01-THAM05_02/1/12" owner
owner <test-owner>
"sas:0:1:1:e"

Type the command and partial SAA name


then press TAB for auto fill/hint.
Dont forget to put the owner parameter.

*A:HAMMBK0900M# show saa "I_THAM09_01/1/01-THAM05_02/1/12" owner "sas:0:1:1:e"


===============================================================================
SAA Test Information
===============================================================================
Test name
: I_THAM09_01/1/01-THAM05_02/1/12
Owner name
: sas:0:1:1:e
Description
: I_THAM09_01/1/01-THAM05_02/1/12
The ping command can be executed
Accounting policy
: None
manually with these same parameter.
Continuous
: No
(use ping in CLI instead of icmp-ping
Administrative status
: Enabled
Test type
: icmp-ping 10.197.254.44 rapid size 1500 source
10.197.170.18 next-hop 10.197.170.17 count 1000
Trap generation
: test-fail-enable test-fail-threshold 1
Test runs since last clear
: 1642
Interesting test result
Number of failed test runs
: 12
Last test result
: Success
------------------------------------------------------------------------------Threshold
Type
Direction Threshold Value
Last Event
Run #
------------------------------------------------------------------------------<snipped>
Number of packet loss and date/time
when that exceed occurs.
Loss-rt
Rising
10
15
03/24/2013 09:30:21 1534
Falling
None
None
Never
None
===============================================================================

SAA Working Status on NE (Contd)


*A:HAMMBK0900M# show saa "I_THAM09_01/1/01-THAM05_02/1/12" owner "sas:0:1:1:e"
<snipped>
Loss-rt
Rising
10
15
03/24/2013 09:30:21 1534
Falling
None
None
Never
None
===============================================================================
Number of test runs, should be counting every hour
Test Run: 1643
Total number of attempts: 1000
Number of requests that failed to be sent out: 0
Number of responses that were received: 1000
Number of requests that did not receive any response: 0
Total number of failures: 0, Percentage: 0
(in ms)
Min
Max
Average
Jitter
Outbound :
0.000
0.000
0.000
0.000
Inbound
:
0.000
0.000
0.000
0.000
Summary of round-trip time
Roundtrip :
0.434
2.00
0.563
0.135
Per test packet:
Sequence
Outbound
Inbound
RoundTrip Result
1
0.000
0.000
0.464 Response Received
2
0.000
0.000
0.462 Response Received
3
0.000
0.000
0.464 Response Received
<snipped>
Round-trip time per test packet
997
0.000
0.000
0.534 Response Received
kept for 3 tests = 3000 packets total
998
0.000
0.000
0.570 Response Received
999
0.000
0.000
0.531 Response Received
1000
0.000
0.000
0.557 Response Received

ETH-CFM SAA Test Configuration


(Ethernet Connectivity Fault Management)

SAA Configuration in 5620-SAM


Under menu Tools/Service Test Manager (STM)
The example is filtered to display only ETH-CFM tests

ETH-CFM Properties General


- Name and Descriptions are same

System address of node generating the test

Tar get MAC address

ETH-CFM Properties Test Parameters &


Results Configuration

Number of Loopback (LB) messages sent in a test

NE will raise SNMP trap when there is


at least 1 message loss in 1 test

ETH-CFM Properties Threshold


Alarm generated when number of packet loss reach a defined threshold
Alarm cleared when number of packet loss goes below the theshold

ETH-CFM Properties Threshold (Contd)

Threshold crossing alarm will be


generated when there are 3
or more packet loss in 1 test

Time since last threshold crossing occurs


Number of packet loss when
last threshold crossing occur

SAA Test Configuration Created on NE


*A:EKCCBK0200M# /configure saa
*A:EKCCBK0200M>config>saa# info
---------------------------------------------<snipped>
test "E_TEKC02_02/1/03-TEKC23_01/2/02" owner "sas:0:1:1:k"
description "E_TEKC02_02/1/03-TEKC23_01/2/02"
type
eth-cfm-loopback 00:00:00:00:00:01 mep 102 domain 1 association 2 size 1500 count 10 timeout
1 interval 1
exit
trap-gen
test-fail-enable
Key Information
exit
MEP ID = 102
loss-event rising-threshold 3
MD Index = 1
no shutdown
MA Index = 2
exit
---------------------------------------------*A:EKCCBK0200M>config>saa# /configure eth-cfm
*A:EKCCBK0200M>config>eth-cfm# info
---------------------------------------------domain 1 format none level 3
association 2 format string name "SAA.EKC02.EKC23"
bridge-identifier 720000472
exit
exit
exit
----------------------------------------------

Assosiation 2 is configured
under servicd ID 720000472

SAA Test Configuration Created on NE


*A:EKCCBK0200M# show eth-cfm cfm-stack-table
===============================================================================
CFM Stack Table Defect Legend:
R = Rdi, M = MacStatus, C = RemoteCCM, E = ErrorCCM, X = XconCCM, A = AisRx
===============================================================================
CFM SAP Stack Table
===============================================================================
Sap
Lvl Dir Md-index
Ma-index
MepId Mac-address
Defect
------------------------------------------------------------------------------2/1/3:624.0
3
Up
1
2 102 00:00:00:00:00:02 -----===============================================================================
<snipped>
*A:EKCCBK0200M# /configure service vpls 720000472 sap 2/1/3:624.0
*A:EKCCBK0200M>config>service>vpls>sap# info
---------------------------------------------eth-cfm
mep 102 domain 1 association 2 direction up
mac-address 00:00:00:00:00:02
no shutdown
exit
exit
----------------------------------------------

Where
MEP ID = 102
MD Index = 1
MA Index = 2
Got this
SAP ID = 2/1/3:624.0

View the configuration of the known


service ID and SAP

Following the previous step, all the configuration related to the ETH-CFM test
could be tracked.

ETH-CFM Configuration on Peer


If would like to check on the peer, got the MD Index, MA Index and MEP ID
from the previous check.
*A:EKCCBK2309W s# show eth-cfm cfm-stack-table
===============================================================================
CFM Stack Table Defect Legend:
R = Rdi, M = MacStatus, C = RemoteCCM, E = ErrorCCM, X = XconCCM, A = AisRx
<snipped>
===============================================================================
CFM SDP Stack Table
===============================================================================
Sdp
Lvl Dir Md-index
Ma-index
MepId Mac-address
Defect
------------------------------------------------------------------------------12010:720000469
3 Down
1
1 101 00:00:00:00:00:01 -----12021:720000472
3 Down
1
2 101 00:00:00:00:00:01 -----12061:720000472
3 Down
1
3 101 00:00:00:00:00:01 -----===============================================================================
*A:EKCCBK2309W# /configure service vpls 720000472 spoke-sdp 12021:720000472
*A:EKCCBK2309W>config>service>vpls>spoke-sdp# info
---------------------------------------------eth-cfm
mep 101 domain 1 association 2 direction down
mac-address 00:00:00:00:00:01
no shutdown
exit
exit
---------------------------------------------*A:EKCCBK2309W>config>service>vpls>spoke-sdp#

Where
MEP ID = 102
MD Index = 1
MA Index = 2
Got this
SDP ID = 12021:720000472

View the configuration of the known


service ID and SDP ID

CRON Configuration
*A:EKCCBK0200M>config>cron# info
---------------------------------------------script "SAA-ETH-CFM"
location "cf3:/cron-script/SAA-ETH-CFM.txt"
no shutdown
exit
action "SAA-ETH-CFM"
results "cf3:/cron-result/SAA-ETH-CFM"
script "SAA-ETH-CFM"
no shutdown
exit
schedule "SAA-ETH-CFM"
description "SAA-ETH-CFM"
action "SAA-ETH-CFM"
type calendar
day-of-month all
hour all
Run SAA test every the minute of 15 and 45, every hour, everyday.
minute 15 45
Its resulting log will be cleared by Delete-SAA-result script.
month all
weekday all
no shutdown
exit
schedule "Deleted-SAA-result"
description "Deleted-SAA-result"
action "Deleted-SAA-result"
type calendar
day-of-month all
hour 4
minute 45
month all
weekday all
no shutdown
exit
----------------------------------------------

Script Files
*A:EKCCBK0200M>file cf3:\cron-script\ # dir
Volume in drive cf3 on slot A is SMART.
Volume in drive cf3 on slot A is formatted as FAT32.
Directory of cf3:\cron-script\
03/05/2013
04/01/2013
02/15/2013
02/18/2013
03/05/2013

05:25p
<DIR>
02:03a
<DIR>
05:07p
02:59p
05:25p
3 File(s)
2 Dir(s)

.
..
128 SAA-Icmp-ping.txt
38 Deleted-SAA-result.txt
121 SAA-ETH-CFM.txt
287 bytes.
1764376576 bytes free.

*A:EKCCBK0200M>file cf3:\cron-script\ # type SAA-ETH-CFM.txt


File: SAA-ETH-CFM.txt
------------------------------------------------------------------------------exit all
Run SAA test one by one, if more then one.
oam saa "E_TEKC02_02/1/03-TEKC23_01/2/02" owner "sas:0:1:1:k" start
Delete result file generated by the
file delete cron-result\Deleted*.* force
Deleted-SAA-result script
===============================================================================
*A:EKCCBK0200M>file cf3:\cron-script\ # type Deleted-SAA-result.txt
File: Deleted-SAA-result.txt
------------------------------------------------------------------------------file delete cron-result\SAA*.* force
===============================================================================

SAA Troubleshooting Example

Troubleshooting SAA Alarms


Confirm the fault location (Optional)
SAA name already gives clue of problem location
Recheck to make sure that given name is correctly
define the problem location

Correlate between SAA alarm and other faults


If no other fault related to the problem location
indicated by SAA, apply verification steps.
If other alarm(s) on the location indicated by SAA,
follow those alarms troubleshooting steps.

SAA Alarm Example #1


3 links on different cards at PNCA BKCA17W ( PNC1 CSN )
Its unlikely that the problem should caused from 3 faulty cards/port at the same time.
Anyway the demonstration will pick one SAA test to check further for example.

3 links on 3 difference cards

Example #1 Understanding the situation


Name of the SAA test

I_TPNCA7W_04/2/12-TPKG04_02/1/05
I = ICMP Ping
T = TUC
PNCA7W is from PNCA BKCA17W
where the SAA test is runing on
The port that originate the
SAA ICMP ping to its neighbor

T = TUC
The port on PKG-04 that is
the target of ICMP ping
PKG04 is from PKGGBK0406W
(PKG-04 PTN), the Ping target

Time of detection

Summary
The SAA test which use ICMP ping from CSN PNC-1
(PNCABKCA17W) egress on port 4/2/12 to PTN PKG-04
(PKGGBK0406W) ingress on port 2/1/5 experienced
packet loss of 14 packets from 1000 packets sent.
The packet loss of 14 exceed the threshold of 10, so
the alarm was raised on Mar 29 16:30:24 local time.

Example #1 Go to the NE go check


Its not always needed to go to the NE if other alarms already indicate the
problem of some component involving the alarm, for example
Alarm on transmission network of the link
Alarm on the NE itself indicating faulty card/port
If there is no other alarm related to the components used by the SAA test,
or just want to check on the NE for some reason, the NE CLI could be reached
by right click on the NE then select NE Sessions then Telnet Session or
SSH Session.

Example #1 View the SAA Test status


*A:PNCABKCA17W# show saa "I_TPNCA7W_0
"I_TPNCA7W_01/2/03-THAM05_01/1/03"
"I_TPNCA7W_04/2/12-TPKG04_02/1/05

Type show saa I_ then press TAB, the CLI will partially
fill the name and show the available SAA test names.
"I_TPNCA7W_02/2/11-TBGU05_02/1/03"
Type a few more character then press tab, the SAA name
will be filled, type the keyword owner then press TAB again.

*A:PNCABKCA17W# show saa "I_TPNCA7W_04/2/12-TPKG04_02/1/05" owner


owner <test-owner>
Type s, the CLI will fill the owner of the SAA test
"sas:0:3:3:e"
automatically, then press ENTER to view the result.
*A:PNCABKCA17W# show saa "I_TPNCA7W_04/2/12-TPKG04_02/1/05" owner "sas:0:3:3:e"
===============================================================================
SAA Test Information
===============================================================================
Test name
: I_TPNCA7W_04/2/12-TPKG04_02/1/05
Owner name
: sas:0:3:3:e
Description
: I_TPNCA7W_04/2/12-TPKG04_02/1/05
Accounting policy
: None
Continuous
: No
Administrative status
: Enabled
Test type
: icmp-ping 10.197.254.46 rapid size 1500 source
10.100.17.53 next-hop 10.100.17.54 count 1000
Trap generation
: test-fail-enable test-fail-threshold 1
Test runs since last clear
: 1529
Number of failed test runs
: 1
Last test result
: Success
------------------------------------------------------------------------------Threshold
Type
Direction Threshold Value
Last Event
Run #
------------------------------------------------------------------------------Jitter-in
Rising
None
None
Never
None
Press any key to continue (Q to quit)

Source 10.100.17.53 is the


parameter that precisely identify
the network interface. Will use
this value for the next step.
Note Although the port number
in the SAA test name can also
identify the ingress/egress port
it could be incorrect by some
mistake in configuration.

Example #1 View the SAA Test status (contd)


------------------------------------------------------------------------------Threshold
Type
Direction Threshold Value
Last Event
Run #
------------------------------------------------------------------------------Jitter-in
Rising
None
None
Never
None
Falling
None
None
Never
None
<snipped>
Loss-out
Loss-rt

Rising
Falling
Rising
Falling

None
None
10
None

None
None
14
None

Never
None
Never
None
03/29/2013 16:30:24 1396
Never
None

===============================================================================
Test Run: 1528
Total number of attempts: 1000
Number of requests that failed to be sent out: 0
Number of responses that were received: 1000
Number of requests that did not receive any response: 0
Total number of failures: 0, Percentage: 0
(in ms)
Min
Max
Average
Jitter
Outbound :
0.000
0.000
0.000
0.000
Inbound
:
0.000
0.000
0.000
0.000
Roundtrip :
1.33
2.08
1.37
0.013
Per test packet:
Sequence
Outbound
Inbound
RoundTrip Result
1
0.000
0.000
1.38 Response Received
2
0.000
0.000
1.44 Response Received
3
0.000
0.000
1.37 Response Received
4
0.000
0.000
1.36 Response Received
5
0.000
0.000
1.36 Response Received
6
0.000
0.000
1.36 Response Received
<snipped>

Example #1 Check for the ports involved


From the command below, we can confirm that
The port that source the SAA test is 4/2/12
The network interface is To_PKG04_7750_G2/1/5
The interface name is also like a description, CANNOT ensure that
the target node is PKG-04 or the target port is 2/1/5.
Show router interface and look for the line that contain
the source IP address and also one previous line.
*A:PNCABKCA17W# show router interface | match 10.100.17.53 pre-lines 1
To_PKG04_7750_G_2/1/5
Up
Up/Down
Network 4/2/12
10.100.17.53/30
n/a
*A:PNCABKCA17W#

Example #1 Check for the ports involved


Telnet using the destination IP address
The ingress port at the destination is 2/1/5
The network interface is To_PNC1_7750_G_4/2/12
*A:PNCABKCA17W# telnet 10.100.17.54
Trying 10.100.17.54 ...
####################################################################
#
W A R N I N G
#
####################################################################
#
#
# Unauthorized access to this system is forbidden and will be
#
# prosecuted by law. Disconnect IMMEDIATELY if you are not
#
# authorized user.
#
#
#
# By accessing this system, you agree that your actions may be
#
# monitored if unauthorized usage is suspected.
#
#
#
####################################################################
Login: someuser
Password:
Show router interface and look for the line that contain
<snipped>
the destination IP address and also one previous line.
*A:PKGGBK0406W# show router interface | match 10.100.17.54 pre-lines 1
To_PNC1_7750_G_4/2/12
Up
Up/Down
Network 2/1/5
10.100.17.54/30
n/a
*A:PKGGBK0406W#

Example #1 Check for the ports involved


From the previous checking we can confirm that the SAA alarm correctly
indicate the problem between
CSN PNC-1 (PNCABKCA17W) port 4/2/12 and
PTN PKG-04 (PKGGBK0406W) port 2/1/5
as indicated by the SAA name
If the checking result indicate some inconsistence information, for example
Incorrect destination node name
Incorrect source/destination port number
The checking result from the previous CLI steps should be used as reference.
*** The correction of SAA name, to reflect the correct source/destination
description, require to
Remove and re-create in Service Test Manager GUI
Reconfigure the new SAA in CLI to add the next-hop parameter
Edit the script file to update the SAA test name to be executed

SAA Alarm Example #2


In this case, PTN RST-02 rebooted completed at 13:34, started some time before that.
It could be concluded that the SAA alarms on the underlying RCU caused from the PTN reboot.

Node reboot at PTN cause


SAA test fail at many RCUs

SAA Alarm Example #3


Alarm on SAA test named E_TEKC02_02/1/03-TEKC23_01/2/02. This alarm happened 20 days ago, no clue in the
active Alarm window.
Try checking in Historical Alarm

SAA Alarm Example #3


Historical Alarm indicate that there was some problem of the link between EKC-23
and RCUs like EKC-02, EKC-20
In this case, no need

Further Troubleshooting Commands


In case that the Alarm / Historical Alarm could not identify the cause of
SAA Alarm raised, the following command shall be applied to the related
components (port/card/sap/sdp).
ICMP Ping Test (For network ports)
show port <slot/mda/port> detail
show card <slot> detail
ETH-CFM (For access ports)
show service id <service-id> sap <sap> detail
show service id <service-id> sdp <sdp-id> detail

You might also like