Common Faults and Alarms On The RTN
Common Faults and Alarms On The RTN
Common Faults and Alarms On The RTN
Handling Common
Faults and Alarms
on the RTN
Network
www.huawei.com
11 Reference Documents
Check alarms.
Check key
configurations.
Collect data.
End
1 Yes
Are there any
wrong Perform rollbacks.
operations?
No
2
Are there any Yes
ODU or IF board Handle alarms.
faults?
3
No
Is Tx power
Handle the fault.
normal?
No
4
Yes
Is Rx power lower Handle the fault.
than normal?
No
Yes 5
Fading causes
abnormal Rx power? Handle the fault.
No
6
Yes
Are links faulty
Handle the fault.
unidirectionally?
7 No
No
Locate faults by Go to the next Faults are
performing loopbacks. step. rectified?
Yes
End
reflection point fall on the ground that has a small reflection coefficient,
reducing multipath fading.
•Configure 1+1 SD for microwave links.
•For microwave links with 1+1 SD, adjust the height difference between two
antennas' Tx power.
To handle up-fading:
•Check for co-channel interference.
•Use a spectrum analyzer to analyze interference sources.
•Contact the spectrum management department for clearing the interference
Handling Procedure
See "Handling Faults of Microwave Links."
Cause 2: If the ODU reports the alarm, handle the alarm as follows:
Check whether the alarm is caused by other alarms.
Yes Yes
E 1 MLPPP
No
No Any alarms physical-
No
Any alarms on Types of NNI layer
on E1 ports?
the boards? ports alarms
Yes
Handle
Yes
alarms.
Reset or
Any alarms No
replace MW
alarmed on IF ports?
boards.
Yes
BTS 1 CES
10G/GE
RTN
RTN STM-1
MPLS
GE/FE RTN
MPLS BSC
Core
CES network
BTS 2 RTN
RTN
10G/GE
STM-1
RTN
ETH
BTS 3
RTN BSC
Possible Causes
Possible Causes
Possible Causes
1. Fiber cuts 2. Faulty optical modules 3. Excessive optical
1.Negotiation
1. Excessive bit errors
fails due are detectedworking
to different at the MAC layer.
modes 2. Line
at the two
attenuation
signals
ends. degrade.cables,
2. Electrical 3. Fiberfiber
performance deteriorates.
connections, or opposite4. units
Optical
areports are dirty.
faulty.
RTN STM-1
RTN
GE/10GE
STM-1
RTN
ETH
BTS 3
RTN BSC
Possible Causes
Possible
PossibleCauses
Causes
1. Failure in received signals 2. Malfunction of clock extraction modules
1.
1.Fiber
Excessive
cuts 2.
attenuation
Excessive of
loss
received
on the line
signals
3. Malfunction
2. Unframedofstructure
opposite of
transmit
signalsunits
from
the opposite site 3. Malfunction of local receive units
Handling Procedure
Cause 3: Fibers are misconnected.
Verify that fibers are connected correctly.
Cause 4: The signals transmitted from the opposite site do not have the frame structure.
Check for the HARD_BAD alarm on the opposite transmit board and clear this alarm
immediately if it is reported.
Cause 5: The local receive board is faulty.
Check for the HARD_BAD alarm on the local receive board and clear this alarm
immediately if it is reported.
BTS 1 CES
GE/10GE
RTN
RTN STM-1
RTN
GE/10GE
STM-1
RTN
ETH
BTS 3
RTN BSC
Possible Causes
Possible Causes
1. E1/T1 services are not received. 2. Fibers on the DDF-side E1/T1 output ports
Some alarms are reported on the opposite site.
are disconnected or loosely connected. 3. Fibers on local E1/T1 output ports
are disconnected or loosely connected. 4. A certain board is faulty. 5. The
electrical cable is faulty.
Handling Procedure
Cause 3: The opposite equipment is faulty.
Perform a self-loop for the alarmed channel on the DDF side. If the alarm clears, the
opposite equipment is faulty and the fault needs to be rectified.
Cause 4: The electrical cable is faulty.
Perform a self-loop for the alarmed channel on the DDF side. If the alarm persists, perform
a self-loop for the alarmed channel on the interface board side. If the alarm clears, the
E1 cable is faulty and needs to be replaced.
Cause 5: The alarmed board is faulty.
Perform a self-loop for the alarmed channel on the interface board side. If the alarm
persists, set an inloop for the alarmed channel on the NMS. If the alarm clears, the
interface board is faulty and needs to be replaced.
Handling Procedure
Cause 3: The local receive power is lower than the lower threshold.
Verify that the bending radius of the pigtail on the local site is no smaller than 6 cm.
If the alarm persists, use proper optical attenuators and correctly connect the
local optical module. If the alarm persists, replace the optical module and clean
the fiber connectors at the two ends.
Cause 4: The receive board is faulty.
Check whether the processing board and cross-connect board on the local site report
any hardware-related alarms such as HARD_BAD and TEMP_OVER. If yes, replace
the boards that report hardware-related alarms.
11 Reference Documents
1
Yes
Any equipment Handle alarms.
alarms?
No
2
Any pointer Yes Handle pointer
justifications? justifications.
No SDH optical 3
interface boards Handle RS errors on
SDH optical interface
boards.
STM-1 electrical 5
No boards Handle RS errors on
STM-1 electrical
interface boards.
6
Any alarms or events
Yes
Handle MS errors
related to MS errors or
HOP errors? and HOP errors.
No
7
Any alarms Yes
related to LOP Handle LOP errors.
errors?
No
End
Handling Procedure
Cause 1: The number of E1 signals is different on both ends of a microwave link.
Cause 2: The AM enabling is different on both ends of a microwave link.
Cause 3: The IEEE 1588 overhead enabling is different on both ends of a microwave link.
Cause 4: The modulation mode is different on both ends of a microwave link.
Cause 5: The channel spacing is different on both ends of a microwave link.
Determine the possible cause of the alarm according to the alarm parameters. Then, check the configuration on
both ends of the microwave link. Ensure that the configuration is the same on both ends of the microwave link.
11 Reference Documents
HARD_BAD, Yes
TEMP_OVER, Board hardware errors
Reset/Reseat/
BUS_ERR, or or inter-board
communication failure Replace boards.
COMMUN_FAIL
occurs?
No
No
Yes No
Troubleshoot Troubleshoot the
MPLS_TUNNEL_LO Tunnel faults
physical links. opposite equipment.
CV occurs?
No
Yes Loss of No
SYNC_C_LOS or Troubleshoot Troubleshoot the
synchronization
LTI occurs? clock faults. opposite equipment.
clock
No
No Faults are
rectified?
Yes
COMMUN_FAIL
T_ALOS
UP_E1_AIS or DOWN_E1_AIS
MPLS_TUNNEL_LOCV
SYNC_C_LOS or LTI
Cause 1: The board carrying CES services cannot work properly due to hardware
errors, over-high temperature, or inter-board communication failure.
Cause 2: The signal transmitted to the processing board or interface board is lost or
degrades.
Cause 3: The tunnel or PW carrying CES services is interrupted.
Cause 4: On the NE, the priority of synchronization clock source is lost, or the
synchronization clock source is lost.
Cause 5: On the PW carrying CES services, the number of lost packets, errored
packets, or jitters within a time unit crosses the threshold.
Handling Procedure
Cause 1: Clock synchronization cannot be performed.
On the NMS, check whether the LTI or other clock alarms are reported. If yes, clear these alarms.
Cause 2: Link quality deteriorates, causing more jitters.
Check whether the alarmed port also reports IN_PWR_ABN or TEM_HA. If yes, clear the IN_PWR_ABN or TEM_HA alarm
immediately.
Cause 3: The size of buffer area is set to a low value.
On the NMS, increase the size of buffer area if possible.
Cause 4: There are too many hops of microwave link on the network side, which generates a large number of jitters.
Reduce the number of hops on the network side.
Handling Procedure
Cause 1: Clock synchronization cannot be performed.
On the NMS, check whether the LTI or other clock alarms are reported. If yes, clear these alarms.
Cause 2: Parameter settings are different at the two ends of CES services.
Modify the parameter settings to the same.
Cause 3: The tunnel or PW carrying CES services is congested.
On the NMS, check whether the bandwidth configured for the tunnel or PW is too low and whether the QoS parameters
are set properly. If the bandwidth and QoS settings cannot meet the requirements of CES services, increase the
bandwidth, replan the service trail, and change QoS settings.
Cause 4: The link signal deteriorates or is interrupted due to a fault of cables, optical fibers, or optical modules.
Verify that electrical cables and fibers are correctly connected to the ports. Clean the fiber connectors and optical
modules. If the alarm persists, replace the cables, fibers, or optical modules that may be faulty.
Handling Procedure
Cause 1: Parameters of CES services are set incorrectly.
Modify the incorrect parameter settings on the NMS.
Cause 2: The tunnel or PW carrying CES services is congested.
On the NMS, check whether the bandwidth configured for the tunnel or PW is too low and whether the QoS
parameters are set properly. If the bandwidth and QoS settings cannot meet the requirements of CES
services, increase the bandwidth, replan the service trail, and change QoS settings.
Cause 3: The link signal deteriorates or is interrupted due to a fault of cables, optical fibers, or optical
modules.
Verify that electrical cables and fibers are correctly connected to the ports. Clean the fiber connectors and
optical modules. If the alarm persists, replace the cables, fibers, or optical modules that may be faulty.
Handling Procedure
Cause 1: Clock synchronization cannot be performed.
On the NMS, check whether the LTI or other clock alarms are reported. If yes, clear these alarms.
Cause 2: The tunnel or PW carrying CES services is congested.
On the NMS, check whether the bandwidth configured for the tunnel or PW is too low and whether the QoS
parameters are set properly. If the bandwidth and QoS settings cannot meet the requirements of CES
services, increase the bandwidth, replan the service trail, and change QoS settings.
Cause 3: The link signal deteriorates or is interrupted due to a fault of cables, optical fibers, or optical
modules.
Verify that electrical cables and fibers are correctly connected to the ports. Clean the fiber connectors and
optical modules. If the alarm persists, replace the cables, fibers, or optical modules that may be faulty.
Handling Procedure
Cause 1: Parameter settings are different at the two ends of CES services.
Modify the parameter settings to the same.
Cause 2: Fibers or cables are connected incorrectly.
Reconnect the fibers or cables correctly.
HARD_BAD,
TEMP_OVER, Yes Board hardware errors Reset/Reseat/
BUS_ERR, or or inter-board
Replace boards.
COMMUN_FAIL communication failure
occurs?
No
No
No
Yes
LOOP_AL Loopbacks on Release
M occurs? ports loopbacks.
No
No Faults are
rectified?
Yes
Contact Huawei
End
engineers.
COMMUN_FAIL
ETH_LOS, ETH_LINK_DOWN, ETH_AUTO_LINK_DOWN, or
LOOP_ALM
LASER_SHUT or LSR_WILL_DIE
LSR_WILL_DIE
FLOW_OVER
Cause 1: The board carrying ETH services cannot work properly due to hardware
errors, over-high temperature, or inter-board communication failure.
Cause 2: The signal is lost in the receive direction.
Cause 3: Negotiation between Ethernet ports fails due to incorrect connections
on Ethernet ports.
Cause 4: Loopbacks are performed for Ethernet ports.
Cause 5: Traffic limit on Ethernet ports is set to a low value or parameter settings
are different on source and sink ports.
Perform
TraceRoute tests.
Start link-layer
detection.
Common Symptoms
MPLS tunnels cannot be created, and therefore services cannot be provisioned.
MPLS tunnels are faulty, causing service interruption.
Protection switching fails, causing service interruption, packet loss, or bit errors.
Common Causes
Cause 1: Cross-connections cannot be created.
Cause 2: The physical links carrying the tunnels are faulty.
Cause 3: Protection switching fails.
Handling Procedure
Cause 1: The ingress node on the tunnel stops transmitting CV/FFD packets.
1. Check whether the settings of detection mode and detection packet type are consistent on the two ends.
If not, make consistent settings.
2. Check the parameter of CV/FFD status on the ingress node. If the CV/FFD status is disabled, change it to
enabled.
Cause 2: The physical link carrying the tunnel is faulty.
On the NMS, check whether the egress node reports the HARD_BAD, ETH_LOS, or ETH_LINK_DOWN alarm. If
yes, clear this alarm.
Handling Procedure
Cause: The upstream NE detects that the tunnel at the physical layer is faulty
On the physical link between the local NE and its upstream NE, check for the
faults such as fiber cuts, failure in optical modules, and board failure. Rectify
the fault if any.
Common Symptoms
1. PWs cannot be created, and therefore services cannot be provisioned.
2. PWs are faulty, causing service interruption, packet loss, or bit errors.
Common Causes
Cause 1: The physical link carrying the PW is faulty.
Cause 2: Cross-connections of PWs cannot be created.
Cause 3: The tunnels carrying PWs are faulty.
Possible Causes
Cause: A small number of packets are lost on the PW.
Handling Procedure
Cause 1: A small number of packets are lost on the PW.
Check whether any service ports on the PW are congested. If yes, replan
the trail of services or increase the bandwidth of congested ports.
11 Reference Documents
After the working channel of a 1+1 protection group is restored, services cannot
Possible Causes
Cause 1: SNCP switching fails because the NE software version mismatches the
board software version.
Cause 2: The working and protection channels of an SNCP protection group fail.
Cause 3: TU_AIS insertion upon E1_AIS is not provided (for OptiX RTN 600 V100R005
and OptiX RTN 900 V100R002C01 and later versions).
No
Yes
Fibers or cables Reconnect
No are connected
fibers or cables.
incorrectly?
Yes
No Enable APS
APS protocol is
enabled on both protocol on both
ends? ends.
Yes
Yes
Hardware Rectify board
alarms occur? hardware faults.
No
Yes
Clock alarms Troubleshoot
occur? clocks.
No
Yes
Tunnel-level alarms Troubleshoot the
occur on the protection protection channel.
channel?
No
Faults are
rectified?
Yes
Contact Huawei
End
engineers.
ETH_APS_SWITCH_FAIL
ETH_APS_TYPE_MISMATCH
MPLS_TUNNEL_MISMERGE
MPLS_TUNNEL_MISMATCH
MPLS_TUNNEL_Excess
MPLS_TUNNEL_SD
MPLS_TUNNEL_SF
MPLS_TUNNEL_UNKNOWN
Cause 1: The settings of the APS protection group differ between the two ends.
Cause 2: The APS protection group is deactivated.
Cause 3: Fibers or cables are connected incorrectly.
Cause 4: APS frames cannot be transmitted because hardware-related alarms
occur
on the board that carries the protection channel.
Cause 5: The system reports clock alarms.
Cause 6: The working tunnel or protection tunnel is faulty.
Cause 2: The settings of the APS protection group differ between the two ends.
Handling Procedure
Cause 1: The opposite NE is not configured with APS protection.
On the NMS, check whether the opposite NE is configured with APS protection. If the opposite NE is
configured with APS protection, create a matching APS protection group on the opposite NE and
activate the APS protocol.
Cause 2: The settings of the APS protection group differ between the two ends.
On the NMS, check whether the settings of the APS protection group are the same at the two ends.
If the settings differ between the two ends, change them to the same.
Cause 3: The APS protection group is deactivated.
Check whether the APS protocol is activated at both ends. If the APS protocol is deactivated at one
end, deactivate the APS protocol at the other end and then activate the APS protocol at both ends.
Cause 4: The service on the protection channel is interrupted.
Check whether the protection channel reports an alarm related to signal loss or signal degrade,
such as ETH_LOS. If yes, clear the alarm immediately.
Possible Causes
Cause 1: The settings of the APS protection group differ between the two ends.
Handling Procedure
Cause 1: The settings of the APS protection group differ between the two ends.
On the NMS, check whether the settings of the APS protection group are the same at the two ends. If the
settings differ between the two ends, change them to the same. Then, deactivate and activate the APS
protection group at the two ends.
Possible Causes
Cause 1: The switching type is different.
Handling Procedure
Cause: The switching type, switching mode, or revertive mode of the protection group differs between the
two ends.
On the NMS, check whether the settings of the APS protection group are the same at the two ends. If the
settings differ between the two ends, change them to the same. Then, deactivate and activate the APS
HUAWEI TECHNOLOGIES
protection CO.,ends.
group at the two LTD. Huawei Confidential Page 89
Locating ETH LAG Faults - Common
Locating Process
LOOP_ALM
ETH_LOS
ETH_LINK_DOWN
Cause 1: The NEs at the two ends of the LAG are incorrectly
configured.
Cause 2: The working mode of the member ports in the LAG is set to
half-duplex.
Cause 3: The loopback is configured on the member ports in the
LAG.
Cause 4: The connections of the member ports in the LAG are faulty
or lost.
Handling Procedure
Cause 1: The opposite NE is not configured with any LAGs.
On the NMS, check whether the opposite NE is configured with a LAG. If the
opposite NE is not configured with a LAG, configure one on the opposite NE
and check whether the alarm clears.
Cause 2: All member ports in the LAG are unavailable.
When a member port in the LAG is unavailable, the system generates an
ETH_LOS, ETH_LINK_DOWN, or LAG_MEMBER_DOWN alarm. Handle and
clear the alarm and activate the member port.
Possible Causes
Cause 1: The port link is unavailable.
Handling Procedure
Cause 1: The port link is unavailable.
On the NMS, check whether the port in the LAG is enabled. If the port is not enabled, enable the port in
the LAG and check whether the alarm clears. If the alarm persists, check whether an
ETH_AUTO_LINK_DOWN alarm occurs on the port that reports the LAG_MEMBER_DOWN alarm. If yes,
clear the LAG_MEMBER_DOWN alarm.
Cause 2: The port receives no LACP packet.
On the NMS, check whether the opposite port is added to the LAG. If the opposite port is not added to
the LAG, add the opposite port to the LAG and check whether the alarm clears. If the alarm persists,
check whether an ETH_LOS or FLOW_OVER alarm occurs on the port that reports the
LAG_MEMBER_DOWN alarm. If yes, clear the LAG_MEMBER_DOWN alarm.
Cause 3: The port works in half-duplex mode.
Change the working mode of the port to auto-negotiation or full-duplex.
Cause 4: The port is looped back.
Release the loopback on the port.
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 95
Contents
1 Process of Locating Common Faults
11 Reference Documents
priority list, but the external clock source cannot be detected or become
invalid.
Handling Procedure
Cause 1: The external clock source is configured in the clock source priority
list, but the external clock source cannot be detected or become invalid.
Check whether the equipment that provides the external clock source is
faulty, and check whether the cable that connects the external clock
source is normal.
Handling Procedure
Cause 1: The clock configuration is incorrect.
Query the clock synchronization status and check whether the data in the clock
source priority list meets the network planning requirement.
Cause 2: All the clock sources in the clock source priority list fail.
Troubleshoot the synchronization sources based on the clock source priority list.
If the synchronization source is an external clock, handle the EXT_SYNC_LOS
alarm; if the synchronization source is a line clock, handle the alarm that occurs
on the line board; if the synchronization source is an IF clock, handle the alarm
that occurs on the IF board; if the synchronization source is a tributary clock,
handle the alarm that occurs on the tributary board; if the synchronization
source is an Ethernet clock, handle the alarm that occurs on the Ethernet
board.
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential Page 100
Common Clock Alarms (3)
The S1_SYN_CHANGE is an alarm indicating that the clock source is
switched in SSM or extended SSM mode.
Possible Causes
Cause 1: The original clock source is lost when the SSM protocol or extended
Handling Procedure
Cause 1: The clock source is lost.
Based on the clock source priority list, determine the synchronization
source corresponding to the lost clock source.
11 Reference Documents
11 Reference Documents
11 Reference Documents
11 Reference Documents
Incorrect IF cable
connections
Fault Symptoms
On a new OptiX RTN network, NE01, NE02, and NE03 formed a chain. A user could log in to
NE03 from NE02 but could not log in to NE03 from NE01.
Cause Analysis
Possible cause 1: NE03 has a hardware fault, causing a DCN communication failure.
Possible cause 2: The network configuration is incorrect.
Handling Procedure
(1) Queried NE03's adjacent routes and found that the NE IDs of NE01 and NE02 were
displayed.
(2) Performed a reset on NE03 and found that the fault persisted.
(3) Checked NE03 on site, and found that one optical port of the EG2 board was connected
to NE02 and another optical port of the EG2 board was connected to NE04.
(4) Logged in to NE04 and found that the NE ID of NE04 was the same as that of NE01.
(5) Changed the NE ID of NE04 to a unique value on the network. Then, logged in to NE03
from NE01. The login was successful.
Handling Procedure
(1) Connected a BER tester to NE01 and set an inloop at one 2 Mbit/s port of NE04. The BER tester
detected a large number of bit errors.
(2) Configured a static ARP entry at NE03 with the MAC address being the egress port of NE03 and
the IP address being NE04, and created a tunnel whose egress label was the same as its ingress
label between NE03 and NE04.
(3) Set an outloop at the network-side port of NE04. Then, on NE03, set an inloop at the network-
side port that was connected to NE04. In both cases, the BER tester detected bit errors.
(4) On NE03, set an outloop at the network-side port that was connected to NE02 and found that no
bit error occurred. Therefore, it was inferred that NE03 malfunctioned.
(5) On NE03, replaced the 10GE line board that was connected to NE02.
Handling Procedure
(1) Suspected that the clock configuration of NE01 was incorrect because NE01 did not
report an alarm.
(2) Queried the clock source priority lists of NE01 and NE02, and found that NE01 traced
the line clock from optical port 1 on the EG2 board in slot 1 (of NE01) and NE02 traced
the line clock from optical port 1 on the EG2 board in slot 2 (of NE02). The two optical
ports were directly interconnected. As a result, the clock signals traced by NE01 and
NE02 formed a loop, resulting in clock quality deterioration and large clock frequency
deviations on the NodeBs connected to NE01.
(3) Changed the clock source of NE01 according to the NE planning table.
(2) Analyzed the distribution of the affected NEs and found that all interrupted
services were first converged to an NE and then backhauled to the BSC.
(3) Found that an ARP entry was frequently and automatically added and deleted on
the convergence NE. Changed the ARP entry to a static entry. Then, the tunnel
alarms cleared and some services were restored.
(4) Checked the configurations of the convergence NE, and found that the NE was
configured with multiple tunnels and that the next-hop IP address of the port was set
to a value same as the next-hop IP address of the convergence port. The incorrect
settings caused abnormal ARP learning and further interrupted tunnel services.
(5) Deleted incorrectly configured services and tunnels, and re-configured services
and tunnels according to the network planning document. The services were normal
even after the static ARP entry was deleted.
11 Reference Documents
http://support.huawei.com/support/pages/navigation/gotoKBNavi.do?
actionFlag=intoKBNavigation&autoFlag=autoThink&colID=ROOTENWEB|
CO0000000173&itemId0=29-2&itemId1=3-400
For the preceding documents, please download the latest versions from
support.huawei.com. For any comments or suggestions on the
documents, please send your feedback to Chen Shaoying (employee ID:
59800).
www.huawei.com