Wireless Mesh
Wireless Mesh
Wireless Mesh
Diagnosis: A Survey
Master of Technology - Seminar Report
Vijay P Gabale
Computer Science And Engineering Department,
Indian Institute of Technology, Bombay,India
Under the guidance of Prof. Bhaskaran Raman
May 19, 2008
1
cific root cause is difficult. tion 4 briefly states the system components required for vari-
Thus, a vulnerable network makes it necessary to diag- ous techniques. Finally, the survey concludes with the possible
nose the faults and take remedial actions as early as possible. scope for future work and conclusion in section 5 and section
This results into robust, resilient and well-planned network 6 respectively.
that gives enhanced performance to users with better connec-
tivity. The issues involved, however, are identifying the exact
causes of faults across layers, exploiting physical layer infor- 2 Survey of existing techniques
mation, taking appropriate remedial actions and automizing
these tasks. To provide answers to these questions in systematic manner,
Most of these issues and problems that we encounter network management and fault diagnosis research has been
in mesh networks are due to the fact that systems are not de- undertaken in the domain of small and large scale Enterprise
signed or deployed with support for easy diagnosis built right WiFi deployments. Comparatively very fewer efforts have
from start. When a wireless link or wireless node goes down, been made so far towards long-distance mesh network man-
the node might become unreachable or at worst the whole net- agement and root cause analysis though. This section cate-
work may come to grinding halt. Until we figure out what has gorizes the approaches taken so far for enterprise and long-
gone wrong, system downtime increases and users loose their distance fault diagnosis into five major appoaches: (a) of-
productivity. Without tools or appropriate measures to char- fline diagnosis on network traces (b) online anomaly detection
acterize these failures, it requires several hours of time and system (c) simulating expected behavior to compare with ob-
energy in repair work. served behavior (d) a daemon working as a part of the node
In case of dense enterprise networks, the challenge lies (e) system incorporating additional (redundant) software and
in characterizing the wireless behavior through monitoring the hardware components in the node.
network, quantifying physical layer parameters and categoriz-
ing causes to specific faults. The monitoring infrastructure
2.1 Offline diagnosis
is intended to provide answers to questions like how many
concurrent transmissions were there?, how is per packet sig- In this approach, multiple monitors are deployed close to
nal strength varying over time?, are there any non reachable clients and APs. Each monitor individually collects data, bun-
nodes? etc. The answers to these questions help identify hid- dles them over a time period and sends them to a central ma-
den terminals, antenna misalignment and RF holes (and simi- chine. The central machine, equipped with inference mea-
lar problems) respectively. sures, collects these traces. It applies certain techniques to
While we can deploy multiple sensors to perform fault synchronize and unify these traces. To synchronize traces
diagnosis in dense enterprise networks, the problem over long- received across clients, either beacon frames, which carry
distance links through remote (rural) areas becomes critical. unique 64-bit timestamp, are used or certain unique reference
Typically, nodes are spread distant from each other, expertise frames are identified which help order the frames in the se-
to solve simple networking problems may not be available and quence in which they were transmitted.
it incurs personal visit to figure out the fault at distant nodes. Although the dense deployment of monitors is expected
The cause of faults occurring could be as diverse as interfer- to capture all the ongoing transmissions, some transmissions
ence due to water pump in a farm-field or damage of router sometimes elude the monitors. To infer about these losses
boards due to power spikes. The failures due to poor power and to have comprehensive trace of wireless activities, Finite
quality could be rife in rural areas. The task lies in arriving State Machines (FSM) are developed for wireless protocols
at a remote monitoring solution, taking decision of whether to and are applied over the traces. Custom inference techniques
have pull based architecture; where a daemon running on one are then used to infer frames missed by monitors themselves
machine queries others or push based architecture; where ev- or to get rid of duplicates. Eventually a single unified trace
ery node pushes pertaining data to central server and incorpo- characterising entire wireless behavior over a period of time is
rating additional software and hardware components to make built. Figure 1 shows the the technique pictoreally. [1] uses
recovery automatic whenever possible. this approach where around 150 passive radios collect traces
This survey attempts to provide answers to questions and a complete wireless behavior is reconstructed in terms of
like, What techniques should be in progress to get insights of records and conversations. Inference techniques are then ap-
the possible causes? how do we categorize the causes of a plied to detect concurrent transmissions for calculating num-
fault? If the diagnosis says that interference is the problem, ber of frames collided. The motive behind this system is to
then how do we go about fixing it? What is the mechanism to produce a precisely synchronized global picture of physical,
apply or action to take? The organization of the report is as link, network and transport layer activities for analysis of large
follows: Section 2 classifies and explores various techniques 802.11 networks. The system gives deeper insights about the
proposed to handle fault diagnosis. Section 3 elaborates on fraction of beacon and ARP traffic comprising of overall traf-
how to handle specific faults in wireless mesh networks. Sec- fic in their network, probability of interference given simulate-
2
Figure 1: Offline diagnosis framework
neous transmissions and overprotective 802.11g devices.
2.2 Online diagnosis Figure 2: Decision tree for detecting faults in simulation
While the offline approach works well to characterise the en- Troubleshooting[3] uses this technique to categorize
tire network, we require online and dynamic network setting faults as packet dropping at the receiver, excessive transmis-
to detect faults as soon as they take place. The online diagno- sions resulting in link congestion, external interference or
sis approach also involves deploying multiple monitors close MAC misbehavior.
to clients and APs to capture transmission frames. Nodes pe- The search space for faults for best match is high-
riodically sample parameters like noise floor, signal strength dimensional due to combinations of faults. But, to make
etc and forward them to central inference engine which makes search efficient, we can take advantage of the fact that different
decisions dynamically. The inference engine running at this types of faults often change one or few metrics. For example,
central machine outputs probable faults like RF holes depend- external noise sources increase noise experienced by its neigh-
ing on known spatial locations and link asymmetries etc. boring nodes but do not increase the sending rate, and there-
[2] employs this approach where faults like hidden ter- fore can be differentiated from MAC misbehavior and packet
minals, capture effect, noise are artificially replicated in the dropping at hosts. In [3], a decision tree is build based on
network and the behavior of network under a known fault is this predicate as shown in figure 2. The figure shows how
characterized in terms of thresholds. Whenever certain pa- the difference between simulated parameters and observed pa-
rameters cross pre-defined threshold, the corresponding fault rameters can be used along with parameter specific threshold
is triggered as the possible reason of anomaly. This study very to classify the faults. Though simulations are cost-effective,
elegantly characterizes faults like hidden terminals, capture ef- they may not capture the intrinsic wireless behavior and hence
fects and non 802.11 interference with the help a small test bed results obtained often deviate from real-world measurements.
and using custom techniques to introduce noise.
3
explained in section 3. 3.1.1 Connectivity problems
• Symptoms: Intermittent connectivity and total failure
2.5 Software and hardware redundancy
• Causes: Weak RF signal, Lack of signal, unpredictable
This approach is particularly suitable for long-distance links. ambiance, obstructions
Through Arawind and AirJaldi[5] networks, it is experienced
that software and hardware failures are quite rampant in long- • Techniques: The key lies in tracking the received signal
distance (rural) networks. Thus we require independent con- strength values as they depict the coverage in a specific
trol mechanisms to curb software and hardware component region. But the problem here is that, once a client gets
failures inside a node. As a part of this structure, a dae- detached from the network due to loss of connectivity,
mon runs on each node to supply the parameters regarding how can it possibly convey the problem to central admin-
the health of node and links to neighboring nodes to a remote istrator. The client conduit protocol[4] mentioned in Ar-
server. The remote server, in turn, runs certain inference tech- chitecture and Techniques paper tackles this ambiguity as
niques to diagnose faults of poor performance like packet loss follows:
due to interference or external noise etc. Software and hard-
ware watchdogs as the part of the node are built to mitigate 1. The Diagnostic Client on the disconnected client
problems due to power quality and system malfunction events. (disconnected from AP) configures the machine to
Backchannels can be used in case of primary link failure to operate in promiscuous mode. It scans all channels
know the health of node. One way to realize the backchannels to determine if any nearby client is connected to the
is to use Short Messaging Service (SMS) by keeping Mobile infrastructure network.
phone inside the node. 2. This newly formed AP at disconnected node broad-
Beyond pilots[5] uses these techniques to resolve soft- casts its beacon like a regular AP.
ware and hardware failures that they experienced in their de-
ployments in India. These techniques are further elaborated in 3. Every client in the network has to perform active
section 3. scanning and while doing this, when the client re-
ceives this beacon, it sends probe message.
4. Disconnected station becomes normal station again
3 Fault Diagnosis in wireless mesh net- and sends reply message.
works 5. Connected node starts ad hoc network with discon-
nected client via Multinet. (The connected client
Previous section categorized the techniques to deal with fault first performs authentication through certificates.
diagnosis. This section exemplifies how different faults in the Also the number of times this can be done is con-
network can be detected using these techniques. Here, the fault strained to refrain connected client from wasting its
diagnosis examples are explained in terms of symptoms ob- resources for helping disconnected client)
served, possible causes of the fault, techniques to be used and
actions to be taken to mitigate these faults. This survey iden- The traces of disconnected clients are then conveyed
tifies two broad network types, dense enterprise network and to central server. Double Indirection for Approximating
long-distance mesh networks for fault diagnosis. The tech- Location (DIAL) protocol is then used to locate the dis-
niques for the two differ substantially, so as the faults occuring. connected client. This is done by (a) measuring signal
Separate subsections are dedicated to these networks which strength values for connected client which acts as inter-
are followed by a short description of how to get the parame- mediary and then (b) using parameters of disconnected
ters required to quantify the performance measures. clients.
4
this fault, multiple monitors capture on-going activities.
For every pair of adjacent data frame transmissions, it is
checked whether there are more than (or close to) 40%
(overlapping) concurrent transmissions directed to same
node. If they are present, then the degradation in perfor-
mance is due to hidden terminal effect.
Figure 3: Flow diagram to detect rogue AP • Capture Effect: In wired communication environment,
network; such APs can result in security holes and un- the packets will be considered as collisions if two packets
wanted RF and network load. It is required to locate the arrive in the same station at the same time. Even though
rogue APs as soon as possible before they cause destruc- the received power for one of the packets is much larger
tion. The architecture and techniques[4] employs follow- than the other one, the station still takes all the packets as
ing apporach: collisions. However, as it turns out for the wireless com-
munication network, even though more than two packets
1. For each AP that a node detects, it sends a four tu- arrive in the same station at the same time, the packet
ple: MAC address, SSID, channel, RSSI to the Di- with high Signal to Noise Ratio (SNR) (greater than pre-
agnostic (central) Server, this four tuple uniquely determined threshold) still can be received successfully.
identifies an AP in a particular location and channel. Thus the corresponding station will always capture the
The AP MAC address can be determined using bea- channel. The natural question is Why would two stations
con frames. These are then mapped with location that are in the range of each other transmit concurrently
databases to calculate the current position of (de- when both use the CSMA/CA protocol? The answer is
tect) rogue AP. Figure 3 shows decision tree to de- that contention window is set to min after each successful
tect rogue access points. Such a decision tree makes ACK (Acknowledgment) and backoff interval is selected
use of the parameters collected from the nodes and based on this number. Also it takes only 25 microseconds
location database available at the central node. to clear channel assessment.
2. Active scanning on the part of the client can also
result in detecting an AP working on unexpected • Techniques: In [2], the catpure effect anomaly is delib-
channel or through channel overlaps. erately arranged and then quantified in terms of good-
put and percentage of frames collided. It is observed
• Actions: Authentication using certificates
that about 5% of the frames collide when capture effect
anomaly is present. This number is termed as thresh-
3.1.3 Hidden terminal old for capture effect. To detect an instance of this fault,
• Symptoms: Degraded performance, lower throughput Multiple monitors capture on-going activities. For every
pair of adjacent data frame transmissions, it is checked
• Causes: One transmitter not able to hear other transmis- whether there are more than (or close to) 5% (overlap-
sions to the same receiver (the hidden node problem[2]), ping) concurrent transmissions directed to same node. If
heterogeneous transmit power they are present, then the degradation in performance is
• Techniques: In Mojo[2], hidden terminal anomaly is de- due to the capture effect.
liberately arranged and then quantified in terms of good-
put and percentage of frames collided. It is observed that • Actions: This anomaly results due to the mismatch be-
about 40% of the frames collide when hidden node termi- tween transmit power and receiver sensitivity across sta-
nal anomaly is present. This number is termed as thresh- tions and can be mitigated by adjusting transmit power to
old for hidden node terminal. To detect an instance of give fair access to medium.
5
3.1.5 Non 802.11 device interference 3.2.1 Connectivity problems
• Symptoms: Retransmissions at the MAC layer, no con- • Symptoms: Remote node NOT Reachable
current transmissions
• Causes: IP address misconfiguration, routing misconfig-
• Causes: Since non 802.11 devices do not follow media
uration, primary link fails, power shutdown at remote
access protocol and since they cause channel interfer-
node, a board failure, malfunctioning wireless card
ence, they result in 802.11 frame corruption, excessive
back offs and frequent retransmissions. Here we need to detect the exact cause among these possi-
ble causes. To achieve this, we should be able to query the
• Techniques: To detect erratic noise erruption, we need node through some other means like backchannels (ex-
to dynamically track the signal energy level present at a plained later). If this succeeds, it makes sure that the node
station. As soon as any unwarranted spike is detected, it is up and working. Then the reasons for problem could
could be the reason of noise interference by non-802.11 be configuration issues which can be solved once logged
devices. Thus, for a time period, termed as EPOCH IN- into the node. If the attempt to connect to the node fails,
TERVAL in [2], set by administrator, noise floor value is then we should have some mechanism which queries the
sampled and a moving average is maintained. As soon as ‘on board but independent’ equipments working inside
the window crosses threshold (set based on observations), the node. This mechanism in turn, gives us back a status
a fault is triggered as non 802.11 device interference. report for the node. This should assist us in predicting
whether there is power shutdown or board failure or a
• Actions: Typically the non 802.11 device is identified and
software malfunction. Figure 4 shows the flow chart of
removed, the other measure could be changing the oper-
this decision process and how we can arrive to a probable
ation frequency for a node(channel).
conclusion.
6
packet signal strength values, noise floor or sequence num-
bers in the packet. To extract these values, low level access to
physical layer header information is required in terms 802.11
frames where we can capture the per packet signal strength,
noise values and can also determine whether checksum was
passed or not. Following examples give glimpse of how these
parameters can be obtained:
7
This continuous monitoring system also allowed them (e) redundancy in terms of hardware and software component
identify faults like antenna misalignment. across two different networks: enterprise and long-distance. It
also delves into the possible faults and their remedies that are
• Inference engine: When the data collected through moni-
encountered in enterprise and long-distance mesh networks.
toring system arrives at a central node, which is specially
Some of the techniques mentioned, are developed specially to
configured to be able to apply costly search algorithms,
characterize the entire wireless behavior in a campus build-
the node goes over the data collected and makes predic-
ing whereas others deal with quantifying thresholds for certain
tions or draws conclusions from variation in various pa-
faults.
rameters. Finite State Machines, as mentioned in sec-
The problem becomes non-trivial in case of long-
tion 2, are incorporated into the inference engine to cope
distance networks where hardware failures become the cause
with monitoring system inabilities like missed packets in
of concern. Various faults like RF holes, hidden terminals,
a conversations.
capture effects, interference, power failures etc are studied
• Optional back channel: These come handy when primary in terms of symptoms experienced by the users, the possible
link goes down due to reasons explained in previous sub- causes, how do we categorize the causes and select the one
sections. Optionally we can have shell access or can open that best matches the anomaly and what are the possible ac-
reverse SSH tunnels to execute commands at the remote tions we can take to mitigate the faults. The metrics to quan-
node to find the exact problem. SMS request reply mech- tify the causes are also listed along with the major components
anism, which has become affordable in India, can be cou- of the framework required to gather the information. The sur-
pled with the nodes (as done in [5]) and can be applied to vey is completed with the need to make comprehensive, au-
detect the node health. tomated, user friendly tool that can monitor remote network
barring failures and can help manage the network locally.
4 Future work
References
Though the faults faced in enterprise and long-distance de-
[1] Yu-Chung Cheng, John ellardo, and Peter Benko. Jigsaw:
ployments have been quantified and characterized, there is a
Solving the Puzzle of Enterprise 802.11 Networks. In SIG-
need to develop a comprehensive network monitoring and in- COMM, 2006.
ference tool for both enterprise and long-distance networks.
Also, though sophisticated tools do exist to detect individual [2] Anmol Sheth, Christian Doerr, Dirk Grunwald, Richard
faults, a single tool could serve to be of much use to character- Han, and Dougla Sicker. Mojo: A Distributed Physical
Layer Anamoly Detection System for 802.11 WLANs. In
ize the performance of network at scale. MOBISYS, 2006.
The Beyond Pilots[5] paper has manifold ingeneously
developed solutions to tackle problems in long-distance mesh [3] Lili Qiu, Paramvir Bahl, Ananth Rao, and Lidong Zhou.
networks, but the paper does not quantify the efficacy of the Troubleshooting Wireless Mesh Networks. In SIGCOMM,
2006.
techniques in terms of performance improvement. Thus ex-
periments must be designed and tested for the techniques like [4] Atul Adya, Paramvir Bahl, Ranveer Chandra, and Lilli
software and hardware redundancies to quantify the perfor- Qiu. Architecture and Techniques for Diagnosing Faults
mance of enhanced network. in IEEE 802.11 Infrastructure Networks. In MOBICOM,
In case of rural areas, where we need to employ local 2004.
expertise, a user friendly GUI could serve to be of great help [5] Sonesh Surana, Rabin Patra, and Sergiu Nedevschi. Be-
for managing and maintaining the network locally. yond Pilots: Keeping Rural Wireless Networks Alive. In
The procedure of remedial actions can be made auto- USENIX NSDI, 2008. To appear.
matic. To give an example, after detecting capture effect[2], [6] Dynamic configuration of IPv4 Link-Local Addresses.
we should automatically be able to set the transmit power of http://www.ietf.org/rfc/rfc3927.txt.
both stations appropriately to give them fair access.
[7] Ratul Mahajan, Maya Rodrig, David Wetherall, and John
Zahorjan. Analyzing the MAC Level Behavior of Wireless
5 Conclusion Networks in the Wild. In SIGCOMM, 2006.
[8] Sonesh Surana, Rabin Patra, and Eric Brewer. Simplifying
The intrisic wireless medium characteristics and by-products Fault Diagnosis in Locally Managed Rural wifi networks.
of dense and long-distance wifi deployments give rise to man- In SIGCOMM NSDR, 2007.
ifold performance problems. This survey classifies the tech-
[9] Kameswari Chebrolu, Bhaskaran Raman, and Sayandeep
niques to solve performance problems in five different cate- Sen. Long-distance 802.11b Links: Performance Mea-
gories: (a) offline diagnosis (b) online diagnosis (c) diagno- surements and Experience. In MOBICOM, 2006.
sis through simulation (d) system with per node daemon and
8
Appendix
Comparison table of techniques