Understanding and Using The SP Switch
Understanding and Using The SP Switch
Abbas Farazdel, Gonzalo R. Archondo-Callao, Eva Hocks, Takaaki Sakachi, Federico Vagnini
http://www.redbooks.ibm.com
SG24-5161-00
SG24-5161-00
April 1999
Take Note!
Before using this information and the product it supports, be sure to read the general information in
Appendix D, “Special Notices” on page 257.
This edition applies to IBM Parallel System Support Programs for AIX (PSSP) Version 3,
Release 1 for use with AIX 4.3.2 and the SP Switch.
When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the
information in any way it believes appropriate without incurring any obligation to you.
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ix
Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xi
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
The Team That Wrote This Redbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Comments Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
v
8.2.2 Using the Eclock Command . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.2.3 The Actions of Eclock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.3 Starting the SP Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.4 Removing a Node from the SP Switch Network . . . . . . . . . . . . . . . . 140
8.5 Adding a Node to the SP Switch Network . . . . . . . . . . . . . . . . . . . . . 140
8.6 Stopping the SP Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.7 Automatic Management of the SP Switch . . . . . . . . . . . . . . . . . . . . . 142
8.7.1 Managing the Switch Before PSSP 3.1 . . . . . . . . . . . . . . . . . . . 142
8.7.2 The Switch Admin Daemon . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.7.3 The Implementation of the Switch Admin Daemon . . . . . . . . . . 144
8.7.4 Management Tasks Not Yet Automated . . . . . . . . . . . . . . . . . . 146
vii
A.2.2 Second Error Capture Register . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
1. A Two-Frame SP System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Switch Board Physical Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3. Switch Board Logical Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4. Multiple Parallel Routes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5. Multiple Paths between Two Nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6. Two Frames Cabling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7. Three Frames Cabling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
8. The 128 Nodes SP Switch Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
9. Logical View of SP Switch-8 Switch Board . . . . . . . . . . . . . . . . . . . . . . . . 12
10. Packet Routing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
11. Multiple Data Flows on the SP Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
12. Switch Channel Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
13. Clock Components Position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
14. Clock Distribution Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
15. Switch Chip Components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
16. Receiver Modules Components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
17. Route Byte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
18. Sender Module Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
19. Service Packet Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
20. TB3 Adapter Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
21. TBMX Adapter Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
22. TB3PCI Adapter Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
23. TB3MX2 Adapter Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
24. Adapter Logical Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
25. Switch Packet Creation for User Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
26. Switch Packet Creation for Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
27. Switch Packet Creation for IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
28. Communication Subsystem Components . . . . . . . . . . . . . . . . . . . . . . . . . 50
29. PIPE FIFOs among Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
30. PIPE-to-Adapter Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
31. Packet Created by PIPE Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
32. IP Kernel Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
33. IP Kernel Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
34. IP Send FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
35. Node Numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
36. Chip Interconnection and Slot Assignment . . . . . . . . . . . . . . . . . . . . . . . . 76
37. Switch Node Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
38. SP Switch Router Connection to the SP System. . . . . . . . . . . . . . . . . . . . 79
39. GRF with Multiple Switch Router Adapter . . . . . . . . . . . . . . . . . . . . . . . . . 80
40. Switch Network Without ARP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
The current day SP uses a shared-nothing model, meaning each node has its
own memory, disk, CPU, and so on. In order for a group of nodes to work
together on a problem, they must share data and status as needed through
inter-node messages. These messages are sent by the components of a
running application and are delivered through “packets” sent over the
communication network chosen for the application.
We give special thanks to Skip Lundin from the IBM PPS Lab Poughkeepsie
for his efforts in reviewing the entire redbook and providing useful information
and suggestions.
Thanks also to the following people for their contributions to this project:
Bernard King-Smith
Bob Simon
Nick Rash
Kevin Reilly
Bill Tuel
Jonathon Kaufman
Peter Chenevert
A. Z. Muszynski
Fu-Chung Chang
Aruna Ramanan
IBM PPS Lab Poughkeepsie
Paul Crumley
IBM Watson Research
Comments Welcome
Your comments are important to us!
xv
• Fax the evaluation form found in “ITSO Redbook Evaluation” on page 279
to the fax number shown on the form.
• Use the electronic evaluation form found on the Redbooks Web sites:
For Internet users http://www.redbooks.ibm.com
For IBM Intranet users http://w3.itso.ibm.com
• Send us a note at the following address:
redbook@us.ibm.com
There are four basic physical components of an SP (See Figure 1 on page 4):
frame A containment unit consisting of a rack to hold computers, together
with supporting hardware, including power supplies, cooling
equipment and communication media such as the system
Ethernet.
nodes AIX RS/6000 workstations packaged to fit in the SP frame; a node
has no display head or keyboard, and so user human interaction
must be done remotely.
switch The medium that allows high-speed communication between nodes.
CWS The Control Workstation (CWS) is a stand-alone AIX workstation,
with display and keyboard, possessing the hardware required to
monitor and control the frame(s) and nodes of the system.
SP nodes are available in three form factors: thin, wide and high. The SP is
generally available with from 2 to 128 nodes, which are packaged
in from 1 to 9 logical frames. The number of physical frames can
be more, depending on the types of nodes.
Each frame may contain an SP Switch board. The nodes of the frames are
connected to their respective switch board via a special SP Switch adapter
and corresponding “switch cables”, and the switch boards of the system are
connected via the same type of cables. Thus, a high-speed communication
network is formed which allows the nodes to communicate with each other to
share data and status. The primary purpose of this high-speed network is the
support of solving problems in parallel.
This software is responsible for sending and receiving packets to and from
other nodes, on behalf of applications. It also sends packets to switch
hardware components as part of its monitoring and controlling functions.
CSS adapter
Thin
node
Switch
Frame 1 Frame 2
switch-to-switch
cable
Switch
Serial Serial
cable cable
Control Workstation
The switch board is connected with the rest of the switch complex by
interposer cards. They provide access to data channels and the supervisor
card, and receive the 48-volt power. For each interposer there is a
connection, or jack, on the bulkhead side where the appropriate cable can be
attached. Bulkhead jacks are numbered: 1 for 48-volt power, 2 for switch
supervisor card connection, 3-to-35 for data channels.
There are five cooling fans in the switch assembly. The rotation of these fans
is monitored by the switch supervisor. There is an N+1 redundancy in the
switch assembly so that a fan may be faulty without affecting the system.
From the network point of view, the switch board is a box that has 32 links
with the outside and there are several paths between any pair of them. When
connected to nodes, the switch boards have 16 links to nodes and the
remaining 16 to other switch boards. To create bigger networks, several
switch boards may be interconnected with other switch boards.
All internal and external switch board point-to-point links are bidirectional full
duplex. They comprise two channels that can carry data in opposite
directions simultaneously, each channel capable of carrying 150 MB of data
per second.
Inside the SP Switch there are eight switching devices called switch chips.
They are the heart of the switch board and are responsible for routing the
data from one link to another. They are non-blocking devices, so any two data
paths can traverse them in parallel if they do not require access to the same
link between switch chips.
Each switch chip has eight ports from which it can send and receive data
simultaneously. Data arriving at one port can be routed to and transmitted
from any other port.
Node connections
P4
P5
P6
P7
P8
P9
P10
P 11
P12
P13
P14
P15
S w itc h B o a r d
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P 11
P12
P13
P14
P15
S w itch Board
P0 1
P1
P2
P3
P4
P5 2
P6
P7
P8
P9
P 10 3
P 11
P 12
P 13
P 14
P 15 4
S w itch Board
All switch chip ports are used to create the network. When only two frames
are present, there are 16 connections between two frames, while with three
frames, only eight connections are possible. In any case, the four shortest
paths are always present.
If the system grows and a new frame is added, all the board-to-board cabling
must be redone. This means that the network has to be stopped during the
re-configuration. A way to avoid this downtime is to start using the
intermediate switch boards we discuss in 2.2.2, even for a system with less
than 80 nodes. This is a more expensive solution but the investment may be
worthwhile if the system is expected to grow.
Additional switch boards are then added to the network topology, and their
task is to provide sufficient links between other switch boards. Since these
boards do not connect to nodes, they are called intermediate switch boards
(ISBs), while the boards that are connected to nodes are called node switch
boards (NSBs).
Nodes
Nodes
NSBs ISBs NSBs
The SP Switch-8 switch board is a multistage switched network that has eight
bidirectional links that connect only to nodes and whose logical topology is
depicted in Figure 9.
Nodes
Nodes
S w it c h B o a r d
This switch board is used only in small environments. It has the same
components as the full switch frame and all the descriptions we give in the
following chapters also apply to this board. However, all future references to
the term switch board imply the full-featured board comprising eight switch
chips.
Each packet is destined either to a node or to a single switch chip. In the first
case it is typically a data packet that has been created by some node
application, while in the second case it is called a service packet and is used
to configure the features of the switch chip. A service packet may also be
created by a switch chip to notify a node of some event that is meaningful in
network administration, or by a node for node-to-node administrative
communications.
Both data and service packets flow on the same network. This choice keeps
architecture simple and gives to both packet types similar path redundancy.
Service packets do not greatly affect the network performance since they are
used basically whenever a switch network configuration is executing or in
case of failure recovery. The protocol used is designed to keep service
communication low.
S o urce 1 st C h ip p o rt 4
2 n d C h ip
p ort 3
po rt 7
D estin atio n 3 rd C hip
Each switch chip starts reading the incoming packet starting from the BOP
character and then begins scanning the routing part to detect to which of its
output ports the packet has to be forwarded. As soon as it detects the port,
the bytes are passed through it to the following switch chip or the destination
node adapter.
Since packet sizes may be very long and minimal buffering is used during
transmission time to reduce latency, the data may be strung across the
network route.
In order to improve link reliability, all bytes transmitted across the link
(packets and control characters) are protected by a time-based, 2-byte Error
Detection Code (EDC) that gives a checksum value of the transmitted data.
Time boundaries, start to end, are called EDC Frames, expressed as a
programmable number of cycles ranging in power of two from 32 to 256 bytes.
EDC errors are used to monitor link data quality and to identify possible link
degradation.
Since the EDC Frame is time-based instead of packet-based, the ECD Frame
may span a partial packet, a full packet, multiple packets, or no packets at all.
It is not a separate entity from a message packet, but they overlap. The EDC
Frame has no effect on the message packet except to steal a fraction of the
bandwidth to insert the EDC bytes into the data flow.
An example of data flow on a switch board is shown in Figure 11. There are
three data flows in the same direction that make use of four switch chips. For
each flow, the data bytes are continuously flowing from one switch chip to
another. In the same instant, all the following may occur:
• The beginning of the packet may be on one switch chip
• Other packet bytes may be on the same chip
• Still other bytes may be travelling on the link between two chips
• The end of the packet may not yet have even been inserted into the switch
network
Two data flows may traverse the same chip in parallel, but if they require use
of the same port, one of them is stopped. In the figure, the beginning of the
packet of Flow C is buffered in Chip 1, while other bytes are still travelling to
Chip 1. When the packet in Flow B finishes, Flow C will proceed.
.
Flow A
Flow B
Chip 4 Chip 2
Flow C
Chip 3 Chip 1
Byte of data
...
8 bit Data Switch Chip
Switch Chip
1 bit TAG or Switch
or Switch
Adapter
Adapter
1 bit Data Valid receive
send port
port
1 bit Token Signal
The signals can have a different meaning depending on the operational mode
of the link, as defined by the sending port. The TAG bit identifies the 8-bit data
as a control character or as a data character.
The token signal is used for flow control. When sending and receiving ports
are initialized, the sending port gets its token counter set to the size of the
buffer present in the receive port. Each time the sending port transmits two
bytes of data, it decreases its token counter by one. When the receiving port
frees its internal buffer of the same fixed amount of data, it sends a token
signal to the sending port. Upon the reception of a token signal, the sending
port increases its token counter.
The number of tokens on the sending port reflects the space available on the
receiving port queue except for possible data and tokens actually traversing
the link. When the counter reaches zero, the sending port does not send any
more data and starts buffering the bytes directed to that sending port.
The token protocol is designed to detect and correct token signal errors that
may create spurious tokens on the sender or avoid reception of tokens.
Sender and receiver periodically check their token counter values and
synchronize themselves.
Data channels are sampled to detect data bits using this common clock;
however, because of the phase difference inherent at each switch chip due to
irregularities in the clock distribution network, the data leaving one chip must
be synchronized to the clock seen by the receiving chip. Such link tuning is
performed for each port of the switch chip using a Self-Timed Interface (STI)
macro embedded in the receive section of each switch chip’s port. The STI
both initially tunes the links at system initialization and dynamically maintains
the proper sampling of the received data as the operating environment
changes. See 3.3, “STI Timing and Logic Synchronization Process” on page
20 for more details.
Different clock sources may be used. A single switch board can receive its
clock from any of 35 sources:
• There are two local 18.75 MHz crystal oscillators on a switch board
• There is a differential clock signal received on each of the 32 switch
board’s data cables
• There is a connector for receiving a differential clock signal from a source
external to the SP system. The switch board can use such signal, but this
feature is currently not supported.
Each switch board assembly has a switch supervisor card which is controlled
from the control workstation. The switch supervisor cards are used to define
the clock signal distribution. Each supervisor sends signals to all its board’s
switch chips to define how the clock signal is generated and used by all the
From the selected source the clock signal is fed to one of the four Phase
Locked Loops (PLL) present on the board for clock pulse shaping and to
reduce clock jitter. Clock shaping is very critical for data receiving in each
switch chip. The clock signal provided by the PLL is then used by all the
switch chips on the board (Figure 12 on page 16).
Different topologies for the clock distribution network are possible depending
on the number of switch boards. The simplest configuration is an SP Switch
with only one switch board: one of its internal oscillators is selected as the
clock source.
The PLLs are strictly tied to switch chips, as can be seen in Figure 13. Only
switch chips SW2, SW3, SW4 and SW5 have PLLs and can receive the clock
signal from the oscillators.
To Switches To Nodes
PLL-3 PLL-4 OSC-4
J3 3 4 7 0 N14 J34
J4 2 SW3 5 6 SW4 1 N13 J33
J5 1 6 5 2 N10 J32
J6 0 7 4 3 N9 J31
PLL-2 OSC-2 PLL-5
J27 3 4 7 0 N6 J10
J28 2 SW2 5 6 1 N5 J9
SW5
J29 1 6 5 2 N2 J8
J30 0 7 4 3 N1 J7
J11 3 4 7 0 N3 J26
J12 2 5 6 1 N4 J25
SW1 SW6
J13 1 6 5 2 N7 J24
J14 0 7 4 3 N8 J23
For each slave board, one switch chip is selected to receive the source clock
at one port. If it has a PLL, it shapes the signal and re-drives it to all the other
switch chips, acting as a master chip for the board, otherwise it does not use
the signal but forwards it to the master.
All the data cables that connect a switch board to another board or to a node
also carry a clock signal. Each switch chip always sends the clock it uses
along with the data. Data and clock are carried by different wires. Therefore,
each switch chip on a slave board receives four different clock signals from
four sources.
ISBs may receive a clock signal from any port of any switch chip, while NSBs
may only receive it from SW0, SW1, SW3 or SW4 since the other switch chips
are dedicated to node connections.
Any switch chip’s port can be selected to receive the clock signal. However, if
you look at clock distribution files, PSSP software always uses jacks J3, J4 or
J5. This limitation is for compatibility with the High Performance Switch that
has limited clock distribution capabilities.
Clock distribution files only mention in the description part the ports of a
switch chip provided with a PLL, but a system administrator can decide to use
any other port, and provide a corresponding configuration file.
Both master and slave switch boards can provide a clock signal to other slave
boards, creating a clock distribution tree, as you can see in Figure 14.
Board 5
Board 1 Board 3
Board 6
Important
The clock signal is of vital importance to the SP Switch.
If a switch board does not receive the clock signal, none of its internal and
external data links will be functional and all the switch boards that depend
on it for the clock will also fail in the same way.
The initialization sequence can be started by either the master or the slave
side of the link. The sequence will start in response to a reset of the port due
to power on or chip reconfiguration, or by the detection of one error condition
on the link. If the synchronization process fails for any reason, it is restarted
again.
Once the link has been synchronized, data can flow between the sender and
the receiver. The STI macro follows the signal phase and continuously tunes
the channel.
A new synchronization process may be needed due to link errors or link delay
variations that cannot be tuned out. When the STI macro detects that the
phase of the data is nearing the edge of the guard band defined by the
current setting of the delay elements, or when the receiver logic determines
that there is an error on the link such that the incoming data is suspect, the
receiver transmits a synchronization request code to the sender. The sender
decodes this code and at the end of the current packet, the synchronization
process is started again.
S S
T R1 S1 T
I Central Queue I
S and S
T R2 Queue Control S2 T
I I
(4 KB)
S S
T R3 S3 T
I I
S S
T R4 S4 T
I I
S S
T R5 S5 T
I I
S
Service Logic S
T R6 S6 T
I I
S S
T R7 S7 T
I I
Receivers Senders
The received flits of a data packet are inserted into the Central Queue from
where the sender module can extract and transmit them when it is not busy
sending another packet. When more than one packet is directed to the same
output port, flits from the first packet are immediately extracted from the
Central Queue, while the others remain in the queue waiting for the first
packet to finish.
Flits from a service packet directed to the switch chip are routed directly into
the chip’s service logic.
The receiver logic of the switch is responsible for taking data off the link,
synchronizing the data with the chip’s clock, and routing that data to the
appropriate sender port via the Central Queue buffer. The receiver also
routes service packets to the chip’s service logic if the packets are destined
for that chip. The receiver controls the data flow through the link using a
token-based flow control implementation, where it transmits a token signal to
Token State
Generation
Machine
Central Queue
Parity Route
Link STI FIFO Deserializer
Generation Modification
EDC
Checker Service logic
As the data enters the receiver, it is checked for correctness. All bytes
transmitted across the link (packets and control characters) are protected by
a time-based, two-byte Error Detection Code (EDC). The transmitted data
and the two-byte EDC code are read from the STI interface and run through
the EDC checker. EDC errors can be reported when they happen or when the
number of errors has reached a programmable threshold. If the threshold is
reached, the link is reset.
A parity bit is internally used by the switch chip to protect each byte of data
that flows through it. The bit is generated by the receiver module and checked
by the sender module to avoid data corruption inside the chip.
Data read from the STI interface is written into a 64-flit size RAM array with
one write and one read port; this is used as a FIFO. The purpose of the array
is to buffer the packet data if the flow to the sender is somehow delayed; for
Each time a single flit is extracted from the FIFO, a token signal is transmitted
back to the sending module to inform it that there is one more space
available.
Taking into account cable and wire length, propagation delays and clock
cycles for STIs and token management on both sides of the link, the optimal
number of tokens is 47. In order to allow significant buffer for possible
implementations with longer cables and faster links, the array size was
chosen at 64 tokens.
If the FIFO becomes full and valid packet data is received on the link despite
the token protocol, the FIFO overflows and packet data is lost. The packet
currently in transmission is terminated with a packet fail character and the
receiver module resets itself, starting a re-initialization of the link with the
sender.
A packet ending with a packet fail character may be transmitted over the
switch network to the destination node or chip where it will be treated
accordingly. There is no other way to handle this kind of packet problem since
the preceding data has already been transmitted to the following stage of the
network.
In order to prevent an overflow of the FIFO due to link errors, two tokens less
than the physical FIFO size are used on the link. In this way the link can gain
up to 2 flits through link errors without data loss. When a third flit is gained,
then the FIFO overflows.
The route modification stage uses the packet’s route information to decide if
the packet is destined to the switch chip or to which sending module it has to
be passed. The route information is also modified to make it usable by the
following switch chip.
The receiver is concerned only with the first route flit it encounters, defined as
the flit following the BOP control character. Each route byte contains one or
two route values. So in each flit, up to four route values may be present. Once
the receiver decodes the route value to be used, it invalidates that value.
A route byte is depicted in Figure 17, and it contains two route values.
Depending on the value of the most significant bit, bits 1 to 3 or bits 5 to 7 will
be processed as the next route data.
(M S B ) 0 1 2 3 4 5 6 7 (LS B )
h h h l l l
The parity bit in the route byte is used as an extra check on the correctness
on the byte. If a parity error occurs, the receiver would no longer know how to
direct the packet, so it simply dumps it without routing it.
When bits 1 to 3 contain the current route value to be used (the MSB bit is 0),
that choice must be invalidated. This is done by setting the MSB bit to 1 and
passing the modified route byte with the packet. When route modification
logic encounters a route whose MSB is 1, bits 5 to 7 are decoded as the
route. Since there is no more useful information in this byte, the entire byte
has to be invalidated. If it is the first (high) byte of the current flit, the packet
will exit from the switch chip with a BOPb character. If it is the second (low)
byte, all data in the flit has been used and the flit is not passed on.
As the packet passes through switch chips, all the route bytes get used and
discarded. A packet reaches its destination with no route bytes.
As seen in Figure 15 on page 23, after route modification, the data is ready
for transmission to the service logic (if the destination is the current switch
chip) or to the sending port through the deserializer for buffering into the
Central Queue.
The Central Queue is a 4 KByte array with one read and one write port; since
there is only one write port and eight receivers requesting its resources,
The use of a Central Queue instead of several local buffers on the receivers
makes it possible to allocate more buffer space to the most active receiving
ports, reducing the probability that a receiver does not have any more buffer
space available. As long as the Central Queue is not full, each input port can
continue to receive flits at full bandwidth. If the Central Queue eventually fills,
input ports with flits destined for the Central Queue are not able to empty their
input FIFO queues, and a lack of tokens causes the associated upstream
output port to be blocked.
The receiver state machine is the brain behind the operations of the receive
port. By having a state machine on each port, faults are better isolated and
recovered locally, without affecting the remainder of the chip function. All the
logic in the receiver is controlled by the receiver state machine. Each state
defines what is currently happening within the receiver.
The disabled state may be entered for several reasons, among them being a
port/chip reset, the link has been disabled, an error occurred or the sender
has requested a link synchronization. The reset and link disabling are
requested by the supervisor, while errors are internally generated.
A link enable bit is associated with each receiver. When the receiver port is
not enabled, it does not respond to the synchronization request from its
corresponding send port, and actually isolates the send port from the rest of
the switch fabric. It defaults to active during switch chip initialization, but may
be changed by a service packet.
The disabled state is used by the receive logic to set the state machine in a
known state. If the link enable bit is active, the STI starts a a synchronization
process (see 3.3, “STI Timing and Logic Synchronization Process” on page
In the tuning state, the STI proceeds in its signal exchange with the sending
port in order to synchronize with it. If any error is detected, the state machine
returns to the disabled state and all synchronization restarts from the
beginning. As soon as the STI identifies a living sender and it is aligned with
its clock phase, the state machine goes into the operational state.
While in the operational state, the data packets are received and transmitted
to the proper destination. The receiver waits for a BOP character and the
route bytes are analyzed to define which sending module has to manage the
packet.
If a parity error is detected in the route byte, the error is reported, all flits up to
the EOP are discarded, and the port waits for the next BOP.
If a timeout occurs while waiting for the EOP character, the state machine will
append a packet fail character to the packet and will restart waiting for the
BOP. The packet fail character will indicate the end of the packet and will tell
you that the packet has to be discarded.
If the received packet is a service packet sent to this switch chip, the receiver
module reads a special character after the BOP character. The data portion
of the packet is then passed to the chip’s service logic and the state machine
makes a transition into the service state.
The sender logic of the switch is responsible for taking the packet data out of
the Central Queue or from the service logic and transmitting it across the
cable or board to the receiver on the adjacent switch chip. The sender can
only transmit data when it has received valid tokens from its corresponding
receiver indicating that the receiver has room to handle it.
P a r it y
C heck
C e n tra l Q u e u e
O u tp u t
S e le c to r F IF O
S e le c t
L in k
S e r v ic e p a th
EDC
G e n e ra to r
The selection stage of the sender module is driven by the state machine that
defines from where data has to be taken and sent to the output port. Data
from the selector is passed into the sender’s FIFO.
The purpose of the FIFO is to smooth out the data flow from the selector into
the reminder of the sender logic. It is a 16-byte Register Array with one read
and one write port. There are times when the normal data flow through the
sender must be stopped for a while, for example, to insert EDC bytes. When
these interruptions occur, data is buffered into the FIFO and the source
continues to transmit.
Each byte of packet data being transmitted through the switch chip is
protected by a parity bit. Parity is checked just before data enters the sender
data selector, from which it is launched from the chip. Parity errors that are
detected are reported to the service logic to enable problem isolation. The
packet is transmitted as it is and data flow is not interrupted. Because of this,
parity errors on data will also show up as CRC errors inside the data at the
destination processor nodes. Care must be taken when CRC errors are
detected to ensure the fault is isolated to the proper place.
The sender module is in charge of creating the EDC frames used to check the
correctness of communication on the link. It inserts the EDC values that will
be checked by the corresponding receiver module.
The output data selector generates the data to be sent on the link depending
on the state of the sender. It will transmit packet data, EDC bytes or control
characters to the receiver module.
If the token counter is at maximum value, a token is detected and a flit is not
simultaneously transmitted across the link, the token counter will signal a
token overflow error. The counter, however, will not overflow and will remain
at the maximum count. By design, the counter cannot underflow since a token
count of zero will not allow packet data to be transmitted, and transmitted
data is what decrements the counter. The token protocol is designed to check
if tokens are missing, and to realign sender and receiver counters.
Token transmission errors may occur on the token line in the form of wrong
token sequences. If this happens, an internal error counter is incremented
and if a pre-defined threshold is reached, a token error character is sent to
the receive logic and the link is re-initialized.
The sender’s state machine is the "brain" that governs the operation of the
send port. By having a state machine on each port, faults are better isolated
and recovered locally, without affecting the remainder of the chip function. All
the logic in the sender is controlled by the sender state machine, and each
state defines what is currently happening within the sender.
The disabled state may be entered for several reasons including a port or chip
reset, disabling of the link, or a request for a synchronization by the receiver.
A link enable bit is associated with each sender. When the sender port is not
enabled, it does not transmit any valid packet to its corresponding receive
port and actually isolates that receive port from the rest of the switch fabric.
No token signals are processed when the link is disabled; flits that should use
the port are discarded and an invalid route error is notified to the service
logic. The link enable bit defaults to active during service initialization, but
may be changed by a service packet.
In the tuning state, the synchronization process is active and if any error is
detected, the state machine returns to the disabled state and all
synchronization restarts from the beginning. When the process ends, the
sender module receives a specific signal from the token line and the state
machine goes in the operational state.
The operational state indicates that the entire link is active and the sender is
either waiting for or already busy handling packets coming from the Central
Queue or the service path. Tokens received in this state increment the token
counter and each flit sent out the port decrements the counter.
Service packets have priority over all other packets, and other packets have
to wait for service to be transmitted. By service packets we mean the packets
generated by this switch chip and not packets that have been received by a
switch chip’s input port, since all data coming from other ports is treated in
the same way.
When the service logic is not ready to send any service packets, the source
selector checks if there are new packets buffered in the central queue and
starts to extract the flits from queue into the FIFO.
The packet data written to and from the array is grouped into eight chunks
representing 16 bytes of packet data, and is moved at chunk size to reduce
latency of the buffering operations. Each chunk is defined as either a header
chunk (the first chunk associated with every packet) or a continuation chunk
(all the chunks following the header chunk inside the packet).
One chunk of the Central Queue storage is reserved for each transmitter and
it is called the emergency slot. The remaining Central Queue storage is
shared. Whenever a receiver has a non-critical chunk and space is available
in the shared pool, the receiver requests service from the Central Queue.
Whenever a receiver has a critical chunk, it requests service and the chunk is
put in the transmitter’s emergency slot. Priority is assigned to the receivers on
a least recently serviced basis.
The emergency slots provide a way for the receiver to transmit the packet
data the sender is waiting for into the Central Queue buffer, even when the
buffer appears full, thus avoiding deadlocks. There are special cases where
the Central Queue buffer may become full: the emergency slots ensure that
all critical chunks are sent to senders.
The service packet looks exactly like any other message packet to a receive
port until all the route bytes are used by previous switch chips. In this case
the service frame character (SVC) is detected right after the BOP and the
receiver routes the data of the packet to the service logic (see Figure 19).
Each byte transmitted has a tag added to it which identifies it as a data
character or a control character.
Since a service packet can come into the chip through any of the eight
receive ports, the service logic must be able to accept the service data from
any of these ports. It is assumed that only one receiver module transmits a
service packet to the service logic at any given time. If two or more packets
are received into the service logic simultaneously, they are all accepted
(logical OR of their data), producing an invalid length, invalid command, and
CRC error, and an error packet is sent. This causes the service logic to
discard the packet as if it has never been received. It is up to the packet’s
sender to detect that data has not been received and a new packet has to be
sent.
The initialization packet contains the information that has to be loaded into
the switch chip’s configuration registers. It sets information such as which
receive and send ports are enabled, the route to be used when sending a
report message, and the switch chip identifier.
The read status packet is used to make the switch chip send back a complete
report message with its current status.
The reset packet requests a reset of selected error registers and logic on the
switch chip. This message is required to receive notification of secondary
error events as described in 3.4.5, “Error Isolation” on page 34.
Two time-of-day (TOD) packets are used to define and propagate to all switch
chips a value, the TOD, that is maintained synchronously in the whole SP
Switch to make problem determination and error isolation easier. The TOD is
increased at each switch chip’s internal clock tick.
When the service logic is busy preparing and sending an error/status packet,
it cannot process any incoming service packets. They are discarded until the
switch chip has sent the message.
Important
An important consideration on internally generated error/status packets is
that two of them cannot be sent by a switch chip without an intervening
reset packet being received from the primary node. This keeps the switch
chips from flooding the switch fabric and the primary node with service
packets. It also requires the primary node to manage all incoming signals
and explicitly reset them.
Upon detection of the first error, an error/status service packet is sent to the
primary node, and succeeding errors are not reported spontaneously until the
primary node has handled the error condition and has cleared the error with a
reset packet.
There are three multibyte error registers that have to be considered when
getting the report from the switch chip. Each bit of each register indicates a
separate error present.
The first error capture register is where the first occurrence of any error is
stored. It marks the kind of error present.
The second error capture register details exactly where the error indicated in
the first error capture register occurred, as well as where any subsequent
errors occur. It will continue to accumulate errors until it is reset by a reset
service packet.
When an error is detected by the switch chip, the corresponding bit in the
second error register is checked. If the bit is already set, the error has already
been addressed, so no action is taken. If the bit is not set, the error is added
to the register and the first error register is checked. If first error register has a
value of zero, this is the first error and the corresponding bit is set; otherwise
the register is not changed.
The pending error capture register is a shadowed version, bit for bit, of the
second error capture register. This register is used to ensure that errors
occurring between the reporting of previous errors (in an error/status packet)
and the service reset packet that should clear those earlier errors are not lost
because of the service reset packet. The contents of the pending error
capture register are not transmitted in an error/status packet, but are copied
to the second error capture register when servicing a reset packet.
A more detailed description of the error registers content can be found in A.2,
“Error Registers” on page 242.
The supervisor sends signals to the switch chips in order to configure them
according to the requests coming from the Control Workstation via the frame
supervisor. When powered on, it sets the internal oscillator (OSC2) as the
clock source on the board, until otherwise configured.
There is N+1 fan redundancy in the switch assembly. If a fan fails, the
maintenance on the switch is deferred to an acceptable time, instead of
causing an entire switch board incident, affecting multiple nodes and possibly
other switch boards. The fan assembly is also a Field Replaceable Unit (FRU)
If the switch supervisor detects that the 48 volts is available, it enables the
power supplies to provide the 3.3 volts to the switch board. If the current
levels monitored by the switch supervisor move out of the valid operating
range, the supervisor will turn that power supply off. Likewise, if the output
voltage sensors indicate the 3.3 volts is not within the valid operating range,
the power supplies will be shut off.
The second power supply is redundant in the switch assembly. If one of the
power supplies fails, the maintenance on the switch is deferred to an
acceptable time, instead of causing an entire switch incident, affecting
multiple nodes and possibly other switch boards. Each power supply is also a
Field Replaceable Unit (FRU) of the switch assembly.
There are different kinds of adapters, depending on the bus type on the
RS/6000. Their internal structure changes among them due to technology
improvements and to bus architecture differences. The functions they perform
are, however, the same since the software present on the RS/6000 is the
same except for the peculiarities introduced by the system bus.
TB3 Adapter
100 MB/sec sustained rate
80 MHz 512 KB
601 SRAM
160 MB/sec
MCA bus
2 x 4 KB
STI
FIFO TBIC
150 MB/sec
MCA interface fabric
Xilinx Xilinx
The heart of the adapter is an 80 MHz 601 PowerPC with 512 KBytes of
SRAM.
TB3MX Adapter
200 MB/sec sustained rate
150 MB/sec
400 MB/sec fabric
MX bus MBA
STI
TBIC2
2 x 4KB FIFO
Xilinx
It is a redesign of the TB3 adapter for the MX bus of the node. The most
significant difference is the replacement of the MCA interface with an MX bus
interface achieved with the MX Bus ASCI chip (MBA). Other significant
differences are:
• 100 MHz PowerPC 603e (instead of the 80 MHz PowerPC 601)
• Internal bus operates at 50 MHz (instead of 40 MHz)
• New TBIC-2 chip replaces old TBIC
• 50 MHz 60x interface
• 4 MB send buffer
• IP checksum generation logic
The MBA chip provides both the MX bus interface and buffering for DMA
operations between the MX and 60x buses.
TB3PCI Adapter
85 MB/sec sustained rate
150 MB/sec
132 MB/sec fabric
PCI bus AMCC 2 x 4 KB
STI
S5933 FIFO TBIC2
Xilinx Xilinx
The interface with the PCI bus is made by an AMCC chip, and a TBIC-2 is
used to connect to the switch board. An internal bidirectional FIFO is used as
in the TB3 adapter. The internal components are:
• 99 MHz 603e PowerPC
• 512K SRAM
• Single eight-byte 33 MHz bidirectional internal bus
150 MB/sec
400 MB/sec fabric
MX bus MBA
STI
TBIC2
2 x 4KB FIFO
Xilinx
Each window has its own receive FIFO (rFIFO) and send FIFO (sFIFO) in the
node’s main memory. The window uses these FIFOs to store application
packets before they are received from (rFIFO) or sent to (sFIFO) the switch
adapter. These structures are also available to the adapter’s microcode,
which uses them to move packets to and from the RS/6000 main memory to
the adapter’s TBIC FIFO, using DMA.
Each window has a set of variables that describe the status of its FIFOs, like
the position of first available packet and the number of packets. These
variables are available to both windows and adapter microcode and are used
to properly transfer data to and from the window’s FIFOs and the adapter.
The internal structure of different adapters may vary, but they all share a
common basic structure as shown in Figure 24 on page 43. The adapter has
a bidirectional FIFO where packets from the windows’ send FIFO are put
before these packets are sent to the adapter’s TBIC FIFO on their way to the
switch network. All data transfers are made by DMA engines on the adapter
under the control of the adapter’s microcode.
TBIC
IP window bidirectional FIFO
FIFO
rFIFO sFIFO
switch
service window
After packets arrive from the SP Switch, another DMA engine moves the data
from the TBIC FIFO directly to the windows’ receive FIFO. The microcode
uses its internal tables to create the correct switch routing information for
outgoing packets, and to send incoming packets to the correct window on the
SP node.
The packet coming from the sFIFO of a window has a 1 KByte maximum size
and has the following structure:
• Header
• Destination node identifier
• Destination route
• Data
The data contained in the packet is formatted according to the layer that
manages the communication at the sFIFO level (see Chapter 4,
“Communicating with the SP Switch” on page 49 for more details about User
Space).
The header is used by the microcode to create the real route bytes that are
used by the switch chips to transport the data. A set of tables in the SRAM of
the adapter contain all the information needed to identify the destination and
the correct route, depending on the window type.
When a user space window packet is passed into the adapter’s FIFO, the
destination node of the header is a logical identifier and not a node number.
The user process does not have to know which the SP system topology is or
The pair destination node and window make it possible for the receiving node
to correctly dispatch the incoming packet. The key field is a unique value that
is used by the communication between two windows. No other windows,
including IP and service, have the same identifier.
The use of the key ensures that a process never receives a packet from a
wrong sender. A new job uses the window that was previously allocated to
another job and processes may be not aware yet that the window has
changed identity. So wrong packets may be sent to the new job, but since
they have a different key, they are discarded by the adapter.
The route field of the header is used to select one of the four shortest paths to
the destination. The message libraries that user processes use do not know
the actual route, but they are aware that four different choices are available,
so they make use of all them, creating packets that continuously change the
route number. This helps maximize throughput in the network.
The adapter has a complete route table that gives the four shortest paths to
each available node in terms of switch chip ports that are used to create the
packet’s route to destination. This table is created and updated by the fault
service daemon that runs on the node.
Once a user space packet is put on the adapter’s FIFO and is selected to be
transmitted, the microcode sends the SP Switch packet feeding the TBIC first
with the all the information needed for the routing of the packet, and then with
the data that is present on the adapter’s FIFO.
Dest ID Route ID
Route 1 Route 2 Route 3 Route 4 Key Window Node ID
DATA
BOP Route 2
SRC Node ID Key Win
DATA
CRC EOP
The adapter adds the BOP, the CRC and the EOP and just copies the route
and the data from the packet it has received (see Figure 26).
DATA DATA
CRC EOP
IP kernel extension does not know the SP Switch topology but only the
physical destination node, so it is the adapter’s microcode that has to specify
the correct routing. This is done by looking into a table that contains the
routes to every node in the system. The table has been supplied to the
adapter by the fault service daemon. In order to better use the switch, the
microcode changes the routes used to reach the same IP destination for each
IP packet.
Switch packets are then created as in the previous cases: BOP, route, the key
field, the window, the data, CRC and EOP (see Figure 27).
Dest ID Route ID
Routing information
DAT A
BOP Route
SRC Node ID Key Win
DAT A
CRC EOP
On the receiving side, when a packet is received by the TBIC, the microcode
looks at the header received, detecting the sending node identifier, the key
and the destination window. Then it checks for the CRC. If any checksum
problem arises, the packet is discarded and an Error/Status error message is
sent.
If the destination window is for User Space, the source and key values are
compared with the expected ones in the table in the adapter’s SRAM to check
81 to 512 2.0
All the links in the network are able to transfer 150 MB/sec in each direction,
so the bandwidth is constant. The latency is very low compared to other
networks and is constant within the ranges shown. When more than 80 nodes
are involved, intermediate switch boards are needed to provide the required
path redundancy; this new network stage causes a small increase in latency.
IP packets can be sent over the SP Switch. The IP kernel extension takes
care of feeding data into the SP Switch, and any application that makes use
of IP can be run on the system and transfer data through the switch network.
The use of IP requires the application to issue system calls, thus switching to
kernel mode in order to send and receive data. Mode switching and IP
communications bring significant overhead that affects overall performance.
Network degradation may be a factor, especially for parallel processing where
a large number of computing nodes are involved and many messages are
exchanged.
As seen in 3.6.5, “Data Flow” on page 41, the switch adapter takes care of the
physical transmission of data through the SP Switch network and transfers
packets of data to and from the node’s main memory. The packets have a
specific format and are always sent to the correct environment ( window) or
process that uses them.
Several interfaces are needed to provide both user applications and kernel
access to the switch adapter for communication and switch management
purposes. The Communication Subsystem (CSS) protocol software and its
interaction with other components is shown in Figure 28 on page 50.
MPI / MPL
Kernel Space MPCI E-command
User application
PIPE
Packet LAPI
TCP UDP
Hardware Hardware
Kernel Abstraction Abstraction Fault Service
Extension Layer Layer Daemon
IP
Swi t ch Adapter
In PSSP 3.1, up to four User Space applications on a single node may use
message passing libraries to exchange data with the switch adapter. If more
than four applications are required, the communication has to rely on UDP/IP
communications over the switch. Performance is reduced, but no application
change is required. The UDP protocol can also be used to exchange
messages with machines that are not connected to the switch network but are
reachable using other TCP/IP communication links.
The library does not take care of allocation of computing resources and
starting of tasks on appropriate computing nodes. The MPI message passing
library expects all tasks to be started at the same time, and does not deal with
situations where the number of tasks varies during a parallel job. A run time
environment, like Parallel Environment for AIX (PE), must be used for this
purpose. We do not describe the run-time environment. Parallel Environment
supports MPI and MPL (IBM’s proprietary and practically obsolete Message
Passing Library).
The MPI API defines a number of data types, topology and environment
functions. These are implemented within the MPI layer of the stack, along
All MPI functions involving actual communication with other tasks are mapped
into point-to-point operations implemented via the MPCI library.
The MPCI layer provides functions to MPI and MPL, giving them access to
the lower PIPE layer. It also serializes message passing over its window,
since only one thread at a time may access a window.
MPCI deals with messages sent and received by the user application, relying
on the PIPE layer for a reliable, byte-stream physical transport medium
between tasks. Each user task has two PIPE FIFOs for each other task it
speaks to, as shown in Figure 29 on page 53, one for sending and one for
receiving messages. Whenever one task inserts a set of bytes on its send
FIFO, it expects that the destination task receives the same bytes in the same
order.
The PIPE layer hides the physical position of tasks. Once the library
environment is set up and all the PIPE FIFOs created, the MPCI layer is not
aware of whether it is using UDP or direct access to the switch adapter to
send the messages, or if the tasks are on the same node or on a different
machine. MPCI just continuously uses the transport layer given by PIPE
structures.
to task 3
Task 3 Task 4
to task 2
from task 2
When the user application sends a message, it specifies the buffer where the
message lies, the destination task and several sending options. MPCI selects
the appropriate PIPE structure and inserts data into it.
On the receiving end, the MPCI layer continuously looks into all the PIPE
FIFOs for the task. It searches the header sent by the corresponding MPCI
layer to detect the beginning and the length of a message. It then waits for the
PIPE layer to fill the receiving buffer with all the data sent. When the message
is in the FIFO, MPCI extracts it. If the task has requested the message, the
data is directly copied in the user’s buffer; otherwise, it is copied in an Early
Arrival Table while waiting for the task to request it.
If the header is received for a message whose size is greater than the
MP_EAGER_LIMIT, MPCI first checks if the task has already requested it. If
the message has been requested, MPCI sends back a message asking for
Once the message is extracted from a PIPE, the space is freed and the PIPE
can accept other bytes from the sending task.
Each PIPE structure may be seen as a FIFO where data is inserted by one
task and removed by the other task with no data loss during transmission. It is
the PIPE layer that takes care of using the correct transport layer (UDP or
direct access to switch adapter) and of performing recovery actions in case of
data loss during data transfer.
The PIPE layer takes the bytes from a send FIFO, assembles them into one
or more packets that fit the physical transport layer and sends them to the
destination task. When MPCI asks to transmit data with a size greater than
the MP_EAGER_LIMIT, the PIPE layer creates the packets directly from the
user buffer.
All packets have a sequence number and a base address that are used to
reconstruct the byte stream when packets are received out of order.
The receiving PIPE acknowledges the delivery of the packets and implements
packet level token flow control. If a packet has been dropped or otherwise
unsuccessfully received, the PIPE layer also retransmits that packet.
Each task has one receive and one send PIPE for each task it exchanges
messages with. The corresponding sending and receiving PIPEs on the two
tasks synchronize themselves in order to provide a continuous data flow to
the MPCI layer.
The data packets created by the PIPE layer are transmitted to the receiving
PIPE either using UDP or through direct access to the switch adapter. If
UDP/IP is used as the transport layer, a switch to kernel mode is required to
execute the IP code both on the sending and on the receiving side to make
packets flow from sender to receiver.
Direct access to the switch adapter is the most efficient way to communicate
with the SP Switch. Each task has one output and one input DMA buffer that
is shared among all PIPEs. Packets are copied from each send PIPE of the
task to the output DMA buffer, where the adapter gets them using DMA. The
adapter sends the packets into the switch network as described in 3.6.5,
“Data Flow” on page 41. On the receiving side, the adapter moves the
packets using DMA into the destination task’s input DMA buffer, from where it
is copied into the correct receive PIPE. In Figure 30 you can see a
representation of data flow.
Task 1 Task 2
SP switch
Switch adapter Switch adapter
Packets from all send PIPEs are evenly copied into the output DMA buffer. If
there is no more space in the output DMA buffer, the PIPE layer stops
transmitting until space is freed.
The packet structure, when direct access to the switch adapter is used, is
described in Figure 31 on page 56. The first part of the packet contains the
destination task logical identifier and a route indicator that tells which of the
D AT A
The second part of the header is used for synchronization between PIPE
layers on different tasks. The adapter sends packets to the DMA buffer of the
destination task and the source field is used by the PIPE layer to identify the
receive FIFO to which the data belongs. Then the sequence number and
base address are used to define in which position of the FIFO the packet’s
content has to be put, creating the FIFO abstraction for the MPCI layer.
The token field is used for flow control. It contains the number of packets the
sender task can receive from the destination task. In this way each PIPE
buffer is aware of how many packets can be sent, thereby avoiding receiving
buffer overflows.
When packets are removed by MPCI from the input buffer, the PIPE creates
an acknowledge packet for the received sequence number that is sent back to
the sender. When the sender receives the acknowledge, it frees the space
held for the sent packets and accepts more data from MPCI.
If sent packets are not acknowledged within a given period of time (less than
one second), all outstanding packets are retransmitted and will be placed in
the correct place in the byte stream of the receiver.
When the active message brings data from the originating process, LAPI
requires the handler to be written as two separate routines:
• A header_handler function, which is the function that is specified in the
active message call. It is called when the message first arrives at the
target process (actually the first packet of the message arrives), and it
provides the LAPI dispatcher (the LAPI layer that deals with the arrival of
messages and the invocation of handlers) with:
• An address where the arrival data of the message must be copied
• The address of the optional completion handler
• A completion handler which is called after the whole message has been
received, meaning that all the packets of the messages have reached the
target process.
The separation of the handler into a header handler and completion handler
in the active message infrastructure allows multiple independent streams of
messages to be sent and received simultaneously within an LAPI context.
LAPI supports messages that can be larger than the size supported by the
underlying network layer. This implies that data sent using an active message
call will arrive in multiple network packets, possibly out of order. This places
some requirement on how the handler is written.
When the first packet is received, the LAPI dispatcher receives it, identifies it
as a new message, and reads the header handler to be called. The handler
returns a buffer pointer where incoming data is to be copied and the address
of the completion handler. The LAPI library then moves all the received
packets to the specified buffer and when the whole message is received, the
completion handler is executed.
The data operations that LAPI supports are Put and Get. They are unilateral
operations. They transfer data between the initiating task and another task
specified by the initiator. Put transfers data from the local virtual memory of
the initiating task to the virtual memory of the other task. Get transfers the
data in the opposite direction. Concurrent use of a buffer which will be
modified by multiple asynchronous operations has to be controlled by the
user.
The Put function in LAPI enables the push mode of communication: write
data into the address space of another task in the parallel job. The Get
function enables the pull operation: read from the address space of another
task in the parallel job.
4.6 IP Layer
The SP system was created to provide a powerful parallel processing
environment and initially did not use IP to provide communication over the
switch. IP was then introduced to make applications that rely on the TCP/IP
protocol work on the SP system and to provide a communication layer for
tasks that exchange messages with machines not in the SP system.
IP
Sw itch adapter
The most common method used in the TCP/IP protocol stack to map IP
addresses into network identifiers is ARP. The ARP protocol maintains an
ARP table with the mappings. If an entry does not exist for a given IP
address, then the protocol issues an ARP request: a broadcast on the
network asking the node with the requested IP address to reply giving its
network identifier. The node with the requested IP address responds back.
ARP table entries are deleted periodically (the default is 20 minutes).
The SP system uses the ARP protocol. The SP Switch does not support
broadcast but only point-to-point communications. In order to support ARP,
the interface layer has to translate an ARP request into several messages,
one for each node in the switch network.
In order to reduce network bandwidth usage due to ARP packets, when if_ls
is configured in a node, it “broadcasts” a gratuitous ARP response packet. In
this manner, the ARP tables for all nodes get the mapping for the incoming
node.
The interface layer works directly with the switch adapter. It can write some
variables to the adapter’s SRAM, and some kernel memory segments are
read and written by the adapter. Using these shared areas, the kernel and the
adapter can synchronize themselves and exchange data.
A send FIFO (sFIFO) structure is part of the interface layer. It has 512 entries
of 256 bytes each. If an outgoing IP datagram consists of only one mbuf, it is
copied to the sFIFO and the mbuf is freed. Each entry of the sFIFO contains a
header that tells to which physical node the data has to be sent.
When sending a larger datagram, too much overhead would occur if it had to
be split into smaller pieces and copied into the sFIFO. To improve
performance, a Send Pool buffer is used in pinned memory that is shared with
the adapter. The Send Pool’s size can be from 512 KB to 16 MB.
For IP communications over the switch, the kernel allocates mclusters from
the Send Pool instead of the standard pool. The cluster size can be 4, 8, 16,
32 or 64 KB (the switch MTU is 64 KB). The interface layer then receives an
mbuf structure where all the mclusters are already in a memory zone that can
be read by the adapter, so it creates a new entry in the sFIFO containing
pointer(s) to the Send Pool.
The adapter reads the sFIFO, detects that there is more data on the Send
Pool and starts reading it using DMA. In this case the IP datagram is larger
than the switch packet: the adapter’s microcode splits the datagram while it is
being read from the Send Pool, creating switch packets whose headers
contain an offset within the mcluster. When all the data is inserted into the
switch, the space in the pool is freed.
IP datagram
mbuf sFIFO Send Pool
header
DATA cop
y da
ta
if header
DATA
IP datagram if header
mbuf try
e en
header te on
crea mclusters
mbuf
header DATA
mbuf
header
When the IP datagrams sent are larger than 228 bytes, a Receive Pool
pinned memory segment shared with the adapter is used. The microcode on
the adapter reads the packet, looks into the header for the datagram’s size,
allocates a buffer on the Receive Buffer, and finally transfers the data into the
correct position of the buffer using DMA. When the last packet is received, the
microcode adds an entry in the rFIFO that points to the data in the Receive
Buffer. The space in the Receive Pool is released after the receiving
application reads the data into its memory space.
The if_ls interface is notified of the presence of new data in the rFIFO using
interrupts. The number of IP packets received that trigger an interrupt varies
depending on the load, in order to reduce the total number of interrupts and
increase the performance.
If a packet that is part of a large IP datagram is lost in the switch network, the
buffer will never be filled completely. A timer is set up upon the receipt of the
first packet and if it expires before the entire datagram is received, the partial
datagram is dropped and the space in the Receive Buffer is freed. If_ls, as
any other IP interface layer, takes no recovery action.
The use of a Receive Pool improves the overall performance of the system
since the data coming from the adapter is not copied into any other
intermediate buffer before being dispatched to the user application. The only
drawback is that applications may not be fast enough to release the buffer
quickly and the pool may fill up.
The primary daemon is the one that is in charge of configuring and monitoring
the SP Switch’s behavior. Exactly one primary daemon is present in every SP
Switch partition. Its fundamental functions are:
• Initialization of SP Switch
• Recovery actions for SP Switch faults
• Generation and update of route tables
The primary backup daemon is a secondary daemon, but it also has the goal
of checking that the primary daemon is up and running. If it detects that the
primary daemon is no longer functional, it changes its personality to primary
and chooses the new primary backup from among the available secondary
daemons.
The secondary daemons only take care of the route table generation and
update, in response to the data passed by the primary daemon.
All the fault daemons are responsible for local switch adapter configuration,
initialization and monitoring. They download routing information to the
adapter’s SRAM, start the microcode and handle adapter faults.
All the switch chips receive initialization packets that instruct them on how to
manage events and what kind of operation they are enabled to perform.
Individual receiving and sending ports are enabled or disabled according to
the network configuration and fault isolation. Each chip receives two different
paths along the switch network to be used by the error/status notifications
generated by the chip.
After the network scan is completed, all the fault service daemons compute
the routes from their node to all the other nodes, using the information
collected by the primary daemon.
During the periodic scan, the primary daemon may detect that a node
previously removed from the network because it was not responding has
started working properly; for example, a node may have been turned off and
then back on. The node is then automatically inserted (automatic Eunfence)
into the network.
Each daemon uses the same network topology file to create the routing
information. The primary daemon is in charge of updating the topology view
of all the secondary daemons. When an update is received, all daemons
compute a new set of paths to all nodes and update the adapter’s internal
tables.
Routing table updates are done at the initialization of the switch and also
each time the primary daemon enables or disables a network component
after a switch fault or a specific request of the system administrator.
The primary backup listens to all the network scans that the primary
performs. For each scan it starts a 2.5 minute timer. If no activity from the
primary is received before the timer expires, another 2.5 minute timer is
activated. If it expires without the scan detection, the primary daemon is
considered faulty.
4.8.1 Initialization
The CSS kernel extension is loaded during system boot time by the switch
adapter configuration method. Its initialization is done in two phases. In the
first initialization phase, at system boot time, AIX recognizes the kernel
extension. In the second initialization phase, the fault service daemon
executes DMA setup code that has to be run by a process.
Clients have a local data structure that has to be synchronized with the data
present on the adapter. Data structures are mostly updated by clients. These
updates are made using kernel services, system calls and, in some cases,
direct reads and writes to the adapter’s address space.
The Kernel Extension validates the values provided by the client to be sure
that the window is not already in use by a running job, and checks the job
identifier for consistency.
If data from the client is valid, a shared memory segment is allocated for DMA
data transfers to and from the adapter. It contains the send and receive FIFOs
of the client. The memory is pinned and attached to the User Space’s address
space. All necessary steps to configure DMA operations are performed.
The adapter generates interrupts when packets arrive or when switch faults
occur. The Kernel Extension is configured to signal the correct process that
data has arrived, or to activate the fault daemon for fault handling.
The Service Library, running as a user client, allocates the DMA buffer from
the fault service daemon process heap.
The Switch Routing Table identifies the physical routes to each node of the
system. This is generated by the fault service daemon code and downloaded
to the switch adapter’s microcode as part of switch fault handling or the
microcode loading process. For each node four different paths are provided
that are used in a round-robin fashion by User Space and IP packets. After
each switch fault, the Kernel Extension is used to update the adapter’s
partition information.
The switch table, also called the Job Switch Resource Table (JSRT) or
partition table, maps logical task destination IDs to switch node numbers and
window IDs. The JSRT is created by the LoadLeveler or the Resource
Manager from Parallel Environment requests of nodes for parallel jobs. When
a parallel job starts, the Kernel Extension is invoked to load the JSRT into the
adapter.
The kernel uses the device driver to register the routines that have to be
called whenever the adapter issues an interrupt to signal that a DMA
operation has completed, that is, that new packets are available. Depending
on the window involved, a separate routine is called.
Several ioctl s are supported to read and write data from and to the adapter’s
registers and SRAM. They are used to update information needed by the
adapter to manage the incoming and outgoing switch packets, and to interact
with the adapter’s microcode.
In this chapter, we describe some basic areas you need to consider when
planning for the switch of your SP system. The tips and hints discussed here
are not meant to be a replacement for the two comprehensive planning
guides SP: Planning, Volume 1, Hardware and Physical Environment ,
GA22-7280 and SP: Planning, Volume 2, Control Workstation and Software
Environment, GA22-7281. Consultation with these planning guides is strongly
recommended.
Although levels of PSSP prior to 3.1 support both the SP Switch and HiPS,
the two switch networks are not physically compatible and cannot be mixed
within an SP system, not even in a partitioned SP system.
If you expect your system to eventually have more than eight nodes or to have
more than two partitions, or you plan to connect it with another frame, you
must not choose SP Switch-8. Its internal configuration does not provide the
scalability and flexibility of the full 16-port switch.
The switch boards are also numbered and they get their number from the
order they appear in the sequence of frames. That is, if all frames are
switched frames (frames with a switch board) the switch number of the switch
board in a frame will be the same as the frame’s number. But if the system
has non-switched frames (which are frames without a switch board) the frame
and switch numbers will not necessarily match, as shown in Figure 2.
Table 2. Frame Numbers and Switch Numbers
tty0 yes 1 1
tty1 no 2 —
tty2 no 3 —
tty3 yes 4 2
The two non-switched frames in the preceding table are expansion frames for
frame 1. That is, the nodes in frame 2 and frame 3 are connected to switch
board number 1. There are two aspects about expansion frames that should
be mentioned:
1. Expansion frames have to follow the switched frame they are
complementing.
2. A switched frame and its expansion frames cannot have any combination
of nodes.
The slot number of a thin node is clearly the number of the slot it occupies.
For wide and high nodes, the slot number is the number of the lowest slot the
node occupies.
15
29 45
13
11
25 41
9
7
21 37
5
3
17 33
1
Switch 1
J3 3 4 7 0 N14 J34
J4 2 SW3 5 6 SW4 1 N13 J33
J5 1 6 5 2 N10 J32
J6 0 7 4 3 N9 J31
J27 3 4 7 0 N6 J10
J28 2 SW2 5 6 SW5 1 N5 J9
J29 1 6 5 2 N2 J8
J30 0 7 4 3 N1 J7
J11 3 4 7 0 N3 J26
J12 2 SW1 5 6 SW6 1 N4 J25
J13 1 6 5 2 N7 J24
J14 0 7 4 3 N8 J23
SP Switch-8 has only 8 ports and the connection rule is different from that of
the standard SP Switch. Nodes are connected to the switch board in the order
they are placed in the frame, independently of their slot number.
Note that in the example, the switch node number of the nodes is always the
node number minus one. This will always be the case, unless the system has
expansion frames. Without going into the details of how nodes in expansion
frames are connected to the switch board, we show in Figure 37 on page 77
the switch node numbers of the system previously shown in Figure 35 on
page 75. Compare the node numbers with switch node numbers.
In configurations that require up to five switch boards, all the boards can still
be connected directly to each other; there are 4 connections between each
pair of boards. We call this configuration a single-stage environment.
Direct connection of five frames thus provides only four separate paths
between two nodes, with no backup path. This means that any problem on a
link on the switch will cause the packet traffic to flow on one of the other three
already used paths, causing a network performance degradation. In this
situation, it may be better to introduce an Intermediate Switch Board (ISB),
which provides a higher number of possible routes and makes the transition
to a 6-switch board system easier.
With more than five NSBs, ISBs are required to provide at least four paths for
every two nodes. ISBs are installed in a switch frame in which you can include
only switch boards. When ISBs are installed, NSBs are no longer directly
connected. NSBs are connected only through the second level of staging
provided by the ISBs. This results in 4 ISBs being included for a 6-NSB
system, and 8 ISBs for a 12-NSB system.
Attention
Originally, GRF was available as an extension node, so documents
published before PSSP3.1 use the term extension node only for GRF.
There are two models of SP Switch router product, GRF-400 and GRF-1600,
while RS/6000 Enterprise Server Model S70 is currently the only self-framed
node.
A valid unused node switch port in the SP system is required to attach the SP
Switch router to an SP switch, that is, a switch port that meets the rules for
configuring frames and switches. You do not put your GRF in the frame, but
you need a reserved space on the switch.
Administrative network
SP Switch
To/from other networks and hosts
* 1
GRF
S P S w itc h
S w itc h ro u te r a d a p te r
F ra m e 1 S w itc h ro u te r a d a p te r
2
S w itc h ro u te r a d a p te r
S w itc h ro u te r a d a p te r
* 3
*
S P S w itc h
* T h e s e s h o w th e s w itc h p o rt n u m b e r b e in g u s e d .
F ra m e 2 T h e a c tu a l p h ys ic a l c o n n e c tio n s a re to th e s w itc h b o a rd .
For a more detailed explanation of GRF and its external connections, refer to
the redbook PSSP 2.4 Technical Presentation, SG24-5173, and to SP Switch
Router Adapter Guide, GA22-7310.
But unlike other networks, you do not need to use ARP over the switch. If you
do not enable ARP, you need to specify the switch network subnet mask and
the IP address of the first node in the switch. The IP addresses for
subsequent nodes are calculated using the switch node numbers of the
nodes. It is the fault service daemon that provides the IP interface layer with a
mapping function that translates IP addresses into switch node numbers, as
depicted in Figure 40. Note that with ARP disabled, you can have a single IP
subnet in your switch.
W ith o u t A R P
F a u lt s e rvic e 1 9 2 .1 6 8 .1 0 0 .1
daem on IP A d d re s s
IP L a ye r
U se r s p a c e
M a p p in g F u n c tio n
1 92 .1 6 8 .1 0 0 .1 0
1 92 .1 6 8 .1 0 0 .3 2
In te rfa c e L a ye r
if_ ls
0
S w itc h N o d e N u m b e r K e rn e l
M ic ro c o d e
P h ys ic a l L a ye r
S w itch A d a p te r
to n o d e 1
W ith A R P
1 92 .16 8.1 00 .1
IP A d d re s s IP La ye r
A R P C a ch e
IP M AC
192.1 68.100.1 0 :0:0:0:0:0 if_ ls Interfac e L a ye r
192.1 68.110.3 0 :2:0:0:0:0
0
S w itc h N o d e N um b e r K ernel
M icrocode
B ro a dc a s t
P h ys ic a l L aye r
S w itc h A da p te r
To N o d e 1
Partitioning is an optional feature. If you do not make use of it, a single default
system partition containing all the nodes is created when the system is
installed the first time, named with the hostname of the Control Workstation.
Attention
Some configuration procedures can only be done on the default
system partition. If you are using multiple partitions on your SP
system, you have to reconfigure your system to the default single
system partition. As an example, when you add a switch board, you
need to have an unpartitioned system.
There are some rules that must be followed to create good partitions that
guarantee a minimal availability. The concept is that at least two disjoint paths
are required between two nodes in the same partition, instead of the four
paths that are always present in a unpartitioned system.
In a switch board (recall Figure 36), a node switch chip is a switch chip that
can be connected to nodes, while a link switch chip is a switch chip whose
ports can be connected only to other link switch chips of other switch boards.
In node switch boards, there are four node switch chips and four link switch
chips, while in intermediate switch boards, all switch chips are link switch
chips.
7 0 N6 J10
6
SW5 1 N5 J9
5 (U2) 2 N2 J8
4 3 N1 J7
7 0 N3 J26
6 SW6 1 N4 J25
5 (U3) 2 N7 J24
4 3 N8 J23
7 0 N 11 J18
6 SW7 1 N 12 J17
5 (U4) 2 N 15 J16
4 3 N 16 J15 M inim um Boundary
for Partitioning
When more than four nodes must be in the same partition, then more node
switch chips are involved. To make those switch chips exchange data, link
switch chips are required. In order to have at least two disjoint paths, at least
two link switch chips are then used.
As a consequence of the second rule, only two partitions that make use of
two or more node switch chips can be defined in the same node switch board:
each such partition reserves for itself two link switch chips, leaving no more
link switch chips for other partitions.
SW 4 SW 4 SW 4
SW 5 SW 5 SW 5
SW 6 SW 6 SW 6
SW 7 SW 7 SW 7
Node A Node B
As you can see, several switch chips are already reserved for the partition,
limiting the topology of any additional partitions. Creating many partitions
when using several switch boards may be tricky, and you should use a
scheme like the one in this example to ease your work.
If you do not find a suitable pattern in the supplied templates, you can create
your custom partitioning scheme very easily with the Partitioning Aid GUI
tool, which also makes the needed consistency checks on the defined
partitioning scheme, applying partitioning rules.
VSD VSD
LVM IP IP LVM
Disk DD Net DD Net DD Disk DD
SP Switch
lv_X (IP Network) lv_Y
R V SD _X
R V SD _Y
V SD V SD V SD
Tw in-tail
or SSA D isk
R VS D _X RV S D_Y
(lv_X) (lv_Y)
SP S witch(IP N etwork)
D ow n! R V S D_X
R V S D_Y
VSD VSD V SD
Tw in-tail
or SS A D isk
R V SD _X R VS D _Y
(lv_X) (lv_Y)
For more detailed information for VSD and RVSD, see PSSP: Managing
Shared Disks, SA22-7349.
A P P LIC AT IO N S
V F S Layer
JF S GPFS
RVSD PSSP
VSD
LVM IP
S w itc h
for d ata traffic
For planning and installation, refer to the redbooks GPFS: A Parallel File
System , SG24-5165.
In this chapter we discuss some of the installation issues for the SP Switch,
as described in PSSP Installation and Migration Guide, GA22-7347. It is not
our objective to discuss all the installation steps of an SP system, and we
assume you are consulting the installation manual to know the context where
the actions described here are to be taken.
You may also partition your system, as more fully discussed in 6.5, “Setting
up System Partitions” on page 103.
After your nodes are installed, you are ready to start the switch through the
Estart command (refer to 7.3, “Starting the SP Switch” on page 119). Before
that, you may also want to specify your primary and primary backup nodes
(refer to 8.1, “Selecting the Primary and Primary Backup Nodes” on page
131).
file name
Eannotator Switch
SDR file
Topology
/Etopology
File
hardware config
Clock
clock config Eclock Configuration
File
When you configure SP Switch adapters to your system you have to define
the following thee options through smit add_adapt_dialog or the spadaptrs
command:
• Skip IP Addresses for Unused Slots ( -s flag).
• Enable ARP for the css0 Adapter (-a flag).
• Use Switch Node Numbers for css0 IP Addresses ( -n flag).
As you can see, the name of the file reflects the SP hardware configuration by
using the following three variables:
• <num_nsb>: number of node switch boards in the configuration
• <num_isb>: number of intermediate switch boards in the system
• <type>: type of topology, usually 0
The only exception to this convention is for the topology file of an SP Switch-8
system. The name of that topology file is:
expected.top.1nsb_8.0isb.1
Actually, the topology files in the /etc/SP directory are symbolic links to the
directory /spdata/sys1/syspar_configs/topologies. In the initial releases of
PSSP, they were located in /etc/SP, while in later releases file placement was
changed and links were introduced to maintain compatibility.
The same topology file can be used in both SP Switch systems and the older
HiPS systems. The only important difference between these systems, with
respect to the topology files, is in the use of external connections (jacks) of
the node-to-switch and switch-to-switch cables.
The topology files shipped with PSSP specify the jack connections for HiPS
systems. Thus, to make use of the topology files in finding out which physical
external connection is being used or is showing a problem, we need to
change the standard topology files, a process called annotation. Note that, if
The topology files are annotated through the Eannotator command. In all
following discussions, unless otherwise indicated, the topology files shown
have been annotated for an SP Switch.
The topology files are plain text files. Each non-comment line in the file
represents a point-to-point link in the SP Switch network. There are two types
of connections: between a switch chip port and a node, and between two
ports of different switch chips. Typical connections are shown in Figure 50.
B e fo re E a n n o ta to r s 15 1 tb 0 4 0 L 0 1 -S 0 0 -B H -J 1 4 to L 0 1 -N 5
A fte r E a n n o ta to r s 15 1 tb 3 4 0 E 0 1 -S 1 7 -B H -J 9 to E 0 1 -N 5
s w itch no d e n u m b e r
c h ip ex ter na l jac k
p o rt ph ysic al fra m e
ad a p ter typ e
s w itc h n od e n u m b e r
B e fo re E a n n o ta to r s 13 2 s 23 2 L 0 1 -S 0 0 -B H -J 5 to L 0 2 -S 0 0 -B H -J 5
A fte r E a n n o ta to r s 13 2 s 23 2 E 0 1 -S 1 7 -B H -J 4 to E 0 2 -S 1 7 -B H -J 4
sw itc h
c hip ex ter na l jac k
p o rt p h ys ic a l fr am e
Each end of a link in the topology file is defined by the following three fields:
<device type > <device ID> <port>
The <device type> field represents to which device the link is connected.
Possible values are:
• s, for switch chip
• tb3, for switch adapters
The name tb3 is generic and does not indicate the actual adapter’s type; it is
just an indication that this is a system with an SP Switch.
After the identification of both sides of a link comes the physical connection
information. There is one entry per side of the link. For the switch side we
have:
• The number of the frame where the switch board is installed
• The frame slot where the switch board is installed
• An indication if this is an intraboard connection (SC) or an external
connection (BH)
• If this is an external connection, the external jack for the corresponding
port
If there is no physical node connected, the frame number and node number
appear as xx.
Important
The topology files should not be changed, because for the resulting
configuration may be one that is not supported. All RS/6000 SP systems
are cabled in a standard way that matches the information in the topology
files. The cabling or the contents of the topology files should only be
changed for diagnostic purposes on the advice of an IBM engineer.
Here we describe two topology files. The full content of the two files can be
found in B.1, “Example of a Switch Topology File” on page 247.
The links corresponding to the highlighted lines in the preceding file are
shown in Figure 51 on page 98.
J3 3 4 7 0 N14 J34
J4 2 SW3 5 6 SW4 1 N13 J33
J5 1 6 5 2 N10 J32
J6 0 7 4 3 N9 J31
J27 3 4 7 0 N6 J10
J28 2 SW2 5 6 SW5 1 N5 J9
J29 1 6 5 2 N2 J8
J30 0 7 4 3 N1 J7
J11 3 4 7 0 N3 J26
J12 2 SW1 5 6 SW6 1 N4 J25
J13 1 6 5 2 N7 J24
J14 0 7 4 3 N8 J23
s 15 4 s 10 5 E01-S17-SC
s 15 1 tb3 4 0 E01-S17-BH-J9 E01-N5
# switch 1 to switch 2
s 13 3 s 23 3 E01-S17-BH-J3 to E02-S17-BH-J3
s 13 2 s 23 2 E01-S17-BH-J4 to E02-S17-BH-J4
s 13 1 s 23 1 E01-S17-BH-J5 to E02-S17-BH-J5
s 13 0 s 23 0 E01-S17-BH-J6 to E02-S17-BH-J6
The annotated topology files, one per partition, are stored as SDR files.
These files are kept in the /spdata/sys1/sdr/partitions/<ip address>/files
directory, where <ip address> is the IP alias that identifies the partition. The
name of the topology file is stored in the Switch_partition SDR class.
When a topology file is stored in the SDR, the actual name stored is the name
you specify followed by a version number. For example, if you specify the
name of your topology file as expected.top.annotated, the actual name stored
would be expected.top.annotated.1. If you again store a topology file in the
SDR for that partition, the name specified will be appended with ".2", and so
on.
The topology files for partitioned systems, as well as the other files that define
the partitioning, are in the directory /spdata/sys1/syspar_configs. Figure 52 on
page 100 shows the directory structure.
The clock topology files are located in the /etc/SP directory. The appropriate
standard clock topology file is uniquely determined by your hardware
configuration. The naming convention used for the clock topology files is the
same one used for the switch topology files, with two exceptions: the
filenames start with "Eclock" instead of "expected", and the clock topology file
for an SP Switch-8 systems has a type of 0.
You may use the Eclock command to specify your clock topology file. You can
explicitly specify the name of the distribution file, using -f <clock config
Each clock topology file starts with a brief explanation of the contents of the
file. The full text of a sample clock topology file can be found in B.2, “Example
of a Clock Topology File” on page 250. Then two distribution trees follow: the
standard distribution and an alternate distribution. An alternate distribution is
provided so you can work around a clock distribution problem, for instance
when your master switch board is down. The distribution tree is defined using
a relatively simple syntax. For instance, an excerpt from the file
Eclock.top.7nsb.4isb.0 follows:
# Switch number
# | Clock multiplexor (mux) value
# | | Clock receiver jack number (High Performance Switch) / (SP Switch)
# | | | Clock source switch number
# | | | | Clock source jack number (High Performance Switch)
# | | | | | Clock source jack number (SP Switch)
# | | | | | |
1 1 J3/J3 1001 J3 J3
2 1 J3/J3 1001 J5 J4
3 1 J3/J3 1001 J7 J5
4 1 J3/J3 1001 J9 J6
5 1 J3/J3 1001 J4 J34
6 1 J3/J3 1001 J6 J33
7 1 J3/J3 1001 J8 J32
1001 0 xx/xx 0 xx xx
1002 1 J3/J3 1 J5 J4
1003 1 J3/J3 1 J7 J5
1004 2 J5/J4 2 J9 J6
The important values are the first and second columns. All the other columns
simply document the clock distribution tree.
The first column identifies the switch board and the second column identifies
its clock source. The clock source is defined by the board’s mux value. The
mux value defines the clock source in the following way:
• mux 0: the clock source is the internal oscillator, that is, this is the master
switch board.
• mux 1: the clock source comes from external jack 3.
• mux 2: the clock source comes from external jack 4.
• mux 3: the clock source comes from external jack 5.
• mux n (between 4 and 34): the clock source comes from external jack n.
(Specifying that the clock source comes from external jacks 7-10, 15-18,
23-26, and 31-34 only makes sense for intermediate switch boards.)
Note that all other values in the file can be derived from the mux value and
the standard SP cabling for the system in question. The remaining columns
are ignored by Eclock, except for the clock source switch (board) number,
which is used to define the order in which the switch boards’ clocks are
configured: first the master switch board, then the first level of the distribution
tree, and so forth.
Switch1 J4 Switch1002
J3 (mux=1) J5 J3 (mux=1)
Switch2 Switch1003
J3 (mux=1) J6 J3 (mux=1)
Switch3 Switch1004
Switch1001 J3 J3 (mux=1) J4 (mux=2)
(mux=0) J4
J5
J6 Switch4
J3 (mux=1)
J34
J33
Switch5
J32
J3 (mux=1)
Switch6
J3 (mux=1)
Switch7
J3 (mux=1)
Figure 53. Standard Clock Distribution for a System with 7 NSBs and 4 ISBs
Archive SDR
Fail ?
Apply configuration Restore SDR
Reboot nodes
Verify configuration
After applying your configuration, you should verify it by issuing the following
commands:
spverify_config
splstdata -p
syspar_ctrl -E
The Eunpartition script sends a request to the primary node for it to enable
the partition’s boundary chip ports. Refer to 7.3.3, “Phase One of Switch
Initialization” on page 124 for an explanation of why this is necessary. After
the ports have been enabled, the primary and primary backup fault service
daemons are restarted.
If you do not run Eunpartition before applying the new configuration, you may
face several Estart problems. To solve these problems, you should restore the
previous partitioning by restoring the SDR. If you do not have a backup of the
SDR, recovery can be accomplished by the following sequence:
1. Issuing Eclock to reset the switch, which will take down the switch in all
partitions, even those not being affected by the repartitioning.
2. Rebooting all the nodes or issuing an rc.switch command on all nodes.
3. Issuing Estart in each of the system partitions.
6.6.1.1 CSS_test
The CSS_test command checks the following:
• Whether each node is alive or not
• It performs a switch IP check via ping
• The level of the ssp.basic and ssp.css software in each node
6.6.1.2 spmon
The spmon command operates the system controls and monitors system
activity. It has many flags. One of the most popular ones is spmon -d, which
shows the state of all nodes in the partition. Adding the option -G, you can
get global status information.
6.6.1.3 splstdata
The splstdata command is used to list the system’s configuration data. It has
many flags, one for each type of data maintained in the SDR.
Following are some of the most relevant SDR classes for switch verification
and debugging.
6.6.2.1 Adapter
This class contains the adapter information shown by spethernt and
spadaptrs.
6.6.2.2 switch_responds
This class shows the current status of the switch. Arguably it is the most
consulted class while debugging switch problems. In the following sample of
the contents of the class, the first two nodes are off the switch:
# SDRGetObjects switch_responds
node_number switch_responds autojoin isolated adapter_config_status
1 0 1 1 css_ready
5 0 1 1 css_ready
9 1 1 0 css_ready
13 1 1 0 css_ready
6.6.2.3 Syspar_map
The Syspar_map class is one of the most important in the SDR. It is created
early during the installation process. It is referenced by many commands to
create other object class and to execute SP-related tasks. You should verify
the objects in this class after installation or hardware reconfiguration.
6.6.2.4 Syspar
Syspar is a partition class and contains several attributes for the partition:
partition name, IP alias, system version, authentication methods, and so on.
6.6.2.5 Switch
This class contains the main hardware-related switch attributes, as follows:
switch_number
frame_number
slot_number
switch_partition_number
switch_type
clock_input
switch_level
switch_name
clock_source
clock_change
6.6.2.6 Switch_partition
This class is used to store the switch starting values for the partition. Its
attributes follow:
switch_partition_number
topology_filename
primary_name
arp_enabled
switch_node_number_used
primary_backup_name
6.6.3 Logs
To verify installation or customization problems, you should also check the
logs maintained by CSS.
You should initially check the AIX log through errpt, and if more information is
needed, look at the log files generated by the switch software. These files are
located in the /var/adm/SPlogs/css directory.
Refer to 9.2, “SP Switch Log Files” on page 150 for more information about
the logs and their use in solving switch-related problems.
CWS Nodes
Primary Non-Primary
boot boot
cfgmgr cfgmgr
configure configure
the switch the switch
time
adapter adapter
init init
start the start the
fault service fault service
daemon daemon
start the
switch
join the
switch
If all the configuration steps have been completed successfully, the method
sets the css0’s adapter_status attribute to css_ready in ODM’s CuAt.
Otherwise, the attribute is set to an appropriate value describing the point
where configuration failed. For example, a value of make_special_fail
indicates a failure while creating the special file. You can find the list of values
this attribute can assume in PSSP: Diagnosis Guide, GA22-7350. This value
is eventually transferred to the SDR and used from there. This intermediate
step is necessary because, at this point in the boot process, the SDR is
inaccessible.
If the adapter was not configured successfully, you can check the node’s boot
log by running the alog -f /var/adm/ras/bootlog -o command.
The ucfgtb3 utility is used to unconfigure the switch adapter. This utility sets
the device css0 as defined and marks the adapter as not_configured in the
SDR. This utility used to terminate and unload the kernel extension and
device driver, but changes in AIX’s IP implementation (as of AIX 4.1)
prevent the utility from doing so.
The cfgtb3 utility is used to configure the SP Switch adapter after it has
been unconfigured. This utility is the configuration method for the css0
adapter.
After running these utilities you must restart the fault service daemon by
running rc.switch or css_restart_node, as described in the following
section.
The last two functions are detailed in 7.3, “Starting the SP Switch” on page
119.
The daemon then enters into its main loop, waiting for service packets from
the primary node, service packets from the switch fabric (if this node is the
primary node), interrupts from the adapter, or E-commands. E-commands,
like Estart, send requests to the fault service daemon through the i_stub_SP
or sp_fs_control programs, which queue those requests in the fault service
work queue maintained by the fault_service_SP kernel extension. Requests
from the adapter are also put in the same work queue.
The primary node’s fault service daemon does not depend on the switch
fabric successfully notifying it when a problem arises. It periodically scans the
whole switch fabric to check for link and chip failures. The switch scan is
executed every two minutes, sending Read Status service packets to all
active switch chips. If errors are reported or no response is received, the
faulty component is disabled and the topology is updated. The primary
backup node is also scanned. If its daemon is not responsive, an error entry
is cut in the AIX error log and a primary backup takeover takes place, where a
new primary backup node is selected.
If no errors were detected in the first part of the scan, the daemon checks for
nodes waiting to be automatically unfenced. Any node that is off the switch
Every daemon in the system handles error conditions that occur locally. They
handle adapter hardware errors, bad packets received, and microcode errors.
If an error is not recoverable or reoccurs beyond its specified threshold, it is
considered permanent. A permanent error leaves the TBIC in reset,
effectively removing the node from the switch network. When a permanent
error is detected the autojoin attribute in the SDR is turned off, so that the
node will not unfence at the next scan.
An interesting situation arises when the daemon is killed with the SIGKILL
signal. The node does not have a fault service daemon running but continues
to be part of the switch fabric. All protocols continue to run normally. The
daemon is only needed, and its absence then causes a fence of the node,
when the switch configuration is changed (an unfence of another node, for
instance), or when Estart is run.
The fault service daemon, besides cutting error entries in the AIX log,
generates several log files of its own: daemon.stderr, daemon.stdout,
worm.trace, fs_daemon_print.file, cable_miswire and flt. The first two files are
not used much in PSSP 3.1. The worm.trace and fs_daemon_print.file files
trace many of the daemon’s activities. The file cable_miswire reports any
node-to-switch or switch-to-switch connection apparently miswired. The last
file, flt, is definitely the most important log file generated by the daemon. It
contains a summary of major events and errors encountered by the daemon.
Also, whenever the daemon faces a serious error, it automatically invokes the
css.snap utility, which collects the aforementioned logs and other files
generated by the daemon, as well as the output of some diagnostics utilities
into a single compressed tar image. The description of the fault service
daemon’s log files, as well as the use of the css.snap command, can be found
in Chapter 9, “SP Switch Problem Determination Tools” on page 149.
Whenever the fsd entry has a wait action, the script changes its behavior and
waits for the node to join the switch before exiting. Changing the fsd entry in
all the nodes in itself is not enough, either, since the rc.switch does not start
the SP Switch, just the node’s daemon. For a node to join the switch, the
switch has to be running with at least one node, the primary node. The script
does not have any logic to start the switch, it assumes that the Estart
command has already been executed. To prevent the initialization process
from being blocked indefinitely when the node is unable to join the switch or
the node is the primary node, rc.switch waits for a maximum of five minutes
for the node to join the switch, after which it exits.
This feature of PSSP 3.1, allied with the automatic starting of the switch
described in 8.7, “Automatic Management of the SP Switch” on page 142,
effectively guarantees that in most situations, any later configuration scripts
will only run after the switch is up.
You can directly execute the rc.switch script to restart the fault service
daemon. This script may be run in a node where the connection to the
switch has been lost. When the daemon is restarted, it resets the adapter,
which may solve some adapter problems, like, for example, losing the clock
after an Eclock. Another way to restart the daemon is to execute the
css_restart_node script, which includes a call to the rc.switch script.
# Estart
Estart: 0028-070 Unable to Estart, the clock source for one or more
switch boards has changed. Eclock must be run to re-establish the clock
distribution of the switch clock network.
If all the tests pass, Estart invokes, using rsh, the Estart_sw script on the
oncoming primary node.
Then the script checks if the topology file needs to be distributed to the nodes
in the partition. This verification is done by matching the number of nodes
with switch adapters to the value of the num_nodes_success attribute in the
Switch_partition SDR class. This attribute saves the number of nodes that
successfully received the current topology file in the last distribution. If the
values do not match, the topology file is distributed to all the nodes.
Attention
The attribute num_nodes_success is reset by the Etopology command to
force the distribution of the new topology file. You can also reset this value
when you need to redistribute the topology file. This could be useful, for
example, when the topology file in one of the nodes has been lost or
become corrupted. To reset this attribute, you can use:
SDRChangeAttrValues Switch_partition num_nodes_success=0
The boot/install servers check whether there were any distribution errors by
checking the timestamp and size of the files. They also inform the primary
node which nodes have successfully received a copy. The primary node
collects these values and updates the num_nodes_success attribute in the
SDR. The distribution process creates a log file on the primary node, named
/var/adm/SPlogs/css/dist_topology.log.
In PSSP 3.1, if there is a problem during the distribution of the topology file,
the switch initialization continues, giving the following warning message.
# Estart
Estart: 0028-061 Estart is being issued to the primary node: sp5n09.msc.itso.ib\
m.com.
dist_to_bootservers: 0028-178 Received errors in distribution of topology file f\
rom bootserver to at least one node.
See /var/adm/SPlogs/css/dist_topology.log on primary node for details.
dist_to_bootservers: 0028-075 Could not distribute the topology file to these no\
des.
They may not come up on the switch network.
sp5n05.msc.itso.ibm.com
Switch initialization started on sp5n09.msc.itso.ibm.com.
Initialized 15 node(s).
Switch initialization completed.
#
It is even possible that the problem that caused the distribution failure (an
authentication problem, for instance), does not prevent the node from joining
the switch: the node tries to use an existing copy of the topology file. But if the
topology file is not present or is corrupted, the node will fail to join the switch
and its fault service daemon will terminate.
Finally, the fault service daemon is signaled so it can start the switch. At this
point in time, the message Switch initialization started on <primary node> is
sent to the Control Workstation.
The Estart_sw script now waits for the fault service daemon to complete the
switch initialization. The initialization is considered complete when the
daemon creates the act.top.<pid> file, where <pid> is the PID of Estart_sw.
This file is created in the usual /var/adm/SPlogs/css directory and its contents
are as follows:
Note that the nodes are identified by their switch node numbers.
Therefore, you may sometimes use the presence of such act.top.n files as
an indication of an error in the switch, and use its contents to check the
result of the switch reinitialization. We should emphasize that this
under-the-covers switch initialization is not a reexecution of the Estart
command, but solely the reexecution of the Worm code described in the
following sections.
After the act.top.<pid> is created by the fault service daemon, the Estart_sw
script reads that file and, from its contents, updates the SDR with the current
primary and primary backup nodes. It also displays the number of initialized
nodes and the number of uninitialized links, if any, at the Control Workstation.
As its last step, the script renames the act.top.<pid> file to topology.data.
The script verifies whether there was a problem during the switch initialization
in two ways:
1. It waits a limited amount of time for the creation of the act.top.<pid> file.
2. It checks whether the fault service daemon is still active.
If either condition fails, the script sends the appropriate message to the
Control Workstation.
Attention
Although Estart_sw sets the time-out in accordance to the size of the
system, under extreme conditions that time limit may be insufficient for the
primary fault service daemon to finish the switch initialization (for example,
if the primary node is heavily loaded). If you run into this problem, you
should choose another primary node.
The Worm uses a Breadth First Search (BFS) algorithm to discover the
topology of the switch. BFS is a very well-known algorithm to visit all nodes in
a graph. The basic idea behind the algorithm is to have a First-In-First-Out
queue containing vertices to visit and to mark all visited vertices, as follows:
1. Pick an initial vertex and put it in the queue.
2. Get a vertex from the queue. If the queue is empty, we are done.
3. Mark the current vertex as visited.
4. Traverse a link from the current vertex. If there are no more links to
traverse, go to step 2.
5. Check whether the vertex on the other side of the link has already been
visited.
6. If it is has not been visited yet, put it at the end of the queue.
7. In any case, go to step 4.
The Worm uses the expected topology file to build the graph. Both switch
chips and processor nodes are the vertices of the graph. The initial vertex is
the primary node. The node-to-chip connections and the chip-to-chip
connections are the links of the graph.
Nodes that are customer-fenced, that is, that have the SDR attribute isolated
equal to 1 and the autojoin attribute equal to 0, are removed from the list of
vertices to be visited. These nodes will not join the switch.
For example, consider an SP with a single frame and 8 wide nodes, as shown
in Figure 56 on page 125. Suppose that the primary node is node 9.
3 4 7 0 N3
2 SW 1 5 6 SW 6 1 N4
1 6 5 2 N7
0 7 4 3 N8
3 4 7 0 N11
2 SW 0 5 6 SW 7 1 N12
1 6 5 2 N15
0 7 4 3 N16
The Worm code visits the switch board in the following order:
From Node 9 to Chip 4
From Chip 4 to Node 13, Chip 0, Chip 1, Chip 2, Chip 3
From Chip 0 to (Chip 4), Chip 5, Chip 6, Chip 7
From Chip 1 to (Chip 4), (Chip 5), (Chip 6), (Chip 7)
From Chip 2 to (Chip 4), (Chip 5), (Chip 6), (Chip 7)
From Chip 3 to (Chip 4), (Chip 5), (Chip 6), (Chip 7)
From Chip 5 to Node 5, Node 1, (Chip 0), (Chip 1), (Chip 2), (Chip 3)
From Chip 6 to Node 3, Node 7, (Chip 0), (Chip 1), (Chip 2), (Chip 3)
From Chip 7 to Node 11, Node 15, (Chip 0), (Chip 1), (Chip 2), (Chip 3)
Note that switch chips are visited more than once. These revisits, shown
within parentheses, are needed to check all links between chips and to check
for miswires in multiboard configurations.
The Worm initializes a switch chip on its first visit. A single route to the
primary node is given to each, and all error reporting is disabled, except for
Re-Timer and Link Synchronization errors. The Worm also informs each chip
of its device ID.
If the Worm cannot talk to a chip or a node, the Worm considers the link to
that component to be faulty and the corresponding predecessor chip port is
disabled. Wrap-plugs that should not be there are detected and the user
informed. Also, all chip ports not present in the topology file are disabled.
Thus, to enable a link that was down and then fixed, you need to reset the
switch chips by running the Eclock command. A similar problem occurs
when two partitions are joined without a previous Eunpartition. The Worm
will fail to reclaim some of the links because ports on the other side were
left disabled.
During a revisit, a switch chip returns its device ID to the primary node, and if
it is not the expected one, a miswire is suspected. A miswire involving a node
is detected during the first visit to the node, since the node’s device ID is
already known and reported by that node’s fault service daemon. During this
visit, each node also receives the name of the partition’s expected topology
file name, as well as a single route back to the primary node.
The Worm now knows the actual topology of the network. It constructs the
out.top file, which contains the expected topology file with annotations
describing the status of links and devices as detected by the Worm’s first
phase. You can find the list of possible device and link status in 9.2.4, “The
out.top File” on page 160. With the actual topology known, the Worm
generates the two disjoint service routes from all nodes and chips back to the
primary node.
The Worm then initializes the time-of-day (TOD) counter in all chips and
adapters. Before the TOD initialization proper, the Worm quiesces the fabric,
broadcasting to all nodes a command to stop sending packets since such
Important
The TOD counter is called a "switch clock" in some SP documentation, and
should not be confused with the oscillating signal that drives the switch
fabric. It should also not be mistaken with the time of day maintained by
AIX, which is synchronized in an SP system through the use of the NTP
The fault service daemon initializes the TOD in all components with almost
the same value. The TOD counters do not necessarily have the same value
because the algorithm used in the SP Switch estimates the distance
(therefore, the delay) between switch chips. The values of the counters,
though, do not drift apart since the same global oscillator increments them.
The switch’s TOD is not used by the PSSP and AIX software, but by CSS
application debug. The global time implemented by these counters allows a
precise (for all practical purposes, at least) ordering of switch events, mainly
fault occurrences. Applications have access to the TOD counter through an
ioctl.
The fault service daemons on all nodes use the DB Updates to create an
out.top file by updating the expected topology file. The update will fail and the
node will be fenced if the expected topology file (whose name was distributed
during the Worm’s first phase) does not exist on that node. The
acknowledgment of the last DB Update packet is only sent after the node has
managed to create out.top. The absence of this acknowledgment packet
informs the primary fault service daemon that there is a problem.
All participant nodes now have the actual topology file and are ready to run
the RTG code to generate the routes between the nodes.
Note that the algorithm calculates the best routes between all pair of nodes.
This is necessary to obtain a global load balance. Every daemon runs RTG,
calculating all best routes. But only the routes between the local node and the
other nodes are loaded to the adapter. This could seem a waste of computing
cycles, since one node could compute the routes for every other node and
distribute them. This used to be the case with the predecessor
High-Performance Switch. But the approach was abandoned because of the
larger overhead of distributing the routes through the switch, mainly in larger
SP systems.
The actual loading of the routes is only done when the Worm broadcasts a
Load Routes service packet. Each node then loads the calculated routes into
its adapter and sends an acknowledgment back to the primary node.
The distribution of DB updates and the generation and loading of routes are
also carried out whenever the topology of the switch changes; that is during,
fences, unfences, and recovery from permanent errors.
During the Worm’s second phase, error recovery is not in effect. The topology
of the switch has to be stable so it can be distributed to all nodes and the
routes between nodes can be generated. An asynchronous error or the
absence of an acknowledgment is considered unrecoverable: the node or the
chip in error is disabled and the initialization is retried. The primary fault
service daemon attempts to initialize the switch four times. If the switch
cannot initialize, the daemon terminates.
The switch initialization is now finished. If the specified primary backup node
did not come up, the primary node chooses a new primary backup node and
notifies the fault service daemon on that node to change its personality. The
Meanwhile, all fault service daemons enable the IP and User Space protocols
for their nodes. In particular, the IP interface css0 is configured as up.
In this chapter we go through the actions you take to manage the SP Switch
during normal operation. To verify the complete syntax of the commands
mentioned in this chapter, refer to the PSSP: Command and Technical
Reference, SA22-7351.
The primary backup node passively listens for activity from the primary node.
The daemon checks every two and a half minutes whether it has been
scanned by the primary node. If the primary backup node is not contacted by
the primary node during two consecutive check periods, it considers the
primary down and assumes its role. This takeover will happen at most seven
minutes after the primary node failure (one scan time of two minutes plus two
detection periods of two and a half minutes). The primary node takeover
involves:
• Selecting itself as the oncoming primary node
• Reinitializing the switch fabric
• Selecting a new primary backup node
The new primary backup node is selected so that it is as far as possible from
the primary node, so the chances of a simultaneous failure is as low as
possible. This translates into the following:
1. Select a node attached to a different switch board.
2. If there is a single switch board, select a node attached to a different
switch chip.
Attention
The reinitialization of the switch during a primary node takeover is not an
Estart, as explained in 7.3.2, “Starting the Worm Code” on page 121. A
noticeable difference is that, in this case, the daemon generates an
act.top.1 file, instead of the usual topology.data file generated by the Estart
command. You can trace a sequence of primary node takeovers by
checking the existence and timestamp of such files, using, for example, the
command:
dsh -a ls -o /var/adm/SPlogs/css/act.top.1
The primary node also watches over the primary backup node. If the primary
node detects that the primary backup node can no longer be contacted on the
switch fabric, it initiates a primary backup node takeover, selecting a new
primary backup node and using the same criteria previously described.
During the period of time between the failure or loss of the primary node and
the takeover by the primary backup node, the switch fabric continues to
function. Failures that happen while there is no active primary node are
detected during the reinitialization of the switch fabric by the new primary
node.
Using the Eprimary command, the administrator may select two nodes, one to
be used as the primary node and the other as the primary backup node. Note
that a dependent node cannot be a primary or primary backup node for the
SP Switch. The Eprimary command selects a default oncoming primary or
oncoming backup primary node if one is not specified. The default oncoming
primary is the lowest numbered node in the partition, and the default
oncoming primary backup is the highest numbered node in the partition.
Until the next Estart is issued, the nodes specified in the Eprimary command
are referred to as the oncoming primary and oncoming primary backup
nodes, and are recorded in the oncoming_primary_name and
oncoming_primary_backup_name attributes of the Switch_partition SDR
class. Once Estart completes, the primary_name attribute is updated based
on the oncoming_primary_name attribute and the primary_backup_name
attribute is updated based on the oncoming_primary_backup_name attribute.
Initially, the primary node and backup primary node fields have the value
none. In a 16-node system, the oncoming primary node field has the default
value of 1 and the oncoming primary backup has the default value of 16.
Current Oncoming
Primary Primary
Node Node
none 1
Current Oncoming
Primary Primary
Backup Backup
Node Node
none 16
When Estart is executed, the node specified in the oncoming primary field
becomes the primary node. The node specified in the oncoming primary
backup becomes the primary backup.
Current Oncoming
Primary Primary
Node Node
1 1
Current Oncoming
Primary Primary
Backup Backup
Node Node
16 16
Current Oncoming
Primary Primary
Node Node
1 1
Current Oncoming
Primary Primary
Backup Backup
Node Node
15 16
If the primary node fails, the primary backup node automatically becomes the
primary node and a new primary backup is selected.
Current Oncoming
Primary Primary
Node Node
15 1
Current Oncoming
Primary Primary
Backup Backup
Node Node
2 16
In summary, primary and primary backup fields reflect the current state of the
partition, and the oncoming fields are not applicable until the next invocation
of the Estart command.
The Eclock command establishes a clock distribution tree after the system is
powered up or when an alternate clock tree must be selected. It sets the
clocking sources that provide the clocking of each switch board. If the current
The clock_input attribute contains the bulkhead jack that is carrying the clock
signal that is driving the corresponding switch board. A value of 0 indicates
that this board is the master board. The clock_source attribute indicates
which switch board is generating the clock signal. This value is a function of
the jack being used and the standard SP cabling for the system in question. It
is saved in the SDR to help the Eclock script in correctly ordering the setting
of the clock sources.
The information in the SDR indicates what the clock distribution tree should
be. To get the actual clock distribution tree you should issue the spmon -G -d
command and consult the frame information. A sample (partial) output for the
same two-frame system follows:
4. Checking frames
Whenever you need to reestablish your clock distribution tree, for example
after powering up your system, you may use the Eclock -r command, which
extracts the clock distribution data from the SDR. Using Eclock with such a
flag, instead of the better known -f or -d flags, will reestablish the most recent
clock distribution tree, which may not be the standard clock distribution.
Assume that you are having a clock distribution problem, and to work around
it you established an alternate distribution. A subsequent Eclock -d would not
reestablish the appropriate distribution tree: your switch network will not
come up complete, or may not come up at all; and you will need to issue
another Eclock. By having the habit of using Eclock -r you would not run into
such problems.
When you have a single switch board system, you do not need to run Eclock
after powering up your system, since there is only one possible clock source,
the default. You should be aware, though, that the master switch chip and PLL
chosen by the Eclock script are not the same as the power-on defaults. This
could lead to a behavior difference in the extremely unlikely event of a single
PLL or chip failure.
If you are having any hardware problem that is affecting your clock
distribution tree, you need to establish a different distribution tree. Your first
option is to try the alternate distribution that comes in the clock configuration
files. You invoke the alternate distribution by issuing Eclock -a <clock config
file>.
Important
You should use the following information with extreme care and at your own
discretion. The Eclock command does take your switch network down, and
its incorrect use may leave your network unusable. If you are facing serious
clock distribution problems, we advise you to get help from IBM Service.
If the standard alternate tree does not work around the hardware problem you
are facing, then you may want to create your own clock configuration file or
change the clock source settings of individual switch boards.
If you run Eclock with the alternate clock configuration, you will find out that
switch board 1004 is up but there are now two other switch boards with no
clock source, intermediate switch boards 1001 and 1003. Now you are in
worse shape than before.
The first thing to do is to look into the annotated switch topology file for your
system, assuming you have an unpartitioned system. Find the external links
of the clockless switch board, which is 1004. Then pick the first link that
connects that board to an on-line switch board. For instance, the first link in
the file for switch board 1004 will do:
s 10043 3 s 13 0 E08-S07-BH-J3 to E01-S17-BH-J6
So we have a cable running from node switch board 1 jack 6, to switch board
1004 jack 3. At this point we are only interested in where the cable connects
to the clockless switch board, which is jack 3. We must transform that value
into the corresponding mux value, which is 1. See 6.4, “Specifying the SP
Switch Clock Distribution Tree” on page 100. Now we are ready to issue the
appropriate Eclock command:
Eclock -s 1004 -m 1
Switch board 1004 should be receiving the clock signal by now, but we are
not done yet. Eclock -s does not update the clock_source attribute in the
SDR. Therefore, it is still indicating that board 1004 is getting the clock signal
from board 2, not 1. In this case, and probably in many others, this
discrepancy will not make any difference, since this value is used just to order
the changes to the clock sources. But to make sure the SDR is correct, we
should issue the command:
SDRChangeAttrValues Switch switch_number==1004 clock_source=1
In large SP systems, you could try to avoid bringing down your entire switch
network by running Eclock -s for the switch board connected to that cable
and farthest away from the primary node, that is, farthest away with respect
to the BFS algorithm. This could be complicated to figure out, depending
on your system configuration. You may try, then, the following approach:
1. Your system does not have ISBs: change your primary node to a node in
one of the switch boards connected to the cable; Eclock the switch board
on the other side of the cable; run Estart.
2. Your system has ISBs: change your primary node to a node that is not in
the node switch board connected to the cable; Eclock the node switch
board connected to the cable; run Estart.
Observe that with the preceding step we are avoiding the Eclock of an
intermediate switch board. We leave as an exercise to you to figure out why
the Eclock of an ISB will never do you any good.
A final note: always keep in mind the clock configuration tree. If the switch
board you are resetting feeds its clock to other boards, these other boards
have to be reset, too. So, in some situations, a piecewise approach may not
bring you any significant gain.
Starting the switch with only the primary and primary backup nodes up is
enough in PSSP 3.1 because of its automatic unfence feature. Unless an
You can issue the Estart with the -m option, which specifies that the Emonitor
daemon should be started. But in PSSP 3.1, the Emonitor subsystem is no
longer needed, because of the new Switch Admin daemon and the automatic
unfence features. However, if you do not run the Switch Admin daemon you
may still wish to use the Emonitor subsystem. And if you are using a primary
node at PSSP 2.4 or earlier in a coexistence environment, you may still need
to use the Emonitor subsystem.
The Efence has an -autojoin option, which allows the node to automatically
join the switch once the node is again operational. In PSSP 3.1, this option is
no longer needed due to the automatic unfence feature. Note further that in
PSSP 3.1, a node that is fenced with autojoin will automatically join the switch
within two minutes. However, if the primary node is at PSSP 2.4 or earlier, the
autojoin option enables the nodes in the argument list to be fenced and to
automatically rejoin the current switch network if the node is rebooted, the
fault service daemon is restarted, or an Estart command is issued.
Note that the automatic unfence is a feature of the fault service daemon
running in the primary node and works even when all other nodes are at
PSSP 2.4 or earlier. But for a pre-PSSP 3.1 fault service daemon to be able
to coexist with this new feature, the node should be at one of the following
PTF sets:
• PSSP 2.1 - PTF set 30 (warning: a coexistence environment with nodes
at PSSP 2.1 is not supported)
• PSSP 2.2 - PTF set 17
• PSSP 2.3 - PTF set 9
• PSSP 2.4 - any
The fix provided in the aforementioned PTFs makes the fault service
If your primary node is at PSSP 2.4 or earlier, you also need to unfence a
node when the node has been rebooted or the fault service daemon has been
restarted without a previous Efence -autojoin command. But if you are
running the Emonitor subsystem, it automatically attempts to unfence a node
when the node is rebooted.
Eunfence first distributes the current topology file to the nodes before they are
unfenced. The Eunfence command, unlike Estart, unconditionally distributes
the topology file to all nodes being unfenced. If the topology file distribution to
a node fails, that node is not unfenced.
The Equiesce command changes the personality of the primary and the
primary backup nodes to secondary, effectively disabling switch error
recovery and primary node takeover. After an Equiesce, data still flows over
the switch.
You must issue an Estart to reestablish switch recovery and primary node
takeover after running Equiesce.
The pre-PSSP 3.1 Emonitor daemon runs on the Control Workstation and
monitors host_responds through spmon. Whenever a node comes up, it sets a
three minute timer, waiting for other nodes to come up. Each time another
node comes up, the timer is reset to three more minutes, up to a maximum of
12 minutes. When the timer triggers, it checks whether any node listed in the
/etc/SP/Emonitor.cfg file has host_responds on and switch_responds off. If
there is any, it tries to bring it back to the switch network. If the primary node
is coming back and the primary backup shows as active in the SDR, Emonitor
takes no action—a primary node takeover should take place. If the preceding
condition is false and the primary or the primary backup nodes are coming
One problem that Emonitor does not address is the need for an explicit Estart
after your system is powered up. The Emonitor daemon itself can only be
started through the Estart -m command. After this manual Estart, nodes
rebooting automatically join the switch, at least in most normal situations.
Also, if you need to keep a node fenced after a reboot, you need to edit
Emonitor’s configuration file and remove the node from it. Otherwise the node
will get unfenced after the reboot, even if you fence it without -autojoin. A
further inconvenience of the Emonitor daemon is that you have to set up its
configuration file before using it.
In PSSP 3.1 the behavior of the Emonitor daemon is slightly changed when
your primary daemon is at PSSP 3.1. The Emonitor daemon is aware of the
new automatic unfence feature and does not attempt to unfence a node in any
situation. Thus, a node that is explicitly fenced, either by you or by the system
after an unrecoverable error, remains fenced, even if the node is in the
daemon’s configuration file.
The PSSP 3.1 Emonitor also checks whether the Switch Admin daemon is
active, and if so, does not attempt an Estart when one is needed.
Start the switch in any partition where no node is acting as the primary node.
This occurs after a power up or reboot of all nodes in a partition, or anytime
the primary and primary backup are included in the set of nodes that are
rebooted. The Switch Admin daemon runs Estart whenever the oncoming
primary node shows itself as available through host_responds.
The Switch Admin is the program cssadm and runs on the Control
Workstation. The Switch Admin daemon is started out of inittab and is SRC
controlled. The subsystem name for the daemon is swtadmd. There is a
single subsystem for the entire SP system, that is, a single daemon
administers all system partitions.
If you need to disable the daemon, you should edit the file, changing the 1 to
a 0. Next you should stop the daemon with the command:
stopsrc -s swtadmd
You may also stop the subsystem and remove the swtadmd entry from inittab.
The following are the tests and resulting actions triggered by the occurrence
of any of the above events:
1. If the current primary node has just lost its IP connectivity or there is no
current primary node in the partition (primary_name==none in the
Switch_partition SDR object):
• If the primary backup node has switch_responds on, the daemon takes
no action as normal primary takeover should occur.
• If the primary backup node has switch_responds off and the oncoming
primary node has host_responds on, the daemon attempts an Estart.
• If the oncoming primary has host_responds off, no action is taken. The
daemon waits until the oncoming primary comes up.
2. If the oncoming primary node has just come on-line:
• An Estart is attempted.
If the Estart fails, the daemon takes no recovery action. It waits for another
significant event to occur.
In addition, when the daemon starts, it verifies, in every partition, whether the
current primary node has no switch connectivity or there is no current primary
node. If so, it simulates the corresponding event and executes the algorithm
previously described.
Note that the actions taken when test 1 above is true are also taken whenever
any node has come up and there is no current primary node. For example,
assume you are doing a cluster start-up and your oncoming primary node
comes up before your oncoming primary backup node. The daemon is
notified that the oncoming primary node is up and attempts an Estart. The
command fails because the oncoming primary backup is still down. But when
the oncoming primary backup comes up, the first condition still holds and the
daemon attempts a second Estart, which this time should succeed.
Notice also that an Estart is attempted whenever the oncoming primary node
comes on-line. At a first glance, it could seem unnecessary to reinitialize the
switch if it is up and running. But if the switch is up and the oncoming primary
In addition, some of the SP Switch logging events may trigger the execution
of the /usr/lpp/ssp/css/css.snap script to generate a snapshot of the log files
on the node where the error was reported. See section 9.4, “SP Switch
Utilities” on page 175 for more information on the css.snap script.
Especially in larger systems, having to go though the AIX error log on all
nodes and on the Control Workstation is not an easy task. To make problem
determination easier, PSSP 3.1 introduces a centralized error log in the
Control Workstation that contains a summary of all switch-related error log
entries, as described in 9.2.1, “The Centralized Switch Error Log” on page
152. This summary log file should be used as the starting point for any SP
Switch problem solving.
Enter the following command to view all the SP Switch error log entries in all
nodes:
dsh -a errpt -a -N Worm
Enter the following command to view all the SP Switch adapter diagnostics
log entries in all nodes:
dsh -a errpt -a -N css0
You can also use the AIX Error Notification Facility to be notified of errors
reported in the AIX error log as soon as they occur. This facility causes the
execution of an ODM-defined method when a particular error occurs. For
example, you could create a method which would send you an e-mail
whenever the fault service daemon cuts an error log entry
(en_resource="Worm"). Refer to the PSSP: Diagnosis Guide, GA22-7350, for
more information.
To reduce the usage of /var you may trim log files or remove old log files. AIX
has a default crontab entry that removes all hardware error entries after 90
days and all software error entries after 30 days. SP-related log files in
/var/adm/SPlogs are cleaned up on the nodes by the script
/usr/lpp/ssp/bin/cleanup.logs.nodes. Log files are cleaned up on the Control
Workstation by /usr/lpp/ssp/bin/cleanup.logs.ws.
When you are frequently running some diagnosing scripts, make sure you
save the log files in /var/adm/SPlogs/css. They may be automatically
overwritten or deleted to generate free space for new log files. The script
css.snap, for example, checks the filesystem size at startup and cleans up old
files.
Identifier Message
Use the summary CSS error log as the primary starting point to diagnose SP
Switch problems. The centralized summary log resides in the
/var/adm/SPlogs/css/summlog file on the Control Workstation. Any SP Switch
or SP Switch adapter error that has an entry cut in the AIX error log in any
node triggers the generation of a summary record. Each record is appended
in the order received by the Control Workstation. The fields of the summary
record are shown in Table 6 on page 153.
The summlog file allows you to identify all nodes that are affected by a fault,
and shows the failure symptoms on each node. The log is formatted for
processing by user scripts. It will be the basis for automated error analysis
mechanisms in future releases. The log cleaning scripts include this log to
keep it to a reasonable size.
Important
You should check the summlog periodically to find out problems on the
switch that were automatically recovered or do not have any perceptible
consequences.
Name Description
The messages in the flt file can be informational (i), notification (n), or error
messages (e).
In the following paragraphs we show some examples of the contents of the flt
file. One general observation is about the device id, shown in several
messages. It is either a switch node number or the chip id plus 100000. For
example, an id of 100015 indicates chip 5 in switch board 1. The description
in this section and the following ones assume an understanding of the
workings of the SP Switch. Refer to Chapter 3, “Communication Network
Hardware” on page 13, Chapter 4, “Communicating with the SP Switch” on
page 49, and Chapter 7, “Initialization of the SP Switch” on page 111 for more
information.
Attention
It is not uncommon to find packets with error capture registers
containing all zeroes. They are responses to the recovery code when it
resets errors on the switch chips.
4. Broadcast failure
During switch initialization, the primary fault service daemon broadcasts
several commands to all nodes. When the node finishes executing the
command successfully, it sends an acknowledgment to the primary node.
The primary then tracks the acknowledgments to determine if everyone
has responded.
One of the possible errors during a broadcast operation follows:
(i) 07/28/98 16:24:08 : 2510-606 A switch Error/Status service packet
was received during a broadcast operation.
SP Switch chip ports that have no connection are usually wrapped. This is
noted as follows:
The possible node error status present in the out.top file is shown in Table 7.
Table 7. SP Switch Device Status
The possible link error status present in the out.top file is shown in Table 8.
Table 8. SP Switch Link Status
You should check the files’ timestamp across the nodes to verify the current
one. The information in this file is self-explanatory; an example follows:
The messages tell you the initialization packets were received at the node.
The node is able to talk over the switch.
The worm.trace file on the primary node contains messages tracing the
switch initialization process. For instance, one of the first actions during an
initialization is to clean the receive FIFO, discarding any not yet received
error/status packets.
(i) print_the_time_worm: The date and time = Mon Aug 3 19:10:04 1998
TBSswitchInit: Switch network Initialization Started!
(i) TBSswitchInit: The Primary backup is node with switch node number 4
TBSworm_bfs_phase1: Switch Phase1 network Initialization Started!
syncFifoPh1: Cleaning up the Receive FIFO
At each visited chip, all known ports connected to a switch chip are checked,
and the expected device id is shown:
TBSworm_bfs_phase1: The current device port = 0
TBSworm_bfs_phase1: The attached device id = 13
When a chip or a node is initialized during this phase, the route from the
primary node to the device is shown, as well as the route from the device to
the primary node. The latter is sent to the device in the initialization packet:
route to device 100010 = 8c000000 00000001
route from device 100010 = 43000000 00000001
When a node is initialized, its personality is noted. The primary node has the
personality 1, the primary backup node is 2 and secondary nodes are 3.
personality = 3 db_cmd = 1 error_enable = 0000
During phase two of the switch initialization, the chips are reset and properly
initialized for normal operation:
Phase-2 Switch Initialization Packet for device 100014
----------------------------------------------------------
route = 00000000 00000000
Primary = b0000000 00000000
Secondary = b0000000 00000000
bypass_enable = 0 central_queue_enable = ff edc_frame_length = 1f
mode_bits = 2 receiver_enable = 1f sender_enable = 1f
receiver_error_enable = 1f sender_error_enable = 1f
All node commands sent by the primary fault service daemon are also logged
in this file. For instance, the following node initialization command is logged:
displayPacket Node Cmd = NODE_INIT:
fs_daemon_fsm_main: packet Node Command (NC) = 1
The command to load the node-to-node routes to the adapter is also logged:
displayPacket: Node Cmd = KLOAD_ROUTES:
When the fault service daemon is killed, the following messages are logged in
the file and the switch_responds is turned off in the SDR:
(i) handler_sigTerm: 2510-195 The fault service daemon got a SIGTERM
signal.
fs_daemon_exit: Turning off this nodes switchResponds bits in the SDR
The diagnostics goes through the following phases, which are then traced:
1. Diagnostics setup consists of making sure that ODM is configured
properly, that the device css0 is configured and that diagnostics can get
exclusive use of the device. At that point diagnostics is run.
2. Clock selection can be done from its own internal clock and the external
clock. To complete diagnostics successfully, one of the external clock
sources must be available for test purposes. If these clocks are not
available, diagnostics will still be attempted on the internal clock. However,
even if the diagnostics pass on this internal clock, a failure code is
returned. This is due to the fact that, even though the adapter is okay,
without an external clock source the card can not communicate with the
switch.
Clock selection is as follows:
• First test whether the internal clock is operational. If it is not,
diagnostics assumes the adapter is bad and no further testing is
attempted.
• Select the data cable clock. If it is available for testing, proceed with the
test.
You should start examining the dtbx_failed.trace file by looking for the
completion status at the bottom of the file. The SRN number at the bottom of
the file should help you locate where in the file to start looking. The 3-digit
You can use Table 9 to decode the failure code given the indicator n. This
table pertains to TB3 only. The table for MX or PCI is different from that of
TB3.
Table 9. SP Switch Adapter TB3 Diagnostics Failure Codes
0 POST test
B Diagnostic setup
(TED)
# SDRGetObjects switch_responds
node_number switch_responds autojoin isolated adapter_config_status
1 0 0 1 not_configured
5 1 1 0 css_ready
9 1 1 0 css_ready
13 0 1 1 css_ready
In the example, nodes 1 and 13 are not connected to the SP Switch but
fenced (isolated). Nodes 5 and 9 are connected to the SP Switch.
You may also use Event Perspective to monitor resources and query resource
variables, including switchResponds. Event Perspective uses the Event
Management EMAPI interface. For instance, you may define an event "Switch
Responds" that monitors the condition SwitchResponds. By requesting to
view the Event Notification Log you can obtain a historic listing of the events
as shown in Figure 62 on page 171.
The fault service daemon is essential to keep the SP Switch running. If the
system administrator can be made aware of a daemon failure early, the
availability of the switch may be improved. We show next how to monitor the
fault service daemon (fault_service_Worm_RTG_SP) by running the pmandef
command, as follows:
pmandef -s Monitor_Worm_daemon \
-e ’IBM.PSSP.Prog.xpcount:NodeNum=*; \
ProgName=fault_service_Worm_RTG_SP;UserName=root:X@0==0’ \
-r ’X@0>0’ \
-c /usr/lpp/ssp/bin/notify_event \
-C "/usr/lpp/ssp/bin/notify_event -r" \
-n 0 -U root
Event Definition
----------------
Resource: IBM.PSSP.Prog.xpcount
Instance: UserName=root;ProgName=fault_service_Worm_RTG_SP;NodeNum=12
Condition: X@0==0
Resource Value
--------------
Type: sbs
Field0: CurPIDCount=0
Field1: PrevPIDCount=1
Field2: CurPIDList=
Table 10 lists additional Event Management resource variables for the Switch:
Table 10. Event Management Resource Variables for the SP Switch
Network status monitoring commands, such as netstat -I, cannot get such
information. These resource variables are helpful in determining the cause of
network problems. Most of them, however, are too detailed to be handled as
events.
You can use the AIX error log notification facility to generate SNMP traps
when specific log entries are cut. In addition, you may use the Problem
Management subsystem to generate traps when conditions of interest are
met. Refer to RS/6000 SP Monitoring: Keeping It Alive, SG24-4873 for a
detailed description of setting SNMP traps.
SP_SW_FIFOOVRFLW_RE soft
SP_SW_RECV_STATE_RE soft
SP_SW_INVALID_RTE_RE soft
SP_SW_NCLL_UNINT_RE soft
SP_SW_PE_INBFIFO_RE soft
SP_SW_PE_ON_DATA_RE soft
SP_SW_PE_ON_NCLL_RE soft
SP_SW_PE_RTE_TBL_RE soft
SP_SW_RECV_STATE_RE soft
SP_SW_SNDLOSTEOP_RE soft
SP_SW_PE_ON_NCLL_RE soft
TB3_CONFIG1_ER full
TB3_LINK_ER full
TB3_PIO_ER soft
TB3_SVC_QUE_FULL_ER full
TB3_THRESHOLD_RE full
POS REGS : 0 1 2 3 4 5 6 7 31 32 33 34 35 36 37
CONTENTS : 69 8F 9B FE FF 8F 00 FF BF 07 00 FD 0E 71 FE
TBIC Registers:
TIME_OF_DAY : 800009f8 768be320
XMIT_FREE_SPACE Data & Hdr : 00000100 00000040
RECV_FULL_SPACE Data & Hdr : 00000000 00000000
SVC_FULL_SPACE : 00000000
XMIT_FIFO_THRESH_DATA & Hdr : 00000000 00000000
RECV_FIFO_THRESH_DATA & Hdr : 00000000 00000000
INT_ERR error status flags : 00001000 00000000
STI_RETIMING
INT_MASK : ffffe8c7 ffffffe0
TBIC_CTRL : 0000ec7f ba1fff04
TBIC_STATUS : 78000000
TBIC_DIAG : 00000000
Check the TBIC_STATUS register. If bits 3 and 4 are both set, the switch
clock is driving the adapter (bits are numbered from left to right, starting at
0). The value 0x78 for the first byte indicates that the node is part of the
switch.
You can also execute /usr/lpp/ssp/css/diags/read_tbic -s to get the TBIC
status.
In this chapter we go over some steps that may help you with problem
determination of the SP Switch hardware and software. Refer to the PSSP:
Diagnosis Guide, GA22-7350, for further information.
The list of recovery examples included in this section is not complete. The
recovery actions suggested might not be successful with the problem you are
experiencing. Nevertheless, this chapter is meant to provide first-aid
information.
You should not only check the summlog file when you discover a switch
problem, but use that file as a monitoring tool. Many switch faults can leave
no trace other than an error message on the log files. For example, the failure
of a switch-to-switch cable can go undetected since its absence may not
necessarily take some nodes down, but an error log entry will be generated
by the primary node when failing to initialize that link during an
under-the-covers switch initialization.
It is also a good idea to run css.snap on the node that reported the error (and
possibly on the primary node if it is a different node) as soon as possible after
the error, so you do not risk losing the log files at the time of the error, or to
Estart may exit with an error right away if some of the initial checks fail. A
common problem occurs when the oncoming primary node is not reachable
or is fenced. If you need to start the switch right away, you may try using the
Eprimary command to select another oncoming primary node and then issuing
another Estart. You may also try to solve the problem in your oncoming
primary node, since it could be a problem that is affecting all nodes. Refer to
10.2.2, “Node Is off the Switch” on page 182 on how to diagnose a node
problem.
The fault service daemon may fail or time out during the switch initialization. If
this happens, you should check the end of the
/var/adm/SPlogs/css/fs_daemon_print.file file on the oncoming primary node
for the cause of the error. In PSSP: Diagnosis Guide, GA22-7350, there is a
table with possible causes and actions that could be taken. If the information
there does not help you in solving the problem, you may try to look at the
/var/adm/SPlogs/css/worm.trace file on the oncoming primary node to trace
the activities of the fault service daemon. In C.1, “SP Switch Worm Return
Codes” on page 253 you may find the list of error codes returned by the fault
service daemon during the switch initialization.
Estart may finish normally but with some nodes or links not initialized. You
should initially check the /var/adm/SPlogs/css/out.top file on the primary node
for a more detailed explanation of why the nodes or links were not initialized.
If the problem is in a single node, you should try to diagnose the node as
described in 10.2.2, “Node Is off the Switch” on page 182. If the problem is on
a single switch-to-switch connection, you may have a cable problem. If
several nodes or links were not initialized, you should try to find a pattern that
could indicate a common problem. For instance, if all links of two switch chips
do not initialize, you may have an intermittent cable problem between those
two chips. You may also have a cable miswire, in which case the
/var/adm/SPlogs/css/cable_miswire file is created with additional information.
Refer to PSSP: Diagnosis Guide, GA22-7350 for additional information on the
analysis of the out.top log file.
Whenever the fault service daemon is not running on a node, you should first
check the AIX error log. Almost all error situations will cause the fault service
daemon to make an error log entry. Refer to PSSP: Diagnosis Guide,
GA22-7350, for the table with an explanation and recovery actions for most of
the daemon’s error log entries.
An error log entry is not generated when there is a failure in the rc.switch
script that bars the fault service daemon from running. You should check the
/var/adm/SPlogs/css/rc.switch.log file to get more details. One possible
reason for a failure in rc.switch is an error during the configuration of the SP
Switch adapter as described in 7.1, “Configuration Method of the SP Switch
Adapter” on page 111. You may check the adapter_config_status attribute of
the switch_responds SDR object for the error status returned by the
configuration method. An explanation and some actions that could be taken
are found in PSSP: Diagnosis Guide, GA22-7350.
When you are having problems on a node, you should also check whether the
adapter is being driven by the switch’s synchronous clock or not. You may use
the read_tbic -s utility, which should return a value with bits 3 and 4 set.
Typical values for TB3 are 1C000000, 1E000000, and 78000000. If not, you
should initially try to restart the fault service daemon. If that does not work,
your next step would be to Eclock the frame or the system, as described in
8.2, “Establishing the SP Switch Clock” on page 134. If running Eclock is not
effective, you may be experiencing a cable problem.
The latter command also works even when the switch is down, especially
when you want to unfence the oncoming primary node.
You may also get a timeout, even with the fault service daemon running, if
there is an adapter problem that prevents the primary node from talking to the
fault service daemon on the node being unfenced. That may happen after a
cable was changed, for example. In this case you should restart the daemon
through rc.switch. If it does not work, you might try unconfiguring and
reconfiguring the adapter as described in 7.1, “Configuration Method of the
SP Switch Adapter” on page 111, checking for any errors during the
reconfiguration, and then restarting the daemon. Your last resort would be to
reboot the node.
# Estart
Estart: 0028-029 Fault service worm not up on oncoming primary backup node, cannot \
Estart : sp5n01.
Check the AIX error log on the failing node for possible fault service daemon
failures:
Description
Switch Fault Service Daemon Terminated
............
---------------------------------------------------------------------------
LABEL: SP_SW_SIGTERM_ER
IDENTIFIER: 5A38E3B0
Description
Switch daemon received SIGTERM
Probable Causes
Another process sent a SIGTERM
User Causes
Operator ran Eclock
Operator ran rc.switch on node and switch daemon was restarted
User program sent SIGTERM
Recommended Actions
Run rc.switch to restart switch daemon
Detail Data
Software ID String
LPP=PSSP,Fn=fs_daemon_init.c,SID=1.33,L#=819,
PID of process sending SIGTERM
23120
Name of process sending SIGTERM
---------------------------------------------------------------------------
The first (last in time) AIX error log, with label HPS_FAULT6_ER, indicates that the
daemon has terminated. The previous switch-related error message
describes the possible cause of the fault service daemon’s termination. It
indicates that the daemon got a SIGTERM signal that killed it. The scripts
Eclock and rc.switch kill the daemon with that signal, but automatically restart
the daemon. Since there are no later log entries that indicate a problem in
The log entry suggests to restart the fault service daemon through rc.switch:
# /usr/lpp/ssp/css/rc.switch
"adapter/mca/tb3"
# ps -ef | grep Worm
root 14014 14550 4 10:54:15 pts/0 0:00 grep Worm
root 15914 1 0 10:53:35 - 0:00 \
/usr/lpp/ssp/css/fault_service_Worm_RTG_SP -r 8 -b 1 -s 4 -p 3 -a TB3 -t 22
You should now check the cables to nodes 7 and 8 (jacks 24 and 23,
respectively), which could be switched. This problem could also have
occurred because the nodes were placed in the wrong slots during
maintenance.
# Eunfence 8
dist_to_bootservers: 0028-178 Received errors in distribution of topology file from \
bootserver to at least one node.
See /var/adm/SPlogs/css/dist_topology.log on primary node for details.
Unable to unfence the following nodes:
sp3n08.msc.itso.ibm.com No topology
There is a Kerberos problem on the node. You should correct it using the
procedures described in PSSP: Diagnosis Guide, GA22-7350. For example,
you could start checking the Kerberos configuration files on the node:
# ls /etc/krb*
/etc/krb.conf /etc/krb.realms
# spbootins -r customize -s no -l 8
# create_krb_files
create_krb_files: tftpaccess.ctl file and client srvtab files created/updated
on server node 0.
# ls /tftpboot/*srvtab
/tftpboot/sp3n08-new-srvtab
# spbootins -r disk -s no -l 8
The file was created on the Control Workstation with the name
sp3n08-new-srvtab on the /tftpboot directory. You must copy it to the node as
/etc/krb-srvtab. You should use FTP, since remote commands are not
functioning.
Through the SDR we can see that node 8 was not initialized:
# tail /var/adm/SPlogs/css/summlog
110311211998 sp3n15 N sp3en0 49 SP_SW_RSGN_BKUP_RE
110311211998 sp3n08 N sp3en0 51 HPS_FAULT6_ER
110311221998 sp3n01 N sp3en0 51 SP_SW_UNINI_NODE_RE
110311221998 sp3n01 N sp3en0 52 SP_SW_UNINI_LINK_RE
Being the only relevant error log entry, label HPS_FAULT6_ER only indicates that
the fault service daemon on the node terminated with an error. There are also
no helpful log entries on the primary node. We now check the flt file on the
primary node:
As marked on the screen, the switch initialization was retried, quite certainly
because of the Error/Status packets received during a broadcast. We also
notice that device 7 (node 8) was removed from the switch during the retry. To
get more detail on the broadcast error we check the fs_daemon_print.file:
# tail /var/adm/SPlogs/css/daemon.stdout
ERROR: deviceConnect: id 100015 port 7 connected more than once
This is most likely an indication for a corrupted topology file on the node. One
possible solution is to force the redistribution of the topology file on the next
Estart, accomplished by:
Now run rc.switch to restart the fault service daemon on the failing node.
Then run Estart on the Control Workstation to distribute the correct topology
file to all nodes and start the switch.
In this chapter we describe the challenge of tuning the SP system and its
applications for maximum possible performance. Applications running on the
SP system with radically different network traffic characteristics can cause
tuning problems. Where an application inherits its network tunables is key to
optimal performance.
You need to take into account the total system picture when you are tuning
large configurations. For example, if you are tuning a partitioned SP system,
subsystems that span several system partitions (such as the SP system
Ethernet) can cause changes within more than the partition being tuned.
When setting tunables, take into account the total system, not necessarily
only those nodes being used for the application or subsystem being tuned.
The approach to tuning the SP system is, in some situations, the opposite of
how you would tune an AIX workstation. In tuning an AIX workstation or
server, the most common approach is to tune the machine to handle the
amount of traffic or services requested of it. In the case of a file server, the
server is tuned to the maximum amount of traffic it can receive. In general,
the bottleneck in a high-end server is the capacity of the network through
which services are requested.
In the SP system, the SP Switch is faster than any other network available.
With the non-blocking nature of the switch, the number of requests and
volume of data that may be requested of a node can far exceed the node’s
capacity. To properly handle this situation, the volume of services requested
of a server or destination node must be managed. Instead of trying to tune a
node for the maximum amount of services requested, each of the nodes
requesting services needs to manage the volume of requests made of
another node. In other words, you should consider reducing the volume of
The SP usually requires that tunable settings be changed from the default
values in order the achieve optimal performance of the entire system.
Network options can be changed with the no command. However, where to
set these tunable values is very important. If they are not set in the correct
places, subsequent rebooting of the nodes, or other changes, can cause
tunable settings to change or be lost.
All dynamically tunable values (those that take affect immediately) setting
should be set in the /tftpboot/tuning.cust script on each node. There is also
a copy of the file in this same directory on the Control Workstation. Tunables
changed using the no, nfso, or vmtune command can be included in this file.
There are several reasons why the /tftpboot/tuning.cust file should be used
rather than /etc/rc.net for dynamically tunable settings. If you cause errors in
/etc/rc.net, you can render the node unusable, requiring a reinstall of the
node. If you partially cause errors in /etc/rc.net, getting to the node through
the console connection from the Control Workstation can take several
minutes or even more than one hour. This is because parts of the initialization
of the node try to access remote nodes or the Control Workstation. Because
/etc/rc.net is defective, each attempt to get remote data takes 9 minutes to
timeout and fail.
If you cause errors in tuning.cust, at least the console connection will work,
enabling you to log in through the Control Workstation and fix the bad tunable
settings.
If the system has nodes that require different tuning settings, we recommend
that a copy of the each settings be saved on the Control Workstation. Then,
when nodes with a specific tuning setting are installed, that version of
tuning.cust is moved to the /tftpboot directory on the node from the Control
Workstation.
As discussed in 4.6.2, “Send Data Flow” on page 61, the send pool and
receive pool are separate buffer pools, one for outgoing data (send pool) and
one for incoming data (receive pool). The amount of data that will fit in an
mbuf is up to 228 bytes, depending on what type of TCP options are set.
When an IP packet is passed to the switch interface, if the data can fit in an
mbuf (that is, the data is at most 228 bytes long), an mbuf is used from the
mbuf pool, and no send buffer pool space is allocated for the packet. If,
however, the data is too large to fit in an mbuf, buffer pool space will be
allocated.
Use the lsattr command to view the current settings for the SP Switch
pools:
lsattr -El css0
The switch buffer pools are allocated in the kernel at initialization of the switch
adapter. They are not dynamic for the SP Switch. This may change for future
technology. Make the changes for the switch buffer pool size to the ODM
database only. Then reboot the node for the changes to become effective.
Important
When allocating the receive pool and send pool, realize that this space is
pinned kernel space in physical memory. This takes away space from user
applications and is particularly important in small memory nodes.
The sizes of the switch buffer pools allocated by the device driver start at
4096 bytes and increase to 65536 bytes. All allocation sizes in between are a
power of 2 value. This means that the buffer sizes allocated from the pools
are:
Bytes Size in K
40964
81928
1638416
3267832
6553664
If the size of the data being sent is just slightly larger than one of these sizes,
the buffer allocated from the pool is the next size up. This can cause a 50%
efficiency in use of the buffer pools. More than half of the pool can go unused
in bad circumstances. When assembling a TCP/IP packet, one mbuf from the
When sending 4 KB of data over the switch, an mbuf from the mbuf pool is
used as well as one 4 KB send pool buffer for the data. If the amount of data
being sent is less than 228 or so bytes, no buffer from the send pool is
allocated because there is space in the mbuf used for assembling the
headers to stage the data. However, if sending 256 bytes of data, you will end
up using one mbuf for the IP headers, and one 4 KB send pool buffer for the
data. This is the worst case, in which you are wasting 15/16 of the buffer
space in the send pool. These same scenarios apply to the receive pool when
a packet is received on a node.
The key for peak efficiency of the send pool and receive pool buffers is to
send messages that are at or just below the buffer sizes allocated from the
pools, or less than 228 bytes.
The appropriate values for the tunables are unique for each installation.
Therefore, a sizing exercise like the one described here is necessary. When
trying to determine the appropriate receive pool and send pool sizes, you
need to get a profile of the message sizes that are being sent by all
applications on a node. This will help determine how the receive pool and
send pool will be allocated, in total number of buffers. At a given pool size,
you will get 16 times as many buffers allocated out of the pool for 4 KB
messages than for 64 KB messages. However, the total pool size in both
cases is identical.
Once you have a profile of the packet or buffer sizes used by all applications
using the switch on a node, you can then determine roughly how many of
each size send pool or receive pool buffers will be needed. This then
determines your initial receive pool and send pool settings. A node that
stages a mix of packet sizes, of which 25% are smaller than 228 bytes, 50%
are 5 KB and 25% are 32 KB, if the number of packets projected to be staged
at any one time is 1024, then the send pool initial setting should be 12582912
or 12 megabytes. None of the small packets need any send pool space, the
5 KB packets each use 8 KB out of the send pool and need about 4 MB of
space, and the 32 KB packets need about 8 MB of send pool space. The total
estimated pool size needed is 12 MB.
The allocations from the pools are transient. The actual number of buffers
used at any time is very dynamic due to the large volume of packets the
switch can handle. The above calculation is a conservative estimate in that it
assumes all packets will be staged at once. In reality, as packets are being
staged into the pool, other packets are being drained out, so the effective
number of active buffers should be less.
The sizes allocated from the pool are not fixed. At any point in time, the
device driver will divide the pool into the sizes needed for the switch traffic. If
there is free space in the send pool, and smaller buffers than the current
allocation has available are needed, then the device driver will carve out the
small buffer needed for the current packet. As soon as the buffer is released,
it is joined back with the rest of the 64 KB buffer it came from. The buffer pool
manager tries to return to 64 KB block allocations as often as possible to
maintain high bandwidth at all times. If the pool were fragmented, and a large
buffer needed 64 KB, then there may not be 64 KB of contiguous space
available in the pool. Such circumstances would degrade performance for the
large packets.
32 KB 16 KB 8 KB 4 KB
If all buffer space is used, then the current packet is dropped, and TCP/IP or
the application will need to resend it, expecting that some of the buffer space
was freed up in the mean time. This is the same way that the transmit queues
are managed for Ethernet, Token Ring and FDDI adapters. If these adapters
are sent more packets than their queues can handle, the adapter drops the
packet.
Currently, the upper limit for the send pool and receive pool is 16 MB for each.
This means you can get a maximum of 4096 4 KB or 256 64 KB buffers each
for sending and receiving data.
To verify that the switch buffer pools are exceeding the threshold, run the
vdidl3 -i command to check the send pool:
# vdidl3 -i|pg
get ifbp info...
For the receive pool, check the AIX error log for "mbuf pool threshold
exceeded" entries by running errpt -a|grep ENOBUF. If you experience
slow network traffic or ENOBUF errors in the error log, it indicates the system
is running out of receive buffer pool space.
Use netstat -m to check for "request for memory denied" errors for small
(less than 228 bytes) packets. Increase the size of thewall in the no options:
/usr/sbin/no -o thewall=131072
For large packets (greater than 228 bytes) use the vdidl3 -i command to
check if the failed count is non-zero. Increase the switch buffer pools using
the chgcss command and reboot the system:
If there are a small number of active sockets, then there is usually enough
receive and send pool space that can be allocated. In systems where a node
has a large number of sockets opened across the switch, it is very easy to run
out of send pool space when all sockets transmit at once. For example, 300
sockets each sending 33 KB of data will far exceed the 16 MB limit for the
send pool. Or, 1100 sockets each sending four 1 KB packets will also exceed
the maximum limit.
The other situation that aggravates exhaustion of the pools is the use of SMP
nodes. Only one CPU is used to manage the send or receive data streams
over the switch. However, each of the other CPUs in the SMP node is
capable of generating switch traffic. As the number of CPUs increases, so
does the aggregate volume of TCP/IP traffic that can be generated. We
suggest that for SMP nodes, the send pool size be scaled to the number of
processors when compared to a single CPU node setting.
The key to tuning the buffer pools on the SP is to understand the number of
packets using the switch pools at any one time and the size of the data being
sent. The receive and send pool sizes need to be set so there are enough
buffers allocated to handle the size and number of packets being sent or
received. Applications that do not efficiently use full buffers from the pool
aggravate the pool size needed or can cause exhaustion of the pools.
The first level network protocol is the Address Resolution Protocol (ARP),
which dynamically translates Internet addresses into the unique hardware
MAC addresses on local area networks. These addresses are kept in a series
of entries in buckets. If a MAC address is not in the ARP cache when
connecting to a remote adapter, an ARP broadcast is sent to get the MAC
address from the remote host and store the information in the local ARP
cache. If the cache is too small, then the ARP cache thrashes, loading up
MAC addresses, causing network performance to suffer.
SP systems with more than 128 nodes can suffer from ARP thrashing.
Though the SP Switch does not include a MAC address, the system uses the
switch number as MAC address and stores it in the ARP cache:
sp2a03.gsi.de (140.181.80.3) at 0:2:0:0:0:0
sp2a04.gsi.de (140.181.80.4) at 0:3:0:0:0:0
sp2a07.gsi.de (140.181.80.7) at 0:6:0:0:0:0
sp2b05.gsi.de (140.181.80.21) at 0:14:0:0:0:0
sp2b07.gsi.de (140.181.80.23) at 0:16:0:0:0:0
arptab_nb * arptab_bsiz
For fast lookups, a large number of small buckets is ideal. For memory
efficiency, a small number of medium buckets is best. Having too many
buckets wastes memory. The following is the recommended way to calculate
the values for the ARP cache sizing. For systems with more than 64 nodes,
round the number of nodes down to the next power of 2, and use that for
arptab_nb.
Table 12 shows the values for systems with from 1 to 512 nodes.
Table 12. arptab_nb Tuning
1-64 25 (default)
65-128 64
129-256 128
257-512 256
To set the number of entries for each ARP bucket, set the arptab_bsize to
double the number of adapter interfaces on a node.
Table 13. arptab_bsize Tuning
1-3 7 (default)
# arp -a|wc -l
23
You can see if any of your buckets are full by pinging an IP address on a local
subnet that is not in the ARP cache and is not being used. See how long the
ARP entry stays in the cache. If it lasts for a few minutes, that particular
bucket is not a problem. If it disappears quickly, that bucket is doing some
thrashing. Carefully choosing the IP addresses to ping will let you monitor
different buckets. Make sure the ping actually made the round trip before
timing the ARP cache entry.
A 200 msec delay in sending data over a switch can be caused by either the
Nagle Algorithm being invoked or by the application writing buffers so that
only one packet is sent over the interface and the delayed ACK timer causing
a 200 msec delay on the acknowledgment.
The effect of the Nagle Algorithm or delayed ACK timer is easy to see if only
one socket connection is running. If you check the packet rate over the switch,
you should see an average of 5 packets per second. Typically, a throughput
rate of 150 to 300 KB per second transfer rate is reported by an application.
To check the packet rate over the switch and total IP packet rates per second,
use the following command:
This listing shows how many packets per second go through the switch
interface. The following suggestions will help to avoid the Nagle Algorithm:
• If you are running an application and do not have access to the source
code, use the no command to increase the TCP window. This may not
always be effective because increasing the tcp_sendspace and
tcp_recvspace size on the sending and receiving nodes may cause other
negative effects on the SP system or to other applications running on the
system. Make sure that you set rfc1323 to 1 if the window size exceeds
65536.
• Changing the MTU of the switch will move the window size and buffer size
combination. When writing 32 KB buffers to a TCP connection, if the
TCP/IP window is 65536, only 5 packets per second are transmitted. If you
change the MTU of the switch interface to 32768, there is no delay in
transmitting a 32768 buffer because it is the same size as the segment
size of the switch. However, reducing the MTU of the switch to 32768
degrades the peak TCP/IP throughput slightly. Reducing the MTU even
further degrades the peak throughput even more.
• From within an application, you can change the TCP window size on the
socket by setting a different size using the SO_SNDBUF and
SO_RCVBUF settings on the socket. For good performance across a
switch, we suggest that both SO_SNDBUF and SO_RCVBUF be set to at
least 327680 on both the client and server nodes. You need to set both
sides since TCP uses the lowest common size to determine the actual
TCP window.
If you set the SO_SNDBUF and SO_RCVBUF sizes above 65536, you
also need to set TCP_RFC1323 on the socket unless the no options
already have it set. Setting TCP_RFC1323 to 1 will take advantage of
window sizes greater than 65536. You also want to ensure that the system
setting for sb_max is at least 655360 or sb_max will reduce the effective
values for SO_SNDBUF and SO_RCVBUF. You cannot change the
sb_max setting from within an application. You must use the no command.
Important
Changes to the network options do not affect existing socket connections.
Child processes inherit default socket settings from parent processes.
Set the MP_EAGER_LIMIT, which the rendezvous methodology will use. Size
it accordingly to ensure that at least 32 messages can be outstanding
between two tasks in user space. Table 14 shows the eager_limit values
relative to the number of tasks:
Table 14. MP_EAGER_LIMIT
Tasks MP_EAGER_LIMIT
1-16 4096
17-32 2048
33-64 1024
65-128 512
User Space 64 MB
Important
Using MPI over IP gives you a much smaller default memory buffer size. It
can be increased by setting a larger value through your shell variables.
One difficulty in tuning networks is that different mixes of traffic, and the way
applications utilize the network, can work against each other. A set of
tunables that benefits a single stream point-to-point large data transfer can
cause severe problems for a parallel application. In many cases you cannot
optimize for two types of traffic using the no tunables. In addition, if your
configuration has several types of network media (Token ring, Ethernet, FDDI
and ATM), it is not always possible to optimize for all network interfaces.
The sections that follow address settings for a single application environment.
Other options to optimize tuning can be handled by setting groups of nodes
into different settings, or using the applications themselves to set network
tuning for their own socket connection.
IBM supplies three alternate tuning files which contain initial performance
tuning parameters for three different SP environments:
/usr/lpp/ssp/install/config/tuning.commercial contains initial performance
tuning parameters for a typical commercial environment.
/usr/lpp/ssp/install/config/tuning.development contains initial performance
tuning parameters for a typical interactive and or development
environment.
/usr/lpp/ssp/install/config/tuning.scientific contains initial performance tuning
parameters for a typical engineering and or scientific environment.
Use SMIT or issue the cptune command. When you select one of these files,
it is copied to /tftpboot/tuning.cust on the Control Workstation and is
propagated from there to each node in the system when it is installed,
migrated, or customized. Each node inherits its tuning file from its boot/install
server. Nodes that have as their boot/install server another node (other than
the Control Workstation) obtain their tuning.cust file from that server node,
so it is necessary to propagate the file to the server node before attempting to
propagate it to the client node. The settings in the /tftpboot/tuning.cust file
are maintained across a boot of the node.
Using SMIT:
Select SP System Management
Select SP Cluster Management
Select The desired tuning file
Table 16. Initial Values for SP Switch Performance at the Expense of Ethernet
rfc1323 1 1 1 1
subnetsarelocal 1 1 1 1
If you have additional network adapters in your nodes and you want to
optimize the network performance for these adapters, use the initial
suggested settings shown in Table 17:
Table 17. Initial Values for Other Adapter Types
rfc1323 1 1 1
subnetsarelocal 1 1 1
Once you have updated tuning.cust, continue installing the nodes. After the
nodes are installed or customized, on all subsequent boots, the tunable
values in tuning.cust will be automatically set on the nodes. Note that each of
the supplied network tuning parameter files, including the default tuning
parameter file, contains the command /usr/sbin/no -o ipforwarding=1.
IBM suggests that on non-gateway nodes, you change this command to read
/usr/sbin/no -o ipforwarding=0. After a non-gateway node has been
installed, migrated, or customized, you can make this change in the
/tftpboot/tuning.cust file on that node.
For the latest performance and tuning information, refer to the RS/6000 Web
site at
http://www.rs6000.ibm.com/support/sp/perf
In systems where there is one server, and the rest of the nodes run an
application that needs larger tcp_sendspace and tcp_recvspace, it is
acceptable to use different settings on the appropriate nodes. In this
situation, the nodes talking to each other use large TCP windows for peak
performance, and when talking to the server use small windows. The TCP
The following list provides network tunable settings designed as initial values
for server environments. You need to change these values on each of the
installed nodes. To temporarily change these values, use the no command. If
you want them to be preserved across booting the nodes, we suggest that
you change the tuning.cust script on each node. Details on how to change
the script are found in 11.2, “Tuning Considerations” on page 196.
thewall =16384
sb_max =131072
subnetsarelocal =1
ipforwarding =1
tcp_sendspace =65536
tcp_recvspace =65536
udp_sendspace =65536
udp_recvspace =655360
rfc1323 =1
tcp_mssdflt =1448
tcp_pmtu_discover (new in AIX 4.2.1) =1
udp_pmtu_discover (new in AIX 4.2.1) =1
These settings are only initial suggestions. Start with them and realize that
you may need to change them. These initial values are derived by expecting
network traffic from lots of connections. To prevent exhaustion of the TCP/IP
buffer pools or the switch buffer pools, tcp_sendspace, tcp_recvspace and
sb_max are reduced. If the total number of active requests from the server is
small, these values may be increased to allow more buffer area per
connection. This increase will help improve performance for small numbers of
connections as long as the aggregate TCP window space does not exceed
other buffer areas.
List the current NFS settings on a node with the nfso -a command. The
parameters vary depending on the NFS version you are running on the AIX
system. The example was taken from NFS V3:
# nfso -a
portcheck= 0
udpchecksum= 1
nfs_socketsize= 1300000
nfs_tcp_socketsize= 60000
nfs_setattr_error= 0
nfs_gather_threshold= 4096
nfs_repeat_messages= 0
nfs_udp_duplicate_cache_size= 1000
nfs_tcp_duplicate_cache_size= 1000
nfs_server_base_priority= 0
nfs_dynamic_retrans= 1
nfs_iopace_pages= 32
nfs_max_connections= 0
nfs_max_threads= 96
nfs_use_reserved_ports= 0
nfs_device_specific_bufs= 1
nfs_server_clread= 1
nfs_max_write_size= 0
nfs_max_read_size= 0
The first problem related to NFS that we generally see in large SP system
configurations occurs when a single node is acting as the NFS server for a
large number of nodes. In this scenario, the aggregate number of NFS
requests can overwhelm the NFS socket or nfsd daemons on the server, or
enough daemons cannot be configured.
For the server configuration, nfsd daemons are the primary concern. If you
have a 64-node configuration and configured one NFS server for the other 63
nodes, and if all 63 client nodes made an NFS request at the same time, you
will need at least 63 nfsd daemons on the server. If you had only 8 nfsd
daemons, as in the default configuration, then 55 NFS requests would have to
However, there is a limit on how many nfsd daemons you can configure
before the amount of processing on behalf of the NFS traffic overwhelms the
server. Generally, when you configure more than 100 nfsd daemons you start
to see NFS performance degradation, depending on the characteristics of the
NFS traffic. The size of the requests, the mix of NFS operations, and the
number of processes on each node that generates NFS requests influence
the amount of NFS performance degradation. Examine nfsstat command
output to check indications that you are generating too much NFS traffic to a
server.
Server nfs:
calls badcalls public_v2 public_v3
1944188 0 0 0
Version 2: (1939153 calls)
null getattr setattr root lookup readlink read
271345 13% 164811 8% 3021 0% 0 0% 723431 37% 269332 13% 399288 20%
wrcache write create remove rename link symlink
0 0% 30071 1% 1924 0% 1725 0% 871 0% 23 0% 4 0%
mkdir rmdir readdir statfs
10 0% 3 0% 63445 3% 9849 0%
Version 3: (5035 calls)
null getattr setattr lookup access readlink read
57 1% 424 8% 0 0% 1175 23% 74 1% 8 0% 19 0%
write create mkdir symlink mknod remove rmdir
3180 63% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
rename link readdir readdir+ fsstat fsinfo pathconf
0 0% 0 0% 12 0% 17 0% 27 0% 34 0% 0 0%
commit
8 0%
If you see RPC timeouts or RPC retransmits when checking on the client
nodes, then you probably are overwhelming the current number of nfsd
daemons or nfs_socketsize size on the server.
Client rpc:
calls badcalls retrans badxid timeout wait newcred
3595281 695 0 665 690 0 0
The limitation to 6 biod’s on a client node per NFS mounted file system is
imposed by NFS. In NFS Version 2.0, all NFS writes use a fully synchronous
4 KB write model. When you write to an NFS-mounted file system, you can
send only 4 KB at a time from the client to the server. If you have only one I/O
outstanding at a time, this is very slow. In a fully synchronous model the data
must be written on the disk at the server before the client can send the next
4 KB block. To speed this up, NFS allows staging 6 requests at the client per
file system, but no more. Given the time it takes to write a block on disk in
AIX, one block at a time is very slow.
If you are using AIX 4.2.1 or higher, on the server and client, you have the
ability of running NFS Version 3.0. This version provides improved
performance by using larger I/O request sizes. In NFS V2.0, the read request
size was 8 KB and the write size 4 KB. In NFS V3.0, you can increase the
read and write request size to 32 KB each. This larger I/O request size
improves performance due to the larger amount of data per request. In
addition, NFS V3.0 has the ability to use asynchronous write. This improves
write performance to close to read performance. However, there is a very
slight risk that data can be lost if the server crashes.
If there is more than one NFS-mounted file system on your client nodes, you
can configure up to 6 biod daemons per file system. If the number of biod’s
gets too high, performance will suffer in the same way as when too many nfsd
daemons are configured on the server. Again, a general rule is that more than
100 biod’s will start to degrade performance.
In general, the same rules discussed in the previous section apply for nfsds
on the client node when the client is also used as NFS server. However, we
do not suggest that you use the same number of nfsd daemons on the client
nodes acting as servers when a large number of nodes all use the same
server (16 clients to 1 server, for example). If all nodes have 16 biod
daemons, each nfsd daemon on the server nodes can potentially be sent 16
NFS requests. In the situation where you use nodes as server and client,
monitor nfsstat on both the client and server nodes to look for RPC
timeouts and retransmits.
If you already have a lot of nfsd daemons configured on the server, the best
solution is to split the NFS server duties across several nodes, and keep the
potential number of concurrent NFS requests below 100. As a general
guideline, configuring up to 100 nfsd’s on a server usually works well on the
SP system.
In NFS V2.0 you only have the nfs_socketsize. In NFS V3.0, since it now
includes the ability to use TCP as the transport mechanism between client
and server, you must make sure that both these parameters are set to the
correct size. To determine the appropriate size on the server, you need to
calculate the maximum number of servers that will be writing to the server at
the same time, times the number of file systems they will be writing to, times
6, which is the maximum number of biod’s per remote file system. By taking
this number, and multiplying it by 4K for NFS V2.0 or 32K for NFS V3.0, you
can estimate the queue space needed.
If the sizes needed for nfs_socketsize and nfs_tcp_socketsize are very large,
you might want to reduce the number of biod’s per mounted file system from
the client nodes. If you do, then you can reduce the size previously
determined by using the smaller number of biod’s per file system rather than
the value of 6 .
For more detailed information regarding tuning NFS, refer to the IBM AIX
Versions 3.2 and 4 Performance Tuning Guide, SC23-2365, and the AIX
System Management Guide: Communications and Networks, GC23-2487.
To preserve these values across booting of the nodes, we suggest that you
change the tuning.cust script on each node. These settings are only initial
suggestions. Start with them and realize you may need to change them.
thewall =16384
sb_max =1310720
subnetsarelocal =1
ipforwarding =1
tcp_sendspace =65536
tcp_recvspace =65536
udp_sendspace =32768
udp_recvspace =65536
rfc1323 =1
tcp_mssdflt =1448
tcp_pmtu_discover (new in AIX 4.2.1) =1
udp_pmtu_discover (new in AIX 4.2.1) =1
These initial values are derived by expecting the network traffic to be small
packets with lots of socket connections. tcp_sendspace and tcp_recvspace
are kept small so that a single socket connection cannot use up lots of
network buffer space, causing buffer space starvation. It is also set so that
high performance for an individual socket connection is not expected.
However, in aggregate, if lots of sockets are active at any one time, the overall
resources will enable high aggregate throughput over the switch.
To achieve peak data transfer across a switch using TCP/IP, you need to
increase the no tunables that affect buffer sizes, queue sizes, and the TCP/IP
window. The tunables to adjust are:
thewall
To get the best TCP/IP transfer rate, you need to size the TCP/IP window
large enough to keep data streaming across the SP Switch without stopping
the IP stream. The switch has a Maximum Transmission Unit (MTU) of 65520
bytes. This is the largest buffer of data that it can send. When using TCP/IP,
TCP will send as many buffers as it can until the total data sent without
acknowledgment from the receiver reaches the tcp_sendspace value. We
found that having at least 4 buffers as the size of the window, allows TCP/IP
to reach high transfer rates. Faster nodes can require a greater number of
buffers. However, if you set tcp_sendspace and tcp_recvspace to 655360, it
can hurt the performance of the other network adapters connected to the
node. This can cause adapter queue overflows. The settings below are only
initial suggestions. Start with them and realize you may need to change them.
thewall =16384
sb_max =1310720
subnetsarelocal =1
ipforwarding =1
tcp_sendspace =655360
tcp_recvspace =655360
udp_sendspace =65536
udp_recvspace =655360
rfc1323 =1
tcp_mssdflt =(Varies depending on other network types.)
tcp_pmtu_discover (new in AIX 4.2.1) =1
udp_pmtu_discover (new in AIX 4.2.1) =1
These initial values are derived by expecting the network traffic to be small
packets with lots of socket connections. However, when running a parallel
database product, you want to be able to get as much SP switch throughput
to a single connection as you can without causing problems on other network
adapters. The settings are also designed to enable a single socket to be able
to send to an Ethernet adapter without causing adapter queue overruns. In
addition, tcp_sendspace and tcp_recvspace are large enough to get most of
the switch bandwidth at database size packets.
If other applications are run on the same node that have vastly different
network characteristics, such as ADSM, or are a data mining type application
that tends to use few sockets, these settings will not provide peak
performance. In these cases, the TCP window settings may have to be
increased. Conflicts with the settings needed by ADSM can be resolved by
having ADSM do its own socket-level tuning. See 11.6.1, “Tuning for the
ADSTAR Distributed Storage Manager (ADSM)” on page 221 for more
information.
The following are the no tunable values that will achieve the best ADSM
performance on the SP ADSM server node:
thewall =16384
sb_max =1310270
rfc1323 =1
tcp_mssdflt =32768
If you already have a larger value for thewall, you need to keep it.
Buddy Buffers:
The VSD server node uses buddy buffers to temporarily store data for I/O
operations originating at a non-server node. Buddy buffers are used only
when a shortage in the switch buffer pool occurs, or on certain networks. In
contrast to the data in the cache buffer, the data in a buddy buffer is purged
immediately after the I/O operation completes.
If you do not plan to create any GPFS file system with a block size larger than
16KB, two or three buddy buffers of 16 KB each on VSD server nodes should
suffice. Create one buddy buffer of 16 KB on each non-server node.
If you do plan to create a GPFS file system with a block size larger than 16
KB, one buddy buffer equal to the largest block size used will do for
non-server nodes, but the number of buddy buffers on VSD server nodes
should be sufficient to handle the maximum number of disk I/Os expected at
any given time. Start with two buffers per physical disk attached to a server. If
you monitor the output of the statvsd command for queued buddy buffer
requests, you can determine whether this setting is correct.
Each VSD requires parameters in the System Data Repository (SDR). These
parameters must be set to sufficient values in order to operate efficiently:
IP Packet Size
Cache Buffers
Request Count
Buddy Buffers
This is only an example; the actual settings you may need will vary. On a
system where VSD is already set up, these parameters could be changed
using the updatevsdnode command instead.
vsdatalst -n
You can use SMIT or the mmchconfig command to change the following
GPFS configuration attributes after the initial configuration has been set:
1. pagepool
2. mallocsize
3. priority
4. autoload
5. client_ports
Attributes 1 through 3 take effect the next time GPFS is started. Attributes 4
through 7 require that the nodes be rebooted before new values take effect.
Attention
In a byte, bit 0 is the most significant bit. For instance, if a byte has the
hexadecimal value 0x40, only bit 1 is set on the byte.
When a byte describes the status of the switch chip ports, bit 0 represents
port 0, bit 1 represents port 1, and so on.
Byte Description
1 Reserved
29 Mode Bits
35 Reserved
45 Reserved
53 Reserved
The two Route Tables, primary and secondary, are used when the switch chip
is either reporting errors or generating acknowledgments. The three least
significant bits of the last byte will indicate how many of the previous seven
route bytes were valid. If the destination node is directly attached to the
switch chip, the length field will be equal to zero. The destination port through
which the service logic will transmit the Error/Status packet must be in the
most significant nibble of the first byte of the route table specification, even for
a route of length equal to zero.
The Link Round Trip Time-out Threshold is used by each receiver port to
control link synchronization.
When the EDC Errors on a switch chip link reach the EDC Threshold Error, an
Error/Status packet is sent and the link starts the initialization process.
When a sender’s Token Error counters reach the Sender Token Error Count
Threshold, an Error/Service packet is sent and the link starts the initialization
process.
The Receiver Bypass Path Enable Bits are used by each receiver state
machine to determine if packets are allowed to take a bypass path to the
destination port. If a receiver has its bypass path disabled, it is forced to
buffer all packets into the central queue buffer.
The Receiver Central Queue Path Enable Bits are used by each receiver
state machine to determine if packets are allowed to use the central queue
buffer. If a receiver has its central queue path disabled, it is forced to transmit
all packets across the bypass path.
The Receiver Link Enable Bits are used by each receiver state machine to
determine if they should accept packets or not. One cannot disable all the
links of the chip since this makes it impossible for the chip to receive a new
configuration packet.
The Sender Link Enable Bits are used by each sender state machine to
determine if it should send packets or not. One cannot disable all the links of
the chip since this makes it impossible for the chip to send error reports.
The EDC Frame Length is used by each of the senders to define the number
of bytes per EDC Frame. The sender notifies its corresponding receiver of the
change in the buffer.
The Mode Bits are used to control the Central Queue data allocation.
The EDC Error Enable Bits are used by each receiver logic to determine if
single EDC errors should be reported when detected. The EDC Error
Threshold Counter, however, is increased even if this error is disabled.
Service Error Enables are used to enable a set of reports on the consistency
of service packets received:
• Incorrect CRC on a Service Packet
• Incorrect Service Packet Length
• Parity Error on Inbound FIFO
• Parity Error on Route Table
• Invalid Link Enable
• Send TOD Error
• State Machine Error
Byte # Description
0 Service Command
Only the command is needed, since the real action to be taken is to generate
the Error/Status packet. The remainder of the bytes are reserved for future
use.
Byte # Description
0 Service Command
1 Reserved
The two Receiver Error Resets bytes are used to reset the second error
capture registers defined as receiver-type errors. Their meaning is the
following:
• High Byte (byte #2)
• Bits 0:3 are undefined
• Bit 4 resets the EDC Error Register
• Bit 5 resets the Parity Error on Route Register
• Bit 6 resets the Undefined Control Character Error Register
• Bit 7 resets the Unsolicited Data Error Register
• Low Byte (byte #3)
• Bit 0 resets the Lost EOP Error Register
• Bit 1 is reserved
• Bit 2 resets the STI Data Re-Time Reporting Register
• Bit 3 resets the Link Synchronization Error Register
• Bit 4 resets the FIFO Overflow Error Register
The Receiver EDC Error Threshold Counter Reset byte is used to reset the
EDC Error Threshold Counter of one or more receiver modules in the switch
chip. Each bit corresponds to one receiver. The counters can be used as an
instrumentation device, where the software monitors may come in and read
the counter values for each receiver (Read Status packets), and reset them
as they please to monitor link error activity. The Receiver EDC Error
Threshold counter is automatically reset when its threshold value has been
reached and the error is enabled.
The Receiver Port Reset Bits are used by the service logic to determine if the
receiver logic should be reset to a known (Disabled) state.
The two Sender Error Resets bytes are used to reset the second error
capture registers defined as receiver type errors. Their meaning is the
following:
• High Byte (byte #6)
• Bits 0:6 are undefined
• Bit 7 resets the Parity Error on Data Register
• Low Byte (byte #7)
• Bit 0 resets the Token Sequence Error Register
• Bit 1 resets the Invalid Route Error Registers
• Bit 2 is reserved
• Bit 3 resets the STI Token Re-Time Reporting Register
• Bit 4 resets the Token Count Overflow Error Register
• Bit 5 resets the Token Error Threshold Register
• Bit 6 resets the Link Synchronization Error Register
• Bit 7 resets the State Machine Error Register
The Sender Token Error Threshold Counter Reset Bits are used to reset the
Token Error Counter of one or more sender modules in a switch chip. Each bit
represents a sender port. These counters can be used as an instrumentation
device, where the software monitors may come in and read the counter
values for each sender (Read Status packets), and reset them as they please
to monitor link error activity. The Sender Token Error Threshold counter is
The Sender Port Reset Bits are used by the switch chip’s sender module to
determine if the sender logic should be reset to a known (Disabled) state.
The Central Queue Error Resets and Control Logic Reset bytes are used to
reset the second error capture registers defined as central queue type errors,
and to reset the central queue logic. The bits are used in the following way:
• Bits 0:3 are undefined
• Bit 4 resets the Next Chunk Linked List Initialization Error Latch
• Bit 5 resets the Next Message Linked List Error Latch
• Bit 6 resets the Next Chunk Linked List Error Latch
• Bit 7 resets the Central Queue logic
The Service Error Enables and Control Logic Reset bytes reset the second
error capture registers defined as service-type errors, and also reset the
service logic. In the latter case, the bits are used in the following way:
• Bit 0 resets the CRC Error Latch
• Bit 1 resets the Length Error Latch
• Bit 2 resets the Parity Error on Inbound FIFO Latch
• Bit 3 resets the Parity Error on Route Table Latch
• Bit 4 resets the Invalid Link Enable Error Latch
• Bit 5 resets the Send TOD Error Latch
• Bit 6 resets the State Machine Error Latch
• Bit 7 resets the Service logic
The last action of the service state machine when it finishes handling this
service message is to reset the First Error Capture Register, thereby enabling
the reporting of new errors. This will be done for any Reset Packet received,
even if there are no reset bits specified in the packet.
Byte # Description
0 Service Command
1 Reserved
2:7 Reserved
8:14 Time-of-Day
15 Time-of-Day Delay
The time-of-day field is the value of the TOD counter inserted by the
transmitting adapter or switch.
The time-of-day Delay field is the calculated delay between the launch of the
Set TOD packet from the adapter or the switch to the loading of the TOD clock
at the destination adapter or switch. This value is calculated by the software.
The time-of-day Delay is concatenated with the time of day and loaded into
the destination chips TOD counter. The switch inherits this new time of day,
and starts incrementing the TOD counter from there.
Byte # Description
0 Service Command
1 Reserved
2:6 Reserved
7 Transmit Port
8:14 Reserved
15 Time-of-Day Delay
The Transmit Port contains in the three least significant bits from which of the
eight send ports of the switch chip the resulting Set TOD packet is to be
The time-of-day Delay is the calculated delay between the launch of the Set
TOD packet from the adapter or the switch to the loading of the TOD clock at
the destination adapter or switch. This value is calculated by the software.
Byte # Definition
0 Service Command
1 Reserved
4 Sequence Number
5 Reserved
14 Reserved
24 Reserved
Chip Status
48 Sender0 States
49 Sender1 States
50 Sender2 States
51 Sender3 States
52 Sender4 States
53 Sender5 States
54 Sender6 States
55 Sender7 States
129 Reserved
139 Reserved
147 Reserved
156:251 Reserved
The two-byte Chip Identification uniquely identifies each switch chip to the
service processor. It is defined to the chip during initialization through the
initialization service packet.
The First Error Capture Register and the Second Error Capture Register
collect all events on the switch chip, separating the first event from the
following ones. See A.2, “Error Registers” on page 242 for more details.
The Chip Status bytes give information relative to the status of the switch
chip. This information can be used for debugging, instrumentation, and
monitoring the switch chip. Several parameters are collected from each
sender and receiver module and from the central queue which provide a
snapshot of the chip status. Network congestion, link and protocol problems
can be monitored and isolated using this data.
The Initialization Register Values are set up during chip initialization. They
are read out to verify that an initialization packet has been received properly.
The time-of-day Counter Value is included for two reasons. First, it puts a time
stamp on the Error/Status packet. Secondly, the most significant bit of the
TOD clock indicates whether or not the TOD clock has ever been initialized. A
value of logic 1 says that the TOD clock has been initialized. If the
Error/Status packet is generated as a result of an error, the TOD value
contains the time that the error occurred.
Attention
As previously mentioned, in a byte, bit 0 is the most significant bit. For
instance, if a byte has the hexadecimal value 0x40, only bit 1 is set on the
6 0 Reserved
1 EDC Error
4 Unsolicited Data
6 Reserved
1 FIFO Overflow
8 0 Reserved
7 Service Error
14 Reserved
24 Reserved
# NAME: Eclock.top.7nsb.4isb.0
#
# FUNCTIONS: This file describes the clock configuration for an 112-way
# with seven (7) Node Switch Boards (NSBs) and four (4)
# Intermediate Switch Boards (ISBs).
# It is used during clock source selection process
# ("Eclock" command). It should not be changed unless the
# switch-to-switch cabling differs from the prescribed pattern.
# FORMAT: See below.
# NOTES:
# Switch Numbers: This is the switch number of the target switch
board.
# Node Switch Boards (NSBs) numbers are less than 1000.
# Intermediate Switch Boards (ISBs) numbers are greater
# than 1000.
#
# Multiplexor values:
# 0 - Use the internal oscillator (make this switch board the
# master frame).
# 1 - Use input 1 (Clock input from Jack J3 -- Both Switches)
# 2 - Use input 2 (Clock input from Jack J5 -- High Perf.
Switch)
# (Clock input from Jack J4 -- SP Switch)
# 3 - Use input 3 (Clock input from Jack J7 -- High Perf.
Switch)
# (Clock input from Jack J5 -- SP Switch)
# 4 - Use input from Jack J4 (SP Switch)
# 5 - Use input from Jack J5 (SP Switch)
# 6 - Use input from Jack J6 (SP Switch)
# 7 - Use input from Jack J7 (SP Switch w/ ISBs)
# 8 - Use input from Jack J8 (SP Switch w/ ISBs)
In this appendix we list the return codes from the Worm code and the Route
Table Generation code.
This publication will help both RS/6000 SP specialists and general users who
want in-depth knowledge about the SP Switch. The information in this
publication is not intended as the specification of any programming interfaces
that are provided by the SP Switch Communication Subsystem (CSS). See
the PUBLICATIONS section of the IBM Programming Announcement for
Parallel System Support Programs (PSSP) for more information about what
publications are considered to be product documentation.
IBM may have patents or pending patent applications covering subject matter
in this document. The furnishing of this document does not give you any
license to these patents. You can send license inquiries, in writing, to the IBM
Director of Licensing, IBM Corporation, 500 Columbus Avenue, Thornwood,
NY 10594 USA.
Licensees of this program who wish to have information about it for the
purpose of enabling: (i) the exchange of information between independently
created programs and other programs (including this one) and (ii) the mutual
use of the information which has been exchanged, should contact IBM
Corporation, Dept. 600A, Mail Drop 1329, Somers, NY 10589 USA.
The information contained in this document has not been submitted to any
formal IBM test and is distributed AS IS. The information about non-IBM
("vendor") products in this manual has been supplied by the vendor and IBM
assumes no responsibility for its accuracy or completeness. The use of this
information or the implementation of any of these techniques is a customer
responsibility and depends on the customer’s ability to evaluate and integrate
them into the customer’s operational environment. While each item may have
Any pointers in this publication to external Web sites are provided for
convenience only and do not in any manner serve as an endorsement of
these Web sites.
The following document contains examples of data and reports used in daily
business operations. To illustrate them as completely as possible, the
examples contain the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and
addresses used by an actual business enterprise is entirely coincidental.
Reference to PTF numbers that have not been released through the normal
distribution process does not imply general availability. The purpose of
including these reference numbers is to alert IBM customers to specific
information relative to the implementation of the PTF when it becomes
available to each customer according to the normal IBM PTF distribution
process.
Microsoft, Windows, Windows NT, and the Windows 95 logo are trademarks
The publications listed in this section are considered particularly suitable for a
more detailed discussion of some of the topics covered in this redbook.
This section explains how both customers and IBM employees can find out about ITSO redbooks,
redpieces, and CD-ROMs. A form for ordering books and CD-ROMs by fax or e-mail is also provided.
• Redbooks Web Site http://www.redbooks.ibm.com/
Search for, view, download or order hardcopy/CD-ROM redbooks from the redbooks web site. Also
read redpieces and download additional materials (code samples or diskette/CD-ROM images) from
this redbooks site.
Redpieces are redbooks in progress; not all redbooks become redpieces and sometimes just a few
chapters will be published this way. The intent is to get the information out much quicker than the
formal publishing process allows.
• E-mail Orders
Send orders via e-mail including information from the redbooks fax order form to:
e-mail address
In United States usib6fpl@ibmmail.com
Outside North America Contact information is in the “How to Order” section at this site:
http://www.elink.ibmlink.ibm.com/pbl/pbl/
• Telephone Orders
United States (toll free) 1-800-879-2755
Canada (toll free) 1-800-IBM-4YOU
Outside North America Country coordinator phone number is in the “How to Order”
section at this site:
http://www.elink.ibmlink.ibm.com/pbl/pbl/
• Fax Orders
United States (toll free) 1-800-445-9269
Canada 1-403-267-4455
Outside North America Fax phone number is in the “How to Order” section at this site:
http://www.elink.ibmlink.ibm.com/pbl/pbl/
This information was current at the time of publication, but is continually subject to change. The latest
information for customer may be found at http://www.redbooks.ibm.com/ and for IBM employees at
http://w3.itso.ibm.com/.
Company
Address
We accept American Express, Diners, Eurocard, Master Card, and Visa. Payment by credit card not
available in all countries. Signature mandatory for credit card payment.
E
A Eannotator 95, 98, 105
Active Message 57 Eclock 100, 119
Address Resolution Protocol 60, 81, 92, 204 mux 101
ARP switch reset 105, 126
See Address Resolution Protocol system power up 134, 146
EDC
See Error Detection Code
B
bandwidth 47 Efence 140, 157
BFS Emonitor 140, 142
See Breadth First Search algorithm Eprimary 105, 132
Breadth First Search algorithm 124 Equiesce 141
See also fault service daemon error
asynchronous 116
permanent 116
unrecoverable 128
N R
Nagle Algorithm 206 rc.switch 105, 114, 146, 175
NFS 214 read_regs 176
nfso 196, 214 read_tbic 177
nfs_socketsize 217 Remote Memory Copy 57
nfs_tcp_socketsize 217 Route Table Generation 128, 158
nfsstat 215 RPC 215
no 196, 204, 218, 221 RTG
ipforwarding 212 See Route Table Generation
rfc1323 208 RVSD
sb_max 213 See Recoverable Virtual Shared Disk
tcp_pmtu_discover 213
tcp_recvspace 207, 208, 212, 218, 219
tcp_sendspace 207, 208, 212, 218, 219
S
SDR classes
thewall 202
Adapter 107, 115
udp_pmtu_discover 213
host_responds 144
notify_event 171
Node 115
numbering
Switch 108, 119, 135, 137
node number 75
Switch_partition 99, 105, 108, 115, 120, 123,
slot number 74
132, 146
switch node number 76
switch_responds 107, 115, 117, 129, 144, 169
switch port number 76
Syspar 108
Syspar_map 107
O SDR_config 115
ODM classes SDRGetObjects 169
CuAt 112, 115 second error capture register 34
PdDv 111 Self-Timed Interface 17, 27
oncoming primary backup node 119, 132 SIGTERM 165
oncoming primary node 119, 132 socket
SO_RCVBUF 207
SO_SNDBUF 207
P TCP_NODELAY 208
Parallel Environment for AIX 51
TCP_RFC1323 207
Partitioning Aid 86
sp_fs_control 116
pending error capture register 35
SP_NAME 104
Phase Locked Loop 18, 136
spadaptrs 92, 107
PIPE 54
spapply_config 105
269
spethernt 107 syspar_ctrl 105
splstdata 93, 105, 106, 107
spmon 106, 135
spverify_config 105
T
TBIC
subsystems
See Trail Blazer Interface Chip
swtadmd 143
Time-Of-Day counter 17, 126, 166
swtlog 152
TOD
switch adapter 38, 165
See Time-Of-Day counter
microcode 42, 112
Trail Blazer Interface Chip 38, 112, 116, 117, 141
Switch Admin daemon 140, 142, 143
switch board 5, 9
intermediate switch board 10, 83, 86, 94, 96 U
master board 19 ucfgtb3 113
master chip 18 updatevsdnode 223
master switch board 100 usconfig 116
node switch board 10, 83, 94 user space application 50
oscillator 36
slave board 19, 100
V
SP Switch-8 12, 73, 76, 94 vdidl3 201
supervisor card 5 Virtual Shared Disk 82, 86, 221
switch buffer pool Buddy Buffers 222
receive pool 63, 197 vmtune 196
rpoolsize 223 VSD
send pool 62, 197 See Virtual Shared Disk
spoolsize 223 vsdatalst 223
switch chip 6, 21 vsdnode 222
Central Queue 31
chunk 32
emergency slot 32 W
flit 22 window 42
link enable bit 27, 30 service 45
link switch chip 83 user space 43, 46
node switch chip 83 Worm
receiver module 23 See fault service daemon
route information 25
sender module 28
switch packet
beginning of packet 13
data packet 13
end of packet 13
packet fail 25, 28
service packet 13, 32
switch supervisor card 35, 37
switch topology file
actual 126
annotated 94
distribution 120
expected 94
use 124
Your feedback is very important to help us maintain the quality of ITSO redbooks. Please complete this
questionnaire and return it using one of the following methods:
• Use the online evaluation form found at http://www.redbooks.ibm.com
• Fax this form to: USA International Access Code + 1 914 432 8264
• Send your comments in an Internet note to redbook@us.ibm.com
Please rate your overall satisfaction with this book using the scale:
(1 = very good, 2 = good, 3 = average, 4 = poor, 5 = very poor)
Was this redbook published in time for your needs? Yes___ No___