SPARC M7 Course PDF
SPARC M7 Course PDF
Processor M7-8: 4 to 8 SPARC M7 processors, each with 32 cores and 8 threads per core
M7-16: 8 to 16 SPARC M7 processors, each with 32 cores and 8 threads per
core
Memory 16 DIMM slots per processor, 16GB or 32GB DIMMs
IO expansion M7-8: 12 to 24 low-profile PCIe Generation 3 (PCIe3) card slots
M7-16: 24 to 48 low-profile PCIe Generation 3 card slots
Storage 1 or more optional flash accelerator PCIe cards
Service processors Two redundant SPs. The M7-16 has an additional four redundant SPPs.
The SPs and SPPs are used to monitor and control the server remotely.
The M6-32 has an internal midplane. The M7-8 servers have Internal Interconnects and the M7-16 server
has External Interconnect consisting of wire trusses.
It is very important to understand the terminology used in these servers before going any further. Some of the
terms may be familiar from other server lines.
The M7-8 and M7-16 servers consist of CPU, memory, and I/O which reside on CMIOU boards. Each CMIOU
board contains one CPU module and 16 Dual Inline Memory Module (DIMM) slots, as well as 3 PCIe Gen 3 Hot
Plug LP slots.
Domain Configuration Units (DCUs) are groups of four CMIOUs in the M7-16, connected by the local
coherency interconnect. A physical domain consists of one or more DCUs.
The M7-8 servers consists of one CMIOU chassis. The M7-16 server consists of two CMIOU chassis with a
switch chassis connecting them.
This physical block diagram displays components of the CMIOU base board. The M7 processor is on a
mezzanine board and connects to its memory through four on-chip memory controllers, with two ports off
each controller.
The I/O interface of the M7 processor consists of two PCIe 16 lane interconnects that are expanded
through a PCIe switch referred to as the I/O Hub (IOH). The switch also connects to 3 PCIe generation 3
slots. Note, these PCIe slots require the adapters to be packaged in carriers. The other I/O device is an
embedded USB device, known as eUSB, with a density of 2GB that can be in conjunction with VersaBoot to
boot the Operating System. This requires an external boot device accessed over Infiniband.
A SPARC M7-16 server is divided into four configurable units called DCUs. These DCUs have four CMIOUs
each. Two of the DCUs are in the top CMIOU chassis, and two are in the bottom CMIOU chassis. You can
configure the DCUs in up to four PDomains.
Each CMIOU chassis also has two service processor proxies. SPP2 and SPP3 are in the top chassis, and SPP0
and SPP1 are in the bottom chassis. These SPPs have two service processor modules each. To achieve
redundancy, each SPM (SPM0 and SPM1) on an SPP is assigned to a different DCU. SPM0 on each SPP
manages one DCU in the CMIOU chassis, while SPM1 on each SPP manages the other DCU in the chassis.
There are two redundant SPs in a SPARC M7-8 server, SP0 and SP1.
In a SPARC M7-16 server, the SPs in each CMIOU chassis are referred to as SPPs. SPP2 and SPP3 are in the
top CMIOU chassis, and SPP0 and SPP1 are in the bottom CMIOU chassis. A SPARC M7-16 server also has
two SPs in the switch chassis. The SPPs in the CMIOU chassis manage DCU activity, and the SPs in the switch
chassis manage system activity.
ILOM on the M7 servers has extensions to support Enterprise features.
Oracle Integrated Lights Out Manager (Oracle ILOM) is the system management firmware that is preinstalled
on the M7 Servers. Oracle ILOM enables you to actively manage and monitor components installed in your
server. Oracle ILOM provides a browser-based interface and a command-line interface, as well as SNMP and
IPMI interfaces.
The Oracle ILOM SP runs independently of the server and regardless of the server power state as long as AC
power is connected to the server. When you connect the server to AC power, the ILOM service processor
immediately starts up and begins monitoring the server. All environmental monitoring and control are
handled by Oracle ILOM.
In the M7-8 servers, each CMIOU chassis has two service processors.
In the M7-16 server, the service processors reside on the Switch chassis. Each CMIOU chassis has two
service processor proxies. SPP2 and SPP3 are in the top chassis, and SPP0 and SPP1 are in the bottom
chassis. These SPPs have two service processor modules each. To achieve redundancy, each SPM (SPM0
and SPM1) on an SPP is assigned to a different DCU. SPM0 on each SPP manages one DCU in the CMIOU
chassis, while SPM1 on each SPP manages the other DCU in the chassis.
The M7-16 server consists of two CMIOU chassis with a switch chassis connecting them.
In SPARC M7-16 servers, switch units are part of the scalability feature that allows a PDomain to control
more than one DCU. Switch units are configured to work together as a single unit.
Oracle ILOM Remote System Console Plus is a Java application that enables you to remotely redirect and
control the following devices on the host server. This group of devices is commonly abbreviated as KVMS.
• Keyboard
• Video display
• Mouse
• Storage devices or images (CD/DVD)
/Servers/PDomains/PDomain_x/HOST is the equivalent of /HOSTx
Hypervisor implements the software component of the sun4v virtual machine, providing low overhead
hardware abstraction. It enforces hardware and software resource access restrictions for guests,
including inter-guest communication, to provide isolation and security. It also performs initial triage and
correction of hardware errors.
FMA is part of the ILOM.
I. Processor Features
---------------------
1. L1Cache protection
- Since the L1 caches are clean, there is no need for the stronger protection of ECC that is used on the larger
caches. So parity protection suffices.
- If a parity error is detected, the affected cacheline is invalidated a new copy is fetched from the L2 cache.
The new fetch uses a bypass path so it is guaranteed to not hit the same location again, if by chance there is
a persistent failure there.
2. L2Cache protection
- Because the L2 can have modified (i.e., dirty) data, ECC is used for protection. ECC provides Single Error
Correct (SEC)/Double Error Detect (DED) coverage so that data can always be recovered in case of a single
bit error. Data is arranged in the physical SRAM of the cache to ensure the cosmic rays and alpha particles
only affect a single bit of data covered by the same ECC checkbits.
- For performance reasons, error correction is not done in the critical access path of the cache. So a
detected error will cause a trap to hypervisor (part of the system firmware), which will then flush the
line from the cache and allow the user process to resume. Again, a bypass path ensures forward
progress in the presence of a persistent failure.
- If the system firmware determines that a specific location in the cache is generating a lot of errors, it
will update the state of the cache so that that location is no longer used. This process is called either
cacheline "retire" or cacheline "sparing".
3. L3Cache protection
- Much the same as the L2 but with inline correction, which means that if a cacheline is loaded to the
L2 from the L3, the data will be corrected when it's placed in the L2 if it contains an error when read
from the L3.
4. Status and Directory protection
- The Status and Directory arrays in both the L2 and L3, which are used for maintaining coherence, are
protected by ECC and have automatic correction built into the HW, to ensure forward progress.
5. Register File protection
- The architectural registers that are used by executing processes are also protected by ECC, so that
any form of upset will not cause loss of data. The HW does not automatically correct, but a detected
error will cause a trap to hypervisor, which will then correct the data and allow the process to resume.
IV. New to M7
1. Redundant SP Proxies
On the M7-16 system, each DCU of (up to) 4 CPUs has an SP Proxy (SPP). While an SPP is not critical to
the runtime functionality of the 4 CPUs, a functioning SPP must be present in order to boot a physical
domain using that DCU. Having redundant SPPs ensures that a DCU will continue to be bootable in the
circumstance of a failed SPP.
It also has the advantage of ensuring the ongoing management of a single-DCU physical domain that
incurs the loss of an SPP.
To enable this feature, the FPGA that resides on the CMIOU has been designed with two ports, one for
each SPP. This means both SPPs within a CMIOU chassis have access paths to all CMIOUs in the chassis,
which in turn means that either SPP can assume responsibility for managing the CMIOUs.
2. DIMM Sparing
The physical address space provided by the memory DIMMs controlled by an individual CPU node are
typically interleaved for performance reasons. An N-way interleave means that each successive
cacheline resides in a different DIMM, from DIMM 0 to DIMM N-1. Cacheline N resides in DIMM 0 and
the pattern repeats.
Interleave support is usually implemented in powers of 2. The plan of record for fully configured M7
systems is to use a 16-way interleave. The M7 processor also supports a 15-way interleave. Which
means that if a physical domain is started with an unusable DIMM, that CPU node can still provide 15
DIMMs worth of physical address space rather than having to drop to an 8-way interleave and thus
provide only 8 DIMMs worth of physical address space.
In addition, the M7 processor provides support for dynamically (with no interruption of service to the
user) switching from a 16-way interleave to a 15-way interleave. Because it is not possible to fit 16
DIMMs worth of address into only 15 DIMMs, this feature will only be possible if the user has
specified, at the time the physical domain is started, that the capability is desired, in which case the
platform FW will ensure that only 15 DIMMs worth of physical address space will be made available to
the system, even though the DIMMs are configured for 16-way interleave operation.
Customers have the ability to enable or disable DIMM sparing though it is not documented and they will
need to open a SR (service request).
DIMM sparing is a mode that protects against persistent DRAM failure. Sparing monitors for an excessive
amount of correctable errors and, when it detects the errors, copies the contents of an unhealthy portion of
memory to an available spare. It increases the system reliability and uptime.
Reference Doc ID 2037793.1 – SPARC T7/M7 Servers : DIMM sparing FAQ
These servers are based on the new SPARC M7 processor. They are optimized for massive workloads and
application consolidation and are designed with extensive RAS and high availability features.
SPARC is more than just a fast processor. It accelerates functionality across multiple layers of software critical
to a corporations operation.
Convert critical software functions, or repetitive software functions, to operations in the SPARC processor.
PDomains and DCUs are logical units, not physical components, that you can configure depending on how
you want to isolate applications and data to satisfy particular business requirements.
A SPARC M7-8 server can be configured as a single PDomain with eight CMIOUs in a single CMIOU chassis.
In this configuration, there are two redundant service processors in this chassis. To facilitate failover from
one SP to another, one SP serves as the Active SP, and one serves as the Standby SP. The Active SP manages
system resources unless it can no longer do so, in which case the Standby SP assumes its role.
A SPARC M7-8 server can also be configured as two Static PDomains. Each Static PDomain consists of four
of the eight CMIOUs in the chassis; one PDomain consists of cmiou0-cmiou3, the other consists of cmiou4-
cmiou7.
In this configuration, two service processors (SP0 and SP1) achieve redundancy with two SPMs each that
are assigned to different Static PDomains. SPM0 on each SP manages one Static PDomain in the chassis,
while SPM1 on each SP manages the other Static PDomain.
Oracle ILOM identifies one of the SPMs from each SP as the Active SP to manage activity on the Static
PDomain. The other SPM on the SP runs Oracle ILOM, but remains inactive unless the Active SP can no
longer manage the hardware. In this case, the inactive SPM assumes the Active SP role.
You cannot specify which SPMs will serve as the Active SP, but you can request that Oracle ILOM should
assign a new SPM to serve each role. You might do this when, for example, you are replacing an SP.
A SPARC M7-16 server is divided into four configurable units called DCUs. These DCUs have four CMIOUs each. Two of
the DCUs are in the top CMIOU chassis, and two are in the bottom CMIOU chassis. You can configure DCUs into one to
four PDomains.
Each CMIOU chassis also has two service processor proxies. SPP2 and SPP3 are in the top chassis, and SPP0 and SPP1
are in the bottom chassis. These SPPs have two service processor modules each. To achieve redundancy, each SPM
(SPM0 and SPM1) on an SPP is assigned to a different DCU. SPM0 on each SPP manages one DCU in the CMIOU
chassis, while SPM1 on each SPP manages the other DCU in the chassis.
Oracle ILOM identifies one of these SPMs as a DCU-SPP to manage DCU activity. The other SPM runs Oracle ILOM, but
remains inactive unless the DCU-SPP can no longer manage the hardware. In this case, the inactive SPM assumes the
role of the DCU-SPP.
Oracle ILOM also identifies one of the DCU-SPPs from the pool of DCU-SPPs on the same PDomain as the PDomain-
SPP to manage activity on that host.
You cannot specify which SPMs will serve as DCU-SPPs or the PDomain-SPP, but you can specify that Oracle ILOM
should assign a new SPM to serve each role. You might do this when, for example, you are replacing an SPP.
The switch chassis in the SPARC M7-16 server contains six switch units that allow CMIOUs to communicate with each
other. The switch chassis also has two service processors (SP0 and SP1) that manage system resources. Each SP in the
switch chassis has a single SPM (SPM0 in each SP). The SPM that you identify as the Active SP manages system
resources unless it can no longer do so, in which case the Standby SP assumes its role.
M7-8 systems are configured in static domains. M7-16 PDoms are dynamic and are configurable.
The on-chip network is a high bandwidth coherent network that maintains coherency for L3$.
There are eight CMIOUs in a CMIOU chassis.
The image above shows the numbering for the M7-16 server. The numbering for the M7-8 servers is
represented in the lower 1/3 of this picture. The CMIOU numbers go from bottom to top in each
CMIOU chassis.
The M7-8 servers are considered a “glue-less” system as they are self-contained.
The M7-16 is considered a “glued” system as it consists of two M7-8 servers connected as 16-way SMP
through the switch chassis.
This diagram shows the layout of the DIMM slots on the CMIOU. It also shows the fault remind button in
blue. The orange dots on the DIMM ejector levers represent the DIMM fault LEDs.
The blue Fault Remind button resides on the CMIOU board. An illuminated green Fault Remind Power LED
indicates that there is power available to light the faulty DIMM LED. Any faulty DIMM is identified by an
associated amber LED until you release the button.
Reference Doc ID 2037793.1 – SPARC T7/M7 Servers : DIMM sparing FAQ
MR (Mystic River) is the code name for the IO Hub (IOH).
For the SPARC M7-8 and SPARC M7-16 servers, some of the terms that are used to describe the I/O maps
have changed from prior releases of the M-series platforms.
This allows the firmware to assign the PCIe device paths in a logical order, allowing the host firmware
to organize and group the device paths for slots so that they are in order, either ascending or
descending, depending on the platform.
The PCI Express I/O is hosted by a new IOH. Each I/O controller has five root complexes. Four of these root
complexes are 16 lanes wide (x16). The PCIe device path for each root complex is controlled by host
firmware.
This figure shows the root complex assignments for one CMIOU.
As shown, each CMIOUx/CMP connects to a single IOH. Thus, the naming is:
• DMU0/PEU0/RC0 has the lowest /pci@xxx root complex name.
• DMU1/PEU1/RC1 has the next highest /pci@xxx root complex name.
• DMU3/PEU3/RC3 has the next highest /pci@xxx root complex name.
This diagram shows the mapping for an individual CMIOU board. The pci_x terminology is used by the
LDoms Manager to track root complexes. This slide is showing the association between root complexes and
slots, using the pci_x terminology.
The switch chassis connects the two CMIOU chassis together in the M7-16 server.
In SPARC M7-8 servers, five internal interconnect assemblies in the CMIOU chassis connect the CMIOUs to
each other.
SP = Service Processor
SPP = Service Processor Proxy
SPM = Service Processor Module
The M7 systems improve on the availability compared to M6 by minimizing potential system outages. They
use a redundant, Hot Plug SP/SPP to allow failover and minimize service action outages.
The SPPs (/SYS/SPPx) on an M6-32 are now SPMs (/SYS/SPPx/SPMy) on M7 platforms. The DCUs are
controlled by a pair of redundant SPMs. The SPM which is currently controlling the DCU is called DCU-SPM,
compared to the M6 'Golden' SPP which controls one PDOM.
The M7-8 servers have a single SP tray with two SPs.
The M7-16 has two CMIOU chassis and a switch. The SPs in the CMIOU chassis become SPPs and the SPs in
the Switch become the primary SPs.
In a SPARC M7-16 server, there are two SPPs in each CMIOU chassis, and two SPs in the switch chassis. SPPs
have two service processor modules each. SPs have one SPM each. This supports hot repair of the Standby
SP/SPP FRU along with redundant SPP's for the M7-16 platform and redundant SP's for the M7-8 platforms.
System firmware consists of two components, an SP component and a host component. The SP firmware
component is located on the SP’s SPM. The host firmware component is located on the SP boards, and is
downloaded to the CMIOU when the host is started. For the server to operate correctly, the firmware in
these two components must be compatible.
Before replacing an SP, save the configuration using the Oracle ILOM backup utility. After replacing an SP,
the SP firmware, the host firmware, and the configuration will be restored from the existing Active SP to
the new SP.
This tray is common to the Switch and CMIOU Chassis.
Connection to the external network needs to be from the NET0 port. For VLAN you can use any of the
NET1/NET2/NET3 ports though the above mentioned convention is preferred. If the chassis/ports are
not connected properly, VLAN will not work for that chassis.
Before the server arrives, ensure that the receiving area is large enough for the shipping package.
If your existing loading dock meets height or ramp requirements for a standard freight carrier
truck, you can use a pallet jack to unload the server. If not, use a standard forklift or other means
to unload the server, or request that the server be shipped in a truck with a lift gate.
The SPARC M7-16 server and certain SPARC M7-8 servers ship from the factory preinstalled in a
rack. Stand-alone SPARC M7-8 servers do not ship in a rack.
When turning the server, temporarily provide additional space in front or rear of the installation site
beyond the minimum aisle width. The server requires at least 52 in. (1.32m) of space to turn.
Before installing the server, prepare a service area that provides enough room to install and service the
server.
The design of your environmental control system, such as computer room air-conditioning units, must
ensure that intake air to the servers complies with the limits specified.
Electrostatic discharge (ESD) is easily generated and less easily dissipated in areas where the relative
humidity is below 35 percent. ESD becomes critical when levels drop below 30 percent. The 5 percent
relative humidity range between 45 and 50 percent might seem unreasonably tight when compared to
the guidelines used in typical office environments or other loosely controlled areas. However, this
narrow range is not as difficult to maintain in a data center because of the high efficiency vapor
barrier and low rate of air changes normally present.
Excessive concentrations of certain airborne contaminants can cause the server’s electronic
components to corrode and fail. Take measures to prevent contaminants such as metal particles,
atmospheric dust, solvent vapors, corrosive gasses, soot, airborne fibers, or salts from entering, or
being generated within, the data center.
Note: To avoid introducing airborne contaminants to the data center, unpack the server outside of the
data center and then move the server to its final location.
There are two cooling zones in a CMIOU chassis. In one cooling zone, eight fans pull air through the
CMIOUs from the front of the server and exhaust it at the rear of the server. In the other cooling zone, six
fans pull air through the power supplies and exhaust it through the chimney to the SPs and directly out the
rear of the chassis.
There are seven cooling zones in a switch unit chassis. In six of the cooling zones, six fans pull air through
the switch units from the front of the server and exhaust it at the rear of the server. In the other cooling
zone, four fans pull air through the power supplies and exhaust it through the chimney to the SPs and
directly out the rear of the chassis.
The servers contain hot-swappable and redundant power supplies. The SPARC M7-8 servers both contain 6
power supplies, and a SPARC M7-16 server has 16 power supplies. These specifications are for each power
supply and not for the entire server.
Note - All power supplies must be installed, and all power cords must be connected, to power the server.
Use these power supply specifications only as a planning aid. For more precise power values, use the online
power calculator to determine the power consumption of the server with your configuration. To locate the
appropriate power calculator, go to the following web site and navigate to the specific server page:
http://www.oracle.com/goto/powercalculators/
The rack mounted servers ship with two redundant three-phase PDUs. To support the power requirements
of all geographical regions, the PDUs can be either low voltage or high voltage.
• Low-voltage PDU – North America, Japan, and Taiwan
• High-voltage PDU – Europe, Middle East, Africa, rest of the world
Notes
• All six PDU power cords must be connected to the facility AC power receptacles to power the server.
• While the PDU power cords are 4m (13.12 ft.) long, 1 to 1.5m (3.3 to 4.9 ft.) of the cords will be routed
within the rack cabinet. The facility AC power receptacles must be within 2m (6.6 ft.) of the rack.
• The installation site must have a local power disconnect (for example, circuit breakers) between the
power source and the power cords. You will use this local disconnect to supply or remove AC power
from the server.
The server is designed to be powered by two utility power grids. Connect the three power cords from PDU
A to one power grid, and connect the three power cords from PDU B to a second power grid. (When facing
the rear of the server, PDU A is on the left and PDU B is on the right.) All six power cords must be
connected when operating the server.
There are no circuit breakers in the system. When power is applied to the power grids, it is applied to the
system.
Note: With this dual-power feed setup, every power cord connected to the server is used to supply power,
and the power load is balanced. When power loads are greater than 5% of the power supply capacity, the
power loads are balanced at ±10%.
Six PDU power cords provide power to the two PDUs in the rack. The server power cords within the rack
connect to the PDUs. Connect the three PDU A power cords (left) to one facility AC power grid, and the
three PDU B (right) power cords to another AC power grid. To ensure the redundant operation of the
power supplies, connect the server chassis power cords to alternate the PDUs. When connected to
alternate PDUs, the power supplies provide 1+1 (2N) redundancy in case of a power failure to a single AC
power grid.
When facing the rear of the SPARC M7-16 server, the server power cord-to-PDU connections are shown
above.
Six AC power cords supply power to the six power supplies of a stand-alone SPARC M7-8 server. These
server power cords connect the rear IEC 60320-C19 AC inputs to your facility AC power sources.
To ensure the redundant operation of the power supplies, connect the server power cords to alternate
power sources. For example, connect server power cords from AC inputs labeled AC0, AC2, and AC4 to one
power source and from AC inputs labeled AC1, AC3, AC5 to another power source. When connected to
alternate power sources, the server has 2N redundancy in case of a power failure to a single power source.
You can install up to three SPARC M7-8 servers into a Sun Rack II 1242 rack. To provide sufficient cooling,
you must provide 3U of space between each server in the rack. Install the servers in the locations listed on
the slide.
Caution - A rackmounted SPARC M7-8 server will be factory-installed at the lowest location shown. To
prevent the rack from becoming top heavy, always install the second SPARC M7-8 server in the middle
location before installing a third server in the top location.
Provide a separate circuit breaker for each PDU power cord connected to the server. These circuit breakers
must accommodate the facility fault current ratings for the power infrastructure.
Standard 3-pole circuit breakers are acceptable. The server has no specific time-to-trip requirements.
Contact your facilities manager or a qualified electrician to determine what type of circuit breakers
accommodate your facility power infrastructure.
In preparation for a system installation, you should always reference the latest version of the Servers
Installation Guide before starting the installation. This document, along with the EIS document,
contains the most current information and procedures for system installation.
The SPARC M7-16 server and certain SPARC M7-8 servers ship from the factory preinstalled in a rack.
Stand-alone SPARC M7-8 servers do not ship in a rack. The M7-8 servers occupy 10 RU (rack units).
You must provide the service area outlined in the slide for the server. Do not attempt to install or
operate the server in a smaller service area.
Many components housed within the chassis can be damaged by electrostatic discharge. To protect
these components from damage, perform the following before opening the chassis for service.
1. Prepare an antistatic surface to set parts on during the removal, installation, or replacement process. Place ESD-
sensitive components, such as printed circuit boards, on an antistatic mat.
2. Attach an antistatic wrist strap. When servicing or removing server components, attach an antistatic strap to
your wrist and then to a metal area on the chassis.
For installations carried out by Oracle, the Install Coordinator will arrange for only a Dayton lift1 to be
dispatched referenced by the p/n 7312636, no other Genie type lift will be allowed (Svc Item P/N
7312636).
When you replace a component, be aware that other components might be affected by the
replacement, and special steps might be required.
This table identifies the server components that are replaceable. All components listed as ‘No’ in the right
column are CRUs.
This table identifies the server components that are replaceable. All components listed as ‘No’ in the right
column are CRUs.
These topics explain how to access Oracle ILOM to prepare the components for service in the server. This
includes how to power off and on the server and individual domains.
Target_name is as specified for the following components:
Component Target Name and Values
CMIOUs /DCUs/DCUx/CMIOUy
SPARC M7-8 server with 2 PDoms: x is 0 and y is 0-3, or x is 1 and y is 4-7
SPARC M7-8 server: x is 0 and y is 0-7
SPARC M7-16 server: x is 0-3 and y is 0-15
SPs
/Other_Removable_Devices/Service_Processors/Service_Processor_x/Service_Processor_Mod
ule_y
Where x is 0 or 1 and y is 0 or 1.
SPPs /Other_Removable_Devices/Proxy_Service_Processors/Proxy_Service_Processor_x/Service_
Processor_Module_y
Note - If you issue the prepare_to_remove command, but decide not to remove a component, you must return the component to
service.
Stopping the server can take some time, and you must wait until the following message appears on the
host console before proceeding to the next step.
-> SP NOTICE: Host is off
Switch off the circuit breakers to remove power from the server only after you have powered off the
server.
Caution - Because standby power is always present in the server, you must switch off the circuit breakers
on the PDUs before accessing any cold-serviceable components.
As soon as PDU circuit breakers are switched on, standby power is applied, and the SP boots.
On multidomain servers, if one of the hosts is already running, you cannot use the start /System
command. The server will return a message that the target is already started.
To verify that the system or domain has power, execute one of the following commands.
• To check the power status of a domain on a single-domain server, at the Oracle ILOM prompt, type:
-> show /System power_state
• To show the power state for a specific domain, at the Oracle ILOM prompt, type:
-> show /Servers/PDomains/PDomain_x/HOST power_state
If you unseat a component, but do not remove it from the server, or, if you use the
prepare_to_remove command for a component, but decide not to remove it, you must return the
component to service before it can be made operational.
On SPARC M7-8 servers, the front indicator panel is located on the CMIOU chassis. On the SPARC M7-16
server, the front indicator panel is located on the switch chassis.
On SPARC M7-8 servers, the rear indicator panel is located on the CMIOU chassis. On the SPARC M7-16
server, rear indicator panels are located on both CMIOU chassis and on the switch chassis.
You must access these components from the front of the server. These components are not part of the
chassis. You must remove these components to access the chassis.
You must access these components from the rear of the server. These components are not part of the
chassis. You must remove these components to access the chassis.
You must access these components from the front of the switch. These components are not part of the
chassis. You must remove these components to access the chassis.
You must access these components from the rear of the switch. These components are not part of the
chassis. You must remove these components to access the chassis.
There are six power supplies in a CMIOU chassis, and four power supplies in a switch chassis (SPARC M7-16
server only). The power supplies for the CMIOU and switch chassis are 2N redundant. In the event that a
power supply fails in either a switch or CMIOU chassis, the system can operate with only three power
supplies in the switch chassis or five power supplies in the CMIOU chassis.
There are no restrictions into which slots the power supplies have to be installed. You can install them in
any of the power supply slots.
This is a hot-service procedure that can be performed by a customer while the server is running.
Caution - If a power supply fails and you do not have a replacement available, leave the failed power supply
installed to ensure proper airflow in the server.
Determine which power supply requires service by looking at the Service required LEDs on the power
supplies. Locate the power supply that you want to remove.
Disengage the power supply from the server. At the front of the server, squeeze the green release latch on
the power supply to be removed. Pull the extraction lever toward you to disengage the power supply.
Remove the power supply from the server by sliding the power supply half way out of the server. Fold the
extraction lever back toward the center of the power supply until it latches into place. This will keep the
lever from getting damaged when you pull the power supply out. Carefully remove the power supply from
the server.
Before installing the power supply, be sure to use a grounding strap to protect the equipment from
ESD damage. Remove any protective caps from the connectors on the PS. Inspect the connectors for
damage on the replacement PS as well as the connectors inside the empty slot.
To insert the power supply, open the latch on the replacement power supply. Align the power supply and
slide it into the empty bay. Secure the power supply in the chassis. Press on the center of the power supply
grill until the lowered latch lever moves upward. Lift the lever up and press the lever against the power
supply to fully seat it in the server. Verify that the fault has been cleared and the replaced component is
operational.
The SPARC M7-8 server has a single power module that includes six AC power receptacles and the rear
indicator panel.
The SPARC M7-16 server has three power modules; one on each CMIOU chassis that each include six AC
power receptacles and a rear indicator panel, and one on the switch unit chassis that includes four AC
power receptacles and the rear indicator panel. This is a cold-service procedure that can be performed only
by qualified service personnel.
Remove AC power using the circuit breakers on the appropriate PDU before performing this procedure.
Locate the power module at the rear of the server. In order to gain access to the power module, you must
first remove power using the circuit breakers on the appropriate PDU, remove all CMIOUs or switch units
from the impacted chassis, remove the SPs or SPPs and SP tray from the impacted chassis, unseat the
power supplies in the impacted chassis, and remove all PDECBs from the impacted chassis. Then loosen the
two T-20 captive screws from the top and bottom of the power module to free it from the chassis and slide
the power module out of the server.
Slide the power module part way in and position it with the slot in the side wall. When the module is about
1inch from being seated in the chassis, locate the spool standoff on the left side and the slot in the chassis
side wall. Press the side wall while sliding the module into the chassis to ensure that the spool slides
properly into the chassis slot. Tighten two T-20 captive screws on the top and bottom of the power module
to secure it to the chassis.
Complete the following steps to reinstall components you removed to service the power module: reinstall
all PDECBs, reseat the power supplies, reinstall the SP tray and the SPs or SPPs, and reinstall all CMIOUs or
switch units. Return the server to operation and verify that the fault has been cleared and the replaced
component is operational.
There are eight fan modules in the CMIOU chassis. The server will continue to operate at full capacity with
seven fan modules installed in the CMIOU chassis. The server will not operate with fewer than seven fan
modules. If the server is operating with seven fan modules and one or more of those fan modules fails, the
server will continue operating until defined temperature thresholds are exceeded, at which point the
server will perform a graceful shutdown.
This is a hot-service procedure that can be performed by a customer while the server is running if there are
at least seven operating fan modules.
Determine which fan module requires service by looking at the fan module LEDs. To remove the fan
module, push the green button on the handle of the fan module to disengage the fan latch (panel 1). Then
slide the fan module out to remove it from the server (panel 2).
To install a fan module, insert the new fan module into the empty fan slot. Push the fan module into the
slot until it clicks into place to completely seat the module into the slot. Power on the server, if necessary.
Verify that the fault has been cleared and the replaced component is operational.
There are two fan cable assemblies in a CMIOU chassis. This is a cold-service procedure that can be
performed only by qualified service personnel. Remove AC power using the circuit breakers on the
appropriate PDU before performing this procedure. From the rear of the server, remove all four fans from
the side of the chassis that contains the failed fan cable assembly. Unseat the CMIOUs and the SP tray.
Remove all of the internal interconnects from the front of the affected chassis and unscrew the bracket at
the top of the cable assembly.
Release all four panel mount connectors on each tab. On the black tab, push down on the arrow while
sliding the tab to the left until the pegs on the tab align with the slats in the metal brace. Press the tab back
to release it. Remove the cable through the opening in the chassis.
To install a CMIOU chassis fan cable assembly, slide the new cable through the opening in the chassis
where the failed cable was previously installed. Insert the four panel mount connectors through each of the
receiving tabs and slide the black tab to the left to secure it.
Screw the top bracket of the cable assembly to the chassis and reinstall the interconnects in the same slots
from which you removed them. Reseat the SP tray and the CMIOUs. Reinstall the fans in the same slots
from which you removed them. Switch on the appropriate PDU circuit breakers and power on the server.
Verify that the fault has been cleared and the replaced component is operational.
There are 36 fans modules in the switch chassis. Each switch unit has six dedicated fan modules. The server
will continue to operate at full capacity with five of the six fan modules operating for each switch unit. If
the server is operating with one or more switch units that have only five fan modules, and one or more of
those fan modules fails, the server will continue operating until defined temperature thresholds are
exceeded, at which point the server will perform a graceful shutdown.
This is a hot-service procedure that can be performed by a customer while the server is running if there are
at least five of the six fan modules operating for each switch unit.
Determine which fan module requires service. Remove the fan module by pressing the green latch above
the LEDs to the right to disengage the fan module latch (panel 1). Grasp the metal handle on the fan
module and slide the module out to remove it from the server (panel 2).
To install a fan module, insert the new fan module into the empty fan slot. Push the fan into the slot until it
clicks into place to completely seat the fan module into the slot. Power on the server, if necessary and
verify that the fault has been cleared and the replaced component is operational.
There are two redundant SPs in a SPARC M7-8 server, SP0 and SP1. In a SPARC M7-16 server, the SPs in each
CMIOU chassis are referred to as SPPs. SPP2 and SPP3 are in the top CMIOU chassis, and SPP0 and SPP1 are
in the bottom CMIOU chassis. A SPARC M7-16 server also has two SPs in the switch chassis. The SPPs in the
CMIOU chassis manage DCU activity, and the SPs in the switch chassis manage system activity.
Note - It is recommended that you replace a failed SP or SPP as soon as possible. Do not remove an SP or
SPP until a replacement is available.
Before replacing an SP, save the configuration using the Oracle ILOM backup utility. Refer to the Oracle
ILOM documentation for instructions on backing up and restoring the Oracle ILOM configuration.
After replacing an SP, the SP firmware, the host firmware, and the configuration will be restored from the
existing Active SP to the new SP.
The SP that is identified as the Active SP manages system activity. On SPARC M7-8 servers, this SP is located
in the CMIOU chassis. On a SPARC M7-16 server, this SP is located in the switch chassis.
To change the current roles of the SP pair, set the initiate_failover_action variable to true.
You will need to change the SP that is currently identified as the Active SP to be the Standby SP if you are
replacing it.
This example shows that SPP0/SPM0 is assigned the status of the DCU-SPM.
You cannot specify which SPM will serve as the DCU-SPM or PDomain-SPM, but you can specify that Oracle
ILOM should assign a new SPM in the pair to serve these roles. You might do this when, for example, you
are replacing an SPP.
Complete this task to change the current roles of the SPMs. You would need to change the SPM in the pair
that is currently identified as the DCU-SPM to be the Standby-SPM if you are replacing it.
In SPARC M7-8 servers, there are two SPs in the CMIOU chassis. In a SPARC M7-16 server, there are two
SPs in the switch chassis. On all servers, one SP serves as the Active SP, and one serves as the Standby SP.
The Active SP manages DCU activity and all system resources unless it can no longer do so, in which case
the Standby SP assumes its role.
Note - If you issue the prepare_to_remove command for an SPM, you must also prepare the SP for
removal. You must then return the SP to service for the SPM to restart. This automatically happens when
an SP is physically removed and installed into the slot, but if you do not physically install an SP into the slot,
you must issue the return_to_service command for the SP.
If state_pcie returns a value of Online, log onto the host and prepare the PCIe card for removal by
taking the devices offline for the desired SP/SPM slot.
In a SPARC M7-16 server, there are two SPPs in each CMIOU chassis, and two SPs in the switch chassis. SPPs
have two service processor modules each. SPs have one SPM each.
Note - If you issue the prepare_to_remove command for an SPM, you must also prepare the SPP for
removal. You must then return the SPP to service for the SPM to restart. This automatically happens when
an SPP is physically removed and installed into the slot, but if you do not physically install an SPP into the
slot, you must issue the return_to_service command for the SPP.
To achieve redundancy, each SPM (SPM0 and SPM1) on an SPP is assigned to a different DCU. SPM0 on
each SPP manages one DCU in the CMIOU chassis, while SPM1 on each SPP manages the other DCU in the
chassis.
Before you remove an SPP, you must ensure that neither of the SPMs on the SPP are the DCUSPP, the SPP
that manages DCU activity.
It can take up to two minute for the SPMs to go offline. Once they are offline, you can prepare the SPP for
removal.
If state_pcie returns a value of Online, log onto the host and take the card offline for the desired
SP/SPM slot. Setting the action=prepare_to_remove powers off the SPP and turns on the blue
Ready to Remove LED on the SPP.
This is a hot-service procedure that can be performed only by qualified service personnel. The procedure is
the same whether you are removing an SP or an SPP.
Note - Only remove an SP or SPP when you have verified that the blue Ready to Remove LED on the SP or
SPP is lit.
To remove an SP or SPP:
If you can access the SP or SPP, back up the configuration information:
-> cd /SP/config
-> dump -destination uri target
If you are replacing a faulty SP or SPP, relocate any existing port plugs. Remove any port plugs that are
installed in the serial and network ports and install them in the same ports on the replacement SP or SPP.
Label and disconnect the cables attached to the serial and network ports. Pinch the ejector latches and
open the ejector arms.
Pull the SP or SPP halfway out of the SP tray. The SPs and SPPs are tightly secured in their bays. Pull firmly
on the ejector latches to free the SP or SPP from the bay. Close the ejector arms. This will keep the arms
from getting damaged when you pull the SP or SPP out. Carefully remove the SP or SPP from the SP tray,
using two hands, and avoid bumping the rear connectors. Place the SP or SPP on an antistatic mat.
A single battery is installed in each SP and SPP in the server.
This procedure can be performed only by qualified service personnel after removing the SP or SPP that
contains it. You do not need to power down the server before performing this procedure, but you do need
to ensure that the SP or SPP that you remove is not the managing system or DCU activity.
Release the thumb screw and slide the top cover on the SP or SPP to the rear. Lift the top cover of the SP or
SPP. Locate the system battery in the SP or SPP. Using a fingernail or flathead screwdriver, lift the battery
out of the holder. Insert the new battery in the SP or SPP, with the positive side (+) facing out. Replace the
top cover of the SP or SPP.
An SPM will automatically restart following the installation of an SP or SPP. Insert the SP or SPP into the slot
and slide it in until the extraction levers start to close. Close the extraction levers fully until they lock into
place. Reinstall the serial management and network management cables.
The newly installed SP/SPP will update its system firmware from the Active-SP in the system. Confirm that
the correct system firmware is running. If an update is needed, update the system firmware on the Active-
SP (not on the newly installed SP/SPP).
Restore the configuration information.
-> cd /SP/config
-> load -source uri target
Verify that the fault has been cleared and the replaced component is operational.
There is one SP tray in each chassis. Each SP tray contains two SPs or SPPs. This is a cold-service procedure
that can be performed only by qualified service personnel.
Remove AC power using the circuit breakers on the appropriate PDU before performing this procedure.
You must unseat an SP tray before you can remove the interconnects and before you can remove the SP
tray.
Locate the SP tray in the server. Power off the server and switch off the appropriate PDU circuit breakers.
Label and disconnect the cables attached to the serial and network ports on the SPs or SPPs. Remove the
SPs or SPPs to gain access to the SP tray ejector locking latches. Press the ejector latches to release the SP
tray. Using a flathead screwdriver, press up on the small metal latch inside the recess of the tray to release
the tray from the chassis. Repeat this step for the second latch.
To unseat the SP tray, disengage the SP tray (panel 1) by opening and squeezing the ejector latches
together. Carefully slide the tray out towards you and press the levers back together, toward the center of
the SP tray (panel 2). This will keep the levers from getting damaged while you are servicing the server.
Then determine your next step. If you are removing the SP tray, pull the ejector latches out and slide the SP
tray completely out of the server. If you are unseating the SP tray to service another component, refer to
the service procedures for the component.
Before installing a new SP tray into an empty bay, open the ejector latches so that they are fully open. Then
slide the SP tray into its slot in the server until the ejector latches begin to engage. Ensure that the tray
aligns with the alignment block on the left side of the chassis and the metal tab on the right side of the
chassis before latching the tray in place. If you do not properly align the tray, the latch will not engage.
Press the ejector latches back together toward the center of the SP tray, and then press the latches firmly
against the SP tray to fully seat the tray back into the server. The latches should click into place when the
tray is fully seated in the server. Reinstall the SPs or SPPs you removed from the SP tray. Reinstall the
cables you removed from the SPs or SPPs. Switch on the appropriate PDU circuit breakers and power on
the server and verify that the fault has been cleared and the replaced component is operational.
There is one SP internal interconnect assembly in each CMIOU chassis. These assemblies connect the
CMIOUs to the SPs or SPPs. There is one SP internal interconnect assembly in the switch chassis. This
assembly connects switch units to the SPs.
This is a cold-service procedure that can be performed only by qualified service personnel. Remove AC
power using the circuit breakers on the appropriate PDU before performing this procedure. The
replacement procedures are the same whether you are replacing an SP internal interconnect assembly in
the CMIOU chassis or in the switch chassis.
Determine which interconnect requires service. Power off the server and switch off the appropriate PDU
circuit breakers. Locate the SP internal interconnect assembly at the front of the server. To gain access to
the SP internal interconnect assembly: From the rear of the server, remove the SPs or SPPs and the SP tray
from the impacted chassis. Then unseat all CMIOUs or switch units from the impacted chassis. Remove all
interconnect assemblies that are installed adjacent to the SP internal interconnect assembly you are
removing. From the front of the server, remove the SP internal interconnect assembly by loosening the
captive screws on the SP internal interconnect assembly to free it from the chassis.
Remove two T-20 screws that are located inside the square holes on the white, horizontal strip that
extends to the left of the SP internal interconnect assembly.
If you are using a screwdriver that has interchangeable bits, the bit needs to be 2 inches long because the
socket on the driver does not fit through the square holes.
Carefully slide the SP internal interconnect assembly out of the chassis.
Insert the guide pin shield to avoid damaging the connector pins on the rear of the SP internal interconnect
assembly.
If you are installing an SP internal interconnect assembly in the CMIOU chassis, place the guide pin shield as
shown in panel 1.
If you are installing an SP internal interconnect assembly in the switch chassis, place the guide pin shield as
shown in panel 2.
From the front of the server, carefully slide the SP internal interconnect assembly into the chassis.
Caution - The small connector pins on the back of the assembly are susceptible to damage. Align the
assembly in the chassis and install the assembly slowly to avoid bending or otherwise damaging them.
Secure the SP internal interconnect assembly to the chassis by tightening the captive screws on the face of
the assembly.
Install the T-20 screws in the white, horizontal strip that extends to the left of the SP internal interconnect
assembly. Complete the following steps to reinstall components you removed to service the assembly.
Reinstall all of the interconnect assemblies that you removed. Reseat all CMIOUs or switch units from the
impacted chassis. From the rear of the server, reinstall the SP tray and the SPs or SPPs. Switch on the
appropriate PDU circuit breakers and power on the server and verify that the fault has been cleared and
the replaced component is operational.
There are eight CMIOUs in a CMIOU chassis.
If you issue the prepare_to_remove command, but decide not to remove the CMIOU, you must return
the component to service. To do this, either issue the return_to_service command or physically
remove the CMIOU from the server and reinstall it. When setting the action variable for:
• SPARC M7-8 server with 2 PDoms: x is 0 and y is 0–3, or x is 1 and y is 4–7
• SPARC M7-8 server: x is 0 and y is 0–7
• SPARC M7-16 server: x is 0-3 and y is 0–15
When the CMIOU is ready to remove, this command will return Offline, and the blue Ready to Remove LED
will light.
Verify that the blue Ready to Remove light on the CMIOU is on. Unseat the CMIOU by pinching the latch on
the back of each ejector arm (panel 1). Pull the ejector arms toward you to disengage the CMIOU
connectors from the server (panel 2). Grasp the ejector arm latches as close to the base as possible and pull
the CMIOU one-third to halfway out of the server (panel 3).
This is a hot-service procedure that can be performed while the server is running.
To remove a CMIOU, ensure that you have prepared the CMIOU for removal. Unseat the CMIOU. Remove
the CMIOU from the server by folding the ejector arms back together, toward the center of the CMIOU
until they latch into place (panel 1).This will keep the levers from getting damaged when you pull the
CMIOU out. Carefully remove the CMIOU from the server, using two hands, and avoid bumping the rear
connectors (panel 2).
Caution - The rear of the unit is heavy. The CMIOU weighs 25 lbs. (11.3 kg). Use two hands to remove the
CMIOU from the chassis.
Place the CMIOU on an antistatic mat. Install the plastic covers that you removed from the connectors on
the new CMIOU on the connectors of the CMIOU you are replacing. If you are replacing DIMMs, a faulty
eUSB disk, or a faulty CMIOU (which involves removing DIMMS for installation in the new CMIOU), remove
the CMIOU top cover. Do not remove the top cover if you are removing a CMIOU to access another
component for service. Press down on the green button at the top of the cover to disengage the cover from
the CMIOU. While pressing the button, grasp the rear edge of the cover and slide it toward the rear of the
CMIOU until it stops. Lift the cover off.
Some component replacement tasks (for example, those for interconnects and PDECBs) require you to
unseat a CMIOU before you perform them. After completing those tasks, you need to reseat the CMIOU.
If you removed the CMIOU cover, reinstall it and slide the cover forward until the latch clicks into place.
Install the CMIOU by carefully sliding the CMIOU less than half way into the slot, taking care to avoid
bumping the connectors on the back of the CMIOU (panel 1). Open the green CMIOU levers so that they
are fully open (panel 2). Insert the CMIOU back into its slot in the server until the levers begin to engage
(panel 3). Press the levers back together toward the center of the CMIOU, and then press the levers firmly
against the CMIOU to fully seat it into the server (panel 4). The levers should click into place when the
module is fully seated in the server. If you replaced the entire CMIOU or removed a CMIOU as part of
another service procedure, connect to the host console, then connect to the host and restart it. Verify that
the fault has been cleared and the replaced component is operational.
Full-population and half-population configurations are supported. This image shows a CMIOU with all of the
DIMM slots populated with DIMMs. In a half-populated configuration, DIMMs are installed only in the slots
with the end ejector levers displayed in black.
In some cases, LEDs for two DIMMs will be identified as faulty. In this case, replace both of the DIMMs that
are identified. Confirm that the DIMM next to the illuminated DIMM Fault LED is the same DIMM that was
reported to be faulty by the fmadm faulty command. Visually check to ensure that all of the other
DIMMs are seated properly in their slots.
DIMMs can be serviced by customers after removing the CMIOU from the CMIOU chassis. Be sure that
you are properly grounded. Once you have removed the CMIOU and its cover and have identified the
DIMM to be replaced, remove the DIMM. Push down on the ejector tabs on each side of the DIMM
until the DIMM is released. Then grasp the top corners of the faulty DIMM, and lift it out of its slot.
Place the DIMM on an antistatic mat.
Unpack the replacement DIMM, and place it on an antistatic mat. You must install the same size and capacity of
DIMMs that are already in the existing CMIOU. Ensure that the ejector tabs on the connector that will receive the
DIMM are in the open position. To install the DIMM, align the DIMM notch with the key in the connector. Be sure that
the orientation is correct. The DIMM might be damaged if the orientation is reversed. Push the DIMM into the
connector until the ejector tabs lock the DIMM in place. If the DIMM does not easily seat into the connector, check
the DIMM's orientation.
Install the CMIOU into the server and verify that the fault has been cleared and the replaced DIMM is operational.
Each CMIOU in the server has three slots, each of which can hold one PCIe hot-plug card carrier. Each of
these carriers contains a single low-profile PCIe card.
Caution - To remove a PCIe card that is assigned to an I/O domain, first remove the device from the I/O
domain. Then, add the device to the root domain before you physically remove the device from the system.
These steps enable you to avoid a configuration that is unsupported by the Direct I/O or SR-IOV feature.
The server supports the PCIe 16-lane format. PCIe card slots connect to root complexes in CMPs that are
mounted on CMIOUs. Each CMIOU has three slots, each of which can hold a single PCIe card carrier. The
slot numbers repeat from 1 to 3 in each CMIOU, so to identify the physical location of a card you must
specify the CMIOU number, as well as the card slot number. Note that the physical numbering for card slots
starts at 1 (not 0) and ends at 3 (not 2).
PCIe cards are hot-service components that can be replaced at any time if the card is not currently in use. If you
haven't already done so, determine which PCIe card requires service. Reference the M7 Service Manual. Ensure that
you have already taken antistatic measures. Press the ATTN button on the carrier that contains the PCIe card that you
wish to remove. The LEDs on the carrier blink for approximately 10 seconds as the PDomain disables the I/O card.
When the LEDs on both the carrier and the card turn off, the carrier and card are ready to be removed. Label and
remove any I/O cables from the PCIe card. To remove the carrier from the slot, pull the carrier’s extraction lever.
Swing the extraction lever out 90 degrees until the far end of the lever begins to push the carrier out of the slot.
Remove the carrier from the slot.
To remove the PCIe card from the carrier, press the green tab to unlock the carrier latch and open the top
of the PCIe carrier (panel 1). Slide the card from the slot (panel 2). Avoid twisting, tilting, or pulling
unevenly on the PCIe card, which could damage the carrier slot or components on the PCIe carrier circuit
board. Place the PCIe card on an antistatic mat or into its antistatic packaging.
To install the PCIe card in the carrier, unlatch and swing open the arm of the PCIe card carrier, and insert
the new PCIe card until the bottom connector is firmly seated in the carrier's connector (panel 1). The card
is correctly seated only when the notch at the top of the card bracket fits around a guide post on the
carrier. Do not twist or turn the PCIe card as you insert it into the carrier. The card's connector must be
fully seated in the carrier's slot before you attempt to close the top cover.
Note - If the PCIe card includes a mounting screw, do not use the mounting screw. The carrier does not
accept mounting screws.
Close the top of the carrier (panel 2). The green latch should click into place. If the top is difficult to close,
verify that the notch of the card bracket or filler panel fits around the guide post.
To install the carrier and I/O card into the CMIOU slot, push evenly on both sides of the carrier so that the
carrier slides straight into the slot (panel 1). If the carrier slides correctly into the slot, you should feel a
slight resistance as the carrier starts to seat in the connector.
Caution - Do not push the extraction lever while you insert the carrier into the slot. The carrier can enter at
an angle and damage the connectors.
Lock the carrier’s extraction lever (panel 2) and attach the I/O cables to the card. Press the ATTN button on
the carrier to reconfigure the PCIe card into the PDomain. The carrier’s LEDs should blink for a few seconds
until the PDomain enables the PCIe card. The card’s LEDs will show activity when the card is enabled.
There is one PDECB (Power Distribution Electronic Circuit Breaker) for the each CMIOU, and one for each
switch unit (SPARC M7-16 server only). The replacement procedures are the same for all PDECBs. This is a
hot-service procedure that can be performed by a customer while the server is running and after removing
the CMIOU or switch unit that is installed in front of it.
Determine which PDECB requires service. Ensure that the CMIOU that contains the PDECB you are
removing displays an amber LED. Remove the CMIOU or switch unit that is installed in front of the failed
PDECB.
Press the latch on the bottom of the PDECB to the left and slide it out of the chassis.
To install a PDECB, slide the new PDECB into the empty slot. Install the removed CMIOU or switch unit.
Verify that the fault has been cleared and the replaced component is operational.
A single eUSB disk is installed in each CMIOU in the server. This is a procedure that can be performed only
by qualified service personnel after removing the CMIOU that contains it. Power down the server
completely before performing this procedure. Determine which eUSB disk requires service. Shut down the
server and remove the CMIOU that contains the faulty eUSB disk. Remove the CMIOU top cover by pressing
down on the green button at the top of the cover to disengage the cover from the CMIOU. While pressing
the button, grasp the rear edge of the cover and slide it toward the rear of the CMIOU until it stops. Lift the
cover off. To remove the eUSB disk, remove the green T-10 screw that secures the eUSB disk to the CMIOU
(panel 1). Carefully lift the eUSB disk out of the CMIOU (panel 2).
To install the new eUSB disk, position the new eUSB disk in the CMIOU (panel 1). Ensure proper pin
engagement with connectors before pressing eUSB disk in place. Tighten the green screw to secure the
eUSB disk to the CMIOU (panel 2). Reinstall the CMIOU you removed to access the eUSB disk.
The front indicator panel is located on the front of the CMIOU chassis for SPARC M7-8 servers, and on the
front of the CMIOU chassis and switch chassis for SPARC M7-16 servers. The service procedures are the
same on either chassis.
This is a cold-service procedure that can be performed only by qualified service personnel. Remove AC
power using the circuit breakers on the appropriate PDU before performing this procedure. Locate the
indicator panel at the front of the server. Remove the Torx screws and loosen the captive screws to remove
the face plate.
With the face plate removed, remove the No 1 Phillips screws to free the indicator panel from the chassis.
Unplug the small cable from the panel and remove the panel completely from the server.
Plug the small cable from the chassis into the new indicator panel and position the panel in place. Install
the No 1 Phillips screws to secure the indicator panel to the chassis.
Position the face plate over the panel and install the Torx screws and tighten the captive screws to secure
the face plate to the panel. Switch on the appropriate PDU circuit breakers and power on the server. Verify
that the fault has been cleared and the replaced component is operational.
The front indicator panel cable connects the front indicator panel to the chassis. This is a cold-service procedure
that can be performed only by qualified service personnel. Remove AC power using the circuit breakers on
the appropriate PDU before performing this procedure. From the rear of the server, remove the SPs or
SPPs and unseat the SP tray from the impacted chassis. Remove and unseat the required CMIOUs and
switch units from the impacted chassis. If the cable you are replacing is in the CMIOU chassis, remove the
top four CMIOUs and unseat the lower four CMIOUs. If the cable you are replacing is in the switch unit
chassis, remove the top three switch units and unseat the lower three switch units. If the cable you are
replacing is in the CMIOU chassis, remove the four left fan modules. Remove all of the interconnects from
the front of the affected chassis. Remove the front indicator panel. If the cable you are replacing is in the
CMIOU chassis, remove the left CMIOU chassis fan cable assembly. Release the Molex connector by
reaching through the hinging access door on the chassis divider floor where the SP tray slides to unlatch the
Molex connector.
Remove the two captive screws that secure the cable to the chassis.
Remove the cable through the opening in the chassis. If the cable you are replacing is in the CMIOU chassis,
remove the cable through opening between the fan modules and the CMIOU area (panel 1). If the cable
you are replacing is in the switch unit chassis, remove the access panel on the left side of the interconnect
assembly area and slide the cable through the opening (panel 2).
Install the cable through the opening in the chassis and secure the Molex connector. If the cable you are
replacing is in the CMIOU chassis, from inside the CMIOU area, feed the cable through the area behind the
fan module bays (panel 1). If the cable you are replacing is in the switch unit chassis, feed the cable through
the area where the access panel on the left side of the interconnect assembly was removed (panel 2).
Reinstall the two captive screws that secure the cable to the chassis. If the cable you are replacing is in the
switch unit chassis, reinstall the access panel on the left side of the interconnect assembly area. If the cable
you are installing is in the CMIOU chassis, reinstall the left CMIOU chassis fan cable assembly. Reinstall the
front indicator panel and all of the interconnects in the front of the affected chassis. You must return
interconnect assemblies to their original slots in the server. Reinstall the removed fan modules. If the cable
you are replacing is in the CMIOU chassis, reinstall the four left fan modules. Reinstall and reseat the
required CMIOUs and switch units from the impacted chassis. If the cable you are installing is in the CMIOU
chassis, reinstall the top four CMIOUs and reseat the lower four CMIOUs. If the cable you are installing is in
the switch unit chassis, reinstall the top three switch units and reseat the lower three switch units. Reseat
the SP tray and reinstall the SP or SPP in the server. Switch on the appropriate PDU circuit breakers and
power on the server and verify that the fault has been cleared and the replaced component is operational.
SPARC M7 servers receive power from six PDU power cords, which provide power to the two PDUs in the
rack. To ensure the redundant operation of the power supplies, the server must receive power from two
separate power grids, with three power cords receiving power from one power grid and the remaining
three PDU power cords receiving power from a second power grid. For example, in a SPARC M7-8 server,
AC inputs labeled AC0, AC2, and AC4 are connected to one PDU, and AC inputs labeled AC1, AC3, and AC5
are connected to the other PDU. All six PDU power cords must be connected.
When facing the rear of the server, PDU-A is at the left side and PDU-B is at the right side. Each PDU has
nine circuit breakers, one for each socket group.
Determine which PDU requires service. Disconnect the PDU input power cords that connect the faulted
PDU to the facility AC power source. Unpack the replacement PDU on a static-safe mat, open the rear
server door, and attach an antistatic wrist strap. Confirm that all PDU circuit breakers are switched off by
ensuring that the circuit breakers on both the faulted and replacement PDUs are turned completely off, as
shown in 2 above.
This is a cold-service component that can be replaced only by qualified service personnel after powering off
the server. Ensure that you have prepared the faulty PDU for removal. Shut down and power off any
ancillary the equipment installed in the rack. From the rear of the server, switch off all of the PDU circuit
breakers in the rack in the following sequence.
Press down on each of the Off (0) toggle switch to power off the PDU. These circuit breakers are at the rear
of the rack cabinet.
For a SPARC M7-8 server: R6, R7, R8, L2, L1, L0
For a SPARC M7-16 server: R0, R1, R2, L8, L7, L6, R4, R5, L5, L4, R6, R7, R8, L2, L1, L0
where R indicates the right PDU from the rear of the server, L indicates the left PDU from the rear of the
server, and the number represents the PDU group number.
Caution - Because standby power is always present in the server, you must switch off the circuit breakers
on the PDUs before accessing any cold-serviceable components.
Disconnect any power jumper cords connected to the faulty PDU from equipment in the rack. Note where
these jumper cords were attached to the PDU. You will need to reinstall the jumper cords in the same
locations on the new PDU. Cut any tie-wraps securing the faulty PDU power input lead cords to the tie-
down brackets. Disconnect the grounding strap connecting the top of the faulty PDU to the rack.
If the rack included a factory-installed PDU, use a T-25 wrench key to remove the four M5 screws and
washers securing the faulty PDU to the mounting brackets. These screws secure the PDUs for shipping
purposes. You might have already removed these screws when you installed the rack at the installation
site. Carefully lift the faulty PDU up and off the mounting brackets and remove the PDU from the rack and
place it on a clean work table.
Always install the replacement PDU in the same location as the original PDU. If installed closer to the center
of the rack, the PDU will interfere with the installed components. If installed nearer to the rear of the rack,
the PDU will interfere with the cable management hooks and you will not be able to access the PDU circuit
breakers.
Lift up the replacement PDU and, while ensuring that the circuit breakers are facing the rear of the rack,
carefully set the replacement PDU's standoff bolts into the top and bottom bracket's keyhole slots.
Caution - You need two people to lift and secure the PDU to the rack. The PDU is held in the rack by gravity,
with the standoff bolts resting in the mounting brackets‘ keyhole slots. The circuit breakers must face the
rear of the rack so that you can reset a breaker if one trips.
(Optional) Use a T-25 Torx wrench and four M5 shipping screws and washers to secure the replacement
PDU to the mounting brackets. For extra durability, secure the PDU to the mounting brackets using the
shipping screws and washers (two screws and washers per bracket). If you plan to ship the rack to another
location, you must secure the PDU using these shipping screws.
Route the power input lead cords between the rear rail and side panel. The replacement PDU has three
power input lead cords, which you must route between the side panel and the rear rail. Route the power
input lead cords either down through the bottom of the rack or up through the top of the rack, depending
on where you plan to connect them to the main power source. Never twist, kink, or tightly bend a power
input lead. Using tie-wraps, secure the replacement PDU input lead cables to the cable routing brackets.
Ensure that you have switched off every PDU circuit breaker on the replacement PDU. Locate the
replacement PDU input lead cord connectors. Depending on how you routed the cords when you installed
the PDUs, route these cords either out the bottom of the rack or out the top. Connect the replacement
PDU power lead cords to the facility AC power source. If your rack contains two PDUs, ensure that each
PDU is connected to different AC power source circuits and reinstall the jumper cords in the same locations
from which you removed them.
Reconnect the grounding strap connecting the top of the replacement PDU to the rack.
Turn breakers on in the following sequence:
For a SPARC M7-8 server: L2, L1, L0, R6, R7, R8
For a SPARC M7-16 server: R4, R5, L5, L4, R6, R7, R8, L2, L1, L0, R0, R1, R2, L8, L7, L6
where R indicates the right PDU from the rear of the server, L indicates the left PDU from the rear of the
server, and the number represents the PDU group number.
Note - As soon as PDU circuit breakers are switched on, standby power is applied, and the SP boots. Restart
the server.
A SPARC M7-8 server has a single CMIOU chassis. A SPARC M7-16 server has two CMIOU chassis and a
single switch chassis. The steps involved in replacing either type of chassis are the same. This is a cold-
service procedure that can be performed only by qualified service personnel.
Remove AC power using the circuit breakers on the appropriate PDU before performing this procedure. At
the front of the server, remove the power supplies. At the rear of the server, remove the cables (ensure
that you label the cables), SPs, SP tray, fan modules (SPARC M7-16 server only), CMIOUs, Switch units
(SPARC M7-16 server only), PDECBs, and power modules.
Caution - Do not attempt to remove the chassis alone without the aide of another person and a mechanical
lift.
At the front of the server, remove the interconnects and fan modules. From the rear of the chassis, remove
the top hold-down brackets.
From the front of the server, deploy the anti-tilt legs by pulling the pin to release the anti-tilt leg
while pulling the leg from the bottom (panel 1). Loosen the leveling foot until it makes solid
contact with the ground (panel 2). Repeat these steps for the second leg. Both legs must be
deployed. If you have another person to assist you, remove the chassis from the rack, and place it
on an appropriate surface. If you are alone, place a mechanical lift under the chassis, and remove
the screws that fasten it to the rack.
If you are alone, place the chassis on a mechanical lift. Install the chassis into the rack and secure the
chassis with the five screws that were removed.
At the front of the server, install the interconnects and fan modules.
At the rear of the server, install the power modules, PDECBs, switch units (SPARC M7-16 server only),
CMIOUs, fan modules (SPARC M7-16 server only), SP tray, SPs and cables.
At the front of the server, install the power supplies. From the rear of the server, install the top hold-down
brackets. Switch on the appropriate PDU circuit breakers and power on the server. Verify that the fault has
been cleared and the replaced component is operational.
In SPARC M7-8 servers, five internal interconnect assemblies in the CMIOU chassis connect the CMIOUs to
each other.
This is a cold-service procedure that can be performed only by qualified service personnel. Remove AC
power using the circuit breakers on the appropriate PDU before performing this procedure.
Determine which interconnect requires service. Locate the internal interconnect assembly at the front of
the server. Be sure to label the internal interconnect assembly slots and assemblies. If you are removing
one or more internal interconnect assemblies and will return them to the server (for example, because you
are removing them to gain access to other components), you must return them to their original locations in
the server. Take care to properly label each slot and assembly to ensure that it is properly reinstalled.
From the rear of the server, remove the SPs or SPPs and unseat the SP tray from the impacted chassis.
Unseat all CMIOUs or switch units from the impacted chassis. From the front of the server, remove the
internal interconnect assembly by loosening the screws on the assembly. When the latch springs out, press
it in to free the assembly from the chassis.
Grasp the assembly and slide it out of the chassis.
You must return internal interconnect assemblies to their original slots. If you are inserting one or more
previously installed internal interconnect assemblies to the server (for example, because you removed
them to gain access to other components), you must return them to their original locations in the server.
From the front of the server, carefully slide the internal interconnect assembly into the chassis. The small
connector pins on the back of the assembly are susceptible to damage. It is important that you align the
assembly in the chassis and install the assembly slowly to avoid bending or otherwise damaging them.
Tighten the captive screws on the face of the assembly to secure the internal interconnect assembly to the
chassis.
Tighten the captive screws on the face of the assembly to secure the internal interconnect assembly to the
chassis. Reseat all CMIOUs or switch units and the SP tray and reinstall the SPs or SPPs. Switch on the
appropriate PDU circuit breakers and power on the server. Verify that the fault has been cleared and the
replaced component is operational.
There is one switch chassis in the SPARC M7-16 server. This is a cold-service procedure that can be
performed only by qualified service personnel. Power down the server completely before performing this
procedure.
At the rear of the server, remove all of these components in the order indicated: cables (ensure that you
label the cables), SPs, SP tray, switch units, PDECBs, and power module, including AC inlet strip and front
indicator panel.
Caution - Do not attempt to remove the chassis alone without the aide of another person or a mechanical
lift.
At the front of the server, remove the external interconnects and SP interconnect. From the rear of the
chassis, remove the top hold-down brackets.
From the front of the server, deploy the anti-tilt legs. Pull the pin to release the anti-tilt leg while pulling the
leg from the bottom (panel 1). Loosen the leveling foot until it makes solid contact with the ground (panel
2). Repeat these steps for the second leg. Both legs must be deployed.
If you have another person to assist you, remove the chassis from the rack, and place it on an appropriate
surface. If you are alone, place a mechanical lift under the chassis, and remove the screws that fasten it to
the rack.
If you are alone, place the chassis on a mechanical lift. Install the chassis into the rack, and secure the
chassis with the five screws that were removed.
At the front of the server, install the SP interconnect and external interconnects. At the rear of the server,
install the power module, which includes the AC input strip and the front indicator panel, PDECBs, switch
units, SP tray, SPs and cables (ensure that you install the cables as they were originally installed). At the
front of the server, install the power supplies.
From the rear of the server, install the top hold-down brackets. Switch on the appropriate PDU circuit
breakers and power on the server. Verify that the fault has been cleared and the replaced component is
operational.
In SPARC M7-16 servers, switch units are part of the scalability feature that allows a PDomain to control
DCUs. Switch units are configured to work together as a single unit. If a switch unit fails, the server will
operate in degraded mode. At least five switch units must be functioning for the server to operate. It is
recommended that you replace a failed switch unit as soon as possible.
Determine which switch unit requires service. If you are replacing a switch unit, unpack the new switch unit
on a static-safe mat.
Note - If you issue the prepare_to_remove command, but decide not to remove the switch unit, you
must return the component to service. To do this, either issue the return_to_service command or
physically remove the switch unit from the server and reinstall it.
Remove the plastic covers from the connectors on the new switch unit and set them aside for installation
on the old switch unit connectors, once you have removed it from the system.
Some component replacement tasks (for example, those for interconnects and PDECBS) require you to
unseat switch units before you perform them.
When removing a switch unit, verify that the blue Ready to Remove LED on the switch unit is on. Unseat
the switch unit by pulling the ejector arm out to disengage the switch unit from the server (panel 1). Press
the arm back toward the unit to prevent it from being damaged (panel 2).
This is a hot-service procedure that can be performed by a customer while the server is running. Pull the
switch unit out of the server less than half way. Carefully remove the switch from the server, to avoid
bumping the rear connectors. Place the switch on an antistatic mat. Install the plastic covers that you
removed from the connectors on the new switch on the connectors of the switch you are replacing.
Install the switch unit by opening the ejector arm so that it is fully open (panel 1). Install the new switch
unit into its slot in the server until the ejector arm begins to engage (panel 2). Press the arm back toward
the switch unit, and then press the arm firmly against the switch unit to fully seat it back into the server
(panel 3). The lever should click into place when the switch unit is fully seated in the server.
Restart the switch unit.
• -> set /System/Other_Removable_Devices/Scalability_Switch_Boards/
• Scalability_Switch_Boardn action=return_to_service
Restart the server.
• -> start /System
In a SPARC M7-16 server, eight external interconnect assemblies connect the CMIOUs in the top chassis
and the CMIOUs in the bottom chassis to the switch units in the switch chassis. One half of an assembly is
installed in the switch chassis, and the other half of the assembly is installed in a CMIOU chassis.
This is a cold-service procedure that can be performed only by qualified service personnel. Remove AC
power using the circuit breakers on the appropriate PDU before performing this procedure.
Locate the external interconnect assembly at the front of the server. Label the external interconnect
assembly slots and assemblies. If you are removing one or more external interconnect assemblies and will
return them to the server (for example, because you are removing them to gain access to other
components), you must return them to their original locations in the server. Take care to properly label
each slot and assembly to ensure that it is properly reinstalled. From the rear of the server, remove the SPs
or SPPs and unseat the SP tray from the impacted chassis. Unseat all CMIOUs and switch units from the
impacted chassis. Remove the four T-20 screws that secure the interconnect cable cover to the chassis and
remove the cover.
From the front of the server, remove the external interconnect assembly. The small connector pins on the
back of the assembly are susceptible to damage. It is important that you remove the assembly slowly to
avoid bending or otherwise damaging them. In addition, take care to avoid flexing or twisting these
assemblies. Loosen the screws on the assembly. When the latch springs out, press it in to free the assembly
from the chassis. Grasp the assembly and slide it out of the chassis.
From the front of the server, have one person carefully slide the interconnect into the switch chassis while
another person slides the interconnect into the CMIOU chassis.
The small connector pins on the back of the assembly are susceptible to damage. It is important that you
align the assembly in the chassis and install the assembly slowly to avoid bending or otherwise damaging
them. In addition, take care to avoid flexing or twisting these assemblies.
If you are inserting one or more previously installed external interconnect assemblies to the server (for
example, because you removed them to gain access to other components), you must return them to their
original locations in the server.
Tighten the captive screws on the face of the assembly to secure the internal interconnect assembly to the
chassis.
Position the interconnect cable cover on the chassis and reinstall the four T-20 screws that secure it to the
chassis. Reseat all CMIOUs and switch units. Reseat the SP tray and reinstall the SPs or SPPs. Switch on the
appropriate PDU circuit breakers and power on the server. Verify that the fault has been cleared and the
replaced component is operational.
Note, SPMs are not serviceable components. If an SPM fails, you must replace the SP or SPP that contains it.
ILOM can be accessed via the command-line interface (CLI), the browser user interface (BUI), intelligent
platform management interface (IPMI) and simple network management protocol (SNMP). System firmware
updates are performed through ILOM. ILOM is also used for management of the host remotely as well as system
monitoring and power consumption management.
Oracle ILOM 3.2.5 or later will be released with the M7-8 and M7-16. The SP looks and behaves very similar to
ILOM running on other platforms. There is a set of extensions that was added to support PDoms and SPPs.
Some of the concepts of the extended system controller facility (XSCF) software that runs in the M-Series
servers was leveraged in developing the ILOM extensions.
Components in the diagram in the slide include the following.
• Integrated Lights Out Management (ILOM):
• The active service processor (SP) consolidation
• Is composed of multiple daemons
• Mostly generic across SPARC and x86
• Provides common look-and-feel across product lines
• Monitors platform HW
– Hypervisor (HV)
• Is a runtime platform-FW that provides the HW abstraction of sun4v
• Manages guest isolation via the machine descriptors generated by the LDoms Manager
• Performs all CPU-specific operations, including error handling
Mapping:
– Most POD functionality runs on the active SP.
• Communication with GM is abstracted by Physical Domain Manager (PDM) so that POD needs to know only about
PDoms, not which SPP is currently the “PDOM-SPP.”
• When the system is powered on, there is a race to become the active SP. All chassis configuration is done
from the active SP. Its top level targets are servers, system, and SP. On the standby SP, only the SP target is
available.
SP failovers can be performed manually from the SP shell. The target used to perform the failover is
/SP/redundancy. When the initiate_failover_action property is set to true, the information is
propagated to the SP and the SPPs. A reset of the active SP is required, which will reboot the SPs and the SPPs.
There is no impact to the running domains. To confirm that you are connected to the active SP, execute ->
show /SP/redundancy status. If status = Active, you are connected to the active SP. If status
= Standby, you are logged in to the standby SP. If status = Standalone, only one of the SPs is
responding.
To continue with the installation, you must log in to the ILOM software on the active SP through a local serial
connection. Log in to the active SP as root with a password of changeme. You can run the commands shown
in the slide to perform chassis identification for your system. Be sure to follow the latest EIS installation
procedures to perform the installation. They can be found at: http://eis.us.oracle.com/checklists.
If the SP host name is not set, the login prompt displays ORACLE-PSN. When configuring the SP network, you
need at least three IP addresses. One for the active SP that will float to whichever SP is currently active, one for
SP0, and one for SP1.
The SPs do not support DHCP. You must assign static network addresses to the following components:
– SP0: NET MGT port on SP0
– SP1: Net MGT port on SP1
– Active_SP: Active SP. If the active SP fails, the standby SP will be assigned this address.
– HOST0: The IP address for PDomain0-SPP host. The server is configured as one PDomain, so only one host requires an
address. If you reconfigure the server to have multiple PDomains then you must assign network addresses to the other
hosts.
• These network addresses must be configured before you can access the ILOM software over a network
connection.
After setting the IP addresses, you must commit them for the new addresses to take effect. If you change the IP
address that you are connected into, you will need to re-establish your connection after you set
commitpending=true.
To locate the server using the BUI, log in to the ILOM web interface. View the System Information >
Summary page. Click the Locator Indicator button in the Actions panel. When prompted, click Yes to
confirm the action. The server’s LOCATE LED illuminates on both the front and rear of the server so
that you can physically identify the server. To turn off the Locate LED; you can press the Locate LED
button if you are at the server. You can also turn it off remotely through the web interface. On the
Summary page, click the Locator Indicator button.
If you need to service a component, lighting the system Locate LED assists in easily identifying the
correct server. You do not need administrator permissions to use these commands. The server’s
Locate LED blinks on both the front and rear of the server so that you can physically identify the server.
It is labeled as #1 in the image in the slide.
To display the server information through the BUI, log in to the ILOM web interface and view the
Summary page. The Summary page provides the following information:
– General Information panel – Provides general information such as the serial number, firmware version, primary
OS, host MAC address, SP IP addresses, and MAC addresses
– Actions panel – Provides the power state of the host
– Status panel – Provides the overall status of the server components
To see more detailed information, click specific components listed under System Information.
In the above example there is an issue with CMIOU8 (which is in DCU2).
If more detail is required:
-> show -l all -o table /System
This produces a long table of output.
The system serial number is located on the front of the rack near the top-right corner.
There is no progress meter during the upgrade which will take about 45min. A firmware upgrade will cause
the SP to be reset, hence if there is a connection to the active SP it will need to be re-established. It is
recommended that a clean shutdown of the server be done prior to the upgrade procedure.
To ensure that all of the hosts are updated at the time the SPs are updated, the hosts must be powered off.
The firmware can be updated without impacting the running hosts. If any of the hosts are running, the
firmware is automatically updated when the hosts are reset.
The SP will enter a special mode to load new firmware. No other tasks can be performed on the SP until the
firmware upgrade is complete and the SP is reset.
A user account is a record of an individual user that can be verified through a username and password.
Each user account is assigned specific roles that allow a user to execute a subset of ILOM commands
and perform select actions on a specific set of components. Those components can be physical
components, domains, or physical components within a domain. By specifying roles for each user, you
can control which operations each user is allowed to perform.
When you assign user roles to a user account for a specific component, the capabilities granted mirror
those of the user roles assigned for the platform, but they are restricted to commands executed on the
given component.
A specific role is required for certain tasks as shown in the table in the slide.
To configure user accounts through the BUI, you need to do the following:
• Log in to the ILOM web interface.
• Navigate to the ILOM Administration > User Management page. The Active Sessions page is displayed.
• Click the User Accounts tab.
• In the Users table, click Add. Enter the username, password, and password confirmation. Select a CLI mode and
the appropriate roles for this user. Select the appropriate roles for the domains that you want this user to have.
• Click Save.
To configure user accounts through the CLI, log in to ILOM. Create a user account, assign a password
to it as well as its roles. Password length must be between 8 and 16 characters.
You must have user (u) permissions to view the properties of existing user accounts.
You must have user (u) permissions to view the properties of and delete existing user accounts.
ILOM can authenticate user accounts through local accounts that you configure or against a remote
user database, such as Active Directory, LDAP/SSL, or RADIUS. With remote authentication, you can
use a centralized user database rather than configuring local accounts on each ILOM instance.
User access can be remotely authenticated and authorized based on a user’s membership in a host
group. A user can belong to more than one host group and, on this server, you can configure up to 10
host groups. The tasks involved in configuring host groups include managing certificates (LDAP/SSL),
administrator groups, operator groups, custom groups, and user domains.
Active Directory is the distributed directory service included with Microsoft Windows Server operating
systems. Like an LDAP directory service implementation, Active Directory is used to authenticate user
credentials.
LDAP/SSL offers enhanced security to LDAP users by way of SSL technology. To configure LDAP/SSL
in an SP, you enter basic data (such as primary server, port number, and certificate mode) and optional
data (such as alternate server, event, or severity levels). You can enter this data by using the
LDAP/SSL configuration page of the ILOM web interface, the CLI, or SNMP.
Users will need the “u” role to modify any settings under host groups.
To configure host groups through the BUI:
• Access the ILOM web interface.
• Navigate to the ILOM Administration > User Management page. The Active Sessions page is displayed.
• Click the Active Directory or LDAP/SSL tab.
• At the top of the page, click the link to access the Host Groups category. Enable the radio button of the individual
table, and then click Edit.
• Enter the name of the host group and select the hosts that you want to be members of the specified host group.
Select the appropriate roles for this host group.
• Click Save.
You must have user (u) permissions to configure host groups.
You must have user (u) permissions to configure host groups.
You must have user (u) permissions to configure host groups.
The example in the slide uses the following values:
– Host group ID number: 2
– Existing host group name: platadm
– New host group name: platops
– Host that is assigned roles by this host group: HOST2
– New host that is assigned roles by this host group: HOST1
– Existing host group roles: a, r
– New host group roles: a, c, r
The example in the slide uses the following values:
– Host group ID number: 3
– Existing host group name: platadm
– New host group name: platops
– Host that is assigned roles by this host group: /PDomain_1, PDomain_2
– New host that is assigned roles by this host group: /PDomain_1, PDomain_3
– Existing host group roles: a, r
– New host group roles: a, c, r
You must set the server altitude so that the server can adjust its fan speeds and monitor the
surrounding environmental conditions required for its elevation. Set the server altitude using the
SP system_altitude property. This property is set to 200 meters by default.
Setting the system_altitude property causes the server to adjust the temperature thresholds
so it can more accurately detect any abnormality in the air intake temperature. However, even if
you do not set the system altitude, the server still detects and responds to any abnormality in the
air temperature, such as the CMP temperature.
To view power consumption information by using the BUI, log in to the ILOM web interface. View the
Power Management > Consumption page. The server’s power consumption wattage value is displayed
for the Actual Power and Peak Permitted Power properties. The consumption metric identifies the input
power wattage that the server is currently consuming. The peak permitted power consumption metric
identifies the maximum power wattage the server can consume.
You can view the power allocation requirements shown for the components. The power usage
statistics can be seen from the Power Management > Statistics page. These statistics are displayed in
15, 30, and 60 second intervals. The per-component power map provides wattage allocations for each
server component. Power history can be viewed from the Power Management > History page. The
power history for the minimum, average, and maximum power usage is displayed.
The Oracle SPARC M7-8 with 2 PDOMs uses a 2x4 socket configuration with two PDoms, as labeled on the diagram.
Each of the two PDoms contain one DCU that consists of up to 4 CMIOUs in each DCU.
NOTE: For the population rules refer to the Oracle SPARC M7 Series Service Manual.
The Oracle SPARC M7-8 uses a 1x8 socket configuration with one Pdom that has one DCU, as labeled on the diagram.
So the PDom is a single domain that contains from 1 to 8 CMIOUs.
NOTE: For the population rules refer to the Oracle SPARC M7 Series Service Manual.
The Oracle SPARC M7-16 consists of four DCUs with up to 4 CMIOUs in each DCU.
NOTE: For the population rules refer to the Oracle SPARC M7 Series Service Manual.
This table shows comparisons on what building block equivalents are on Oracle’s other high-end
servers.
The examples in the slide show the syntax for assigning a DCU to a PDom. They require the user role
of admin (a) to be executed.
The dcus_assignable property enables you to control which DCUs can be assigned to a PDomain.
Setting these variables overwrites what they are currently set to. If you want to append to the current
settings, you must include the current settings in the “set” syntax along with the new values.
In a multi-DCU domain scenario, the SPP with the lowest number available in the domain will
automatically become the PDomain SPP, that will manage the domain for all multiple DCUs. In
case the Golden SPP fails, the next higher numbers SPP takes over.
NOTE: Starting the console before starting the host is a good practice because it displays any errors it may
encounter.
NOTE: Access to the console my take up to 60 seconds.
NOTE: Starting the console before starting the host is a good practice because it displays any errors it may
encounter.
NOTE: All these commands perform a graceful shut down. To perform an immediate shutdown add -f
to the command line.
Use the /HOST/console/history host server console output buffer to write all types of log
information. If you enter the show /HOST/console/history command without first setting any
arguments with the set command, Oracle ILOM displays all lines of the console log, starting from the
end.
Note: Timestamps recorded in the console log reflect server time. These timestamps reflect local time,
and the Oracle ILOM console log uses UTC (Coordinated Universal Time). The Oracle Solaris OS
system time is independent of the Oracle ILOM time.
The power on and boot progress can be monitored with the following commands:
-> show /Servers/PDomains/PDomain_0/HOST status
or
-> show /HOST0 status
When the host is powered on but the OS is not booted, you communicate with the OBP firmware. The
OBP firmware displays ok as its prompt.
When PDomains are powered on, their clocks synchronize to the NTP server when the system is
configured to listen to NTP multicast (the default for the current Oracle Solaris OS). If the PDomains
and SPs use the same NTP server, events logged in the Oracle Solaris OS and on the SP can be
correlated based on their timestamps. If the PDomains and SPs use different NTP servers, their times
might drift, and correlating log files could become difficult. If you connect a domain to an NTP server
other than the one used by the SP, ensure that both are low-stratum NTP servers that provide the
same degree of accuracy.
Timestamps in the console log reflect server time. By default, the Oracle ILOM console log uses
UTC/GMT, but you can use the /Servers/PDomains/PDomain_x/SP/clock time zone command to the
set the SP clock to use other timezones. The Oracle Solaris OS system time is independent of the
Oracle ILOM time.
NOTE: To perform an immediate shutdown, use the -force option from the stop command. Ensure
that all data is saved before entering this command. To perform power operations on the server or a
specific domain, user accounts on each must be assigned reset (r) user roles.
You can start, stop, or reset the whole system or an individual PDomain. To perform these tasks, user
accounts for the components you want to start, stop, or reset must be assigned reset (r) user
roles.
NOTE: The VersaBoot/eUSB option is new so refer to the Products Notes to determine its status as the
product matures.
NOTE: .VersaBoot uses iSCSI-over-Infiniband as well as IP –over- InfiniBand (ISCSI/IPoIB)
The VersaBoot boot process supports a boot pool consisting of embedded USB or eUSB flash-based
disk device that is located in each CMIOU.
Other parameters include:
osroot-iscsi-target-ip, osroot-iscsi-port, osroot-iscsi-partition,
osroot-iscsi-lun, osroot-iscsi-target-name, osroot-subnet-mask,
osroothost-ip
NOTE: rKVMS stands for remote Keyboard, Video, Mouse and Storage.
NOTE: This configuration is covered in the SPARC M7 Series Servers Installation Guide
The KVMS software that is preinstalled on this server allows for both video-redirection and serial-
redirection connections to the Oracle Solaris OS. However, only the serial-redirection connection
supports Oracle Solaris console. Video redirection provides a standard X-session connection to the
Oracle Solaris OS. If an X server has not already been enabled on the Oracle Solaris OS, video
redirection will display a blank screen. Complete the steps in the slide to install X server packages on
the server so you can access the command prompt for a video redirection session.
Note: The OBP input-device=rkeyboard and output-device=rscreen properties are not supported on
this server.
One SPP is assigned to manage each DCU. One of these SPPs is identified as a PDomain SPP, which
is responsible for hosting the KVMS server. In some cases (for example, if the PDomain SPP that
hosts the KVMS server reboots), the network connection to Oracle ILOM Remote Console Plus might
terminate. The PDomain will not automatically attempt to re-establish these links.
Oracle VM Server for SPARC, also known as Logical Domains, provides highly efficient, enterprise-
class virtualization capabilities for Oracle’s SPARC M-Series servers.
Oracle VM Server for SPARC leverages the built-in hypervisor to subdivide system resources (CPUs,
memory, network, and storage) by creating partitions called logical (or virtual) domains. Each logical
domain can run an independent operating system.
Oracle VM Server for SPARC provides the flexibility to deploy multiple Oracle Solaris operating
systems simultaneously on a single platform. This virtualization solution has been designed to fully
optimize Oracle Solaris and SPARC for enterprise server workloads.
The primary roles of Hypervisor are to implement the software components of the sun4v virtual
machine, providing low overhead hardware abstraction, to enforce hardware and software resource
access restrictions for guest, including inter-LDom communication, and to provide isolation and
security. It also performs initial triage and correction of hardware errors.
The secondary roles of Hypervisor include implementing dynamic LDom reconfiguration, providing
data for performance statistics, and managing hardware elements of some power management
features.
Control domain: The LDoms Manager runs in this domain, which enables you to create and manage other logical
domains, and to allocate virtual resources to other domains. You can have only one control domain per PDom. The
control domain is the first domain created when you install the OracleVMServer for SPARC software. The control domain
is named primary.
Service domain: A service domain provides virtual device services to other domains, such as a virtual switch, a virtual
console concentrator, and a virtual disk server. You can have more than one service domain, and any domain can be
configured as a service domain.
I/O domain: An I/O domain has direct access to a physical I/O device, such as a network card in a PCI EXPRESS
(PCIe) controller. An I/O domain owns a PCIe root complex. An I/O domain can share physical I/O devices with other
domains in the form of virtual devices when the I/O domain is also used as a service domain.
Root domain: A root domain has a PCIe root complex assigned to it. This domain owns the PCIe fabric and provides all
fabric-related services, such as fabric error handling. A root domain is also an I/O domain, as it owns and has direct
access to physical I/O devices. The number of root domains that you can have depends on your platform architecture.
Guest domain: A guest domain is a non-I/O domain that consumes virtual device services that are provided by one or
more service domains. A guest domain does not have any physical I/O devices, but only has virtual I/O devices, such as
virtual disks and virtual network interfaces.
The hypervisor software is responsible for maintaining the separation between logical domains. The
hypervisor software also provides logical domain channels (LDCs) that enable logical domains to
communicate with each other. LDCs enable domains to provide services to each other, such as
networking or disk services.
The maximum number of Physical Domains is 4 and the minimum number of DCUs per Physical
Domain is 1. Each physical domain has its own hypervisor.
This image shows the device mapping on an individual CMIOU board.
This is the device map for a M7-8 server with one PDomain.
This is the device map for a M7-8 server with two PDomains.
This is the device map for a M7-16 server with three PDomains.
Features of the M7 I/O Controller include:
• 4 Root Complexes of 16 lanes each where each group can be individually configured as 1x16, 2x8, or 4x4
Root Ports i.e. bifurcated and quadfurcated
• 1 Root Complex of 8 lanes that can be configured as 1x8, or 2x4 Root Ports
Each logical domain is only permitted to observe and interact with those server resources that are
made available to it by the hypervisor. The LDoms Manager enables you to specify what the
hypervisor should do through the control domain. Thus, the hypervisor enforces the partitioning of the
server’s resources and provides limited subsets to multiple operating system environments. This
partitioning and provisioning is the fundamental mechanism for creating logical domains.
The diagram in the slide shows the hypervisor supporting two logical domains in the PDom on the left
and three logical domains in the PDom on the right.
The number and capabilities of each logical domain that a specific SPARC hypervisor supports are
server-dependent features. The hypervisor can allocate subsets of the overall CPU, memory, and I/O
resources of a server to a given logical domain. This enables support of multiple operating systems
simultaneously, each within its own logical domain. Resources can be rearranged between separate
logical domains with an arbitrary granularity. For example, CPUs are assignable to a logical domain
with the granularity of a CPU thread.
Each logical domain can be managed as an entirely independent machine with its own resources,
such as Kernel, patches, and tuning parameters, user accounts and administrators, disks, network
interfaces, MAC addresses, and IP addresses.
Each logical domain can be stopped, started, and rebooted independently of each other without
requiring you to perform a power cycle of the server.
There are three types of Machine Descriptions (MDs), the Platform MD (aka mini-MD), the Guest
MD(s), and the Hypervisor MD (HVMD).
The Platform MD is generated by Hostconfig during power-on-reset (POR) operations. It contains
low-level platform configuration information intended for Hypervisor consumption only.
The Hypervisor MD (HVMD) defines a set of Logical Domains and their resources. It is generated by
Hostconfig for initial factory-default configuration only. The LDoms Manager produces new HVMDs
and dynamically updates them as the system runs. The LDoms Manager-generated HVMD and
guest MDs can be saved on the SP as selectable bootsets.
OBP is only resident in the stack for booting. When Solaris has booted, OBP is no longer part of the
picture.
The Guest Manager (GM) resides on the SP. It provides services to Guest domains via Logical
Domain Channels (LDCs). It is the communication bridge between the Host and ILOM and is
responsible for managing LDom configurations.
OBP is present in the stack only for booting. When Solaris has booted, OBP is no longer part of the
picture.
The service processor (SP) monitors and runs the physical machine, but it does not manage the
logical domains. The LDoms Manager manages the logical domains.
SER – Serious Error Reports
The following virtual device services must be created to use the control domain as a service domain
and to create virtual devices for other domains:
vcc – Virtual console concentrator service
vds – Virtual disk server
vsw – Virtual switch service
In the first example, the command would add a virtual console concentrator service (primary-vcc0)
with a port range from 5000 to 5100 to the control domain (primary).
In the second example, the command adds a virtual disk server (primary-vds0) to the control domain
(primary).
In the third example, the command adds a virtual switch service (primary-vsw0) on network adapter
driver net0 to the control domain (primary).
Initially, all system resources are allocated to the control domain. To allow the creation of other
logical domains, you must release some of these resources.
You must reboot the control domain for the configuration changes to take effect and for the resources
to be released for other logical domains. Either a reboot or a power cycle instantiates the new
configuration. Only a power cycle actually boots the configuration saved to the service processor
(SP), which is then reflected in the list-config output.
You must enable the virtual network terminal server daemon (vntsd) to provide access to the virtual
console of each logical domain.
The guest domain must run an operating system that understands both the sun4v platform and the
virtual devices presented by the hypervisor. Currently, this means that you must run at least the
Oracle Solaris 10 11/06 OS. Running the Oracle Solaris 10 8/11 OS provides you with all the
OracleVMServer for SPARC 3.0 features. When you have created default services and reallocated
resources from the control domain, you can create and start a guest domain.
The virtual disks are generic block devices that are associated with different types of physical
devices, volumes, or files. A virtual disk is not synonymous with a SCSI disk and, therefore, excludes
the target ID in the disk label. Virtual disks in a logical domain have the following format: cNdNsN,
where cN is the virtual controller, dN is the virtual disk number, and sN is the slice.
Before removing a guest domain, you must first stop it and unbind the resources from it. After
removing the guest domain, you need to clean up the configuration on the SP.
An I/O domain has direct ownership of and direct access to physical I/O devices. It can be created by
assigning a PCI EXPRESS (PCIe) bus or a PCIe endpoint device to a domain. Use the ldm add-io
command to assign a bus or device to a domain.
An I/O domain might have direct access to one or more I/O devices, such as PCIe buses, network
interface units (NIUs), PCIe endpoint devices, and PCIe single root I/O virtualization (SR-IOV) virtual
functions.
This type of direct access to I/O devices means that more I/O bandwidth is available to provide
services to the applications in the I/O domain as well as virtual I/O services to guest domains.
You can use the OracleVMServer for SPARC software to assign an entire PCIe bus (also known as a
root complex) to a domain. An entire PCIe bus consists of the PCIe bus itself, and all of its PCI
switches and devices. PCIe buses that are present on a server are identified with names such as
pci@300. An I/O domain that is configured with an entire PCIe bus is also known as a root domain.
The diagram in the slide shows a system that has two PCIe buses (pci_0 and pci_1). Each bus is
assigned to a different domain. Thus, the system is configured with two I/O domains.
When you assign a PCIe bus to an I/O domain, all devices on that bus are owned by that I/O domain.
You are not permitted to assign any of the PCIe endpoint devices on that bus to other domains. Only
the PCIe endpoint devices on the PCIe buses that are assigned to the primary domain can be
assigned to other domains.
When a server is initially configured in a Logical Domains environment or is using the factory-default
configuration, the primary domain has access to all the physical device resources. This means that
the primary domain is the only I/O domain configured on the system and that it owns all the PCIe
buses.
This example procedure shows how to create a new I/O domain from an initial configuration where
several buses are owned by the primary domain. By default, the primary domain owns all buses
present on the system.
First, you must retain the bus that has the primary domain’s boot disk. Then, remove another bus
from the primary domain and assign it to another domain.
Support for the Peripheral Component Interconnect Express (PCIe) single root I/O virtualization (SR-
IOV) feature has been added starting with the OracleVMServer for SPARC 2.2 release.
The SR-IOV standard enables the efficient sharing of PCIe devices among virtual machines and is
implemented in the hardware to achieve an I/O performance that is comparable to a native
performance. The SR-IOV specification defines a new standard wherein new devices that are
created enable the virtual machine to be directly connected to the I/O device.
A single I/O resource, which is known as a physical function, can be shared by many virtual
machines. The shared devices provide dedicated resources and also use shared common resources.
In this way, each virtual machine has access to unique resources. Therefore, a PCIe device, such as
an Ethernet port, that is SR-IOV-enabled with appropriate hardware and OS support can appear as
multiple, separate physical devices, each with its own PCIe configuration space.
Each SR-IOV device can have a physical function and each physical function can have up to 64,000
virtual functions associated with it. This number is dependent on the particular SR-IOV device. The
virtual functions are created by the physical function.
After SR-IOV is enabled in the physical function, the PCI configuration space of each virtual function
can be accessed by the bus, device, and function number of the physical function. Each virtual
function has a PCI memory space, which is used to map its register set. The virtual function device
drivers operate on the register set to enable its functionality and the virtual function appears as an
actual PCI device. After creation, you can directly assign a virtual function to an I/O domain. This
capability enables the virtual function to share the physical device and to perform I/O without CPU
and hypervisor software overhead.
This figure shows the relationship between virtual functions (VFs) and a physical function in an I/O
domain.
SR-IOV has the following function types:
– Physical function – A PCI function that supports the SR-IOV capabilities as defined by the SR-IOV
specification. A physical function contains the SR-IOV capability structure and manages the SR-IOV
functionality. Physical functions are fully featured PCIe functions that can be discovered, managed, and
manipulated like any other PCIe device. Physical functions can be used to configure and control a PCIe device.
– Virtual function – A PCI function that is associated with a physical function. A virtual function is a lightweight
PCIe function that shares one or more physical resources with the physical function and with virtual functions
that are associated with that physical function. Unlike a physical function, a virtual function can only configure its
own behavior.
Use Cases
– Creating NPRD Domains by assigning PCIe Buses that can be setup as alternate service
domains
– Live Service of CMIOU
The flowchart in the slide illustrates the general process of diagnostics for this server. Depending on
the fault, you might need to perform all of the steps or just some of them. You also might have to run
diagnostic software that needs to be installed or enabled.
The table in the slide provides descriptions of the troubleshooting actions that you should take to
identify a faulty component. The diagnostic tools you use, and the order in which you use them,
depend on the nature of the problem you are troubleshooting.
Log files can be found both in the domain and on the SP. Domain log information can be obtained with
the dmesg command to see what is in the system buffer and by viewing the contents of the
/var/adm/messages file. SP log information can be obtained by viewing the event log, -> show
/SP/logs/event/list, and the audit log, -> show /SP/logs/audit/list.
You can use a variety of diagnostic tools, commands, and indicators to monitor and troubleshoot a
server. LEDs provide a quick visual notification of the status of the server and some of the replaceable
components.
Oracle ILOM firmware runs on the SPs. In addition to providing the interface between the hardware
and OS, ILOM also tracks and reports the health of key server components. It also works closely with
POST, Solaris Predictive Self-Healing technology and the Fault Management Architecture on Oracle
ILOM to keep the server running even where there is a faulty component. POST performs diagnostics
on server components upon server reset to ensure the integrity of those components. POST is
configurable and works with Oracle ILOM to take faulty components offline if needed.
Solaris’s predictive self-healing (PSH) technology continuously monitors the health of the CPU,
memory, and other components, and works with FMA on Oracle ILOM to take a faulty component
offline if needed. The PSH technology enables servers to accurately predict component failures and
mitigate many serious problems before they occur. The system provides the standard Oracle Solaris
OS log files and investigative commands that can be accessed and displayed on the device of your
choice. Sun VTS is an application that exercises the server, provides hardware validation, and
discloses possible faulty components with recommendations for repair.
The table in the slide describes what tools are available at the different states in which the server
operates.
The Oracle ILOM shell commands listed in the slide are used most frequently when performing service-
related tasks.
Remember, /HOSTx = /Servers/PDomains/PDomain_x/HOST.
Remember, /HOSTx = /Servers/PDomains/PDomain_x/HOST.
PSH enables the server to diagnose and mitigate problems before they negatively affect operations.
PSH uses the Fault Manager daemon, fmd, which starts at boot time and runs in the background, to
monitor all of the faults that are generated by the components in the server. On the SP, PSH works
with Oracle ILOM to manage all of the components on the server. On the host, PSH works with POST
and the Oracle Solaris OS to manage the components assigned to the host.
If a component generates a fault, the fmd daemon correlates the fault with data from previous faults
and other relevant information to diagnose the problem. After diagnosis, the daemon assigns a UUID
to the error. This value distinguishes this error across any set of systems.
When possible, the Fault Manager daemon initiates steps to self-heal the failed component and take
the component offline. The daemon also logs the fault to the syslogd daemon and provides a fault
notification with a MSGID. You can use the message ID to get additional information about the problem
from the knowledge article database. If PSH detects a faulty component, use the fmadm faulty
command to display information about the fault.
The fmadm faulty command displays the list of faults detected by PSH. You can run this command
from either the host or through the fault management shell under ILOM.
Note: Oracle ILOM automatically clears the messages in the Open Problems table upon detecting the
replacement or repair of a server component.
4. When applicable, click the URL link in the message to view further details about the problem and for
suggested corrective actions.
When PSH detects faults, the faults are logged and displayed on the console. In most cases, after the
fault is repaired, the corrected state is detected by the server and the fault condition is repaired
automatically. However, this repair should be verified. In cases where the fault condition is not
automatically cleared, the fault must be cleared manually.
You can use the fmadm replaced command to indicate that the suspect FRU has been replaced or
removed. If the system automatically discovers that an FRU has been replaced (the serial number has
changed), this discovery is treated in the same way as if fmadm replaced had been entered on the
command line. The fmadm replaced command is not allowed if fmd can automatically confirm that
the FRU has not been replaced (the serial number has not changed). If the system automatically
discovers that an FRU has been removed but not replaced, the current behavior is unchanged. The
suspect is displayed as not present, but is not considered to be permanently removed until the fault
event is 30 days old, at which point it is purged.
You can use the fmadm repaired command when some physical repair has been carried out to
resolve the problem, other than replacing an FRU. Examples of such repairs include reseating a card
or straightening a bent pin.
Often you use the acquit option when you determine that the resource was not the cause. Acquittal can
also happen implicitly when additional error events occur, and the diagnosis gets refined. Replacement
takes precedence over repair, and both replacement and repair take precedence over acquittal. Thus,
you can acquit a component and then subsequently repair it, but you cannot acquit a component that
has already been repaired.
18
When PSH detects faults, the faults are logged and displayed on the console. In most cases, after the
fault is repaired, the corrected state is detected by the server and the fault condition is repaired
automatically. However, this repair should be verified. In cases where the fault condition is not
automatically cleared, the fault must be cleared manually.
You can use the fmadm replaced command to indicate that the suspect FRU has been replaced or
removed. If the system automatically discovers that an FRU has been replaced (the serial number has
changed), this discovery is treated in the same way as if fmadm replaced had been entered on the
command line. The fmadm replaced command is not allowed if fmd can automatically confirm that
the FRU has not been replaced (the serial number has not changed). If the system automatically
discovers that an FRU has been removed but not replaced, the current behavior is unchanged. The
suspect is displayed as not present, but is not considered to be permanently removed until the fault
event is 30 days old, at which point it is purged.
Within Solaris, use cfgadm to locate and manage I/O cards and hard drives. The cfgadm
command un-configures and configures HDDs and SSDs. Use format to locate hard
drives. The format command does not support I/O cards. Use hotplug to list I/O card slots and
identify card types. The hotplug command currently does not support hard drives. The device
path is how the Oracle Solaris hotplug command identifies a slot location.
NOTE: The CMIOU has a Ready to Remove LED but it is not currently a hot swap item. Check
Product Notes to see if it becomes a hot swap item in the future.
With the OS running on the server, you have the full complement of Oracle Solaris OS files and
commands available for collecting information and for troubleshooting. If PSH does not indicate the
source of a fault, check the message buffer and log files for notifications for faults. Drive faults are
usually captured by the Oracle Solaris message files.
The error logging daemon, syslogd, automatically records various system warnings, errors, and faults
in message files. These messages can alert you to system problems, such as a device that is about to
fail. The /var/adm directory contains several message files. The most recent messages are in the
/var/adm/messages file.
You can use the /Servers/PDomains/PDomain_x/HOST/console/history console output buffer
to write all types of log information. If you enter show /HOSTx/console/history without first
setting any arguments with the set command, Oracle ILOM will display all lines of the console log,
starting from the end.
NOTE: Timestamps recorded in the console log reflect server time. These timestamps reflect local
time, and the Oracle ILOM console log uses UTC (Coordinated Universal Time). The Oracle Solaris
OS system time is independent of the Oracle ILOM time.
To manage the event and audit logs:
1. Log in to the Oracle ILOM web interface.
2. View the ILOM Administration > Logs page. The event log is displayed.
3. If needed, filter the event types shown, or control the display properties for rows and pages.
4. Use the controls at the top of the log table.
5. If needed, clear all log entries shown in the table by clicking Clear Log.
6. A confirmation dialog box appears. In the confirmation dialog box, click OK to clear the entries.
7. Click the Audit tab to view the audit log. The audit log is displayed.
POST is a group of PROM-based tests that run when the server is powered on or when it is reset.
POST checks the basic integrity of the critical hardware components in the server. You can set other
Oracle ILOM properties to control various other aspects of POST operations. For example, you can
specify the events that cause POST to run, the level of testing POST performs, and the amount of
diagnostic information POST displays.
If POST detects a faulty component, the component is disabled automatically. If the system is able to
run without the disabled component, the system boots when POST completes its tests. For example, if
POST detects a faulty processor core, the core is disabled, POST completes its test sequence, and the
system boots using the remaining cores. Remember, /HOSTx =
/Servers/PDomains/PDomain_x/HOST.
Remember, /HOSTx = /Servers/PDomains/PDomain_x/HOST.
This example sets the virtual keyswitch to normal, which will configure POST to run according to
other parameter values. Remember, /HOSTx = /Servers/PDomains/PDomain_x/HOST.
This procedure describes how to configure the server to run the maximum level of POST for an
individual PDom. The virtual keyswitch can be used to run full POST diagnostics without having to
modify the diagnostic property settings. Note that POST diagnostics can take a significant amount of
time to run at server reset. Remember, /HOSTx = /Servers/PDomains/PDomain_x/HOST.
The server has two status panels, one located at the front of the server and the other at the rear.
The labeling in the diagram represents:
System Locator LED (white) – The Locator LED can be turned on to identify the server. When on, it blinks rapidly. There
are two methods for turning on the
Locator LED:
• Issuing the Oracle ILOM command set /SYS/LOCATE value=Fast_Blink
• Pressing the Locator button on the front of the server.
System Service Required LED (amber) – Indicates that service is required.
• The Oracle ILOM show faulty command provides details about any faults that cause this indicator to light.
• Under some faulty conditions, individual component fault LEDs light up in addition to the Service Required
LED.
System Power OK LED (green) - Indicates the following conditions:
• Off – System is not running in its normal state. System power might be off. The SPs might still be running.
• Steady on – System is powered on and is running in its normal operating state. No service actions are
required.
• Fast blink – System is running in standby mode and can be quickly returned to full function.
• Slow blink – A normal but transitory activity is taking place. Slow blinking might indicate that system
diagnostics are running or that the system is booting.
34
To view the information, the user accounts for each component must be assigned read-only operator
(o) user roles.
To view the information, the user accounts for each component must be assigned read-only
operator (o) user roles.
To view the information, the user accounts for each component must be assigned read-only operator
(o) user roles.
This server provides many ways to identify faulty behavior, including LEDs, Oracle ILOM, and POST.
In addition to the system wide and subcomponent statuses you can access with Oracle ILOM, for this
server you can view the state of individual PDomains or specific components (DCUs, CMIOUs, or
CPUs). To view the information, the user accounts for each component must be assigned read-only
operator (o) user roles.
This output is continued from
-> show /Servers/PDomains/PDomain_1/HOST in the previous slide.
Note : Setting the auto-boot parameter to false is a one-time setting. The next time a PDomain
is reset, the auto-boot parameter returns to its default setting.
To display server information from the web interface, view the Summary page.
The summary page provides information about:
• General Information panel – Provides general information, such as the serial number, firmware version, primary
OS, host MAC address, SP IP addresses, and MAC addresses
• Actions panel – Provides the power state of the host
• Status panel – Provides the overall status of the server components
Click specific components listed under System Information for more details.
Oracle Explorer is a collection of shell scripts and a few binary executables that gathers information
and creates a detailed snapshot of a system’s configuration and state. Explorer output enables
Oracle’s engineers to perform assessments of the system by applying the output against a
knowledge-based rules engine. Information related to drivers, patches, recent system event history,
and log file entries is obtained from the Explorer output. It can be used by Oracle and Oracle’s
customers to identify and solve problems both reactively, to expedite problem diagnosis and
resolution, and proactively, to prevent future problems.
The Snapshot utility provides a single solution to collect SP data for use by Oracle Services personnel
to diagnose problems. The utility collects log files, runs various commands and collects their output,
and sends the data collection as a zip file to a user-defined location. The resulting file is a zip file.
It is possible to invoke snapshot in normal mode via DMTF CLI and BUI. Collecting the snapshot
requires the “a” role. Snapshot supports the SFTP (Secure File Transfer Protocol) and FTP (File
Transfer Protocol) download protocols as well as HTTPS when using the browser as the target in the
BUI. Snapshot also supports encrypting the entire output file.
The Service Snapshot utility collects data about the current state of the service processor, including
environmental data, logs, and FRU information. It can also run host diagnostics and capture the log.
The output from Snapshot is saved as a zip or an encrypted zip file. This information is used by Oracle
Service for diagnostic purposes.
The Admin (a) role is required to run the Snapshot utility. The Host Control and Reset (r) role
is required for running diagnostics that reset the host. Snapshot is available from the BUI under the
ILOM Administration -> Maintenance tab. Click the More details… link to gather additional information
about the fields in this window.
52
Lab Guide
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
FRU Cold Replacement
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Exercise
In this exercise, perform the tasks that are presented.
Preparation
• Prepare for this exercise with a lab tour of the M7-16 Server, identifying system
components, CMIOUs, switch chassis, power supplies, fan trays, and so on and
practice removing cold replacement FRUs.
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 1: Identify Server Components
• The topics listed in the table identify key components of the server,
including major modules and assemblies, as well as front and rear
panel features. Use the links in the table to help you identify all of the
server components.
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 2: Rack Details
• When rack mounting the M7-8 server, follow the procedures listed in
the SPARC M7 Series Servers Installation Guide.
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 3: Replacing Cold Service FRUs
Follow the procedures in the SPARC M7 Series Servers Service Manual to
replace:
• SP Trays
• Power Modules
• Fan Modules
• CMIOU Chassis Fan Cable Assembly
• Front Indicator Panel
• Front Indicator Panel Cable
• SP Internal Interconnect Assembly
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 3: Replacing Cold Service FRUs
Follow the procedures in the SPARC M7 Series Servers Service Manual to
replace:
• Internal Interconnect Assembly
• External Interconnect Assembly
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Platform Configuration Using EIS
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Exercise
In this exercise, perform the tasks that are presented.
Preparation
• Prepare for this exercise by obtaining the latest EIS Checklist for the SPARC M7
Series Servers and the lab configuration information.
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 1: EIS Checklist – Platform Configuration
Perform the step by step instructions as shown in the M7 EIS Checklist
to:
• Obtain a serial connection to the SP
• Check SP boot sequence
• Log in as root
• Create a user
– create /SP/users/admin
– set /SP/users/<username> role=aucro
– set /SP/users/<username> password
• Create an additional user for later service usage (aucros)
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 1 continued
• Check the PS
– show -t /System/Power/Power_Supplies health output_power
• Verify the system health
– show /System health
• Configure the network
• Set the hostname, altitude, date etc...
• Upgrade FW
• Configure alertmgt
• Configure ASR
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 2: EIS Checklist – Domain Configuration
Perform the step by step instructions as shown in the M7 EIS Checklist
to:
• Check the PDom configuration
• Verify the PDom host flash info
• Configure as many PDoms as needed for the training
• Review possible host properties
• Configure post levels
• Start the hosts
• Check the host progress and results
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 2 continued
• Verify OBP variables
• Install Solaris via host storage redirection
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
FRU Hot Replacement
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Exercise
In this exercise, perform the tasks that are presented.
Preparation
• Prepare for this exercise by reviewing the procedures in the SPARC M7 Series
Servers Service Manual. You will practice removing hot replacement FRUs.
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 1: Replacing Hot Service FRUs
Follow the procedures in the SPARC M7 Series Servers Service Manual to
replace:
• Power Supplies
• CMIOU Chassis Front Components
• Switch Chassis Front Components
• Fan Modules (CMIOU Chassis)
• Fan Modules (Switch Chassis)
• CMIOUs
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 1: Replacing Hot Service FRUs
Follow the procedures in the SPARC M7 Series Servers Service Manual to
replace:
• Service Processors
– Interpret SP General Status LEDs
– Determine which SP in managing system activity
– Determine which SPP is managing DCU activity (SPARC M7-16)
• PDECBs
• Switch Chassis
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Logical Domain Configuration
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Exercise
In this exercise, perform the tasks that are presented.
Preparation
• Prepare for this exercise by obtaining the lab configuration information and
reviewing the Oracle VM Server for SPARC documentation.
• The goal of this lab is a very brief introduction to building Logical Domains.
Students will build two guests that will provide a sufficient example to
demonstrate how commands are used to display guest configuration.
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 1: Verification
NOTE: Make certain you run all ldm commands as the root user
1. Verify that Logical Domain Manager Daemon (ldmd) service is
properly running
# ldm ls
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-c-- UART 384 4193024M 0.1% 0.1% 25m
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 2: LDOM Configuration
1. Start a delayed reconfiguration to make resource changes on primary
# ldm start-reconf primary
Initiating a delayed reconfiguration operation on the primary
domain. All configuration changes for other domains are disabled
until the primary domain reboots, at which time the new
configuration for the primary domain will also take effect.
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 2: LDOM Configuration cont.
3. Add the new configuration to the SP.
# ldm add-config initial
# ldm list-config factory-default initial [current]
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 3: Virtual Services
1. Configure the virtual services, including console, disk and network
# ldm add-vcc port-range=5000-5100 primary-vcc0 primary
# ldm add-vds primary-vds0 primary
# ldm add-vsw net-dev=net0 primary-vsw0 primary
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 3: Virtual Services cont.
# ldm list-services
VCC
NAME LDOM PORT-RANGE
primary-vcc0 primary 5000-5100
VSW
NAME LDOM MAC NET-DEV ID DEVICE LINKPROP DEFAULT-VLAN-ID PVID VID MTU MODE INTER-VNET-LINK
primary-vsw0 primary 00:14:4f:fb:5c:b8 net0 0 switch@0 1 1 1500 on
VDS
NAME LDOM VOLUME OPTIONS MPGROUP DEVICE
primary-vds0 primary
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 4: Logical Domains Creation
1. Create the logical domains
# ldm add-domain guest1
# ldm add-domain guest2
2. Add vcpus
# ldm add-vcpu 128 guest1
# ldm add-vcpu 128 guest2
3. Add memory
# ldm add-memory 10G guest1
# ldm add-memory 10G guest2
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 5: Virtual Disk Devices
1. Create virtual disk devices (i.e., guest boot disks)
# ldm add-vdsdev /export/ldoms/disk1 vol1@primary-vds0
# ldm add-vdsdev /export/ldoms/disk2 vol2@primary-vds0
3. Create vdisk devices for the S11 iso text installer (so we can install
Solaris)
# ldm add-vdsdev /export/ldoms/sol-11_3-text-sparc.iso iso1@primary-vds0
# ldm add-vdsdev /export/ldoms/sol-11_3-text-sparc.iso.2 iso2@primary-vds0
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 6: Root Complex Assignment
Root complex assignment per HOST
HOST0: pci_0(guest1), pci_5(guest2)
HOST1: pci_20(guest1), pci_25(guest2)
HOST2: pci_40(guest1), pci_45(guest2)
HOST3: pci_60(guest1), pci_65(guest2)
1. Remove some root complexes from primary (NOTE: this is dynamic
IO)
# ldm rm-io pci_0 primary
# ldm rm-io pci_5 primary
2. Assign root complexes to guests
# ldm add-io pci_0 guest1
# ldm add-io pci_5 guest2
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 7: Final Prep and Start
1. Bind resources
# ldm bind-domain guest1
# ldm bind-domain guest2
2. Set auto-boot? false
# ldm set-var "auto-boot?=false" guest1
# ldm set-var "auto-boot?=false" guest2
3. Start guests
# ldm start-domain guest1
LDom guest1 started
# ldm start-domain guest2
LDom guest2 started
4. Save the config to the SP
# ldm add-config guests
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 8: OS Installation
1. Directory for console logs (/var/log/vntsd)
telnet localhost 5000
{0} ok show-disks
a) /reboot-memory@0
b) b) /virtual-devices@100/channel-devices@200/disk@1
c) c) /virtual-devices@100/channel-devices@200/disk@0
d) d) /iscsi-hba/disk
e) q) NO SELECTION
f) Enter Selection, q to quit: q
{0} ok devalias
iso1 /virtual-devices@100/channel-devices@200/disk@1
vdisk1 /virtual-devices@100/channel-devices@200/disk@0
vnet1 /virtual-devices@100/channel-devices@200/network@0
net /virtual-devices@100/channel-devices@200/network@0
disk /virtual-devices@100/channel-devices@200/disk@0
virtual-console /virtual-devices/console@1
name aliases
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 8: OS Installation cont.
2. Install S11.3 from the iso text image (NOTE: Do NOT configure the network
address for simplicity)
{0} ok boot iso1:f
Boot device: /virtual-devices@100/channel-
devices@200/disk@1:f File and args:
SunOS Release 5.11 Version 11.3 64-bit
Copyright (c) 1983, 2015, Oracle and/or its affiliates.
All rights reserved. Remounting root read/write
Probing for device nodes ...
3. Reboot domain
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 9: Domain Verification
1. Check the domain
# prtdiag
# hotplug list –c
# cfgadm -v
2. Check the resources and dependencies
# ldm list-group -l -d guest1
# ldm list-netdev guest1
# ldm list-dependencies -l
# ldm list-rsrc-group -l -d guest1
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Data Collection and Fault Diagnosis
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Exercise
In this exercise, perform the tasks that are presented.
Preparation
• Prepare for this exercise by obtaining the lab configuration information as well as
the documentation listed in this lab.
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 1: Verification
• Check the live system:
– PSN
– /System Open_problems
– show disabled
– show components
– show /SYS -t -l all current_config_state==[!E]* current_config_state
disable_reason
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 2: Data Collection
• Data collection
– snapshot vs explorer (CDOM, GDOM ...)
• Collect the snapshot (Doc ID 1020204.1)
– set /SP/diag/snapshot dataset=fruid
– set /SP/diag/snapshot dump_uri=xxxxx
– BUI
• Go through snapshot
– Look at interesting files/info
– Observe snapshot structure and files
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 3: Fault Management
• Use faulmgmt shell, rshell, xir , sp traces from rshell etc ...
– -> start -script /SP/faultmgmt/shell
– faultmgmtsp>
• Review FMA logs examples
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 4: Hangs and XIRs
• -> set SESSION mode=restricted
WARNING: The "Restricted Shell" account is provided solely
to allow Services to perform diagnostic tasks.
[(restricted_shell) sca-m74-046-sp1:~]#
– tab-tab will list all available commands
– showpsnc
– traceroute, ping
• Review host hang (Doc ID 2063096.1)
• Initiate an xir from SP, xir from SPP
– options
– retrieve xir files
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 5: Troubleshooting
• Check the device paths
• Review DIMM sparing policy
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 6: EoUSB
• EoUSB (Doc ID 2063349.1)
– /System/Other_Removable_Devices/Service_Processors/Service_Processor_0/Service_Proc
essor_Module_0
– ldm list-netstat
– ldm list-netdev
– dladm show-phys –L
– dladm show-vnic
– hotplug / cfgadm
– ilomconfig list interconnect
– /var/svc/log/network-ilomconfig-interconnect:default.log
– SP/SPP failover
– loss of CMIOU / pcie-path
• remove/deconfigure the CMIOU hosting the pci-path and observe SP failover upon restart
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal
Task 7: Faults and Dumps
• [IO] fault proxying (Doc ID 1942045.1)
• Configure deferred dump (Doc ID 2012629.1)
– check enable
– force crash and observe
Copyright @2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal