Ek DS150 SG A01
Ek DS150 SG A01
Ek DS150 SG A01
Service Guide
This manual is intended for service providers and self-maintenance customers for DS15 systems.
Hewlett-Packard Company
October 2003 2003 Hewlett-Packard Company. Linux is a registered trademark of Linus Torvalds in several countries. UNIX is a trademark of The Open Group in the United States and other countries. All other product names mentioned herein may be trademarks of their respective companies. HP shall not be liable for technical or editorial errors or omissions contained herein. The information in this document is provided as is without warranty of any kind and is subject to change without notice. The warranties for HP products are set forth in the express limited warranty statements accompanying such products. Nothing herein should be construed as constituting an additional warranty. FCC Notice This equipment generates, uses, and may emit radio frequency energy. The equipment has been type tested and found to comply with the limits for a Class A digital device pursuant to Part 15 of FCC rules, which are designed to provide reasonable protection against such radio frequency interference. Operation of this equipment in a residential area may cause interference in which case the user at his own expense will be required to take whatever measures may be required to correct the interference. Any modifications to this deviceunless expressly approved by the manufacturercan void the users authority to operate this equipment under part 15 of the FCC rules. Modifications The FCC requires the user to be notified that any changes or modifications made to this device that are not expressly approved by Hewlett-Packard Company may void the user's authority to operate the equipment. Cables Connections to this device must be made with shielded cables with metallic RFI/EMI connector hoods in order to maintain compliance with FCC Rules and Regulations. Taiwanese Notice
Japanese Notice
Canadian Notice This Class A digital apparatus meets all requirements of the Canadian Interference-Causing Equipment Regulations. Avis Canadien Cet appareil numrique de la classe A respecte toutes les exigences du Rglement sur le matriel brouilleur du Canada. European Union Notice Products with the CE Marking comply with both the EMC Directive (89/336/EEC) and the Low Voltage Directive (73/23/EEC) issued by the Commission of the European Community. Compliance with these directives implies conformity to the following European Norms (in brackets are the equivalent international standards): EN55022 (CISPR 22) - Electromagnetic Interference EN50082-1 (IEC801-2, IEC801-3, IEC801-4) - Electromagnetic Immunity EN60950 (IEC950) - Product Safety Warning! This is a Class A product. In a domestic environment this product may cause radio interference in which case the user may be required to take adequate measures. Achtung! Dieses ist ein Gert der Funkstrgrenzwertklasse A. In Wohnbereichen knnen bei Betrieb dieses Gertes Rundfunkstrungen auftreten, in welchen Fllen der Benutzer fr entsprechende Gegenmanahmen verantwortlich ist. Attention! Ceci est un produit de Classe A. Dans un environnement domestique, ce produit risque de crer des interfrences radiolectriques, il appartiendra alors l'utilisateur de prendre les mesures spcifiques appropries.
Contents
Preface.............................................................................................................................xv Chapter 1
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.7.1 1.8 1.8.1 1.8.2 1.8.3 1.9 1.10 1.10.1 1.10.2 1.11 1.12 1.13
System Overview
System Enclosure Configurations.......................................................................... 1-2 Common Components ........................................................................................... 1-5 Front View............................................................................................................. 1-6 Top View ............................................................................................................... 1-8 Rear Ports and Slots............................................................................................. 1-10 Network Connections .......................................................................................... 1-12 Operator Control Panel ........................................................................................ 1-14 Remote Commands ...................................................................................... 1-15 System Motherboard............................................................................................ 1-16 CPU .............................................................................................................. 1-17 DIMMS ........................................................................................................ 1-17 PCI................................................................................................................ 1-17 Slots on the PCI Riser Card ................................................................................. 1-18 Storage Cage Options .......................................................................................... 1-20 Internal Storage Cage ................................................................................... 1-20 Front Access Storage Cage........................................................................... 1-22 Console Terminal ................................................................................................ 1-24 Power Connection................................................................................................ 1-26 System Access Lock ............................................................................................ 1-28
Chapter 2
2.1 2.2 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.2.7 2.2.8 2.2.9
Troubleshooting
Questions to Consider............................................................................................ 2-2 Diagnostic Categories............................................................................................ 2-3 Error Beep Codes ........................................................................................... 2-4 Diagnostic LEDs on the OCP ......................................................................... 2-5 Power Problems.............................................................................................. 2-7 Problems Getting to Console Mode................................................................ 2-8 Problems Reported by the Console................................................................. 2-9 Boot Problems .............................................................................................. 2-10 Errors Reported by the Operating System .................................................... 2-11 Memory Problems ........................................................................................ 2-12 PCI Bus Problems......................................................................................... 2-13
2.2.10 2.2.11 2.3 2.3.1 2.3.2 2.3.3 2.4 2.5 2.5.1 2.5.2 2.5.3 2.5.4 2.5.5 2.6 2.6.1 2.6.2 2.6.3 2.6.4 2.7 2.7.1 2.7.2 2.7.3 2.7.4 2.7.5 2.7.6 2.7.7 2.7.8
SCSI Problems.............................................................................................. 2-14 Thermal Problems and Environmental Status .............................................. 2-15 Fail-Safe Booter Utility ....................................................................................... 2-16 Starting the FSB Automatically.................................................................... 2-16 Starting the FSB Manually ........................................................................... 2-16 Required Firmware....................................................................................... 2-17 Updating Firmware.............................................................................................. 2-18 Service Tools and Utilities................................................................................... 2-20 Error Handling/Logging Tools (System Event Analyzer) ............................ 2-20 Loopback Tests............................................................................................. 2-20 SRM Console Commands ............................................................................ 2-20 Remote Management Console (RMC) ......................................................... 2-21 Crash Dumps ................................................................................................ 2-21 Q-Vet Installation Verification ............................................................................ 2-22 Installing Q-Vet ............................................................................................ 2-24 Running Q-Vet ............................................................................................. 2-26 Reviewing Results of the Q-Vet Run ........................................................... 2-28 De-Installing Q-Vet ...................................................................................... 2-29 Information Resources......................................................................................... 2-30 HP Service Tools CD ................................................................................... 2-30 DS15 Service HTML Help File.................................................................... 2-30 Alpha Systems Firmware Updates................................................................ 2-30 Fail-Safe Booter............................................................................................ 2-31 Software Patches .......................................................................................... 2-31 Learning Utility ............................................................................................ 2-31 Late-Breaking Technical Information .......................................................... 2-31 Supported Options ........................................................................................ 2-31
Chapter 3
3.1 3.2 3.3 3.3.1 3.3.2 3.3.3 3.4 3.4.1 3.4.2 3.5 3.5.1 3.5.2 3.6 3.7
Overview of Power-Up Diagnostics ...................................................................... 3-2 System Power-Up Sequence.................................................................................. 3-3 Power-Up Displays................................................................................................ 3-5 Power-Up Display .......................................................................................... 3-5 Console Power-Up Display ............................................................................ 3-8 SRM Console Event Log.............................................................................. 3-10 Power-Up Error Messages ................................................................................... 3-12 Checksum Error............................................................................................ 3-13 SROM Memory Configuration Errors.......................................................... 3-15 Forcing a Fail-Safe Load ..................................................................................... 3-17 Starting the FSB Automatically.................................................................... 3-17 Starting the FSB Manually ........................................................................... 3-17 Updating the RMC............................................................................................... 3-19 Field Use of a Floppy Diskette ............................................................................ 3-20
vi
Chapter 4
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21
Diagnostic Command Summary............................................................................ 4-2 Buildfru.................................................................................................................. 4-4 cat el and more el................................................................................................... 4-8 clear_error.............................................................................................................. 4-9 crash..................................................................................................................... 4-10 deposit and examine ............................................................................................ 4-11 exer ...................................................................................................................... 4-15 grep ...................................................................................................................... 4-20 hd ......................................................................................................................... 4-22 info....................................................................................................................... 4-24 kill and kill_diags ................................................................................................ 4-39 memexer .............................................................................................................. 4-40 memtest................................................................................................................ 4-42 net ........................................................................................................................ 4-47 nettest................................................................................................................... 4-49 set sys_serial_num ............................................................................................... 4-52 show error ............................................................................................................ 4-53 show fru ............................................................................................................... 4-55 show_status.......................................................................................................... 4-58 sys_exer ............................................................................................................... 4-60 test........................................................................................................................ 4-62
Chapter 5
5.1 5.1.1 5.1.2 5.1.3 5.2 5.3 5.3.1
Error Logs
Error Log Analysis with System Event Analyzer.................................................. 5-2 WEB Enterprise Service (WEBES) Director.................................................. 5-3 Using System Event Analyzer ........................................................................ 5-4 Bit to Text....................................................................................................... 5-8 Fault Detection and Reporting............................................................................. 5-14 Machine Checks/Interrupts .................................................................................. 5-16 Error Logging and Event Log Entry Format ................................................ 5-18
vii
Chapter 6
6.1 6.1.1 6.2 6.3 6.4 6.4.1 6.5 6.6 6.7 6.7.1 6.7.2 6.7.3 6.8
System Consoles.................................................................................................... 6-2 Selecting the Display Device.......................................................................... 6-3 Displaying the Hardware Configuration................................................................ 6-4 Setting Environment Variables.............................................................................. 6-5 Setting Automatic Booting .................................................................................. 6-14 Setting the Operating System to Auto Start.................................................. 6-14 Changing the Default Boot Device ...................................................................... 6-15 Setting SRM Security .......................................................................................... 6-16 Configuring Devices............................................................................................ 6-19 CPU Location ............................................................................................... 6-20 Memory Configuration ................................................................................. 6-21 PCI Configuration and Installation............................................................... 6-25 Booting Linux...................................................................................................... 6-28
Chapter 7
7.1 7.2 7.2.1 7.3 7.4 7.5 7.6 7.6.1 7.6.2 7.6.3 7.7 7.8 7.9 7.10 7.11 7.12
RMC Overview...................................................................................................... 7-2 Operating Modes ................................................................................................... 7-4 Bypass Modes................................................................................................. 7-6 Terminal Setup ...................................................................................................... 7-9 SRM Environment Variables for COM1 ............................................................. 7-10 Entering the RMC................................................................................................ 7-11 Using the Command-Line Interface..................................................................... 7-13 Displaying the System Status ....................................................................... 7-14 Displaying the System Environment ............................................................ 7-18 Using Power On and Off, Reset, and Halt Functions ................................... 7-19 Configuring Remote Dial-In................................................................................ 7-21 Configuring Dial-Out Alert ................................................................................. 7-25 RMC Firmware Update and Recovery ................................................................ 7-29 Resetting the RMC to Factory Defaults............................................................... 7-32 RMC Command Reference.................................................................................. 7-35 Troubleshooting Tips........................................................................................... 7-44
Chapter 8
8.1 8.2 8.2 8.3.1 8.4 8.5
Overview of FRU Procedures................................................................................ 8-1 Important Information before Replacing FRUs ..................................................... 8-4 Important Information before Replacing FRUs ..................................................... 8-4 Recommended Spares............................................................................................ 8-5 Power Cords ................................................................................................... 8-7 FRU Locations....................................................................................................... 8-8 Removing the Top Cover..................................................................................... 8-10
viii
8.6 8.7 8.8 8.9 8.10 8.11 8.12 8.13 8.14 8.15 8.16 8.17 8.18 8.19 8.20 8.21 8.21.1 8.21.2 8.22 8.23 8.24 8.25 8.26 8.27
Removing the Side Panel..................................................................................... 8-12 Replacing the PCI Fan ......................................................................................... 8-14 Replacing the CPU Fan ....................................................................................... 8-16 Replacing the Disk in Center Internal Storage Bay ............................................. 8-18 Replacing a Front Access Drive .......................................................................... 8-22 Accessing the Front Access Storage Cage........................................................... 8-26 Accessing the Internal Storage Cage ................................................................... 8-28 Replacing or Installing a PCI Option Module ..................................................... 8-30 Replacing the PCI Riser Card.............................................................................. 8-34 Replacing Bottom Drive Front Access Storage Cage ....................................... 8-36 Replacing Bottom Drive Internal Storage Cage ............................................... 8-39 Replacing Middle Drive Internal Storage Cage................................................ 8-41 Replacing DVD/CD-RW Drive Internal Storage Cage .................................... 8-43 Replacing the Power Supply................................................................................ 8-45 Replacing the System Fan ................................................................................... 8-49 Removing or Installing a Memory DIMM........................................................... 8-51 Removing a Memory DIMM........................................................................ 8-54 Installing a Memory DIMM ......................................................................... 8-55 Replacing the Operator Control Panel ................................................................. 8-57 Replacing the Speaker ......................................................................................... 8-61 Preparing to Replace the Motherboard ................................................................ 8-63 Removing Intervening Components .................................................................... 8-64 Replacing the Motherboard ................................................................................. 8-69 Reinstalling System Components ........................................................................ 8-71
Appendix A
A.1 A.2 A.2.1 A.2.2 A.2.3 A.3
Location of Jumpers ............................................................................................. A-2 Function of Jumpers ............................................................................................. A-3 System Jumpers ............................................................................................. A-3 Server Management Jumpers......................................................................... A-4 Jumper for COM1 Pass through Enable ........................................................ A-5 Setting Jumpers..................................................................................................... A-6
Appendix B
B.1 B.2 B.3
Information for Isolating Failures......................................................................... B-2 DIMM Isolation Procedure................................................................................... B-3 EV68 Single-Bit Errors....................................................................................... B-12
Examples
Example 21 Memory Sizing........................................................................................... 2-12 Example 22 Running LFU ............................................................................................. 2-18
ix
Example 31 Sample Power-Up Display........................................................................... 3-5 Example 32 Power-Up Display........................................................................................ 3-8 Example 33 Sample Console Event Log........................................................................ 3-10 Example 34 Using the Log Command to Check for Errors ............................................ 3-11 Example 35 Checksum Error and Fail-Safe Boot Console ............................................ 3-13 Example 36 Report for Illegal DIMM............................................................................ 3-15 Example 37 Report for Missing DIMM......................................................................... 3-15 Example 38 Report for Incompatible DIMM................................................................. 3-16 Example 39 Report for Failed DIMM............................................................................ 3-16 Example 41 Buildfru Command ...................................................................................... 4-4 Example 42 more el ......................................................................................................... 4-8 Example 43 clear_error.................................................................................................... 4-9 Example 44 deposit and examine................................................................................... 4-11 Example 45 exer ............................................................................................................ 4-15 Example 46 grep ............................................................................................................ 4-20 Example 47 hd ............................................................................................................... 4-22 Example 48 info 0 .......................................................................................................... 4-25 Example 49 info 1 .......................................................................................................... 4-26 Example 410 info 2 ........................................................................................................ 4-27 Example 411 info 3 ........................................................................................................ 4-28 Example 412 info 4 ........................................................................................................ 4-29 Example 413 info 5 ........................................................................................................ 4-31 Example 414 info 6 ........................................................................................................ 4-35 Example 415 info 7 ........................................................................................................ 4-37 Example 416 info 8 ........................................................................................................ 4-38 Example 417 kill and kill_diags..................................................................................... 4-39 Example 418 memexer................................................................................................... 4-40 Example 419 memtest.................................................................................................... 4-42 Example 420 net -ic and net -s....................................................................................... 4-47 Example 421 nettest ....................................................................................................... 4-49 Example 422 set sys_serial_num ................................................................................... 4-52 Example 423 show error ................................................................................................ 4-53 Example 424 show fru ................................................................................................... 4-55 Example 425 show _status............................................................................................. 4-58 Example 426 sys_exer ................................................................................................... 4-60 Example 427 test -lb ...................................................................................................... 4-62 Example 61 Set Password .............................................................................................. 6-16 Example 62 set secure.................................................................................................... 6-17 Example 63 clear password............................................................................................ 6-18 Example 64 Linux Boot Output ..................................................................................... 6-29 Example 71 Dial-In Configuration................................................................................. 7-21 Example 72 Unsetting the Password.............................................................................. 7-24 Example 73 Dial-Out Alert Configuration..................................................................... 7-25
Figures
Figure 11 DS15 Rackmounted and Pedestal System ........................................................ 1-2 Figure 12 DS15 Desktop System with Internal Storage Cage Option ............................. 1-3 Figure 13 DS15 Desktop System with Front Access Storage Cage Option ...................... 1-4 Figure 14 Front View with Optional Front Access Storage Cage..................................... 1-6 Figure 15 Top View .......................................................................................................... 1-8 Figure 16 Rear Ports and Slots........................................................................................ 1-10 Figure 17 Ethernet Network Connection ........................................................................ 1-12 Figure 18 Network LED indicators................................................................................. 1-13 Figure 19 Operator Control Panel ................................................................................... 1-14 Figure 110 System Motherboard.................................................................................... 1-16 Figure 111 Slots on the PCI Riser Card ......................................................................... 1-18 Figure 112 Internal Storage Cage Configuration ........................................................... 1-20 Figure 113 Front Access Storage Cage Configuration................................................... 1-22 Figure 114 Console Terminal Connected to COM Port................................................. 1-24 Figure 115 Console Terminal Connected to Optional Video Card ................................ 1-25 Figure 116 Connecting the Power for the Desktop ........................................................ 1-26 Figure 117 Connecting the Power for a Rackmount System ......................................... 1-27 Figure 118 System Access Lock ..................................................................................... 1-28 Figure 21 LED Patterns during Power-Up....................................................................... 2-5 Figure 22 FSB Switch "On" Setting............................................................................... 2-17 Figure 31 Power-Up Sequence......................................................................................... 3-4 Figure 32 FSB Switch "On" Setting (Rackmounted Orientation).................................. 3-18 Figure 33 Location of Floppy Device Connector........................................................... 3-20 Figure 51 System Event Analyzer Initial Screen ............................................................. 5-4 Figure 52 Problem Reports Screen .................................................................................. 5-5 Figure 53 System Event Analyzer Problem Report Details ............................................. 5-6 Figure 54 Correctable System Event Sample Table......................................................... 5-9 Figure 61 CPU Location ................................................................................................ 6-20 Figure 62 Stacked and Unstacked DIMMs .................................................................... 6-22 Figure 63 Memory Configuration .................................................................................. 6-24 Figure 64 Slots on the PCI Riser Card ........................................................................... 6-27 Figure 71 Data Flow in Through Mode ........................................................................... 7-4 Figure 72 Data Flow in Bypass Mode.............................................................................. 7-6 Figure 73 Setup for RMC with VT Terminal................................................................... 7-9 Figure 74 Setup for RMC with VGA Monitor ................................................................. 7-9 Figure 75 RMC Jumpers (Default Positions)................................................................. 7-34 Figure 81 FRU Locations: Front and Top........................................................................ 8-8 Figure 82 Removing the Top Cover............................................................................... 8-10 Figure 83 Removing the Side Panel............................................................................... 8-12 Figure 84 Replacing the PCI Fan ................................................................................... 8-14
xi
Figure 85 Replacing the CPU Fan ................................................................................. 8-16 Figure 86 Accessing the Center Internal Storage Bay ................................................... 8-18 Figure 87 Replacing the Disk in the Center Internal Storage Bay ................................. 8-20 Figure 88 Replacing a Front Access Disk Drive........................................................... 8-22 Figure 89 Replacing a Front Access Tape Drive .......................................................... 8-24 Figure 810 Accessing the Front Access Storage Cage................................................... 8-26 Figure 811 Accessing the Internal Storage Cage ........................................................... 8-28 Figure 812 Slots on the PCI Riser Card ......................................................................... 8-31 Figure 813 Replacing or Installing a PCI Option Module ............................................. 8-32 Figure 814 Replacing the PCI Riser Card...................................................................... 8-34 Figure 815 Replacing Bottom Drive Front Access Storage Cage ............................... 8-37 Figure 816 Replacing Bottom Drive Internal Storage Cage ....................................... 8-39 Figure 817 Replacing Middle Drive Internal Storage Cage........................................ 8-41 Figure 818 Replacing DVD/CD-RW Drive Internal Storage Cage ............................ 8-43 Figure 819 Removing Connectors from the Power Supply............................................ 8-45 Figure 820 Replacing the Power Supply........................................................................ 8-47 Figure 821 Replacing the System Fan............................................................................ 8-49 Figure 822 Locations for DIMMs on the Motherboard .................................................. 8-52 Figure 823 Removing a Memory DIMMs ..................................................................... 8-54 Figure 824 Installing a Memory DIMM ........................................................................ 8-55 Figure 825 Removing the Front Bezel ........................................................................... 8-57 Figure 826 Replacing the Operator Control Panel ......................................................... 8-59 Figure 827 Replacing the Speaker ................................................................................. 8-61 Figure 828 Components Connected to the Motherboard ............................................... 8-63 Figure 829 Removing Rear Screws................................................................................ 8-65 Figure 830 Removing the Center Support Bracket ........................................................ 8-67 Figure 831 Replacing the Motherboard ......................................................................... 8-69 Figure A1 Locations of Jumpers ..................................................................................... A-2
xii
Tables
Table 11 How Physical I/O Slots Map to Logical Slots................................................. 1-19 Table 21 Error Beep Codes .............................................................................................. 2-4 Table 22 OCP Switches .................................................................................................... 2-5 Table 23 OCP LED Indications ....................................................................................... 2-6 Table 24 Power Problems ................................................................................................ 2-7 Table 25 Problems Getting to Console Mode .................................................................. 2-8 Table 26 Problems Reported by the Console ................................................................... 2-9 Table 27 Boot Problems................................................................................................. 2-10 Table 28 Errors Reported by the Operating System....................................................... 2-11 Table 29 Memory Testing.............................................................................................. 2-12 Table 31 Error Beep Codes ............................................................................................ 3-12 Table 41 Summary of Diagnostic and Related Commands.............................................. 4-2 Table 42 Show Error Message Translation .................................................................... 4-56 Table 51 DS15 Fault Detection and Correction ............................................................. 5-15 Table 52 Machine Checks/Interrupts ............................................................................. 5-16 Table 53 Sample Error Log Event Structure Map.......................................................... 5-18 Table 61 SRM Environment Variables ............................................................................ 6-7 Table 6-2 Comparison of Physical and Logical Slot Numbering...................................... 6-25 Table 63 How Physical I/O Slots Map to Logical Slots................................................. 6-26 Table 71 Status Command Fields .................................................................................. 7-15 Table 72 Modem Initialization Commands.................................................................... 7-24 Table 73 Elements of Dial String and Alert String ........................................................ 7-28 Table 74 DS15 initialization commands with MODEMDEF enabled ........................... 7-38 Table 75 RMC Troubleshooting .................................................................................... 7-44 Table 81 Recommended Spares ....................................................................................... 8-5 Table 82 Optional Disk and Tape Drives......................................................................... 8-6 Table 83 Country-Specific Power Cords ......................................................................... 8-7 Table 84 DIMM and Array Reference ........................................................................... 8-52 Table A1 Jumpers for System-Level Functions .............................................................. A-3 Table A2 Server Management Jumpers .......................................................................... A-4 Table A3 Jumper to Enable COM1 Pass through Mode ................................................. A-5 Table B1 Information Needed to Isolate Failing DIMMs ............................................... B-2 Table B2 Determining the Real Failed Array for 2-Way Interleaving............................ B-3 Table B3 Description of DPR Locations 80, and 84 ....................................................... B-4 Table B4 Failing DIMM Lookup Table.......................................................................... B-5 Table B5 Syndrome to Data Check Bits Table ............................................................. B-12
Index
xiii
Preface
Intended Audience
This manual is for service providers and self-maintenance customers for AlphaServer DS15 systems.
Document Structure
This manual uses a structured documentation design. Topics are organized into small sections, usually consisting of two facing pages. Most topics begin with an abstract that provides an overview of the section, followed by an illustration or example. The facing page contains descriptions, procedures, and syntax definitions. This manual contains eight chapters and two appendixes. Chapter 1, System Overview, provides an overview of the system. Chapter 2, Troubleshooting, describes the starting points for diagnosing problems on AlphaServer DS15 systems and also provides information resources. Chapter 3, Power-Up Diagnostics and Display, explains the power-up process and RMC, SROM, and SRM power-up diagnostics. Chapter 4, SRM Console Diagnostics, describes troubleshooting with the SRM console. Chapter 5, Error Logs, explains how to interpret error logs reported by the operating system. Chapter 6, System Configuration and Setup, describes how to configure and set up a DS15 system. Chapter 7, Using the Remote Management Console, explains how to manage the system through the remote management console (RMC). Chapter 8, FRU Removal and Replacement, describes the procedures for removing and replacing Field Replaceable Units (FRUs) on AlphaServer DS15 systems. Appendix A, Jumpers on System Motherboard, provides detailed information on the configuration of jumpers on the system motherboard
xv
Appendix B, Isolating Failing DIMMs, explains how to manually isolate a failing DIMM from the failing address and failing data bits.
Documentation Titles
hp AlphaServer DS15 and AlphaStation DS15 Documentation
Title User Documentation Kit DS15 AlphaServer and DS15 AlphaStation Owners Guide AlphaServer DS15 Quick Setup AlphaServer DS15 Floor Stand Kit DS15 AlphaServer and DS15 AlphaStation Service Guide CD-ROM Installation Guide AlphaServer DS15 Release Notes Order Number QA-72XAA-G8 EKDS150OG EKDS150IG EKDS150FS EKDS150SG EKDS152CD EKDS150RN
xvi
This chapter provides an overview of the system including: System Enclosure Configurations Common Components Front View Top View Rear Ports and Slots Network Connection Operator Control Panel System Motherboard PCI Slots Storage Cage Options Console Terminal Power Connection System Access Lock
System Overview
1-1
1.1
The DS15 family consists of a rackmounted system, a standalone pedestal system, and a desktop system. All have similar features, components, capabilities and options; the desktop system will be shown throughout this manual in illustrations and examples.
hp
Alp
haSer
ver
DS15
hp
Alp haSer
ver
DS15
hp
Alp haSer
ver
DS15
hp
Alp haSer
ver
DS15
hp
Alp haSer
ver
DS15
hp
Alp haSer
ver
DS15
hp
Alp haSer
ver
DS15
hp AlphaServer DS15
MR0496
1-2
MR0497B
System Overview
1-3
Figure 13 DS15 Desktop System with Front Access Storage Cage Option
MR0497A
1-4
1.2
Common Components
The basic building block of AlphaServer DS15 systems is the system enclosure chassis that houses the following common components. Alpha 1 GHz CPU with 2 MB onboard ECC cache 512-MB, 1 GB, or 2 GB SDRAM memory expandable to 4 GB maximum memory capacity Onboard dual 10/100 BaseT Ethernet ports Four 64-bit PCI expansion slots Onboard Dual Channel Ultra160 SCSI controller Choice of storage cage subsystems: a. b. Internal Storage Cage with a maximum SCSI storage capacity of 218.4 GB Front Access Storage Cage with a maximum SCSI storage capacity of 510.4 GB
Two serial ports: a. b. COM1 port with RMC port with modem control and a full-duplex asynchronous communications port COM2 port with full-duplex asynchronous communications port
PS/2-style keyboard port and mouse port 400W (120/240V, 60/50 Hz) power supply
System Overview
1-5
1.3
Front View
4
MR0497
1-6
Center internal storage bay DVD/CD-RW drive Disk storage Operator control panel
System Overview
1-7
1.4
Top View
11
5 6 7 8
hp Alp h
aSe rve rD S1 5
4 9 10 3 1 2
MR0499
1-8
Operator Control Panel DVD/CD-RW drive Internal disk drive Power supply PCI riser CPU System motherboard Memory Speaker (hidden) Center internal storage bay Cover
System Overview
1-9
1.5
1
2
11
10
6
MR0498A
1-10
Power supply ground Key Mouse connector PCI Slots Ethernet port B Ethernet port A Cable run hook SCSI connector Keyboard connector COM 1 serial port (top), COM 2 serial port (bottom) Power connector
System Overview
1-11
1.6
Network Connections
There are two onboard Ethernet network connectors on the rear of the DS15 system. The DS15 system has dual onboard 10/100 BaseT Ethernet ports. You can connect to either or both.
or B
1-12
The LEDs to the left of each Ethernet connector indicate its status. LED Speed/Activity; indicates activity through the connection. LED Link indicator; network connection exists when this is lit.
System Overview
1-13
1.7
The control panel provides system controls and status indicators. The controls are the Power and Halt/Reset buttons. The panel has a green power LED, a green disk activity indicator LED, and three amber diagnostic LEDs.
MR0500
1-14
Halt/Reset button Amber system fault LED Amber over temperature fault LED Amber fan fault LED Green disk activity LED Green system power LED System Power Switch (On/Off)
NOTE:
Jumper J22 (pins 13 14) must be installed for the halt/reset button to function as a reset button.
1.7.1
Remote Commands
Commands issued from the remote management console (RMC) can be used to reset, halt, and power the system on or off. For information on RMC, see Chapter 7. RMC Command Power on Power off Halt Halt in Halt out Reset Function Turns on power. Emulates pressing the Power button to the On position. Turns off power. Emulates pressing the Power button to the Off position. Halts the system. Halts the system and causes the halt to remain asserted. Releases a halt created with halt in. Resets the system.
System Overview
1-15
1.8
System Motherboard
1-16
CPU Internal SCSI connector IDE connector Memory DIMM slot - array 2, DIMM 2 Memory DIMM slot - array 0, DIMM 0 Memory DIMM slot - array 2, DIMM 3 Memory DIMM slot - array 0, DIMM 1 Slot for PCI riser
1.8.1
CPU
The CPU microprocessor is a superscalar pipelined processor packaged in a 675-pin LGA carrier. The CPU has the ability to issue up to four instructions during each CPU clock cycle and a peak instruction execution rate of four times the CPU clock frequency.
1.8.2
DIMMS
The AlphaServer DS15 system supports up to two pairs of 200-pin synchronous DIMMs. Supported DIMM sizes are 256 MB, 512 MB, and 1 GB, allowing memory to be configured from 512 MB to 4096 MB.
1.8.3
PCI
The AlphaServer DS15 system supports two PCI busses, one for the onboard integrated I/O and the other controls the four expansion slots through the PCI riser card.
System Overview
1-17
1.9
MR0502C
1-18
Slot 1 66/33 MHz, 3.3v Slot 2 66/33 MHz, 3.3v Slot 3 33 MHz, 3.3v Slot 4 33 MHz, 3.3v LED connected to +5 VAUX
System Overview
1-19
2 3 4
15 DS er Serv ha Alp hp
MR0548A
1-20
Center internal storage bay DVD/CD-RW drive DVD/CD-RW or internal drive bay (disk or tape) Internal drive bay
System Overview
1-21
2 3 4 5
15 DS er Serv ha Alp hp
MR0549A
1-22
Center internal storage bay DVD/CD-RW drive Universal drive bay Universal drive bay Internal drive bay
System Overview
1-23
1 2
MR0508A
1-24
1 2
MR0508B
System Overview
1-25
1 2
MR0504B
1-26
1 2
3 4
1 2
3
MR0504A
Thumbscrew Power cord bracket with attached screw Power cord Power cord bracket To connect the power cord, loosen the thumbscrew, plug the cord in, rotate the bracket so that it supports the power cord plug, and tighten the attached screw.
System Overview
1-27
1 2
MR0507A
1-28
Chapter 2 Troubleshooting
This chapter describes the starting points for diagnosing problems on AlphaServer DS15 systems. The chapter also provides information resources. Questions to Consider Diagnostic Categories Fail-Safe Booter Utility Updating Firmware Service Tools and Utilities Q-Vet Installation Verification Information Resources
Troubleshooting
2-1
2.1
Questions to Consider
Before troubleshooting any system problem, first check the site maintenance log for the system's service history. Be sure to ask the system manager the following questions: Has the system been used and did it work correctly? Have changes to hardware or updates to firmware or software been made to the system recently? If so, are the revision numbers compatible for the system? (Refer to the system release notes.) What is the current state of the system? o If the operating system is down, but you are able to access the SRM console, use the console environment diagnostic tools, including the Operator Control Panel (OCP) LEDs and SRM commands. If you are unable to access the SRM console, enter the Remote Management Console (RMC) command-line interface (CLI) and issue commands to determine the hardware status. See Chapter 7. If the operating system has crashed and rebooted, the Computer Crash Analysis Tool (CCAT), the System Event Analyzer service tools (to interpret error logs), the SRM crash command, and operating system exercisers can be used to diagnose system problems.
2-2
2.2
Diagnostic Categories
System problems can be classified into the following categories. Using these categories, you can quickly determine a starting point for diagnosis and eliminate the unlikely sources of the problem. The next several subsections group problems into one of several categories. Error beep codes Diagnostic LEDs on the OCP Power problems Problems getting to the console mode Problems reported by the console mode Boot problems Errors reported by the operating system Memory problems PCI bus problems SCSI problems Thermal problems and environmental status WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. These measures include: 1. Remove any jewelry that may conduct electricity. 2. If accessing the system card cage, power down the system and wait 2 minutes to allow components to cool. 3. Wear an anti-static wrist strap when handling internal components.
Troubleshooting
2-3
2.2.1
Audible beep codes announce specific errors that might be encountered while the system is powering up. Table 21 identifies the error beep codes.
Action to Repair
No action necessary. Check memory and memory configuration. Check system configurations. Replace the system board. Possible CPU problem.
2-4
2.2.2
Diagnostic LEDs on the operator control panel indicate error conditions and power-up information.
MR0500
Troubleshooting
2-5
LED On Function
System power is on. There is disk activity. There is a fan fault. The system is over temperature. There is a system fault. RMC image is corrupted but the system is not in emergency runtime image recovery mode or emergency runtime image recovery mode has timed out. If recovery has timed out, unplug the system power cord and wait until the LED on the PCI Riser card turns off. Plug in the power cord and try again. System is in emergency runtime image recovery mode and is awaiting firmware update. RMC has failed or the system is configured for emergency runtime image recovery but is not powered on. Firmware update is in progress.
and
Amber/blink in unison
2-6
2.2.3
Power Problems
Power problems can prevent the system from operating. Use the following table to troubleshoot these problems.
Check:
Front-panel power switch Power at the wall receptacle AC cord Power cable connectors Unplug the power cord for 15 seconds, then reconnect.
Enter the RMC. Use the poe command to check for poweron errors, and use the log or log # command to check the event log for symptoms of failure. Make sure that all jumpers are in their default state.
On, but the monitor screen is black for approximately 40 seconds and then turns blue.
Monitor power indicator is On. Video cable is properly connected. SRM console environment variable setting. EV may not be set to graphics. NOTE: A black raster is displayed if the console environment variable is set to serial mode rather than graphics mode.
Troubleshooting
2-7
2.2.4
Certain problems can prevent access to console mode. Use the following table to troubleshoot these problems.
Power-up screen is not Note any error beep codes and observe the OCP for a failure detected displayed at system console. during self-tests. Check keyboard and monitor connections. Press the Return key. If the system enters console mode, check that the console environment variable is set correctly. If the console terminal is a VGA monitor, the console variable should be set to graphics. If it is a serial terminal, the console environment variable should be set to serial. If console is set to serial, the power-up screen is routed to the COM1 serial communication port and cannot be viewed from the VGA monitor. Try connecting a console terminal to the COM2 serial communication port. When using the COM2 port set the console environment variable to serial. Use RMC commands to determine status.
2-8
2.2.5
The console may report certain problems. Use the following table to troubleshoot these problems.
Action
Use error beep codes or console serial terminal to determine what error occurred. Check the power-up screen for error messages. Enter the RMC. Use the poe command to check for poweron errors, and use the log or log # command to check the event log for symptoms of failure.
Interpret the error beep codes at power-up and check the powerup screen for a failure detected during self-tests. Examine the console event log (use the more el command) to check for error messages recorded during power-up. If the power-up screen or console event log indicates problems with mass storage devices or PCI devices, or if devices are missing from the show config or show device display, see Section 2.2.9 and 2.2.10. Enter the RMC and check the power-on errors poe and the event log log, log # for symptoms for failure. Use the SRM test command to verify the problem.
Troubleshooting
2-9
2.2.6
Boot Problems
Certain problems may interfere with the boot process. Use the following table to troubleshoot these problems.
Action
Install the operating system and license key.
Target boot device is not listed in Check the cables. Are the cables oriented properly and not the SRM show device or show cocked? Are there bent pins? Check all the SCSI devices for config command. incorrect or conflicting IDs. Refer to the device's documentation. SCSI termination: The SCSI bus must be terminated at the end of the internal cable and at the last external SCSI peripheral. Review the position of all relevant jumpers. System cannot find the boot device. Use the SRM show config and show device commands. Use the displayed information to identify target devices for the boot command, and verify that the system sees all of the installed devices. If you are attempting to use bootp, first set the following variables as shown. (Replace ewa0 with the appropriate device designation.) >>>set ewa0_inet_init BOOTP >>>set ewa0_protocols BOOTP Verify that no unsupported adapters are installed. This could happen if the main logic board has been replaced, which would cause a loss of the previous configuration information. Use the SRM show and set commands to check and set the values assigned to boot-related variables such as auto_action, bootdef_dev, and boot_osflags.
For problems booting over a network, check the ew*0_protocols, ei*0_protocols or eg*0_protocols environment variable settings: Systems booting from a Tru64 UNIX server should be set to bootp; systems booting from an
2-10
Problem/Possible Cause
Action
OpenVMS server should be set to mop. Run the test command to check that the boot device is operating. Check ei*0_mode. Refer to Table 6-1, SRM Environment Variables.
2.2.7
The operating system may hang, crash, or log errors. Use the following table to troubleshoot these problems.
Action
If possible, halt the system by using either the halt/reset button or the RMC halt command. (Jumper J22 pins 13-14 must be removed. If jumper J22 is installed, you will reset the system and loose system context so that no crash can be acquired.) Then enter the SRM crash command and examine the crash dump file. Refer to the Guide to Kernel Debugging (AA-PS2TD-TE) for information on using the Tru64 UNIX Crash utility.
Troubleshooting
2-11
2.2.8
Memory Problems
Memory problems may affect system performance. Use the following table to troubleshoot these problems.
DIMMs ignored by system, or Ensure that each memory array has identical DIMMs installed. system unstable. System hangs or crashes. DIMMs failing memory powerup self-test. DIMMs may not have ECC bits. Noticeable performance degradation. The system may appear hung or run very slowly. Replace the DIMMs that the SROM has isolated on power up. See Example 21. Some third-party DIMMs may not be compatible with DS15 systems. Ensure that the DIMMs are qualified. This could be a result of hard single-bit ECC errors on a particular DIMM. Check the error logs for memory errors. Ensure that the memory DIMMs are qualified.
Testing AAR0 Memory data test in progress Memory address test in progress Memory pattern test in progress Memory initialization Failed DIMM 3 Loading console Code execution complete (transfer control)
2-12
2.2.9
PCI bus problems at startup are usually indicated by the inability of the system to detect the PCI device. Use the following steps to diagnose the likely cause of PCI bus problems. 1. 2. 3. 4. 5. Five volt PCI adapters are not allowed. Confirm that the PCI option module is supported and has the correct firmware and software versions. Confirm that the PCI option module and any cabling are properly seated. Check for a bad PCI slot by moving the last installed PCI option module to a different slot. Contact HP Service to replace the PCI riser card.
PCI Parity Error Some PCI devices do not implement PCI parity, and some have a parity generating scheme that may not comply with the PCI specification. In such cases, the device should function properly if parity is not checked. Parity checking can be turned off with the set pci_parity off command so that false PCI parity errors do not result in machine check errors. However, if you disable PCI parity, no parity checking is implemented for any device. Turning off PCI parity is therefore not recommended or supported.
Troubleshooting
2-13
2-14
Troubleshooting with the show power Command The SRM console show power command can help you determine if environmental problems necessitate the replacement of a power supply, system fan or fans, or the motherboard. Show power indicates:
Bad voltage Bad fan Bad temperature
Action
Replace the power supply and or the system motherboard. Contact HP Services. Replace the indicated fan or fans. Contact HP Services. The problem could be a bad fan or an obstruction to the airflow. Check the airflow first. If there is no obstruction, contact HP Services to replace the bad fan.
Troubleshooting with RMC Commands The RMC status command displays the system status and the current RMC settings. See Section 7.6.1 for more information. The RMC env command provides a snapshot of the system environment. See Section 7.6.2 for more information. The log command prints out a brief summary of the last 10 system events that have been logged. Issuing the log command followed by a number (0-9) provides detailed information about the selected system event (0 = most recent event).
Troubleshooting
2-15
2.3
The fail-safe booter utility (FSB) is another variant of the SRM console. The FSB provides an emergency recovery mechanism if the firmware image contained in flash memory becomes corrupted. You can run the FSB and boot another image from a CDROM or network that is capable of reprogramming the flash ROM. Use the FSB when one of the following failures at power-up prohibits you from getting to the console program: Firmware image in flash memory corrupted Power failure or accidental power-down during a firmware upgrade Error in the nonvolatile RAM (NVRAM) file Incorrect environment variable setting Driver error
2.3.1
If the firmware image is unavailable when the system is powered on or reset, the FSB runs automatically. 1. 2. Reset the system to restart the FSB. The FSB loads from the flash. Update the firmware as described in Section 2.4.
2.3.2
1. 2. 3. 4.
Power the system off, unplug the AC power cord, and remove the cover. Insert jumper J8 over pins 1 2 on the system motherboard. See Figure 22 for a location. Reconnect the AC power cord and reinstall the system cover. Power up the system to the FSB console.
2-16
2.3.3
Required Firmware
The required firmware for your system is preloaded onto the flash ROM. Copies of the firmware files are included on your distribution CD. You can also download the latest firmware files from the Alpha systems firmware Web site: ftp://ftp.digital.com/pub/Digital/Alpha/firmware/readme.html The utilities that are used to reload or update the firmware need to find the files on a CD.
Troubleshooting
2-17
2.4
Updating Firmware
Be sure to read the information on starting the FSB utility before continuing with this section. Updating the Console Firmware Perform the following steps to update the console firmware. Refer to Example 22. 1. 2. 3. Insert the Alpha Firmware CD into the DVD/CD-RW drive. At the SRM console prompt, issue the >>>b dqa0 command. At the UPD> prompt, enter the update SRM command.
After the update has completed, enter the exit command to exit the utility.
2-18
UPD> exit Do you want to do a manual update? [y/(n)] n UPD> list Device FSB SRM booter rt srom tig UPD> u fsb Confirm update on: FSB [Y/(N)]y WARNING: updates may take several minutes to complete for each device. DO NOT ABORT! FSB UPD> exit Initializing.... Updating to V6.6-8... Verifying T6.6-8... PASSED. Current Revision T6.6-6 T6.6-7 V0.5-6 V0.6-3 V1.0-1 1.9 Filename fsb_fw srm_fw booter_fw rt_fw srom_fw tig_fw Update Revision T6.6-8 T6.6-7 No Update Available V0.6-3 V1.0-1 1.9
Troubleshooting
2-19
2.5
This section lists some of the tools and utilities available for acceptance testing and diagnosis and gives recommendations for their use.
2.5.1
The operating systems provide fault management error detection, handling, notification, and logging. The primary tool for error handling is System Event Analyzer (SEA), a fault analysis utility designed to analyze both single and multiple error/fault events. SEA uses error/fault data sources other than the traditional binary error log. See Chapter 5 for more information.
2.5.2
Loopback Tests
Internal and external loopback tests are used to test the I/O components and adapter cards. The loopback tests are a subset of the SRM diagnostics. Use loopback tests to isolate problems with the COM2 serial port, the parallel port, and Ethernet controllers. See the test command in Chapter 4 for instructions on performing loopback tests.
2.5.3
SRM console commands are used to set and examine environment variables and device parameters. For example, the show configuration and show device commands are used to examine the configuration, and the set envar and show envar commands are used to set and view environment variables. SRM commands are also used to invoke ROM-based diagnostics and to run native exercisers. For example, the test and sys_exer commands are used to test the system. See Chapter 4 for information on running console exercisers. See Chapter 6 for information on configuration-related console commands and environment variables. See Chapter 7 for a list of console commands used most often on AlphaServer DS15 systems.
2.5.4
The remote management console is used for managing the server either locally or remotely. It also plays a key role in error analysis by passing error log information to the dual-port RAM (DPR), which is shared between the RMC and the system motherboard logic, so that this information can be accessed by the system. RMC also controls the diagnostic LEDs on the Operator Control Panel (OCP). RMC has a command-line interface from which you can enter a few diagnostic commands.
2-20
RMC can be accessed as long as the power cord for a working supply is plugged into the AC wall outlet and a console terminal is attached to the system. This feature ensures that you can gather information when the operating system is down and the SRM console is not accessible. See Chapter 7.
2.5.5
Crash Dumps
For fatal errors, the operating systems save the contents of memory to a crash dump file. This file can be used to determine why the system crashed. The Computer Crash Analysis Tool (CCAT) is the primary crash dump analysis tool for analyzing crash dumps on Alpha systems. CCAT compares the results of a crash dump with a set of rules. If the results match one or more rules, CCAT notifies the system user of the cause of the crash and provides information to avoid similar crashes in the future.
Troubleshooting
2-21
2.6
CAUTION: Customers are not authorized to access, download, or use Q-Vet. Q-Vet is for use by HP engineers to verify the system installation. Misuse of Q-Vet may result in loss of customer data. Q-Vet is the Qualification Verifier Exerciser Tool that is used by HP engineers to exercise systems under development. HP recommends running the latest Q-Vet released version to verify that hardware is installed correctly and is operational. Q-Vet does not verify specific operating system or layered product configurations. The latest Q-Vet release, information, Release Notes, and documentation are located at http://cisweb.mro.cpqcorp.net/projects/qvet/. HP recommends that Compaq Analyze be installed on the operating system prior to running Q-Vet.
CAUTION: Do not install the Digital System Verification Software (DECVET) on the system; use Q-Vet instead. Non-IVP Q-Vet scripts verify disk operation for some drives with "write enabled" techniques. These are intended for Engineering and Manufacturing Test. Run ONLY IVP scripts on systems that contain customer data or any other items that must not be written over. See the Q-Vet Disk Testing Policy Notice on the Q-Vet Web site for details. All Q-Vet IVP scripts use Read Only and/or File I/O to test hard drives. Floppy and tape drives are always write tested and should have scratch media installed Q-Vet must be de-installed upon completion of system verification.
2-22
Swap or Pagefile Space The system must have adequate swap space (on Tru64 UNIX) or pagefile space (on OpenVMS) for proper Q-Vet operation. You can set this up either before or after Q-Vet installation. During initialization, Q-Vet will display a message indicating the minimum amount of swap/pagefile needed, if it determines that the system does not have enough. You can then reconfigure the system. If you wish to address the swap/pagefile size before running Q-Vet, see the Swap/Pagefile Estimates on the Q-Vet Web site.
Troubleshooting
2-23
2.6.1
Installing Q-Vet
The procedures for installation of Q-Vet differ between operating systems. You must install Q-Vet on each partition in the system. Install and run Q-Vet from the SYSTEM account on OpenVMS and the root account on Tru64 UNIX. Remember to install Q-Vet in each partition. Tru64 UNIX 1. Make sure that there are no old Q-Vet or DECVET kits on the system by using the following command: setld -i | grep VET Note the names of any listed kits, such as OTKBASExxx etc., and remove the kits using qvet_uninstall if possible. Otherwise use the command setld -d kit1_name kit2_name kit3_name 2. 3. Copy the kit tar file (QVET_Vxxx.tar) to your system. Be sure that there is no directory named output. If so move to another directory or remove the output directory. rm -r output Untar the kit with the command tar xvf QVET_Vxxx.tar Note: The case of the file name may be different depending upon how it was stored on the system. Also, you may need to enclose the file name in quotation marks if a semicolon is used. Install the kit with the command setld -l output During the install, if you intend to use the GUI you must select the optional GUI subset (QVETXOSFxxx). The Q-Vet installation will size your system for devices and memory. It also runs qvet_tune. You should answer 'y' to the questions that are asked about setting parameters. If you do not, you may have trouble running Q-Vet. After the installation completes, you should delete the output directory with rm -r output. You can also delete the kit tar file. You must reboot the system before starting Q-Vet. On reboot you can start Q-Vet GUI via vet& or you can run nonGUI (command line) via vet nw.
4.
5. 6. 7.
8. 9.
OpenVMS
2-24
1. 2. 3.
Delete any QVETAXPxxx.A or QVETAXPxxx.EXE file from the current directory. Copy the self-extracting kit image file (QVETAXPxxx.EXE) to the current directory. It is highly recommended, but not required, that you purge the system disk before installing Q-Vet. This will free up space that may be needed for pagefile expansion during the AUTOGEN phase. $purge sys$sysdevice:[*]*.* Extract the kit saveset with the command $run QVETAXPxxx.EXE and verify that the kit saveset was extracted by checking for the "Successful decompression" message. Use @sys$update:vmsinstal for the Q-Vet installation. The installation will size your system for devices and memory. You should choose all the default answers during the Q-Vet installation. This will verify the Q-Vet installation, tune the system, and reboot. During the install, if you do not intend to use the GUI, you can answer no to the question "Do you want to install Q-Vet with the DECwindows Motif interface?" After the installation completes you should delete the QVETAXP0xx.A file and the QVETAXPxxx.EXE file. On reboot you can start Q-Vet GUI via $vet or the command interface via $vet/int=char.
4. 5.
6. 7.
Troubleshooting
2-25
2.6.2
Running Q-Vet
You must run Q-Vet on each partition in the system to verify the complete system. Review the Special Notices and the Testing Notes section of the Release Notes located at http://cisweb.mro.cpqcorp.net/projects/qvet/ before running Q-Vet. Follow the instructions listed for your operating system to run Q-Vet in each partition. Tru64 UNIX
Graphical Interface 1. From the Main Menu, select IVP, Load Script and select Long IVP (the IVP tests will then load into the Q-Vet process window). Click the Start All button to begin IVP testing.
2. Command-Line Interface
> vet -nw Q-Vet_setup> execute .Ivp.scp Q-Vet_setup> start Note that there is a "." in front of the script name, and that commands are case sensitive.
2-26
OpenVMS
Graphical Interface 1. From the Main Menu, select IVP, Load Script and select Long IVP (the IVP tests will then load into the Q-Vet process window). Click the Start All button to begin IVP testing.
2. Command-Line Interface
$ vet /int=char Q-Vet_setup> execute ivp.vms Q-Vet_setup> start Note that commands are case sensitive.
NOTE: A short IVP script is provided for a simple verification of device setup. It is selectable from the GUI IVP menu, and the script is called .Ivp_short.scp (ivp_short.vms). This script will run for 15 minutes and then terminate with a Summary log. The short script may be run prior to the long IVP script if desired, but not in place of the long IVP script, which is the full IVP test.
The long IVP will run until the slowest device has completed one pass (typically 2 to 12 hours). This is called a Cycle of Testing.
Troubleshooting
2-27
2.6.3
After running Q-Vet, check the results of the run by reviewing the summary log. If you follow the above steps, Q-Vet will run all exercisers until the slowest device has completed one full pass. Depending on the size of the system (number of CPUs and disks), this will typically take 2 to 12 hours. Q-Vet will then terminate testing and produce a summary log. The termination message will tell you the name and location of this file. All exerciser processes can also be manually terminated with the Suspend and Terminate buttons (stop and terminate commands). After all exercisers report Idle, the summary log is produced containing Q-Vet specific results and statuses. If there are no Q-Vet errors, no system event appendages, and testing ran to the specified completion time, the following message will be displayed:
Q-Vet Tests Complete: Passed
It is recommended that you run System Event Analyzer to review test results. The testing times (for use with System Event Analyzer) are printed to the Q-Vet run window and are available in the summary log.
2-28
2.6.4
De-Installing Q-Vet
The procedures for de-installation of Q-Vet differ between operating systems. You must de-install Q-Vet from each partition in the system. Failure to do so may result in the loss of customer data at a later date if Q-Vet is misused. Follow the instructions listed under your operating system to de-install Q-Vet from a partition. The qvet_uninstall programs will remove the Q-Vet supplied tools and restore the original system tuning/configuration settings. Tru64 UNIX 1. 2. 3. 4. Stop, Terminate, and Exit from Q-Vet testing. Execute the command qvet_uninstall. This will also restore the system configuration/tuning file sysconfigtab. Note: log files are retained in /usr/field/tool_logs Reboot the system. You must reboot in any case, even if Q-Vet is to be reinstalled.
OpenVMS 1. 2. 3. 4. Stop, Terminate, and Exit from any Q-Vet testing. Execute the command @sys$manager:qvet_uninstall. This will restore system tuning (modparams.dat) and the original UAF settings. Note: log files are retained in sys$specific:[sysmgr.tool_logs] Reboot the system. You must reboot in any case, even if Q-Vet is to be reinstalled.
Troubleshooting
2-29
2.7
Information Resources
Many information resources are available, including tools that can be downloaded from the Internet, firmware updates, a supported options list, and more.
2.7.1
HP Service Tools CD
The HP Service Tools CD-ROM enables field engineers to upgrade customer systems with the latest version of software when the customer does not have access to HP Web pages. The Web site is: http://www.mse.qvar.cpqcorp.net/ServiceTools/default.asp
2.7.2
The information contained in this guide, including the FRU procedures and illustrations, is available in HTML Help format as part of the Maintenance Kit. It can also be accessed from the Learning Utility and ProSIC Web sites.
2.7.3
The firmware resides in the flash ROM on the system motherboard. You can obtain the latest system firmware from CD-ROM or over the network. Quarterly Update Service The Alpha Systems Firmware Update Kit CD-ROM is available by subscription. Alpha Firmware Internet Access You can obtain Alpha firmware updates from the following Web site: http://ftp.digital.com/pub/Digital/Alpha/firmware/readme.html The README file describes the firmware directory structure and how to download and use the files. If you dont have a Web browser, download the files using anonymous ftp: ftp://ftp.digital.com/pub/Digital/Alpha/firmware/ Individual Alpha system firmware releases that occur between releases of the firmware CD are located in the interim directory: ftp://ftp.digital.com/pub/Digital/Alpha/firmware/interim/
2-30
2.7.4
Fail-Safe Booter
The fail-safe booter (FSB) allows you to run another console to repair files that reside in the flash ROMs on the system motherboard. See Chapter 3 for information on running the FSB.
2.7.5
Software Patches
Software patches for the supported operating systems are available from: http://h18002.www1.hp.com/alphaserver/
2.7.6
Learning Utility
The Learning Utility provides information about various technical topics. http://learning1.americas.cpqcorp.net/mcsl-html/home.asp
2.7.7
You can download up-to-date files and late-breaking technical information from the Internet. The information includes firmware updates, the latest configuration utilities, software patches, lists of supported options, and more. http://h18002.www1.hp.com/alphaserver/
2.7.8
Supported Options
Troubleshooting
2-31
This chapter describes the power-up process and RMC, SROM, and SRM power-up diagnostics. The following topics are covered: Overview of Power-Up Diagnostics System Power-Up Sequence Power-Up Displays Power-Up Error Messages Forcing a Fail-Safe Load Updating the RMC
3.1
The power-up process begins with the power-on of the power supply. After the AC and DC power-up sequences are completed, the remote management console (RMC) reads EEROM information and deposits it into the dual-port RAM (DPR). The SROM minimally tests the CPU, initializes and tests backup cache, and minimally tests memory. Finally, the SROM loads the SRM console program into memory and jumps to the first instruction in the console program. There are three distinct sets of power-up diagnostics: 1. System power controller and remote management console diagnostics These diagnostics check the power regulators, temperature, and fans. Failures are reported in the dual-port RAM and on the Operator Control Panel (OCP) LEDs. Certain failures may prevent the system from powering on. Serial ROM (SROM) diagnostics SROM tests check the basic functionality of the system and load the console code from the FEPROM on the system motherboard into system memory. Failures during SROM tests are indicated by error beep codes and messages to the power-up console terminal. Console firmware diagnostics These tests are executed by the SRM console code. They test the core system, including boot path devices. Failures during these tests are reported to the console terminal through the power-up screen or console event log.
2.
3.
3.2
The RMC is responsible for the power-up sequence of the AlphaServer DS15. The general power-up sequence follows: 1. Verify that the MLB FRU EEPROM is accessible and has a valid checksum. The system will not be allowed to power-on unless these conditions are met (this check and others - can be disabled with the FEATURE_0 jumper). Verify that the RMC did not detect any system problems during its power-on self-test (this check can be overridden with the FEATURE_0 jumper). Verify that the PCI Riser Card (PRC) is installed (this check can be disabled with the FEATURE_3 jumper). Assert DC Enable to the bulk power supply and wait for Power OK (POK) to assert. Check the 2.5V regulator to ensure it is within specified tolerances. Check to see if the disk fan is spinning. If it is and the FEATURE_4 jumper is installed, flag a configuration error. Release Titan Interrupt and General logic chip (TIG) from reset. Assert power-on signal to TIG. Release system from reset.
2. 3. 4. 5. 6. 7. 8. 9.
10. Wait for TIG to assert ready. 11. Wait for TIG to request 1.65V enable. 12. Assert the 1.65V VRM enable signal. 13. Wait for the 1.65V VRM Power OK signal to assert.
no
disabled
yes
verify system problems with RMC
yes
disabled
no
verify PCI riser card installed
no
disabled
yes
apply DC enable to power supply 2.5V regulator checked for tolerances
3.3
Power-Up Displays
Power-up information is displayed on the OCP LEDs and on the console terminal startup screen. Messages sent from the RMC and SROM programs are displayed first, followed by messages from the SRM console.
3.3.1
Power-Up Display
The following example describes the power-up sequence and shows the power-up messages.
RelCPU
Power-Up Sequence
When the system powers up, the SROM code is loaded into the I-cache (instruction cache) on the CPU. Minimum amount of hardware is verified including the EV6 and certain Titan related items. The CPU attempts to access the PCI bus. If it cannot, either a hang or a failure occurs, and this is the only message displayed. Clock speed is determined. At this point the SROM checks a jumper to see if it needs to go to the mini-debugger or wait for the RMC to complete populating the DPR. The CPU interrogates the I2C EEROM on the system board through shared RAM. The CPU determines the system configuration to jump to. The CPU next checks the SROM checksum to determine the validity of the flash SROM sectors. If flash SROM is invalid, the CPU reports the error and continues the execution of the SROM code. Memory is programmed and tested and SROM transfers execution to the console indicating in the DPR that the flash is BAD. Invalid flash SROM must be reprogrammed. If flash SROM is good, the CPU programs appropriate registers with the values from the flash data and selects itself as the target CPU to be loaded. When the SROM is reloaded from flash, the system will be programmed with correct values and running at correct speed. The CPU initializes and tests the B-cache and memory, then loads the flash SROM code. At this point code execution begins from STEP 1 just as the on-chip SROM code. However a flag indicates that the CPU is running flash SROM and there is no need to re-load the flash on the second pass. The flash SROM performs B-cache tests. For example, the ECC data test verifies the detection logic for single- and double-bit errors. The CPU initiates all memory tests. The memory is tested for address and data errors for the first 32 MB of memory in each array. It also initializes all the sized memory in the system. If a memory failure occurs, an error is reported. An untested memory array is assigned to address 0 and the failed memory array is de-assigned. Memory tests are rerun on the first 32 MB of memory in each remaining arrays. If all memory fails, the No Memory Available message is reported and the system halts. The CPU validates that its external interrupts are functioning. If all memory passes, the CPU loads the console and transfers control to it.
NOTE:
The power-up text that is displayed on the screen depends on what kind of terminal is connected as the console terminal: VT or VGA. If the SRM console environment variable is set to serial, the entire power-up display, consisting of the SROM and SRM power-up messages, is displayed on the VT terminal screen. If console is set to graphics, no SROM messages are displayed, and the SRM messages are delayed until VGA initialization has completed.
3.3.2
When power-up is complete, the CPU transfers control to the SRM console. The console continues the system initialization. Failures are reported to the console terminal through the power-up screen and a console event log. The following section shows the messages that are displayed once the SROM has transferred control to the SRM console.
The primary CPU prints a message indicating that it is running the console. Starting with this message, the power-up display is sent to any console terminal, regardless of the state of the console environment variable. If console is set to graphics, the display from this point on is saved in a memory buffer and displayed on the VGA monitor after the PCI buses are sized and the VGA device is initialized. The memory size is determined and memory is tested. The I/O subsystem is probed and I/O devices are reported. I/O adapters are configured. Device drivers are started.
Entering the idle loop. Various diagnostics are performed. The console terminal displays the SRM console banner and the prompt, >>>. From the SRM prompt, you can boot the operating system. NOTE: If the console requires the heap to be expanded, it restarts.
3.3.3
The SRM console event log helps you troubleshoot problems that do not prevent the system from coming up to the SRM console. The console event log consists of status messages received during power-up self-tests.
To check for and locate errors, enter the Log command at the RMC> prompt.
CPU Fan
3450RPM
3.4
Audible beep codes announce specific errors that might be encountered while the system is powering up. Table 31 identifies the error beep codes.
Message/Meaning
Done with execution; jumping to console No usable memory available Configuration error detected ROM checksum error detected Bcache error detected
Action to Repair
No action necessary. Check memory and memory configuration. Check system configurations. Replace the system board. Possible CPU problem.
3.4.1
Checksum Error
When the system detects the error, it attempts to load the fail-safe booter (FSB) console so that you can load new console firmware images. A sequence similar to the one in Example 35 occurs.
create poll create timer create powerup access NVRAM 1024 MB of System Memory Testing Memory ... probe I/O subsystem starting drivers entering idle loop initializing keyboard initializing GCT/FRU at 1cc000 Initializing dqa dqb eia eib pka pkb **************************************************************************** * * * DS15 Failsafe Boot Console * *Please use the LFU utility to update/recover your SRM console flash image.* * * **************************************************************************** AlphaServer DS15 Console X6.6-3140, built on Aug 15 2003 at 00:53:42 >>>
NOTE:
For more information on LFU, see the Firmware Updates Web site: http://ftp.digital.com/pub/digital/Alpha/firmware/
3.4.2
If the SROM fails, the display will show all the DIMMs that are missing. The system uses the JEDEC data on the DIMM and reports configuration errors if no memory is available for the console. The system reports DIMM failure as ILLEGAL, MISSING, INCMPAT, or FAILED. The following excerpts are examples of error reports. See Chapter 6 for memory configuration rules.
Testing AAR0 Memory data test in progress Memory address test in progress Memory pattern test in progress Memory initialization Failed DIMM 3 Loading console Code execution complete (transfer control) *****************************************
Memory initialization Missing DIMM 1 Loading console Code execution complete (transfer control) ******************************************
Testing AAR0 Memory data test in progress Memory address test in progress Memory pattern test in progress Memory initialization *********************************
3.5
The fail-safe booter is another variant of the SRM console. The FSB provides an emergency recovery mechanism if the firmware image contained in flash memory becomes corrupted. You can run the FSB and boot another image from a CD-ROM or network that is capable of reprogramming the flash ROM. Use the FSB when one of the following failures at power-up prohibits you from getting to the console program: Firmware image in flash memory corrupted Power failure or accidental power-down during a firmware upgrade Error in the nonvolatile RAM (NVRAM) file Incorrect environment variable setting Driver error
3.5.1
If the firmware image is unavailable when the system is powered on or reset, the FSB runs automatically. 1. 2. Reset the system to restart the FSB. The FSB loads from the flash. Update the firmware as described in Chapter 7.
3.5.2
1. 2. 3.
Power the system off, unplug the AC power cord, and remove the top cover. (See Chapter 8 for instructions.) Insert jumper J8 over pins 1-2 on the system motherboard. See Figure 32. Reconnect the AC power cord and reinstall the system cover. Power up the system to the FSB console.
3.6
Under certain circumstances, the RMC will not function. If the problem is caused by corrupted RMC flash ROM, you need to update RMC firmware. The RMC will not function if: No AC power is provided. DPR does not pass its self-test (DPR is corrupted). RMC flash ROM is corrupted, but the system will still power-up.
You can update the remote management console firmware from flash ROM using the LFU. For details, see Chapter 7, RMC Firmware Update and Recovery. NOTE: For more information on LFU, see the Firmware Updates Web site: http://ftp.digital.com/pub/digital/Alpha/firmware/
3.7
The DS15 does not ship with a floppy diskette device. However, the console software and hardware maintain floppy support. Carrying a floppy device and associated cabling could be quite handy in the field if there are no other means to update the console firmware. Additionally, if a motherboard needs to be replaced, one can preserve the customer NVRAM settings by invoking the save_nvram and restore_nvram console commands. The floppy device plugs into the motherboard at connector J4, as shown in the following figure.
MR0655
This chapter describes troubleshooting with the SRM console. The SRM console firmware contains ROM-based diagnostics that allow you to run systemspecific or device-specific exercisers. The exercisers run concurrently to provide maximum bus interaction between the console drivers and the target devices. Run the diagnostics by using commands from the SRM console. To run the diagnostics in the background, use the background operator & at the end of the command. Errors are reported to the console terminal, the console event log, or both. If you are not familiar with the SRM console, see the AlphaServer DS15 and AlphaStation DS15 Owners Guide.
4-1
4.1
Diagnostic commands are used to test the system and help diagnose failures. Table 41 gives a summary of the SRM diagnostic commands and related commands. See Chapter 6 for a list of SRM environment variables, and see Chapter 7 for a list of RMC commands most commonly used for the DS15 system.
Function
Initializes I2Cbus EEPROM data structures for the named FRU. Displays the console event log. Same as more el, but scrolls rapidly. The most recent errors are at the end of the event log and are visible on the terminal screen. Clear errors logged in the FRU EEPROMs as reported by the show error command. Forces a crash dump at the operating system level. Writes data to the specified address of a memory location, register, or device. Displays the contents of a memory location, register, or device. Exercises one or more devices by performing specified read, write, and compare operations. Searches for regular expressions specific strings of characters and prints any lines containing occurrences of the strings. Dumps the contents of a file (byte stream) in hexadecimal and ASCII. Displays registers and data structures. Terminates a specified process. Terminates all executing diagnostics.
Function
Same as cat el, but displays the console event log one screen at a time. Runs a requested number of memory tests in the background. Tests a specified section of memory. Initializes the MOP counters for the specified Ethernet port. Displays the MOP counters for the specified Ethernet port. Runs loopback tests for PCI-based Ethernet ports. Also used to test a port on a live network. Controls whether or not a particular SCSI device driver polls for devices on the bus when the driver is started. This device is supported by some, but not all, console SCSI device drivers. Controls whether or not a particular SCSI device driver resets the SCSI bus when the driver is started. This EV is supported by some, but not all, console SCSI device drivers. Sets the system serial number, which is then propagated to all FRUs that have EEPROMs. Enables/disables internal COM1 access to the RMC. Reports errors logged in the FRU EEPROMs. Displays information about field replaceable units (FRUs), including CPUs, memory DIMMs, and PCI cards. Displays the progress of diagnostic tests. Reports one line of information for each executing diagnostic. Exercises the devices displayed with the show config command. Runs console loopback tests for the COM2 serial port during the sys_exer test sequence. Verifies the configuration of the devices in the system. Runs loopback tests for the COM2 serial port in addition to verifying the configuration of devices.
scsi_reset
set sys_serial_ num sys_com1_rmc show error show fru show_status sys_exer sys_exer -lb test test -lb
4-3
4.2
Buildfru
The buildfru command initializes I2C bus EEPROM descriptive data structures for the named FRU and initializes its SDD and TDD error logs. This command uses data supplied on the command line to build the FRU descriptor. Buildfru is used by Manufacturing, FRU repair operations, or Field Service. The buildfru command is used for several purposes: By Manufacturing to build a FRU table containing a description of each FRU in the system. By FRU repair operations for initializing good stocking spares By Field Service to make any FRU descriptor adjustments required by the customer.
NOTE:
3.
Build the motherboard EEPROM at offset 80 with sequential data: the value at offset 80 is 47, the value at offset 81 is 46, and so on.
The information supplied on the buildfru command line includes the console name for the FRU, part number, serial number, model number, and optional information. The buildfru command facilitates writing the FRU information to the EEPROM on the device. Use the show fru command to display the FRU table created with buildfru. Use the show error command to display FRUs that have errors logged to them. Typically, you only need to use buildfru in Field Service if you replace a device for which the information displayed with the show fru command is inaccurate or missing. After replacing the device, use buildfru to build the new FRU descriptor. NOTE: Be sure to enter the FRU information carefully. If you enter incorrect information, the callout used by System Event Analyzer will not be accurate.
Three areas of the EEPROM can be initialized: the FRU generic data, the FRU specific data, and the system specific data. Each area has its own checksum, which is recalculated any time that segment of the EEPROM is written. When the buildfru command is executed, the FRU EEPROM is first flooded with zeros and then the generic data, the system specific data, and EEPROM format version information are written and checksums are updated. For certain FRUs, such as CPU modules, additional FRU specific data can be entered using the -s option. This data is written to the appropriate region, and its corresponding checksum is updated. FRU Assembly Hierarchy Alpha-based systems can be decomposed into a collection of FRUs. Some FRUs carry various levels of nested FRUs. For instance, the system motherboard is a FRU that carries a number of child FRUs such as DIMMs. The naming convention for FRUs represents the assembly hierarchy. The following is the general form of a FRU name: <frun>[.<frun>[.<frun>]] The fru is a placeholder for the appropriate FRU type at that level and n is the number of that FRU instance on that branch of the system hierarchy.
4-5
The DS15 FRU assembly hierarchy has three levels. The FRU types from the top to the bottom of the hierarchy are as follows: Level
First Level
FRU Type
HMB PWR0 FAN HMB.DIMM(0-3) HMB.PCIRSR HMB.PCIRSR.PCI
Description
System motherboard Power supply Fans for system, PCI, disks, CPU Memory DIMMs PCI riser card PCI slots (1-4)
To build an FRU descriptor for a lower-level FRU, point back to the higher-level FRUs to which the lower-level FRU is associated. See preceding Section 4.2. If you enter the buildfru data correctly for a device that has an EEPROM to program, nothing is displayed after you enter the command. If you enter incorrect data or the device does not have an EEPROM to program, an error message similar to the following is displayed:
>>>buildfru "sys fan" 12-10010-01 ay12345678 Device SYS FAN does not support setting FRU values >>>
Syntax buildfru ( <fru_name> <part_num> <serial_num> [<misc> [<other>]] or -s <fru_name> <offset> <byte> [<byte>...] ) Arguments
<fru_name> <part_num> Console name for this FRU. This name reflects the position of the FRU in the assembly hierarchy. The FRU 2-5-2.4 part number. This ASCII string should be 16 characters (extra characters are truncated). This field should not contain any embedded spaces. If a space must be inserted, enclose the entire argument string in double quotes. This field contains the FRU revision. In some cases, an embedded space is allowed between the part number and the revision. The FRU serial number. This ASCII string must be 10 characters (extra characters are truncated). The manufacturing location and date are extracted from this field. The FRU model name, number, or the common name for the FRU. This ASCII string may be up to 10 characters (extra characters are truncated). This field is optional, unless <alias> is specified. The FRU HP alias number, if one exists. This ASCII string may be up to 16 characters (extras are truncated). This field is optional. The beginning byte offset (0255 hex) within this FRU EEPROM, where the following supplied data bytes are to be written. The data bytes to be written. At least one data byte must be supplied after the offset.
<serial_num>
<misc>
Writes raw data to the EEPROM. This option is typically used to apply any FRU specific data. Generates a unique serial number for each DIMM in the system.
4-7
4.3
The cat el and more el commands display the contents of the console event log. In Example 42, the console reports that the CPU did not power up and fans 1 and 2 failed.
Example 42 more el
>>> more el *** Error - CPU failed powerup diagnostics *** Secondary start error EV6 BIST = 1 STR status = 1 CSC status = 1 PChip0 status = 1 DIMx status = 0 TIG Bus status = 1 DPR status = 0 CPU speed status = ff CPU speed = 1000 Powerup time = 08-06-51 14:30:19 CPU SROM sync = 0 DPR has failed. 1=good 0=bad
Status and error messages are logged to the console event log at power-up, during normal system operation, and while running system tests. Standard error messages are indicated by asterisks (***). When cat el is used, the contents of the console event log scroll by. Use the Ctrl/S key combination to stop the screen from scrolling, and use Ctrl/Q to resume scrolling. The more el command allows you to view the console event log one screen at a time. Syntax cat el or more el
4.4
clear_error
The clear_error command clears errors logged in the FRU EEPROMs as reported by the show error command.
Example 43 clear_error
>>>clear_error HMB >>> >>>clear_error all >>>
Clears all errors logged in the FRU EEPROM on the system motherboard (HMB). Clears all errors logged to all FRU EEPROMs in the system.
The clear_error command clears TDD, SDD, and checksum errors. Hardware failures and unreadable EEPROM errors are not cleared. See Table 42. Syntax
clear_error <fruname> Clears all errors logged to a specific FRU. Fruname is the name of the specified FRU. If you do not specify a FRU, you must use clear_error all to clear errors. Clears all errors logged to all system FRUs.
clear_error all
See the show error command for information on the types of errors that might be logged to the FRU EEPROMs.
4-9
4.5
crash
The SRM crash command forces a crash dump to the selected device for Tru64 UNIX and OpenVMS systems.
>>>crash CPU 0 restarting operator requested crash dump on cpu 0 DUMP: blocks available: 66044768 DUMP: blocks wanted: 168562 (partial compressed dump) [OKAY] DUMP: Device Disk Blocks Available DUMP: -------------------------DUMP: 0x1300003 449308 - 786429 (of 786430) [primary swap] DUMP.prom: Open: dev 0x5100001, block 786432: SCSI 0 8 0 0 0 0 0 DUMP: Writing header... [1024 bytes at dev 0x1300003, block 786430] DUMP: Writing data....... [7MB] DUMP: Writing header... [1024 bytes at dev 0x1300003, block 786430] DUMP: crash dump complete. halted CPU 0 halt code = 5 HALT instruction executed PC = fffffc00008f0aac >>>
Use the crash command when the system has hung and you are able to halt it with the halt/reset button (if configured for halt) or the RMC halt command. The crash command restarts the operating system and forces a crash dump to the selected device. See the OpenVMS Alpha System Dump Analyzer Utility Manual for information on how to interpret OpenVMS crash dump files. See the Guide to Kernel Debugging for information on using the Tru64 UNIX Krash Utility.
4.6
The deposit command writes data to the specified address of a memory location, register, or device. The examine command displays the contents of a memory location, register, or a device.
>>>e dpr:34f0 -l -n 5 dpr: 34F0 00000000 dpr: 34F4 00000000 dpr: 34F8 00000000 dpr: 34FC 00000000 dpr: 3500 00000000 dpr: 3504 00000000 >>>
4-11
Deposit The deposit command stores data in the location specified. If no options are given, the system uses the options from the preceding deposit command. If the specified value is too large to fit in the data size listed, the console ignores the command and issues an error. If the data is smaller than the data size, the higher order bits are filled with zeros. In Example 44:
Clear first 512 bytes of physical memory Deposit 5 into four longwords starting at virtual memory address 1234. Load GPRs R0 through R8 with -1. Deposit 8 in the first longword of the first 17 pages in physical memory. Deposit 0 to physical memory address 0. Deposit FF to physical memory address 4. Deposit 820000 to SCBB.
Examine The examine command displays the contents of a memory location, a register, or a device. If no options are given, the system uses the options from the preceding examine command. If conflicting address space or data sizes are specified, the console ignores the command and issues an error. For data lengths longer than a longword, each longword of data should be separated by a space. In Example 44: Examine the DPR starting at location 34f0 and continuing through the next 5 locations, and display the data size in longwords.
Syntax deposit [-{b,w,l,q,o,h}] [-{n value, s value}] [space:] address data examine [-{b,w,l,q,o,h}] [-{n value, s value}] [space:] address -b Defines data size as byte.
Defines data size as word. Defines data size as longword. Defines data size as quadword. Defines data size as octaword. Defines data size as hexword. Instruction decode (examine command only) The number of consecutive locations to modify. The address increment size. The default is the data size. Device name (address space) of the device to access. Device names are: dpr eerom fpr gpr ipr pcicfg pciio pcimem pt pmem vmem Dual-port RAM. See Appendix C for the DPR address layout. Nonvolatile ROM used for EV storage. Floating-point register set; name is F0 to F31. Alternatively, can be referenced by name. General register set; name is R0 to R31. Alternatively, can be referenced by name. Internal processor registers. Alternatively, some IPRs can be referenced by name. PCI configuration space. PCI I/O space. PCI memory space The PALtemp register set; name is PT0 to PT23. Physical memory (default). Virtual memory.
offset data
4-13
Symbolic forms can be used for the address. They are: p c + The program counter. The address space is set to GPR. The location immediately following the last location referenced in a deposit or examine command. For physical and virtual memory, the referenced location is the last location plus the size of the reference (1 for byte, 2 for word, 4 for longword). For other address spaces, the address is the last referenced address plus 1. The location immediately preceding the last location referenced in a deposit or examine command. Memory and other address spaces are handled as above. The last location referenced in a deposit or examine command.
@ The location addressed by the last location referenced in a deposit or examine command.
4.7
exer
The exer command exercises one or more devices by performing specified read, write, and compare operations. Typically exer is run from the built-in console script. Advanced users may want to use the specific options described here. Note that running exer on disks can be destructive. Optionally, exer reports performance statistics: A read operation reads from a device that you specify into a buffer. A write operation writes from a buffer to a device that you specify. A compare operation compares the contents of the two buffers.
The exer command uses two buffers, buffer1 and buffer2, to carry out the operations. A read or write operation can be performed using either buffer. A compare operation uses both buffers.
Example 45 exer
>>> exer dk*.* -p 0 -secs 36000
Read SCSI disks for the entire length of each disk. Repeat this until 36000 seconds, 10 hours, have elapsed. All disks will be read concurrently. Each block read will occur at a random block number on each disk.
>>> exer -l 2 dka0
Write hex 5a's to every byte of blocks 1, 2, and 3. The packet size is bc * bs, 4 * 512, 2048 for all writes.
4-15
0/0 0/0
0 0
0 0
dka0.0.0.8.0 exer completed dka100.1.0.8.0 exer completed packet elapsed idle size IOs seconds secs 8192 12753 20 15 >>> IOs bytes read 104472576 bytes written 0 /sec bytes/sec 635 5204632
A destructive write test over block numbers 0 through 100 on disk dka0. The packet size is 2048 bytes. The action string specifies the following sequence of operations: 1. Set the current block address to a random block number on the disk between 0 and 97. A four block packet starting at block numbers 98, 99, or 100 would access blocks beyond the end of the length to be processed so 97 is the largest possible starting block address of a packet. Write a packet of hex 5a's from buffer1 to the current block address. Set the current block address to what it was just prior to the previous write operation. From the current block address read a packet into buffer2. Compare buffer1 with buffer2 and report any discrepancies. Repeat steps 1 through 5 until enough packets have been written to satisfy the length requirement of 101 blocks.
2. 3. 4. 5. 6.
>>> exer -a '?r-w-Rc' dka0 A nondestructive write test with packet sizes of 512 bytes. Use this test only if the customer has a current backup of any disks being tested. The action string specifies the following sequence of operations: 1. 2. Set the current block address to a random block number on the disk. From the current block address on the disk, read a packet into buffer1.
3. 4. 5. 6. 7. 8.
Set the current block address to the device address where it was just before the previous read operation occurred. Write the contents of buffer1 back to the current block address. Set the current block address to what it was just prior to the previous write operation. From the current block address on the disk, read a packet into buffer2. Compare buffer1 with buffer2 and report any discrepancies. Repeat the above steps until each block on the disk has been written once and read twice.
You can tailor the behavior of exer by using options to specify the following: An address range to test within the test device(s) The packet size, also known as the I/O size, which is the number of bytes read or written in one I/O operation The number of passes to run How many seconds to run A sequence of individual operations performed on the test devices. The qualifier is called the action string qualifier.
Syntax exer ( [-sb start_block>] [-eb end_block>] [-p pass_count>] [-l blocks>] [-bs block_size>] [-bc block_per_io>] [-d1 buf1_string>] [-d2 buf2_string>] [-a action_string>] [-sec seconds>] [-m] [-v] [-delay milliseconds>] device_name>... )
Arguments device_name Options -sb <start_block> -eb <end_block> -p <pass_count> -l <blocks> Specifies the starting block number (hex) within a filestream. The default is 0. Specifies the ending block number (hex) within filestream. The default is 0. Specifies the number of passes to run the exerciser. If 0, then run forever or until Ctrl/C. The default is 1. Specifies the number of blocks (hex) to exercise. -l has precedence over -eb. If only reading, then specifying neither -l nor -eb defaults to read till end of file (eof). If writing, and neither -l nor -eb are specified then exer will write for the size of device. The default is 1. Specifies the names of the devices or filestreams to be exercised.
4-17
Specifies the block size (hex) in bytes. The default is 200 (hex). Specifies the number of blocks (hex) per I/O. On devices without length (tape), use the specified packet size or default to 2048. The maximum block size allowed with variable length block reads is 2048 bytes. The default is 1. String argument for eval to generate buffer1 data pattern from. Buffer1 is initialized only once before any I/O occurs. Default = all bytes set to hex 5A's. String argument for eval to generate buffer2 data pattern from. Buffer2 is initialized only once before any I/O occurs. Default = all bytes set to hex 5A's. Specifies an exerciser action string, which determines the sequence of reads, writes, and compares to various buffers. The default action string is ?r. The action string characters are: r W R W N N c Read into buffer1. Write from buffer1. Read into buffer2. Write from buffer2. Write without lock from buffer1. Write without lock from buffer2. Compare buffer1 with buffer2. Seek to file offset prior to last read or write.
-d1 <buf1_string>
-d2 <buf2_string>
-a <action_string>
-a <action_string> (continued)
? Seek to a random block offset within the specified range of blocks. exer calls the program, random, to deal each of a set of numbers once. exer chooses a set that is a power of two and is greater than or equal to the block range. Each call to random results in a number that is then mapped to the set of numbers that are in the block range and exer seeks to that location in the filestream. Since exer starts with the same random number seed, the set of random numbers generated will always be over the same set of block range numbers. s Sleep for a number of milliseconds specified by the delay qualifier. If no delay qualifier is present, sleep for 1 millisecond. Times as reported in verbose mode will not necessarily be accurate when this action character is used. z Z b B Zero buffer 1 Zero buffer 2 Add constant to buffer 1 Add constant to buffer 2
-sec <seconds>
Specifies to terminate the exercise after the number of seconds have elapsed. By default the exerciser continues until the specified number of blocks or passcount are processed. Specifies metrics mode. At the end of the exerciser a total throughput line is displayed. Specifies verbose mode. Data read is also written to stdout. This is not applicable on writes or compares. The default is verbose mode off. Specifies the number of milliseconds to delay when s appears as a character in the action string.
-m -v -delay <millisecs>
4-19
4.8
grep
The grep command is very similar to the UNIX grep command. It allows you to search for regular expressionsspecific strings of charactersand prints any lines containing occurrences of the strings. Using grep is similar to using wildcards.
Example 46 grep
>>>sh fru | grep PCI HMB.PCIRSR 00 54-30560-01.A1 HMB.PCIRSR.PCI PCI FAN >>> 00 00 12-49806-04 SW31200018 3D Labs OX FAN J1
In Example 46 the output of the show fru command is piped into grep (the vertical bar is the piping symbol), which filters out only lines with PCI. Grep supports the following metacharacters:
^ $ . [] Matches beginning of line Matches end of line Matches any single character Set of characters; [ABC] matches either 'A' or 'B' or 'C'; a dash (other than first or last of the set) denotes a range of characters: [A-Z] matches any uppercase letter; if the first character of the set is '^' then the sense of match is reversed: [^0-9] matches any non-digit; several characters need to be quoted with backslash (\) if they occur in a set: '\', ']', '-', and '^' Repeated matching; when placed after a pattern, indicates that the pattern should match any number of times. For example, '[a-z][0-9]*' matches a lowercase letter followed by zero or more digits. Repeated matching; when placed after a pattern, indicates that the pattern should match one or more times '[0-9]+' matches any non-empty sequence of digits. Optional matching; indicates that the pattern can match zero or one times. '[a-z][0-9]?' matches lowercase letter alone or followed by a single digit. Quote character; prevent the character that follows from having special meaning.
+ ? \
Print only the number of lines matched. Ignore case. By default grep is case sensitive. Print the line numbers of the matching lines. Print all lines that do not contain the expression. Take regular expressions from a file, instead of command.
4-21
4.9
hd
The hd command dumps the contents of a file (byte stream) in hexadecimal and ASCII.
Example 47 hd
>>> hd -eb 0 dpr:2b00 block 0 00000000 00000010 00000020 00000030 00000040 00000050 00000060 00000070 00000080 00000090 000000a0 000000b0 000000c0 000000d0 000000e0 000000f0 00000100 00000110 00000120 00000130 00000140 00000150 00000160 00000170 00000180 00000190 000001a0 000001b0 000001c0 000001d0 000001e0 000001f0 >>>
01 17 00 00 00 00 00 00 40 00 00 00 00 00 00 00 80 8F 15 00 CE 34 39 FF 53 20 00 00 00 00 00 00
80 53 00 00 00 00 00 00 10 01 00 C3 00 00 00 00 08 04 08 00 00 35 CE FF 57 00 00 00 00 00 00 00
01 43 00 00 00 00 00 00 00 00 00 C1 00 00 00 00 04 06 15 00 00 30 15 FF 33 00 00 00 00 00 00 00
01 31 00 00 00 00 00 00 00 00 00 F0 02 00 00 00 0D 01 08 00 00 44 FF FF 32 00 00 00 00 00 00 00
01 07 00 00 00 00 00 00 41 02 00 BE 02 00 00 00 0B 01 00 00 00 54 59 FF 34 00 00 00 00 00 00 00
00 51 00 00 00 00 00 00 10 03 00 23 01 00 00 00 01 16 00 00 00 43 41 FF 30 00 00 00 00 00 00 00
01 00 00 00 00 00 00 00 00 00 00 01 03 00 00 00 48 0E 00 00 00 2D 46 FF 30 00 00 00 00 00 00 00
01 7D 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 F0 00 00 00 43 33 FF 30 00 00 00 00 00 00 00
DD 00 00 00 00 00 00 00 00 02 00 B8 00 00 00 00 01 90 00 00 01 37 37 FF 31 00 00 00 00 00 00 00
01 00 00 00 00 00 00 00 00 08 00 00 00 00 00 00 75 00 00 00 4D 35 30 FF 31 00 00 00 00 00 00 00
FF 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 54 00 00 00 33 20 41 FF 00 00 00 00 00 00 00 00
E8 00 00 00 00 00 00 00 00 00 00 00 00 DB 00 00 02 14 00 00 20 43 FF FF 30 00 00 00 00 00 00 00
03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 82 0F 00 00 37 44 FF FF 32 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 14 00 00 38 34 FF FF 31 00 00 00 00 00 00 4A
00 80 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 2D 00 02 53 02 FF FF 00 00 00 00 00 00 00 21
00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 80 00 A3 36 25 FF FF 20 00 00 00 00 00 00 73
.............. .SC1.Q.}........ ................ ................ ................ ................ ................ ................ @...A........... ................ ................ ...#.......... ................ ............... ................ ................ ......H..uT..... ..............-. ................ ............... ........M3 78S6 450DTC-C75 CD4.% 9..YAF370A..... ................ SW32400011.021. ............... ................ ................ ................ ................ ................ .............J!s
Example 47 shows a hex dump to DPR location 2b00, ending at block 0. Syntax hd [-{byte|word|long|quad}] [-{sb|eb} <n>] <file>[:<offset>].
Arguments <file>[:<offset>] Options -byte -word -long -quad -sb <n> -eb <n> Print out data in byte sizes Print out data by word Print out data by longword Print out data by quadword Start block End block Specifies the file (byte stream) to be displayed.
4-23
4.10 info
The info command displays registers and data structures. You can enter the command by itself or followed by a number (0 8). If you do not specify a number, a list of selections is displayed and you are prompted to enter a selection. The following commands are available:
info 0 info 1 info 2 Displays the SRM memory descriptors as described in the Alpha System Reference Manual. Displays the page table entries (PTE) used by the console and operating system to map virtual to physical memory. Valid data is displayed only after a boot operation. Dumps the Galaxy Configuration Tree (GCT) FRU table. Galaxy is a software architecture that allows multiple instances of OpenVMS to execute cooperatively on a single computer. Dumps the contents of the system control status registers (CSRs) for the C-chip, D-chip, and P-chips. Displays the per CPU impure area in abbreviated form. The console uses this scratch area to save processor context. Displays the per CPU impure area in full form. Displays the per CPU machine check logout area. Displays the contents of the Console Data Log. Clears all event frames in the Console Data Log.
For information about the data displayed by the info commands, see the following documents: For info 0, info 1, and info 4, see the Alpha System Reference Manual. For info 2, see the Galaxy Console and Alpha Systems V5.0 FRU Configuration Tree Specification. For info 3, see the Titan 21274 Chipset Functional Specification. For info 6 and info 7, see the AlphaServer DS15 Platform Fault Management Specification.
Example 48 info 0
>>>info
0. HWRPB MEMDSC 1. Console PTE 2. GCT/FRU 5 3. Dump System CSRs 4. IMPURE area (abbreviated) 5. IMPURE area (full) 6. LOGOUT area 7. Dump Error Log 8. Clear Error Log Enter selection: 0 HWRPB: 2000 MEMDSC:25c0 Cluster count: 3
Cluster: 0, Usage: Console START_PFN: 00000000 PFN_COUNT: 0000015b PFN_TESTED: 00000000 347 pages from 0000000000000000 to 00000000002b5fff Cluster: 1, Usage: System START_PFN: 0000015b PFN_COUNT: 0003fe9c PFN_TESTED: 0003fe9c BITMAP_VA: 0000000000000000 BITMAP_PA: 00000000001be020 261788 good pages from 00000000002b6000 to 000000007ffedfff Cluster: 2, Usage: Console START_PFN: 0003fff7 PFN_COUNT: 00000009 PFN_TESTED: 00000000 9 pages from 000000007ffee000 to 000000007fffffff >>>
4-25
Example 49 shows an info 1 display. This output is available only after a boot operation.
Example 49 info 1
>>> info 1
pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte pte 000000003FFA8000 000000003FFA8008 000000003FFA8010 000000003FFA8018 000000003FFA8020 000000003FFA8028 000000003FFA8030 000000003FFA8038 000000003FFA8040 000000003FFA8048 000000003FFA8050 000000003FFA8058 000000003FFA8060 000000003FFA8068 000000003FFA8070 000000003FFA8078 000000003FFA8080 000000003FFA8088 000000003FFA8090 000000003FFA8098 000000003FFA80A0 000000003FFA80A8 000000003FFA80B0 000000003FFA80B8 000000003FFA80C0 000000003FFA80C8 000000003FFA80D0 000000003FFA80D8 000000003FFA80E0 000000003FFA80E8 0000000100001101 0000000200001101 0000000300001101 0000000400001101 0000000500001101 0000000600001101 0000000700001101 0000000800001101 0000000900001101 0000000A00001101 0000000B00001101 0000000C00001101 0000000D00001101 0000000E00001101 0000000F00001101 0000001000001101 0000001100001101 0000001200001101 0000001300001101 0000001400001101 0000001500001101 0000001600001101 0000001700001101 0000001800001101 0000001900001101 0000001A00001101 0000001B00001101 0000001C00001101 0000001D00001101 0000001E00001101 va va va va va va va va va va va va va va va va va va va va va va va va va va va va va va 0000000010000000 0000000010002000 0000000010004000 0000000010006000 0000000010008000 000000001000A000 000000001000C000 000000001000E000 0000000010010000 0000000010012000 0000000010014000 0000000010016000 0000000010018000 000000001001A000 000000001001C000 000000001001E000 0000000010020000 0000000010022000 0000000010024000 0000000010026000 0000000010028000 000000001002A000 000000001002C000 000000001002E000 0000000010030000 0000000010032000 0000000010034000 0000000010036000 0000000010038000 000000001003A000 pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa pa 0000000000002000 0000000000004000 0000000000006000 0000000000008000 000000000000A000 000000000000C000 000000000000E000 0000000000010000 0000000000012000 0000000000014000 0000000000016000 0000000000018000 000000000001A000 000000000001C000 000000000001E000 0000000000020000 0000000000022000 0000000000024000 0000000000026000 0000000000028000 000000000002A000 000000000002C000 000000000002E000 0000000000030000 0000000000032000 0000000000034000 0000000000036000 0000000000038000 000000000003A000 000000000003C000
. . .
Example 410 shows an info 2 display. This command is the SRM's view of the configuration tree that the RCM displays.
4-27
show flags? ( Y/<N>) Dump a Node - Enter Handle (hex) ? show fw_usage flags? ( Y/<N>) >>>
CSRs:
PCHIP 0 CSRs: GWSBA0 GWSBA1 GWSBA2 GWSBA3 GWSM0 GWSM1 GWSM2 GWSM3 GTBA0 GTBA1 GTBA2 GTBA3 GPCTL GPLAT SERROR SERREN
GPERROR GPERREN SCTL AWSBA0 AWSBA1 AWSBA2 AWSBA3 AWSM0 AWSM1 AWSM2 AWSM3 ATBA0 ATBA1 ATBA2 ATBA3 APCTL APLAT AGPERROR AGPERREN APERROR APERREN >>>
0000000000000000 00000000000007F6 0000000002831411 0000000000800000 0000000080000001 0000000000000000 0000000000000002 0000000000700000 000000003FF00000 000000003FF00000 00000000FFF00000 0000000000000000 0000000000000000 0000000004C00000 0000000005000000 00000004C00200C2 000000000000FF00 0020000000000000 0000000000000000 00200000003B8000 00000000000007F6
: : : : : : : : : : : : : : : : : : : : :
0500 0540 0700 1000 1040 1080 10c0 1100 1140 1180 11c0 1200 1240 1280 12c0 1300 1340 1400 1440 1500 1540
4-29
cns$i_ctl cns$i_ctl+4 cns$pctr_ctl cns$pctr_ctl+4 cns$process_context cns$process_context+ cns$i_stat c cns$i_stat+4 cns$dtb_alt_mode cns$dtb_alt_mode+4 cns$mm_stat cns$mm_stat+4 cns$m_ctl cns$m_ctl+4 cns$dc_ctl cns$dc_ctl+4 cns$dc_stat cns$dc_stat+4 cns$write_many cns$write_many+4 cns$virbnd cns$virbnd+4 cns$sysptbr cns$sysptbr+4 cns$report_lam cns$report_lam+4 cns$report_cstat0 cns$report_cstat0+4 cns$crd_count cns$crd_count+4 cns$m_fix cns$m_fix+4 >>>
21300386 : 0360 00000000 : 0364 00000000 : 0368 00000000 : 036c 00000004 : 0370 00000000 : 0374 0000000 : 0378 00000143 : 037c 00000000 : 0380 00000000 : 0384 000000b0 : 0388 00000000 : 038c 00000020 : 0390 00000000 : 0394 000000c3 : 0398 00000000 : 039c 00000000 : 03a0 00000000 : 03a4 00000000 : 03a8 00000000 : 03ac 00000000 : 03b0 00000000 : 03b4 00000000 : 03b8 00000000 : 03bc 00000000 : 03c0 00000000 : 03c4 00000000 : 03c8 00000000 : 03cc 00000000 : 03d0 00000000 : 03d4 00000000 : 03d8 00000000 : 03dc
4-31
cns$gpr[25] cns$gpr[25]+4 cns$gpr[26] cns$gpr[26]+4 cns$gpr[27] cns$gpr[27]+4 cns$gpr[28] cns$gpr[28]+4 cns$gpr[29] cns$gpr[29]+4 cns$gpr[30] cns$gpr[30]+4 cns$gpr[31] cns$gpr[31]+4 cns$fpr[0] cns$fpr[0]+4 cns$fpr[1] cns$fpr[1]+4 cns$fpr[2] cns$fpr[2]+4 cns$fpr[3] cns$fpr[3]+4 cns$fpr[4] cns$fpr[4]+4 cns$fpr[5] cns$fpr[5]+4 cns$fpr[6] cns$fpr[6]+4 cns$fpr[7] cns$fpr[7]+4 cns$fpr[8] cns$fpr[8]+4 cns$fpr[9] cns$fpr[9]+4 cns$fpr[10] cns$fpr[10]+4 cns$fpr[11] cns$fpr[11]+4 cns$fpr[12] cns$fpr[12]+4 cns$fpr[13] cns$fpr[13]+4 cns$fpr[14] cns$fpr[14]+4 cns$fpr[15] cns$fpr[15]+4 cns$fpr[16] cns$fpr[16]+4 cns$fpr[17] cns$fpr[17]+4 cns$fpr[18] cns$fpr[18]+4 cns$fpr[19] cns$fpr[19]+4 cns$fpr[20] cns$fpr[20]+4 cns$fpr[21] cns$fpr[21]+4 cns$fpr[22] cns$fpr[22]+4 cns$fpr[23] cns$fpr[23]+4 cns$fpr[24]
24d0b9ca 00000000 00781048 fffffc00 0096d1b0 fffffc00 00782550 fffffc00 009b6330 fffffc00 a124f7b0 fffffe04 00000000 00000000 00000000 89000000 9999999a 40ada999 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 9999999a 3fb99999 00000000 00000000 00002008 00000000 00000000 00000000 00000000 00000000 9999999a 40ada999 00000000 40a00000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
00d8 00dc 00e0 00e4 00e8 00ec 00f0 00f4 00f8 00fc 0100 0104 0108 010c 0110 0114 0118 011c 0120 0124 0128 012c 0130 0134 0138 013c 0140 0144 0148 014c 0150 0154 0158 015c 0160 0164 0168 016c 0170 0174 0178 017c 0180 0184 0188 018c 0190 0194 0198 019c 01a0 01a4 01a8 01ac 01b0 01b4 01b8 01bc 01c0 01c4 01c8 01cc 01d0
cns$fpr[24]+4 cns$fpr[25] cns$fpr[25]+4 cns$fpr[26] cns$fpr[26]+4 cns$fpr[27] cns$fpr[27]+4 cns$fpr[28] cns$fpr[28]+4 cns$fpr[29] cns$fpr[29]+4 cns$fpr[30] cns$fpr[30]+4 cns$fpr[31] cns$fpr[31]+4 cns$mchkflag cns$mchkflag+4 cns$pt cns$pt+4 cns$whami cns$whami+4 cns$scc cns$scc+4 cns$prbr cns$prbr+4 cns$ptbr cns$ptbr+4 cns$trap cns$trap+4 cns$halt_code cns$halt_code+4 cns$ksp cns$ksp+4 cns$scbb cns$scbb+4 cns$pcbb cns$pcbb+4 cns$vptb cns$vptb+4 cns$shadow4 cns$shadow4+4 cns$shadow5 cns$shadow5+4 cns$shadow6 cns$shadow6+4 cns$shadow7 cns$shadow7+4 cns$shadow20 cns$shadow20+4 cns$p_temp cns$p_temp+4 cns$p_misc cns$p_misc+4 cns$shadow23 cns$shadow23+4 cns$fpcr cns$fpcr+4 cns$va cns$va+4 cns$va_ctl cns$va_ctl+4 cns$exc_addr cns$exc_addr+4
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000228 00000000 00004200 00000000 00000000 00000000 00000000 00000000 00000000 00000000 55418000 00000000 00000000 00000000 00000005 00000000 a124f730 fffffe04 00000000 00000000 1264fa00 00000000 00000000 fffffe00 00004200 00000000 00000000 00005b00 04f57f11 00000000 0001fb84 00000000 00000005 00000000 00007000 00000000 00000004 00000000 007683f0 fffffc00 00000000 89000000 00000008 00000000 00000000 fffffe00 007683f0 fffffc00
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
01d4 01d8 01dc 01e0 01e4 01e8 01ec 01f0 01f4 01f8 01fc 0200 0204 0208 020c 0210 0214 0218 021c 0220 0224 0228 022c 0230 0234 0238 023c 0240 0244 0248 024c 0250 0254 0258 025c 0260 0264 0268 026c 02d8 02dc 02e0 02e4 02e8 02ec 02f0 02f4 02f8 02fc 0300 0304 0308 030c 0310 0314 0318 031c 0320 0324 0328 032c 0330 0334
4-33
cns$ier_cm cns$ier_cm+4 cns$sirr cns$sirr+4 cns$isum cns$isum+4 cns$exc_sum cns$exc_sum+4 cns$pal_base cns$pal_base+4 cns$i_ctl cns$i_ctl+4 cns$pctr_ctl cns$pctr_ctl+4 cns$process_context cns$process_context+ cns$i_stat cns$i_stat+4 cns$dtb_alt_mode cns$dtb_alt_mode+4 cns$mm_stat cns$mm_stat+4 cns$m_ctl cns$m_ctl+4 cns$dc_ctl cns$dc_ctl+4 cns$dc_stat cns$dc_stat+4 cns$write_many cns$write_many+4 cns$virbnd cns$virbnd+4 cns$sysptbr cns$sysptbr+4 cns$report_lam cns$report_lam+4 cns$report_cstat0 cns$report_cstat0+4 cns$crd_count cns$crd_count+4 cns$m_fix cns$m_fix+4 >>>
e0000000 0000006a 00000000 00000000 00000000 00000000 00000000 00000000 00018000 00000000 21300396 fffffe00 00000000 00000000 00000000 00005b00 80000000 00000147 00000000 00000000 00000290 00000000 00000024 00000000 000000c3 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
0338 033c 0340 0344 0348 034c 0350 0354 0358 035c 0360 0364 0368 036c 0370 0374 0378 037c 0380 0384 0388 038c 0390 0394 0398 039c 03a0 03a4 03a8 03ac 03b0 03b4 03b8 03bc 03c0 03c4 03c8 03cc 03d0 03d4 03d8 03dc
mchk_crd__i_stat mchk_crd__i_stat+4 mchk_crd__dc_stat mchk_crd__dc_stat+4 mchk_crd__c_addr mchk_crd__c_addr+4 mchk_crd__dc1_syndrome mchk_crd__dc1_syndrome+4 mchk_crd__dc0_syndrome mchk_crd__dc0_syndrome+4 mchk_crd__c_stat mchk_crd__c_stat+4 mchk_crd__c_sts mchk_crd__c_sts+4 mchk_crd__mm_stat mchk_crd__mm_stat+4 mchk_crd__os_flags mchk_crd__os_flags+4 mchk_crd__cchip_dirx mchk_crd__cchip_dirx+4 mchk_crd__cchip_misc mchk_crd__cchip_misc+4 mchk_crd__pachip0_serror mchk_crd__pachip0_serror+ mchk_crd__pachip0_aperror mchk_crd__pachip0_aperror mchk_crd__pachip0_gperror mchk_crd__pachip0_gperror mchk_crd__pachip0_agperro mchk_crd__pachip0_agperro mchk_crd__pachip1_serror mchk_crd__pachip1_serror+ mchk_crd__pachip1_aperror mchk_crd__pachip1_aperror mchk_crd__pachip1_gperror mchk_crd__pachip1_gperror mchk_crd__pachip1_agperro mchk_crd__pachip1_agperro mchk__flag_frame mchk__flag_frame+4 mchk__offsets mchk__offsets+4 mchk__mchk_code mchk__mchk_code+4 mchk__i_stat mchk__i_stat+4 mchk__dc_stat mchk__dc_stat+4 mchk__c_addr mchk__c_addr+4 mchk__dc1_syndrome mchk__dc1_syndrome+4 mchk__dc0_syndrome mchk__dc0_syndrome+4 mchk__c_stat mchk__c_stat+4 mchk__c_sts mchk__c_sts+4 mchk__mm_stat mchk__mm_stat+4 mchk__exc_addr mchk__exc_addr+4 mchk__ier_cm
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000000f8 00000000 00000018 000000a0 00000202 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 004c3050 fffffc00 e0000000
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
0018 001c 0020 0024 0028 002c 0030 0034 0038 003c 0040 0044 0048 004c 0050 0054 0058 005c 0060 0064 0068 006c 0070 0074 0080 0084 0078 007c 0088 008c 0090 0094 00a0 00a4 0098 009c 00a8 00ac 00b0 00b4 00b8 00bc 00c0 00c4 00c8 00cc 00d0 00d4 00d8 00dc 00e0 00e4 00e8 00ec 00f0 00f4 00f8 00fc 0100 0104 0108 010c 0110
4-35
mchk__ier_cm+4 mchk__isum mchk__isum+4 mchk__reserved_0 mchk__reserved_0+4 mchk__pal_base mchk__pal_base+4 mchk__i_ctl mchk__i_ctl+4 mchk__process_context mchk__process_context+4 mchk__reserved_1 mchk__reserved_1+4 mchk__reserved_2 mchk__reserved_2+4 mchk__os_flags mchk__os_flags+4 mchk__cchip_dirx mchk__cchip_dirx+4 mchk__cchip_misc mchk__cchip_misc+4 mchk__pachip0_serror mchk__pachip0_serror+4 mchk__pachip0_aperror mchk__pachip0_aperror+4 mchk__pachip0_gperror mchk__pachip0_gperror+4 mchk__pachip0_agperror mchk__pachip0_agperror+4 mchk__pachip1_serror mchk__pachip1_serror+4 mchk__pachip1_aperror mchk__pachip1_aperror+4 mchk__pachip1_gperror mchk__pachip1_gperror+4 mchk__pachip1_agperror mchk__pachip1_agperror+4 >>>
0000006e 00000000 00000002 00000000 00000000 00018000 00000000 21300396 fffffe00 00000004 00004380 00000000 00000000 00000000 00000000 00000004 00000000 00000000 40000000 000000e0 00000012 56780002 00000034 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
0114 0118 011c 0120 0124 0128 012c 0130 0134 0138 013c 0140 0144 0148 014c 0150 0154 0158 015c 0160 0164 0168 016c 0178 017c 0170 0174 0180 0184 0188 018c 0198 019c 0190 0194 01a0 01a4
Console Uncorrectable Error Frame Header OCT 25 15:19:36 Processor Machine Check Frame CPU ID Frame Flag/Size Frame Offsets Frame Revision/Code I_STAT DC_STAT C_ADDR DC1_SYNDROME DC0_SYNDROME
0050 0058 0060 0068 0070 0078 0080 0088 0090 0098 00a0 00a8 00b0 00b8 00c0 00c8 00d0 00d8 00e0 00e8 00f0 00f8 0100
: : : : : : : : : : : : : : : : : : : : : : :
0000000000000000 000000000000000d 00000000000002d1 00000000001caf00 0000002280000000 0000000200000000 0000000000000000 0000000000008000 0000000016304386 0000000000000004 0000000000000000 0000000000000000 0000000000000004 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000
C_STAT C_STS MM_STAT EXC_ADDR IER_CM ISUM RESERVED PAL_BASE I_CTL PROCESS_CONTEXT Reserved Reserved OS Flags Cchip DIRx Cchip MISC PChip 0 SERROR PChip 0 GPERROR PChip 0 APERROR PChip 0 AGPERROR PChip 1 SERROR PChip 1 GPERROR PChip 1 APERROR PChip 1 AGPERROR Titan Pchip0 Extended Frame SCTL SERREN APCTL APERREN AGPERREN ASPRST AWSBA0 AWSBA1 AWSBA2 AWSBA3 AWSM0 AWSM1 AWSM2 AWSM3 ATBA0 ATBA1 ATBA2 ATBA3 GPCTL GPERREN GSPRST GWSBA0 GWSBA1 GWSBA2 GWSBA3 GWSM0 GWSM1 GWSM2 GWSM3 GTBA0 GTBA1 GTBA2 GTBA3
0000 : 00010008000c0110 0008 : 0000000002831411 0010 : 000000000000000e 0018 : 00000004c00000c2 0020 : 00000000000007f6 0028 : 0000000000000000 0030 : 0000000000000000 0038 : 0000000000000000 0040 : 0000000080000001 0048 : 00000000c0000003 0050 : 0000000000000002 0058 : 0000000000000000 0060 : 000000003ff00000 0068 : 000000003ff00000 0070 : 0000000000000000 0078 : 0000000000000000 0080 : 0000000000000000 0088 : 0000000003100000 0090 : 0000000000000000 0098 : 00000004c10000c2 00a0 : 00000000000007f6 00a8 : 0000000000000000 00b0 : 0000000000800003 00b8 : 0000000080000001 00c0 : 00000000c0000003 00c8 : 0000000000000002 00d0 : 0000000000700000 00d8 : 000000003ff00000 00e0 : 000000003ff00000 00e8 : 0000000000000000 00f0 : 00000000002cc000 00f8 : 0000000000000000 0100 : 0000000002f00000 0108 : 0000000000000000 Error 3 0000 : 0001000200050018 0008 : 00003308060e3a34
4-37
0010 : 0000000100000080 0000 0008 0010 0018 0020 0028 0030 0038 0040 0048 0050 0058 0060 0068 0070 0078 : : : : : : : : : : : : : : : : 00010003000c0080 0000000000000000 0000000000000070 0000001800000018 0000000100000206 0000000000000000 0000000000000000 0000000000000009 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000400 0000010001000000 0000000000000000 System Event Frame CPU ID Frame Flag/Size Frame Offsets Frame Revision/Code OS Flags Cchip DIRx TIG Info Reserved RMC Override Power Info RMC Info Temp Info Fan Info Fatal Summary Reserved
The kill command terminates a specified process. The kill_diags command terminates all diagnostics. Syntax kill_diags kill [PID. . . ] Arguments
[PID. . . ] The process ID of the diagnostic to terminate. Use the show_status command to determine the process ID.
4-39
4.12 memexer
The memexer command runs a specified number of memory exercisers in the background. Nothing is displayed unless an error occurs. Each exerciser tests all available memory in twice the backup cache size blocks for each pass. The following example shows no errors.
The following example shows a memory compare error indicating bad DIMMs. In most cases, the failing bank and DIMM position are specified in the error message.
>>> memexer 3 *** Hard Error - Error #41 - Memory compare error Diagnostic Name memtest Expected value: Received value Failing addr: ID 00000193 25c07 35c07 a11848 Device Pass brd0 114 Test 1 Hard/Soft 0 03-Sept-2003 12:00:01
Use the show_status command to display the progress of the tests. Use the kill or kill_diags command to terminate the test. Syntax memexer [number] Arguments
[number] Number of memory exercisers to start. The default is 1. The number of exercisers, as well as the length of time for testing, depends on the context of the testing.
4-41
4.13 memtest
The memtest command exercises a specified section of memory. Typically memtest is run from the built-in console script. Advanced users may want to use the specific options described here.
Use the show memory command or an info 0 command to see where memory is located. Starting address Length of the section to test in bytes Passcount. In this example, the test will run for 10 passes. The test detected a failure on DIMM 3.
Use the show_status command to display the progress of the test. Use the kill or kill_diags command to terminate the test. Memtest provides a graycode memory test. The test writes to memory and then reads the previously written value for comparison. The section of memory that is tested has its data destroyed. The -z option allows testing outside of the main memory pool. Use caution because this option can overwrite the console. Memtest may be run on any specified address. If the -z option is not included (default), the address is verified and allocated from the firmware's memory zone. If the -z qualifier is included, the test is started without verification of the starting address. When a starting address is specified, the memory is allocated beginning at the starting address -32 bytes for the length specified. The extra 32 bytes that are allocated are reserved for the allocation header information. Therefore, if a starting address of 0xa00000 and a length of 0x100000 is requested, the area from 0x9fffe0 through 0xb00000 is reserved. This may be confusing if you try to begin two memtest processes simultaneously with one beginning at 0xa00000 for a length of 0x100000 and the other at 0xb00000 for a length of 0x100000. The second memtest process will send a message that it is Unable to allocate memory of length 100000 at starting address b00000. Instead, the second process should use the starting address of 0xb00020.
4-43
NOTE:
If memtest is used to test large sections of memory, testing may take a while to complete. If you issue a Ctrl/C or kill PID in the middle of testing, memtest may not abort right away. For speed reasons, a check for a Ctrl/C or kill is done outside of any test loops. If this is not satisfactory, you can run concurrent memtest processes in the background with shorter lengths within the target range.
Memtest Test 1 Graycode Test Memtest Test 1 uses a graycode algorithm to test a specified section of memory. The graycode algorithm used is: data = (x>>1)^x, where x is an incrementing value. Three passes are made of the memory under test. The first pass writes alternating graycode inverse graycode to each four longwords. This causes many data bits to toggle between each 16-byte write. For example graycode patterns for a 32 byte block would be: Graycode(0) 00000000 Graycode(1) 00000001 Graycode(2) 00000003 Graycode(3) 00000002 Inverse Graycode(4) FFFFFFF9 Inverse Graycode(5) FFFFFFF8 Inverse Graycode(6) FFFFFFFA Inverse Graycode(7) FFFFFFFB The second pass reads each location, verifies the data, and writes the inverse of the data, one longword at a time. This causes all data bits to be written as a one and zero. The third pass reads and verifies each location.
You can specify the -f (fast) option so that the explicit data verify sections of the second and third loops are not performed. This does not catch address shorts but stresses memory with a higher throughput. The ECC/EDC logic can be used to detect failures.
Syntax memtest ( [-sa <start_address>] [-ea <end_address>] [-l <length>] [-bs <block_size>] [-i <address_inc>] [-p <pass_count>] [-d <data_pattern>] [-rs <random_seed>] [-ba <block_address>] [-t <test_mask>] [-se <soft_error_threshold>] [-g <group_name>] [-rb] [-f] [-m] [-z] [-h] [-mb] ) Options -sa -ea -l
Start address. Default is first free space in memzone. End address. Default is start address plus length size. Length of section to test in bytes, default is the zone size with the -rb option and the block_size for all other tests. -l has precedence over -ea. Block (packet) size in bytes in hex, default 8192 bytes. This is used only for the random block test. For all other tests the block size equals the length. Specifies the address increment value in longwords. This value is used to increment the address through the memory to be tested. The default is 1 (longword). This is only implemented for the graycode test. An address increment of 2 tests every other longword. This option is useful for multiple CPUs testing the same physical memory. Passcount If 0 then run indefinitely or until Ctrl/C is issued. Default = 1 Test mask. Default = run all tests in selected group. Group name Soft error threshold Fast. If -f is included in the command line, the data compare is omitted. Detects only ECC/EDC errors.
-bs
-i
-p -t -g -se -f
4-45
Options -m -z
Timer. Prints out the run time of the pass. Default = off . Tests the specified memory address without allocation. Bypasses all checking but allows testing in addresses outside of the main memory heap. Also allows unaligned input.
CAUTION: This flag can overwrite the console. If the system hangs, press the halt/reset button (if configured for reset). -d -h -rs -rb -mb -ba
Used only for march test (2). Uses this pattern as test pattern. Default = 5's Allocates test memory from the firmware heap. Used only for random test (3). Uses this data as the random seed to vary random data patterns generated. Default = 0. Randomly allocates and tests all of the specified memory address range. Allocations are done of block_size. Memory barrier flag. Used only in the -f graycode test. When set an mb is done after every memory access. This guarantees serial access to memory. Used only for block test (4). Uses the data stored at this address to write to each block.
4.14 net
The net command performs maintenance operations on a specified Ethernet port. Net -ic initializes the MOP counters for the specified Ethernet port, and net -s displays the current status of the port, including the contents of the MOP counters.
4-47
4.15 nettest
The nettest command tests the network ports using MOP loopback. Typically nettest is run from the built-in console script. Advanced users may want to use the specific options and environment variables described here.
e*
Internal loopback test on port ei*0 Internal loopback test on ports ewa0/ewb0 External loopback test on all Ethernet ports; wait 10 seconds between tests Nettest performs a network test. It can test the ei* or ew* ports in internal loopback, external loopback, or live network loopback mode. Nettest contains the basic options to run MOP loopback tests. Many environment variables can be set from the console to customize nettest before nettest is started. The environment variables, a brief description, and their default values are listed in the syntax table in this section. Each variable name is preceded by e*a0_ or e*b0_ to specify the desired port. You can change other network driver characteristics by modifying the port mode. See the mode option. Use the show_status display to determine the process ID when terminating an individual diagnostic test. Use the kill or kill_diags command to terminate tests.
4-49
Syntax nettest ( [-f <file>] [-mode <port_mode>] [-p <pass_count>] [-sv <mop_version>] [-to <loop_time>] [-w <wait_time>] [<port>] ) Arguments
<port> Options -f <file> Specifies the Ethernet port on which to run the test. Specifies the file containing the list of network station addresses to loop messages to. The default file name is lp_nodes_e*a0 for port e*a0. The default file name is lp_nodes_e*b0 for port e*b0. The files by default have their own station address. Specifies the mode to set the port adapter. The default is ex (external loopback). Allowed values are: df : default, use environment variable values ex : external loopback in : internal loopback nm : normal mode nf : normal filter pr : promiscuous mc : multicast ip : internal loopback and promiscuous fc : force collisions nofc : do not force collisions nc : do not change mode -p <pass_count> Specifies the number of times to run the test. If 0, then run until terminated by a kill or kill_diags command The default is 1. NOTE: This is the number of passes for the diagnostic. Each pass will send the number of loop messages as set by the environment variable, ega*_loop_count, eia*_loop_count, or ewa*_loop_count.
-mode <port_mode>
-sv <mop_version>
Specifies which MOP version protocol to use. If 3, then MOP V3 (DECNET Phase IV) packet format is used. If 4, then MOP V4 (DECNET Phase V IEEE 802.3) format is used.
Specifies the time in seconds allowed for the loop messages to be returned. The default is 2 seconds. Specifies the time in seconds to wait between passes of the test. The default is 0 (no delay). The network device can be very CPU intensive. This option will allow other processes to run.
Environment Variables e*a*_loop_count e*a*_loop_inc e*a*_loop_patt Specifies the number (hex) of loop requests to send. The default is 0x3E8 loop packets. Specifies the number (hex) of bytes the message size is increased on successive messages. The default is 0xA bytes. Specifies the data pattern (hex) for the loop messages. The following are legitimate values. 0 : all zeros 1 : all ones 2 : all fives 3 : all 0xAs 4 : incrementing data 5 : decrementing data ffffffff : all patterns loop_size Specifies the size (hex) of the loop message. The default packet size is 0x2E.
4-51
IMPORTANT:
The system serial number must be set correctly. System Event Analyzer will not work with an incorrect serial number.
When the system motherboard is replaced, you must use the set sys_serial_num command to restore the master setting. Syntax set sys_serial_num value Value is the system serial number, which is on a sticker on the back of the system enclosure.
4-53
The output of the show error command is based on information logged to the serial control bus EEPROMs on the system FRUs. Both the operating system and the ROM-based diagnostics log errors to the EEPROMs. This functionality allows you to generate an error log from the console environment. No errors are displayed for fans and the power supply because these components do not have an EEPROM. Syntax show error All FRUs with errors are displayed. If no errors are logged, nothing is displayed and you are returned to the SRM console prompt. Example 423 shows TDD, SDD, checksum, and sys_serial_num mismatch errors logged to the EEPROM on the system motherboard (HMB). Table 42 shows a reference to these errors. The bit masks correspond to the bit masks that would be displayed in the E field of the show fru command. FRU to which errors are logged; in this example the system motherboard, HMB. A TDD error has been logged. TDDs (test-directed diagnostics) test specific functions sequentially. Typically, nothing else is running during the test. TDDs are performed in SROM or XSROM or early in the console power-up flow. An SDD error has been logged. SDDs (symptom-directed diagnostics) are generic diagnostic exercisers that try to cause random behavior and look for failures or symptoms. All SDDs are logged by the System Event Analyzer. Three checksum errors have been logged. There was a mismatch between the serial number on the system motherboard and the system serial number. This could occur if a motherboard from a system with a different serial number was swapped into this system.
FRUname HMB HMB.DIMM0.J14 HMB.DIMM1.J12 HMB.DIMM2.J15 HMB.DIMM3.J13 HMB.PCIRSR HMB.PCIRSR.PCI PWR0 SYS FAN PCI FAN CPU FAN DISK FAN >>>
Serial# Model/Other Alias/Misc SW32400011 SW32400011 SW32400011 SW32400011 SW32400011 SW31200018 234 234 234 234 3D Labs OX PS FAN FAN FAN FAN ce ce ce ce
J3 J1 J32 J31
FRUname
The FRU name recognized by the SRM console. The name also indicates the location of that FRU in the physical hierarchy. HMB = system motherboard; CPU = CPUs; DIMMn = DIMMs; CPB = PCI; PCI = PCI option; SBM = SCSI backplane; PWR = power supply; FAN = fans; JIO= I/O connector module (junk I/O).
Error field. Indicates whether the FRU has any errors logged against it. FRUs without errors show 00 (hex). FRUs with errors have a non-zero value that represents a bit mask of possible errors. See Table 42. The part number of the FRU in ASCII, either an HP part number or a vendor part number.
Part #
4-55
Serial #
The serial number. For HP FRUs, the serial number has the form XXYWWNNNNN. XX = manufacturing location code YWW = year and week NNNNN = sequence number. For vendor FRUs, the 4-byte sequence number is displayed in hex. Optional data. For HP FRUs, the HP part alias number (if one exists). For vendor FRUs, the year and week of manufacture. Miscellaneous information about the FRUs. For HP FRUs, a model name, number, or the common name for the entry in the Part # field. For vendor FRUs, the manufacturer's name.
Model/Other Alias/Misc
The following table lists bit assignments for failures that could potentially be listed in the E (error) field of the show fru command. Because the E field is only two characters wide, bits are ored together if the device has multiple errors. For example, the E field for a FRU with both TDD (02) and SDD (04) errors would be 06: 010 | 100 = 110 (6)
02
04
<fruname> SDD - Type:0 LastLog: 0 Overwrite: 0 <fruname> EEPROM Unreadable <fruname> Bad checksum 0 to 64 EXP:01 RCV:02 <fruname> Bad checksum 64 to 126 EXP:01 RCV:02
08 10
20
Text Message <fruname> Bad checksum 128 to 254 EXP:01 RCV:02 <fruname> SYS_SERIAL_NUM Mismatch
Meaning and Action Informational. Use the clear_error command to clear the error unless TDD or SDD is also set. Informational. Use the clear_error command to clear the error unless TDD or SDD is also set.
80
4-57
4.19 show_status
The show_status command displays the progress of diagnostics. The command reports one line of information per executing diagnostic. Many of the diagnostics run in the background and provide information only if an error occurs.
ID Program Device Pass Hard/Soft Bytes Written Bytes Read -------- ------------ ----------- ------ --------- ------------- ----------00000001 idle system 0 0 0 0 0 00002147 memtest memory 1 0 0 742391808 742391808 0000214c memtest memory 1 0 0 742391808 742391808 00002151 memtest memory 2 0 0 729808896 729808896 0000218b memtest memory 1 0 0 0 0 0000218c memtest memory 2 0 0 734003200 734003200 000021cf exer_kid dka0.0.0.8.0 0 0 0 0 483328 000021d0 exer_kid dka100.1.0.8 0 0 0 0 483328 000021df exer_kid dqa0.0.0.13. 0 0 0 0 482304 00002211 exer_kid tta1 4 0 0 4252 4252 0000227b 000022d4 >>> nettest eia0.0.0.9.0 nettest eib0.0.0.10. 38 37 0 0 0 0 53504 52096 53504 52096
Process ID The SRM diagnostic for the particular device The device under test Number of diagnostic passes that have been completed Error count (hard and soft). Soft errors are not usually fatal; hard errors halt the system or prevent completion of the diagnostics. Bytes successfully written by the diagnostic. Bytes successfully read by the diagnostic. The following command string is useful for periodically displaying diagnostic status information for diagnostics running in the background:
>>> while true;show_status;sleep n;done
4-59
4.20 sys_exer
The sys_exer command exercises the devices displayed with the show config command. Tests are run concurrently and in the background. Nothing is displayed after the initial test startup messages unless an error occurs.
Device Pass Hard/Soft Bytes Written Bytes Read ------------ ------ --------- ------------- ------------system 0 0 0 0 0 memory 1 0 0 339738624 339738624 memory 1 0 0 335544320 335544320 memory 1 0 0 335544320 335544320 memory 1 0 0 327155712 327155712 memory dka0.0.0.8.0 dka100.1.0.8 dqa0.0.0.13. eia0.0.0.9.0 eib0.0.0.10. lp_nodes_eib 1 0 0 0 17 15 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23936 21270 0 0 156672 156672 197632 23936 21120 6
OpenVMS PALcode V1.98-6, Tru64 UNIX PALcode V1.92-7 starting console on CPU 0
Use the show_status command to display the progress of diagnostic tests. The diagnostics started by the sys_exer command automatically reallocate memory resources, because these tests require additional resources. Use the init command to reconfigure memory before booting an operating system. Because the sys_exer tests are run concurrently and indefinitely (until you stop them with the init command), they are useful in flushing out intermittent hardware problems. When using the sys_exer command after shutting down an operating system, you must initialize the system to a quiescent state. Enter the following command at the SRM console:
>>> init . . . >>> sys_exer
By default, no write tests are performed on disk and tape drives. Media must be installed to test the drives. When the -lb argument is used, a loopback connector is required for the COM2 port (9-pin loopback connector, 12-27351-01). Syntax sys_exer [-lb] [-t] Arguments
[-lb] [-t] The loopback option runs console loopback tests for the COM2 serial port during the test sequence. Number of seconds to run. The default is run until terminated by a kill or kill_diags command.
4-61
4.21 test
The test command verifies all the devices in the system. This command can be used on all supported operating systems.
The test command also does a quick test on the system speaker. A beep is emitted as the command starts to run. The tests are run sequentially, and the status of each subsystem test is displayed to the console terminal as the tests progress. If a particular device is not available to test, a message is displayed. The test script does no destructive testing; that is, it does not write to disk drives. Syntax test [argument] Use the -lb (loopback) argument for console loopback tests.
To run a complete diagnostic test using the test command, the system configuration must include: A serial loopback connected to the COM2 port (not included) A trial CD-ROM with files installed
The test script tests devices in the following order: 1. 2. Memory tests (one pass) Read-only tests: DK* disks, DR* disks, DQ* disks, and MK* tapes
NOTE: You must install media to test disks and tape drives. Since no write tests are performed, it is safe to test disks and tapes that contain data. 3. 4. 5. Console loopback tests if -lb argument is specified: COM2 serial port. VGA console tests: These tests are run only if the console environment variable is set to serial. The VGA console test displays rows of the word HP. Network internal loopback tests for EW*, EI*, and EG* networks.
4-63
This chapter explains how to interpret error logs reported by the operating system. The following topics are covered: Error Log Analysis with System Event Analyzer Fault Detection and Reporting Machine Checks/Interrupts
Error Logs
5-1
5.1
System Event Analyzer (SEA) is a fault management diagnostic tool that is used to determine the cause of hardware failures. System Event Analyzer performs system diagnostic processing of both single and multiple error/fault events. System Event Analyzer may or may not be installed on the customer's system with the operating system, depending on the release cycle. If SEA is installed, the System Event Analyzer Director starts automatically as part of the system start-up. SEA provides automatic background analysis. When an error event occurs, it triggers the firing of an analysis rule. The analysis engine collects and processes the information and typically generates a problem found report, if appropriate. The report can be automatically sent to users on a notification mailing list and, if DSNlink is installed, a call can be logged with the customer support center. System Event Analyzer has the capability to support the Tru64 UNIX and OpenVMS operating systems on Alpha platforms.
NOTE:
Compaq Analyze was a successor tool to DECevent and typically did not support the same systems as DECevent. Compaq Analyze was renamed to System Event Analyzer in release V4.2.
5-2
5.1.1
System Event Analyzer uses the functionality contained in the WEBES Director, a process that manages all other WEBES processes and executes continuously on the machine when configured to do so. The Director manages the decomposition processing of system error events, provides required information to the analysis engine, and performs notification message routing for the system. System Event Analyzer provides the functionality for system event analysis and Bit-To-Text (BTT) translation. System Event Analyzer includes common WEBES code. Subsequent releases of System Event Analyzer will continue to ship with the common WEBES code. The Director is started when the system is booted. Normally you do not need to start the Director. If the Director has stopped running, restart it by following the instructions in the System Event Analyzer User Guide. System Event Analyzer includes a graphical user interface (WUI) that allows the user to interact with the Director. While only one Director process executes on the machine at any time, many WUI processes can run at the same time, connected to the single Director. Refer to the System Event Analyzer installation and user manuals for the respective operating system to launch the System Event Analyzer WUI. The HP service tools Web sites available to customers are: http://h18023.www1.hp.com/support/svctools/webes or http://www.compaq.com/support/svctools/webes The applicable System Event Analyzer documentation includes the following: System Event Analyzer User s Guide WEBES Installation Guide for Tru64 UNIX WEBES Installation Guide for OpenVMS System Event Analyzer Releases Notes WEBES Releases Notes
Error Logs
5-3
5.1.2
After you have logged on to System Event Analyzer the following screen appears. If an event has occurred, it is listed under localhost events. See Figure 51.
5-4
In Figure 52, the Other Logs file is selected and the list of Problem Reports is displayed.
Full View is selected and the problem reports are listed. You may select any log listed in Other Logs to view a list of all problems found. You may also view each report by clicking on the underlined hot link under Problem Reports.
Error Logs
5-5
5-6
Managed Entity The Managed Entity designator includes the system host name (typically a computer name for networking purposes), the type of computer system (AlphaServer DS15), and the error event identification. The error event identification uses new common event header Event_ID_Prefix and Event_ID_Count components. The Event_ID_Prefix refers to an OS specific identification for this event type. The Event_ID_Count indicates the number of this event and the event type. Service Obligation Data This item provides Obligation number and validity, system serial number, and company name of service provider.
Error Logs
5-7
Brief Description The Brief Description designator indicates whether the error event is related to the CPU, system (PCI, storage, and so on), or environmental subsystem. Callout ID The Callout ID designator provides information about the analysis rule-set. Most characters within this designator are reserved for HP-specific purposes. Full Description The Full Description designator provides detailed error information, which can include a description of the detected fault or error condition, the specific address or data bit where this fault or error occurred, the probable FRU list, and service related information. FRU List The FRU List designator lists the most probable defective FRUs. This list indicates that one or more of these FRUs needs to be serviced. The information typically includes the FRU probability, manufacturer, system device type, system physical location, part number, serial number, and firmware revision level (if applicable).
5.1.3
Bit to Text
The following is an example of the Correctable System Event for ds15a.errlog. To access the data, select the Events tab for the problem report selected.
NOTE:
1. By default, SEA does not display correctable cpu or system events (event type 630/620). To display these events with the WUI, the -adv must be added to the logon profile, for example: test-adv 2. When using the CLI to translate event type 630 or 620, the showall qualifier must be added to the command, for example: wsea x trans ds15a.errlog showall.
5-8
------
Tru64 UNIX Alpha Hewlett-Packard Company Titan Corelogic CPU Logging this Event
AlphaServer DS15 Compaq Tru64 UNIX V5.1B (Rev. 2650) EPMDS15033 May 29, 2003 2:44:50 PM GMT-04:00 csse32 620
Logout_Frame_CPU_Section Frame_Size x0000 00B0 Frame_Flags x8000 0000 CPU_Area_Offset System_Area_Offset Mchk_Error_Code Code Value[31:0] Frame_Rev I_STAT DC_STAT C_ADDR Register Io_M[43] C_SYNDROME_1 QW_Upper[7:0] C_SYNDROME_0 QW_Lower[7:0] C_STAT x0000 0018 x0000 0058 x0000 0204 x204 x0000 x0000 x0000 x0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000
Cbox Read Erred Address System Memory Access Odd QW Data Syndrome No Syndrome Even QW Data Syndrome No Syndrome
x0 x0000 0000 0000 0000 x0 x0000 0000 0000 0000 x0 x0000 0000 0000 0000
Error Logs
5-9
Logout_Frame_System_Section SW_Error_Sum_Flags x0000 0000 Pchip0_PCI_Error[0] x0 Pchip1_PCI_Error[1] x0 Pchip_Mem_Error[2] x1 Detected Hot_Plug_Slot[39:32]x0 Cchip_DIRx x1000 0000 Register Pchip0_Cerr[60] x1 Detected Cchip_MISC x0000 0012 Nxm[28] x0 Nxs[31:29] x0 Device P0_Serror xF100 0023 Corr_Ecc_Error[2] x1 Sys_Addr[46:15] x46 9332 [34:3] Bus_Source[53:52] x0 TransAction_Cmd[55:54]x0 ECC_Syndrome[63:56] xF1 xF1 P0_GPerror x0000 0000 PCI_Cmd[55:52] x0 P0_APerror x0020 0000 PCI_Addr[46:14] xED PCI_Cmd[55:52] x2 P0_AGPerror x0000 0000 AGP_Lost_Err[0] x0 AGP_Cmd[52:50] x0 P1_Serror x0000 0000 Bus_Source[53:52] x0 TransAction_Cmd[55:54]x0 ECC_Syndrome[63:56] x0 P1_GPerror x0000 0000 PCI_Cmd[55:52] x0 P1_APerror x0000 0000 PCI_Cmd[55:52] x0 P1_AGPerror x0000 0000 START OF SUBPACKETS IN THIS EVENT
0000 00E0
Cchip Miscellaneous Register Nxs[31:29] NOT Valid If Nxm[28] = 1 - CPU 0 Source Pchip0 System Error Register Non-Fatal ECC Error System Erred Address Bits GPCI Bus DMA Read Data Bit 30 Error ECC Syndrome No Error Detected Interrupt Acknowledge Pchip0 Aport Error Register PCI Erred Address Bits [34:2] IO Read No Error Detected Read No Error Detected GPCI Bus DMA Read No ECC Error No Error Detected Interrupt Acknowledge No Error Detected Interrupt Acknowledge No Error Detected
4999 0004
ES4X Dual Port RAM Subpacket, Version 1 DPR_0 x40 (DS15 - 2 Dimms), configured as lowest array DPR_1 x10 Array_0_Size[7:0] x10 DPR_2 x00 DPR_3 x00 Array_1_Size[7:0] x0 DPR_4 x41 (DS15 - 2 Dimms), configured as next lowest array DPR_5 x10
Non - Split, Set0 - 4 Dimms Array 0 Dpr Location x81 1 Gbytes DPR Location x82 Unused DPR Location x83 Unused No Good Memory in Array 1 Non - Split, Set0 - 4 Dimms Array 2 Dpr Location x85
5-10
1 Gbytes DPR Location x86 Unused DPR Location x87 Unused No Good Memory in Array 3
System Memory / IO Configuration Subpacket, Version 1 AAR_0 x0000 0000 0000 7009 Memory Array 0 Configuration Register Sa0[8] x0 Non - Split Array Asiz0[15:12] x7 1 Gb Addr0[34:24] x0 Array0 Base Address [34:24] Bits AAR_1 x0000 0000 0000 0000 Memory Array 1 Configuration Register Sa1[8] x0 Non - Split Array Asiz1[15:12] x0 Array 1 Not Used Addr1[34:24] x0 Array1 Base Address [34:24] Bits AAR_2 x0000 0000 4000 7009 Memory Array 2 Configuration Register Sa2[8] x0 Non - Split Array Asiz2[15:12] x7 1 Gb Addr2[34:24] x40 Array2 Base Address [34:24] Bits AAR_3 x0000 0000 0000 0000 Memory Array 3 Configuration Register Sa3[8] x0 Non - Split Array Asiz3[15:12] x0 Array 3 Not Used Addr3[34:24] x0 Array3 Base Address [34:24] Bits P0_SCTL x0000 0000 0283 1411 Pchip0 System Control Register REV[7:0] x11 PID[8] x0 Pchip ID Value RPP[9] x0 ECCEN[10] x1 DMA ECC Enabled SWARB[12:11] x2 Round Robin CRQMAX[19:16] x3 CDQMAX[23:20] x8 PTPMAX[27:24] x2 INUM[28] x0 256K Max Downstream PTP/PIO Writes to bypass PIO Read NEWAMU[29] x0 GPCI Enabled to Perform PTE Fetch Xactions PTPWAR[30] x0 PTP Writes Disabled During Pending Reads P0_GPCTL x0000 0004 C100 00C2 Pchip 0 Gport Control Register FBTB[0] x0 THDIS[1] x1 TLB Anti-Thrash Disabled CHAINDIS[2] x0 TGTLAT[4:3] x0 Target Latency Timer = 128 PCI Clocks Win_HOLE[5] x0 MnStr_WIN_Enable[6] x1 Monster Window Enabled ARBENA[7] x1 Pchip 0 Internal Arbiter Enabled PRIGRP[15:8] x0 No req_l[6:0] High Priority PPRI[16] x0 PCISPD66[17] x0 GPCI Frequency = 33 MHz CNGSTLT[21:18] x0 All DMA Reads Retry w/delayed Completion PTPDESTEN[29:22] x4 Writes to Pchip0, APCI Enabled
Error Logs
5-11
DPCEN[30] APCEN[31] Enabled DCR_Timer[33:32] EN_Stepping[34] P0_APCTL FBTB[0] THDIS[1] CHAINDIS[2] TGLAT[4:3] HOLE_Enable[5] MWIN_Enable[6] ARBENA[7] PRIGRP[15:8] PCISPD66[17] CNGSTLT[21:18] Completion PTPDESTEN[29:22] DPCEN[30] APCEN[31] DCR_Timer[33:32] EN_Stepping[34] Enabled AGP_Rate[53:52] AGP_SBA_Enabled[54] AGP_Enabled[55] AGP_Present[57] AGP_HP_RD[60:58] AGP_LP_RD[63:61] P1_SCTL REV[7:0] PID[8] RPP[9] ECCEN[10] SWARB[12:11] CRQMAX[19:16] CDQMAX[23:20] PTPMAX[27:24] INUM[28] Enabled to bypass PIO Read NEWAMU[29] Fetch Xactions PTPWAR[30] Pending Reads P1_GPCTL FBTB[0] Disabled THDIS[1] CHAINDIS[2] Disabled TGLAT[4:3] Clocks WIN_Hole[5] Disabled Mnstr_Win_Enable[6] ARBENA[7] PRIGRP[15:8] Enabled PCISPD66[17] CNGSTLT[21:18] Completion Enabled
x1 x1 x0 x1 x0000 0004 C002 00C2 x0 x1 x0 x0 x0 x1 x1 x0 x1 x0 x0 x1 x1 x0 x1 x0 x0 x0 x0 x0 x0 x0000 0000 0000 0000 x0 x0 x0 x0 x0 x0 x0 x0 x0 x0 x0 x0000 0000 0000 0000 x0 x0 x0 x0 x0 x0 x0 x0 x0 x0
Data Parity Checking Enabled Address Parity Checking DCR Timer Count = 2^15 Address Stepping Enabled Pchip0 Aport Control Register TLB Anti-Thrashing Disabled TGLAT = 128 PCI Clocks Monster Window Enabled Internal Arbiter Enabled APCI = 66MHz All DMA Reads Retry w/delayed No Legal PTPs Enabled DCRT Count = 2^15 PCI Config Address Stepping AGP Rate = 1X PCI Bus Enabled 0 Cchip HP Outstanding Reads 0 Cchip LP Outstanding Reads Pchip1 System Control Register Pchip PID = 0 Pchip1 ECC Disabled GPCI > APCI > AGPX
256K MAX PTP/PIO Writes GPCI Enabled to Perform PTE PTP Writes Disabled During Pchip1 Gport Control Register PCI Fast Back-To_Back Xactions TLB Anti-Thrashing Disabled GPCI PIO Write Chaining Target RetryTimer = 128 PCI 512Kb - 1Mb Window Hole Monster Window Disabled Internal Arbiter Disabled No Arbitor Priority Groups GPCI Frequency = 33 Mhz Every DMA Read Retry w/Delayed
5-12
PTPDESTEN[29:22] Destinations Disabled DPCEN[30] Disabled APCEN[31] Disabled DCRTV[33:32] EN_Stepping[34] P1_APCTL FBTB[0] Disabled THDIS[1] CHAINDIS[2] Enabled TGLAT[4:3] Clocks Win_Hole[5] Mnstr_Win_Enable[6] ARBENA[7] PRIGRP[15:8] Enabled PPRI[16] PCISPD66[17] CNGSLT[21:18] Completion Enabled PTPDESTEN[29:22] Destinations Disabled DPCEN[30] Disabled APCEN[31] Detection Disabled DCRTV[33:32] EN_Stepping[34] AGP_Rate[53:52] AGP_SBA_EN[54] AGP_EN[55] AGP_Present[57] AGP_HP_RD[60:58] AGP_LP_RD[63:61]
All GPCI Legal PTP Data Parity Error Detection Address Parity Error Detection DCR Timer = 2^15 Counts Address Stepping Disabled Pchip1 Aport Control Register PCI Fast Back-To-Back Xactions TLB Anti-Thrashing Enabled APCI PIO Write Chaining Target Latency Timer = 128 PCI 512Kb Monster Arbiter No High 1Mb Hole Disabled Window Disabled Disabled Priority Groups
APCI Frequency = 33 Mhz Every DMA Read Retry w/Delayed All APCI legal PTP Data Parity Error Detection Address Command Parity Error DCR Timer = 2^15 Counts Address Stepping Disabled AGP Rate = 1X SideBand Addressing Disabled AGP Xactions Disabled agp_present = 0 No Cchip Pending HP Reads No Cchip Pending LP Reads
Error Logs
5-13
5.2
Table 51 provides a summary of the fault detection and correction components of DS15 systems. Generally, PALcode handles exceptions/interrupts as follows: 1. 2. 3. The PALcode determines the cause of the exception/interrupt. If possible, it corrects the problem and passes control to the operating system for error notification, reporting, and logging before returning the system to normal operation. If PALcode is unable to correct the problem, it 4. Logs double error halt error frames into the flash ROM Logs uncorrectable error logout frames to the DPR For single error halts, logs the uncorrectable logout frame into the DPR.
If error/event logging is required, control is passed through the OS Privileged Architecture Library (PAL) handler. The operating system error handler logs the error condition into the binary error log. System Event Analyzer should then diagnose the error to the defective FRU.
5-14
Error Logs
5-15
5.3
Machine Checks/Interrupts
The exceptions that result from hardware system errors are called machine checks/interrupts. They occur when a system error is detected during the processing of a data request. During the error-handling process, errors are first handled by the appropriate PALcode error routine and then by the associated operating system error handler. PALcode transfers control to the operating system through the PAL handler. Table 52 lists the machine checks/interrupts that are related to error events. The designations 630, 670, 620, 660, and 680 indicate a system control block (SCB) offset to the fatal system error handler for Tru64 UNIX and OpenVMS.
CPU Uncorrectable Error (670) Fatal microprocessor machine check errors that result in a system crash.
5-16
System Environmental Error (680) System-detected machine check caused by an overtemperature condition, fan failure, or power supply failure.
NOTE:
The override for fan and overtemperature shutdown is set in the RMC. If override is set, the system continues operating until more severe errors occurs.
Error Logs
5-17
5.3.1
The operating system error handlers generate several entry types that vary in length based on the number of registers within the entry. Each entry consists of an operating system header, several device frames, and an end frame. Most entries have a PAL-generated logout frame, and may contain frames for CPU, memory, and I/O. Table 53 shows an event structure map for a system uncorrectable PCI target abort error. An AlphaServer DS15 has only PCHIP 0. PCHIP 1 information in error registers is always 0.
ech0000 ech+nnnn lfh0000 lfh+nnnn lfEV680000 lfEV68+nnnn lfctt_A0[u] lfctt_A8[u] lfctt_B0[u] lfctt_B8[u] lfctt_C0[u] lfett_C8[u] lfett_138[u] eelcb_140 eelcb_190 eelcb_1E0 eelcb_230 2D8 SESF<63:32> = Reserved(MBZ)
NEW COMMON OS HEADER STANDARD LOGOUT FRAME HEADER COMMON PAL EV68 SECTION (first 8 QWs Zeroed) <39:32>= (MBZ) SESF<31:16> = Reserved(MBZ) SESF<15:0>= 0001(hex)
Cchip CPUx Device Interrupt Request Register (DIRx<62> = 1) Cchip Miscellaneous Register (MISC) Pchip0 Error Register (P0_PERROR<51>=0;<47:18>=PCI Addr;<17:16>=PCI Opn; <6>=1) Pchip1 Error Register (P1_PERROR<63:0> = 0) Pchip0 Extended Titan System Packet Pchip 0 PCI Slot 1 Single Device Bus Snapshot Packet Pchip 0 PCI Slot 2 Single Device Bus Snapshot Packet Pchip 0 PCI Slot 3 Single Device Bus Snapshot Packet Pchip 0 PCI Slot 4 Single Device Bus Snapshot Packet Termination or End Packet
5-18
This chapter describes how to configure and set up an AlphaServer DS15 system. The following topics are covered: System Consoles Displaying the Hardware Configuration Setting Environment Variables Setting Automatic Booting Changing the Default Boot Device Setting SRM Security Configuring Devices Booting Linux
6.1
System Consoles
The SRM console is located in a flash ROM on the system motherboard. From the console interface, you can set up and boot the operating system, display the system configuration, and run diagnostics. For complete information, see the AlphaServer DS15 and AlphaStation DS15 Owners Guide. SRM Console Systems running the Tru64 UNIX or OpenVMS operating systems are configured from the SRM console, a command-line interface (CLI). From the CLI you can enter commands to configure the system, view the system configuration, boot the system, and run ROM-based diagnostics. NOTE: The operating systems use different algorithms for system time. If you switch between operating systems (for example, between TRU64 UNIX and OpenVMS), be sure to reset the time at the operating system level.
Linux The procedure for installing Linux on an Alpha system is described in the Alpha Linux installation document for your Linux distribution. The installation document can be downloaded from the following Web site: http://www.compaq.com/alphaserver/linux RMC CLI The remote management console (RMC) provides a command-line interface (CLI) for controlling the system. You can use the CLI either locally or remotely (modem connection) to power the system on and off, halt or reset the system, and monitor the system environment. You can also use the dump, env, and status commands to help diagnose errors. See Chapter 7 for details.
6.1.1
The SRM console environment variable determines to which display device (VT-type terminal or VGA monitor) the console display is sent. The console terminal that displays the SRM user interface can be either a serial terminal (VT320 or higher, or equivalent) or a VGA monitor. The SRM console environment variable determines the display device. If console is set to serial, and a VT-type device is connected, the SRM console powers on in serial mode and sends power-up information to the VT device. If console is set to graphics, the SRM console expects to find a VGA card and, if so, displays power-up information on the VGA monitor after VGA initialization has been completed.
You can verify the display device with the SRM show console command and change the display device with the SRM set console command. If you change the display device setting, you must reset the system (with either the halt/reset button, if configured, the RMC reset command, or the SRM init command) to put the new setting into effect. In the following example, the user displays the current console device (a graphics device) and then resets it to a serial device. After the system initializes, output will be displayed on the serial terminal.
>>> show console console >>> set console serial >>> init . . . graphics
6.2
View the system hardware configuration by entering commands from the SRM console. It is useful to view the hardware configuration to ensure that the system recognizes all devices, memory configuration, and network connections. Use the following SRM console commands to view the system configuration. See the Owners Guide for details.
show boot* show config show device show fru show memory Displays the boot environment variables. Displays the logical configuration of interconnects and buses on the system and the devices found on them. Displays the bootable devices and controllers in the system. Displays the physical configuration of FRUs (field-replaceable units). Displays configuration of main memory.
6.3
Environment variables pass configuration information between the console and the operating system. Their settings determine how the system powers up, boots the operating system, and operates. To check the setting for a specific environment variable, enter the show envar command, where the name of the environment variable is substituted for envar. To reset an environment variable, use the set envar command, where the name of the environment variable is substituted for envar.
set envar The set command sets or modifies the value of an environment variable. It can also be used to create a new environment variable if the name used is unique. Environment variables pass configuration information between the console and the operating system. Their settings determine how the system powers up, boots the operating system, and operates. The syntax is: set envar value Envar value The name of the environment variable to be modified. The new value of the environment variable.
New values for the following environment variables take effect only after you reset the system with either the halt/reset button, if configured, the RMC reset command, or the SRM init command. auto_action console os_type pk*0_fast pk*0_host_id pk*0_soft_term show envar The show envar command displays the current value (or setting) of an environment variable. The syntax is: show envar Envar The name of the environment variable to be displayed. The wildcard * displays all environment variables.
Table 61 summarizes the SRM environment variables used most often on the DS15 system.
Attributes
NV,W1
Description
Action the console should take following an error halt or power failure. Defined values are: bootAttempt bootstrap. haltHalt, enter console I/O mode. restartAttempt restart. If restart fails, try boot. Device or device list from which booting is to be attempted when no path is specified. Set at factory to disk with factory-installed software; otherwise NULL. Default file name used for the primary bootstrap when no file name is specified by the boot command. The default value is NULL. Default parameters to be passed to system software during booting if none are specified by the boot command. OpenVMS: Additional parameters are the root_number and boot flags. The default value is NULL. root_number: Directory number of the system disk on which OpenVMS files are located. 0 (default)[SYS0.SYSEXE] 1[SYS1.SYSEXE] 2[SYS2.SYSEXE] 3[SYS3.SYSEXE]
bootdef_dev
NV,W
boot_file
NV,W
boot_osflags
NV,W
NVNonvolatile. The last value saved by system software or set by console commands is preserved across cold bootstraps (when the system goes through a full initialization), and long power outages. WWarm nonvolatile. The last value set by system software is preserved across warm bootstraps (Tru64 UNIX shutdown -r command, OpenVMS REBOOT command, or a crash and reboot; not all of the SRM initialization is run) and restarts.
Attributes
NV,W
Description
boot_flags: The hexadecimal value of the bit number or numbers to set. To specify multiple boot flags, add the flag values (logical OR). 1Bootstrap conversationally (enables you to modify SYSGEN parameters in SYSBOOT). 2Map XDELTA to running system. 4Stop at initial system breakpoint. 8Perform a diagnostic bootstrap. 10Stop at the bootstrap breakpoints. 20Omit header from secondary bootstrap file. 80Prompt for the name of the secondary bootstrap file. 100Halt before secondary bootstrap. 10000Display debug messages during booting. 20000Display user messages during booting. Tru64 UNIX: The following parameters are used with this operating system: aAutoboot. Boots /vmunix from bootdef_dev, goes to multi-user mode. Use this for a system that should come up automatically after a power failure. sStop in single-user mode. Boots /vmunix to single-user mode and stops at the # (root) prompt. iInteractive boot. Requests the name of the image to boot from the specified boot device. Other flags, such as -kdebug (to enable the kernel debugger), may be entered using this option.
boot_osflags (continued)
DFull dump; implies s as well. By default, if Tru64 UNIX crashes, it completes a partial memory dump. Specifying D forces a full dump at system crash. Common settings are a, autoboot, and Da, autoboot and create full dumps if the system crashes.
Attributes
NV,W
Description
Sets the baud rate of the COM1 (MMJ) port. The default baud rate is 9600. Baud rate values are 1800, 2000, 2400, 3600, 4800, 7200, 9600, 19200, 38400, 57600.
com2_baud
NV,W
Sets the baud rate of the COM2 port. The default baud rate is 9600. Baud rate values are 1800, 2000, 2400, 3600, 4800, 7200, 9600, 19200, 38400, 57600.
com1_flow com2_flow
NV,W
The com1_flow and com2_flow environment variables indicate the flow control on the serial ports. Defined values are: noneNo data flows in or out of the serial ports. Use this setting for devices that do not recognize XON/XOFF or that would be confused by these signals. softwareUse XON/XOFF(default). This is the setting for a standard serial terminal. hardwareUse modem signals CTS/RTS. Use this setting if you are connecting a modem to a serial port.
com1_mode com1_modem
NV NV,W
Specifies the COM1 data flow paths so that data either flows through the RMC or bypasses it. Used to tell the operating system whether a modem is present on the COM1 port. OnModem is present. OffModem is not present (default value).
console
NV
Sets the device on which power-up output is displayed. GraphicsSets the power-up output to be displayed at a VGA monitor or device connected to the VGA module. SerialSets the power-up output to be displayed on the device that is connected to the COM1 port.
Attributes
NV
Description
Determines whether the interface's internal Internet database is initialized from nvram or from a network server (via the bootp protocol). Sets the Ethernet controller to the default Ethernet device type. auiSets the default device to AUI. bncSets the default device to ThinWire. fastSets the default device to fast 100BaseT. fastfdSets the default device to fast full duplex 100BaseT. fullSet the default device to full duplex twisted pair. twisted-pairSets the default device to 10BaseT (twisted-pair). autonegotiateAutomatically negotiates highest common performance with other network controller(s) supporting IEEE 802.3u auto-negotiation. If no Ethernet cable is connected in this mode, the SRM reports a failure: Error (eib0.0.10.0), No link, auto negotiation did not complete. This is applicable for ei*, ew*, and eg* device in auto negotiate. Determines which network protocols are enabled for booting and other functions. MopSets the network protocol to MOP for systems using the OpenVMS operating system. BootpSets the network protocol to bootp for systems using the Tru64 UNIX operating system. Bootp,mopWhen the settings are used in a list, the mop protocol is attempted first, followed by bootp.
NV
NV
Attributes
NV
Description
Increases the amount of memory available for the SRM console's heap. Valid selections are: NONE (default) 64KB 128KB 256KB 512KB 1MB 2MB 3MB 4MB Sets the keyboard hardware type as either PCXAL or LK411 and enables the system to interpret the terminal keyboard layout correctly. Specifies the default value for the KZPSA host SCSI bus node ID. Specifies the console keyboard layout. The default is English (American). Specifies the extent to which memory is tested prior to a boot on Tru64 UNIX. The options are: FullFull memory test will be run. Required for OpenVMS. PartialFirst 256 MB of memory will be tested. NoneOnly first 32 MB will be tested.
NV
W NV NV
Attributes
NV NV NV
Description
Sets the default operating system. vms or unixSets system to boot the SRM firmware. Sets a console password. Required for placing the SRM into secure mode. Disable or enable parity checking on the PCI bus. OnPCI parity enabled (default value) OffPCI parity disabled Some PCI devices do not implement PCI parity checking, and some have a parity-generating scheme in which the parity is sometimes incorrect or is not fully compliant with the PCI specification. In such cases, the device functions properly so long as parity is not checked. Enables fast SCSI devices on a SCSI controller to perform in standard or fast mode. 0Sets the default speed for devices on the controller to standard SCSI. If a controller is set to standard SCSI mode, both standard and fast SCSI devices will perform in standard mode. 1Sets the default speed for devices on the controller to fast SCSI mode. Devices on a controller that connects to both standard and Fast SCSI devices will automatically perform at the appropriate rate for the device, either fast or standard mode.
pk*0_fast
NV
Attribute
NV
Description
Sets the controller host bus node ID to a value between 0 and 7. 0 to 7Assigns bus node ID for specified host adapter. Enables or disables SCSI terminators for optional SCSI controllers. This environment variable applies to systems using the Qlogic SCSI controller, though it does not affect the onboard controller. The Qlogic SCSI controller implements the 16-bit wide SCSI bus. The Qlogic module has two terminators, one for the 8 low bits and one for the high 8 bits. There are five possible values: offTurns off both low 8 bits and high 8 bits. LowTurns on low 8 bits and turns off high 8 bits. HighTurns on high 8 bits and turns off low 8 bits. OnTurns on both low 8 bits and high 8 bits.
pk*0_soft_term
NV
sys_serial_num
NV
Sets the system serial number, which is then propagated to all FRUs that have EEPROMs. The serial number can be read by the operating system. Enables or disables login to the SRM console firmware on alternative console ports. 0Disables login on alternative console ports. 1Enables login on alternative console ports (default setting). If the console output device is set to serial, set tt_allow_login 1 allows you to log in on the primary COM1 port, or alternate COM2 port, or the VGA monitor. If the console output device is set to graphics, set tt_allow_login 1 allows you to log in through either the COM1 or COM2 console port.
tt_allow_login
NV
6.4
Tru64 UNIX and OpenVMS systems are factory set to halt in the SRM console. You can change these defaults, if desired. Systems can boot automatically (if set to autoboot) from the default boot device under the following conditions: When you first turn on system power When you power cycle or reset the system When system power comes on after a power failure After a bugcheck (OpenVMS) or panic (Linux or Tru64 UNIX)
6.4.1
The SRM auto_action environment variable determines the default action the system takes when the system is power cycled, reset, or experiences a failure. The factory setting for auto_action is halt. The halt setting causes the system to stop in the SRM console. You must then boot the operating system manually. For maximum system availability, auto_action can be set to boot or restart. With the boot setting, the operating system boots automatically after the SRM init command is issued or the Reset button is pressed. With the restart setting, the operating system boots automatically after the SRM init command is issued or the Reset button is pressed, and it also reboots after an operating system crash.
To set the default action to boot, enter the following SRM commands:
>>> set auto_action boot >>> init
See the AlphaServer DS15/AlphaStation DS15 Owners Guide for more information.
6.5
You can change the default boot device with the set bootdef_dev command. You can designate a default boot device. You change the default boot device by using the set bootdef_dev SRM console command. For example, to set the boot device to the IDE CD-ROM, enter commands similar to the following:
>>> show bootdef_dev bootdef_dev dka400.4.0.1.1 >>> set bootdef_dev dqa500.5.0.1.1 >>> show bootdef_dev bootdef_dev dqa500.5.0.1.1
See the DS15 AlphaServer and DS15 AlphaStation Owners Guide for more information.
6.6
The set password and set secure commands set SRM security. The login command turns off security for the current session. The clear password command returns the system to user mode. The SRM console has two modes, user mode and secure mode. User mode allows you to use all SRM console commands. User mode is the default mode. Secure mode allows you to use only the boot and continue commands. The boot command cannot take command-line parameters when the console is in secure mode. The console boots the operating system using the environment variables stored in NVRAM (boot_file, bootdef_dev, boot_osflags).
Setting a password. If a password has not been set and the set password command is issued, the console prompts for a password and verification. The password and verification are not echoed. Changing a password. If a password has been set and the set password command is issued, the console prompts for the new password and verification, then prompts for the old password. The password is not changed if the validation password entered does not match the existing password stored in NVRAM. The password length must be between 15 and 30 alphanumeric characters. Any characters entered after the 30th character are not stored.
>>> b dkb0 Console is secure - parameters are not allowed. >>> login Please enter the password: >>> b dkb0 (boot dkb0.0.0.3.1)
. . .
The set secure command console puts the console into secure mode. The operator attempts to boot the operating system with command-line parameters. A message is displayed indicating that boot parameters are not allowed when the system is in secure mode. Entering the login command turns off security features for the current console session. After successfully logging in, the operator enters a boot command with command-line parameters.
The set secure command enables secure mode. If no password has been set, you are prompted to set the password. Once you set a password and enter the set secure command, secure mode is in effect immediately and only the continue, boot (using the stored parameters), and login commands can be performed. The syntax is: set secure
The wrong password is entered. The system remains in secure mode. The password is successfully cleared.
The clear password command is used to exit secure mode and return to user mode. To use clear password, you must know the current password. Once you clear the password, the console is no longer secure. To clear the password without knowing the current password, you must use the login command in conjunction with the RMC halt in/out commands.
6.7
Configuring Devices
Become familiar with the configuration requirements for CPUs and memory before removing or replacing those components. See Chapter 8 for removal and replacement procedures.
WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. These measures include: Remove any jewelry that may conduct electricity. If accessing the system card cage, power down the system and wait 2 minutes to allow components to cool. Wear an anti-static wrist strap when handling internal components.
6.7.1
CPU Location
The CPU
6.7.2
Memory Configuration
Become familiar with the rules for memory configuration before adding DIMMs to the system. Refer to Figure 63 and observe the following rules for installing DIMMs. You can install up to 4 DIMMs. There are two memory arrays (0 and 2). An array consists of 2 DIMMs, which must be the same capacity and type. A maximum of 4 GB of memory is supported. A memory array must be populated with two DIMMs of the same size and speed. (See the following table for supported sizes and capacity.) Populate memory arrays in numerical order, starting with array 0. Using different DIMMs in an array may result in loss of data.
CAUTION:
512MB 1024MB 1024MB 1536MB 1536MB 2048MB 2048MB 2560MB 2560MB 3072MB 3072MB 4096MB
256MB 256MB 512MB 256MB 512MB 512MB 1024MB 1024MB 256MB 1024MB 512MB 1024MB 256MB
256MB 256MB 512MB 256MB 512MB 512MB 1024MB 1024MB 256MB 1024MB 512MB 1024MB 256MB
DIMM Information for Two System Types You can mix stacked and unstacked DIMMs within the system, but not within an array. The DIMMs within an array must be of the same capacity and type (stacked or unstacked) because of different memory addressing.
Unstacked DIMMS
Stacked DIMMS
PK1209
Only the following DIMMs and DIMM options can be used in the DS15 system. Density 512 MB 1 GB 2 GB DIMM 20-01EBA-09 20-00FBA-09 20-00GBA-09 DIMM Option (2 DIMMs per) 3X-MS315-EA 3X-MS315-FA 3X-MS315-GA
CAUTION:
Memory Performance Considerations Interleaved operations reduce the average latency and increase the memory throughput over non-interleaved operations. With one memory option (2 DIMMs) installed, memory interleaving will not occur. For 2-way interleaving, array 0 and 2 must have the same size memory. The output of the show memory command provides the memory interleaving status of the system.
Memory Array --------0 2 Size ---------1024Mb 1024Mb Base Address ---------------0000000000000000 0000000040000000 Intlv Mode ---------2-Way 2-Way
Populate both arrays with the same size memory. See Figure 63 for array locations.
Memory DIMM slot - array 0, DIMM 1 Memory DIMM slot - array 2, DIMM 3 Memory DIMM slot - array 0, DIMM 0 Memory DIMM slot - array 2, DIMM 2
6.7.3
The DS15 PCI slots are all 3.3 volts, and are normally automatically configured when you boot the system after installing the option. When installing PCI option modules, you do not normally need to perform any configuration procedures; the system configures PCI modules automatically. But because some PCI option modules require and provide their own configuration utility CDs, refer to the option documentation. PCI Slot Information PCI slot 1 is the bottom slot on a desktop or rackmounted system or the right-hand slot as viewed from the back of a pedestal system. PCI option modules are either designed for 5.0 volts or 3.3 volts, or are universal in design and can plug into either 3.3 or 5.0 volt slots. The DS15 system provides only 3.3 volt slots. Some PCI options require drivers to be installed and configured. These options come with a CD-ROM. Refer to the installation document that came with the option and follow the manufacturer's instructions. There is no direct correspondence between the physical numbers of the slots on the PCI riser and the logical slot identification reported with the SRM console show config command. Table 6-2 maps the physical slot numbers to the SRM logical ID numbers for the PCI slots.
PCI Configuration Rules To run at 66 MHz, the following conditions must be met: Both slot 3 or 4 must be empty. A 33 MHz module must not be installed in either slot 1 or 2. A 66 MHz modules must be installed in either slot 1 and/or 2, otherwise the bus will run at 33 MHz.
CAUTION: Check the keying before you install the PCI module and do not force it in. Plugging a module into a wrong slot can damage it.
MR0502C
Slot 1 66/33MHz, 3.3v Slot 2 66/33MHz, 3.3v Slot 3 33MHz, 3.3v Slot 4 33MHz, 3.3v LED connected to +5 VAUX For more information, see http://h18002.www1.hp.com/alphaserver/ .
6.8
Booting Linux
Obtain the Linux installation document and install Linux on the system. Then verify the firmware version, boot device, and boot parameters, and issue the boot command. The procedure for installing Linux on an Alpha system is described in the Alpha Linux installation document for your Linux distribution. The installation document can be downloaded from the following Web site: http://www.compaq.com/alphaserver/linux You need V6.6-24 or higher of the SRM console to install Linux. If you have a lower version of the firmware, you will need to upgrade. For instructions and the latest firmware images, see the following URL. http://ftp.digital.com/pub/DEC/Alpha/firmware/ Linux Boot Procedure 1. Power up the system to the SRM console and enter the show version command to verify the firmware version.
>>> show version version >>> V6.6-24 Sept 5 2003 08:36:11
2.
Enter the show device command to determine the unit number of the drive for your boot device, in this case dka0.0.0.8.0.
DKA0 DKA100 DQA0 DVA0* EIA0 EIB0 PKA0 PKB0 COMPAQ BF03665A32 COMPAQ BF03665A32 DW-224E 00-02-A5-20-C0-39 00-02-A5-20-C0-3A SCSI Bus ID 7 SCSI Bus ID 7 3B01 3B01 A.1J
>>> show device dka0.0.0.8.0 dka100.1.0.8.0 dqa0.0.0.13.0 dva0.0.0.1000.0 eia0.0.0.9.0 eib0.0.0.10.0 pka0.7.0.8.0 pkb0.7.0.108.0 >>> * DS15 systems have no floppy drives.
3.
From SRM enter the boot command. The following example shows boot output.
0-9 aboot> l iso: Max size:329552 Log zone size:2048 iso: First datazone:28 Root inode number 57344 # # Red Hat Linux/Alpha aboot configuration options: # # 0 - Boot the Red Hat Linux installer # 1 - Boot the Red Hat Linux installer with serial console (ttyS0) # 2 - Boot the Red Hat Linux installer with callback console (srm) # (required for "serial" console on AlphaServers ES47, ES80, GS1280) # 3 - Boot the Red Hat Linux installer in text only mode # 4 - Boot the Red Hat Linux installer in text only rescue mode # 5 - Boot the Red Hat Linux installer but allow manual selection of drivers # 6 - Boot the Red Hat Linux installer and allow for other than just # a CD install (offers http, nfs, ftp, and local disk install methods) # # Additional arguments can be provided at the aboot> prompt. For example, # '6 console=ttyS0' will boot an 'expert' install using a serial console. # 0:/kernels/vmlinux.gz initrd=/images/cdrom.img 1:/kernels/vmlinux.gz initrd=/images/cdrom.img console=ttyS0 2:/kernels/vmlinux.gz initrd=/images/cdrom.img console=srm
aboot>
NOTE:
You can manage the system through the Remote Management Console (RMC). The RMC is implemented through an independent microprocessor that resides on the system board. The RMC also provides configuration and error log functionality. This chapter explains the operation and use of the RMC. Sections are: RMC overview Operating modes Terminal setup SRM environment variables for COM1 Entering the RMC Using the command-line interface Configuring remote dial-in and dial-out alert RMC firmware update and recovery Resetting the RMC to factory defaults RMC command reference Troubleshooting tips
7-1
7.1
RMC Overview
The remote management console provides a mechanism for monitoring the system (voltages, temperature, and fans) and manipulating it on a low level (reset, power on/off, halt). The RMC performs monitoring and control functions to ensure the successful operation of the system. Monitors the thermal sensor on the system motherboard. Monitors voltages and fans Detects alert conditions such as excessive temperature, fan failure, and voltage failure. On detection, pages an operator, and sends an interrupt to SRM, which then passes the interrupt to the operating system or an application. Shuts down the system if any fatal conditions exist. For example: The temperature reaches the failure limit. Any system fan failure. Provides a command-line interface (CLI) for the user to control the system. From the CLI you can power the system on and off, halt or reset the system, and monitor the system environment. Passes error log information to shared RAM so that this information can be accessed by the system.
7-2
The RMC logic is implemented using the QLogic Zircon baseboard management controller. The RMC logic is responsible for monitoring temperature, fan speed, and all voltages. The RMC firmware images (booter and runtime) are stored in flash ROM. If the firmware should ever become corrupted or obsolete, you can update it manually using the Loadable Firmware Update Utility. See Chapters 2 and 5 for details. The microprocessor can also communicate with the system power control logic to turn on or turn off power to the rest of the system. You can gain access to the RMC as long as AC power is available to the system (through the wall outlet). Thus, if the system fails, you can still access the RMC and gather information about the failure. Configuration, Error Log, and Asset Information The RMC provides additional functionality to read and write configuration and error log information to Field Replaceable Unit (FRU) error log devices. These operations are carried out via shared RAM (also called dual-port RAM or DPR). At power-on, the RMC reads the EEPROMs in the system and dumps the contents into the DPR. These EEPROMs contain configuration information, asset inventory and revision information, and error logs. During power-up the SROM sends status and error information for the CPU to the DPR. The system also writes error log information to the DPR when an error occurs. Service providers can access the contents of the DPR to diagnose system problems.
7-3
7.2
Operating Modes
The RMC can be configured to manage different data flow paths defined by the com1_mode environment variable. In Through mode (the default), all data and control signals flow from the system COM1 port through the RMC to the active external port. You can also set bypass modes so that the signals partially or completely bypass the RMC. The com1_mode environment variable can be set from either SRM or the RMC. See Section 7.11.
DUART
PC16552D External COM 1 Port UART
Zircon RMC
Modem
RMC>
RMC>
7-4
Through Mode Through mode is the default operating mode. The RMC routes every character of data between the internal system COM1 port and the external COM1 port. If a modem is connected, the data goes to the modem. The RMC filters the data for a specific escape sequence. If it detects the escape sequence, it enters the RMC CLI. Figure 71 illustrates the data flow in Through mode. The internal system COM1 port is connected to one port of the DUART chip, and the other port is connected to a 9-pin external COM1, providing full modem controls. The DUART is controlled by the RMC microprocessor, which moves characters between the two UART ports. The escape sequence signals the RMC to enter the CLI. Data issued from the CLI is transmitted between the RMC microprocessor and the external port. In Through mode, the RMC also broadcasts power-up and power-down error messages through the COM1 port. Additional RMC broadcast messages may occur when the RMC CLI is active.
NOTE:
The internal system COM1 port should not be confused with the external COM1 serial port on the back of the system.
7-5
7.2.1
Bypass Modes
For modem connection, you can set the operating mode so that data and control signals partially or completely bypass the RMC. The bypass modes are Snoop, Soft Bypass, and Firm Bypass.
DUART
PC16552D
Zircon RMC
DCD,RX
Modem
RMC>
RMC>
7-6
Figure 72 shows the data flow in the bypass modes. Note that the internal system COM1 port is connected directly to the external COM1 port. NOTE: You can connect a serial terminal to the external COM1 port in any of the bypass modes.
Snoop Mode In Snoop mode data partially bypasses the RMC. The data and control signals are routed directly between the system COM1 port and the external COM1 port, but the RMC taps into the data lines and listens passively for the RMC escape sequence. If it detects the escape sequence, it enters the RMC CLI. The escape sequence is also passed to the system on the bypassed data lines. If you decide to change the default escape sequence, be sure to choose a unique sequence so that 1) the system software does not interpret characters intended for the RMC and 2) you ensure that you dont inadvertently invoke the RMC CLI. In Snoop mode the RMC is responsible for configuring the modem for dial-in as well as dial-out alerts and for monitoring the modem connectivity. Because data passes directly between system COM1 port and the 9-pin external COM1 port (bypassing the DUART), Snoop mode is useful when you want to monitor the system but also ensure optimum COM1 performance. In Snoop mode, the RMC also broadcasts power-up and power-down error messages through the COM1 port. Additional RMC broadcast messages may occur when the RMC CLI is active. Soft Bypass Mode In Soft Bypass mode all data and control signals are routed directly between the system COM1 port and the external COM1 port, and the RMC does not listen to the traffic on the COM1 data lines. The RMC is responsible for configuring the modem and monitoring the modem connectivity. If the RMC detects loss of carrier or the system loses power, it switches automatically into Snoop mode. If you have set up the dial-out alert feature, the RMC pages the operator if an alert is detected and the modem line is not in use. Soft Bypass mode is useful if management applications need the COM1 channel to perform a binary download, because it ensures that RMC does not accidentally interpret some binary data as the escape sequence.
7-7
After downloading binary files, you can set the com1_mode environment variable from the SRM console to switch back to Snoop mode or other modes for accessing the RMC. The RMC will also switch back to Snoop mode when the system power is off or when no DCD signal is detected on COM1. Firm Bypass Mode In Firm Bypass mode all data and control signals are routed directly between the system COM1 port and the external COM1 port. The RMC does not configure or monitor the modem. Firm Bypass mode is useful if you want the system, not the RMC, to fully control the modem and you want to disable RMC remote management features such as remote dialin and dial-out alert. You can switch to other modes by resetting the com1_mode environment variable from the SRM console, but you must set up the RMC again from the local terminal.
7-8
7.3
Terminal Setup
Figure 73 and Figure 74 show the connections for a VT terminal and a VGA monitor to the system. To set up the RMC to monitor a system remotely, see Section 7.7 for the procedure.
VT
1 2
ENET
MR0571A MR0571A
1 2
VGA
ENET
MR0571
7-9
7.4
Several SRM environment variables allow you to set up the COM1 serial port for use with the RMC. You may need to set the following environment variables from the SRM console, depending on how you decide to set up the RMC.
com1_baud com1_flow com1_mode Sets the baud rate of the COM1 serial port. The default is 9600. See Table 6-1. Specifies the flow control on the serial port. The default is software. Specifies the COM1 data flow paths so that data either flows through the RMC or bypasses it. This environment variable can be set from either the SRM or the RMC. The default for com1_mode is through. See Section 7.11. Specifies to the operating system whether or not modem controls are to be utilized on COM1. The default for com1_modem is off/disabled.
com1_modem
7-10
7.5
You type an escape sequence to invoke the RMC. You can enter RMC from any of the following: Modem or terminal connected to the 9-pin external COM1 port or the local VGA monitor through the SRM console. You can enter the RMC from the 9-pin external COM1 port if the RMC is in Through mode or Snoop mode. In Snoop mode the escape sequence is passed to the system and displayed. You can enter the RMC from the local VGA monitor if COM1_MODE is set to THROUGH mode, the console environment variable is set to graphics, the 9-pin external COM1 port is inactive, and the SRM is loaded. Only one RMC session can be active at a time.
NOTE:
Entering from a Serial Terminal Invoke the RMC from a serial terminal by typing the following default escape sequence: ^[^[ rmc This sequence is equivalent to typing Ctrl/left bracket, Ctrl/left bracket, rmc. On some keyboards, the Esc key functions like the Ctrl/left bracket combination. To exit, enter the quit command. This action returns you to whatever you were doing on COM1 before you invoked the RMC. RMC> quit Returning to COM port
7-11
Entering from the Local VGA Monitor To enter the RMC from the local VGA monitor, the console environment variable must be set to graphics and COM1_MODE must be set to THROUGH. Invoke the SRM console on the VGA monitor and enter the rmc command.
>>>set Com1_mode through >>> rmc You are about to connect to the Remote Management Console. Use the RMC reset command or press the front panel reset button to disconnect and to reload the SRM console. Do you really want to continue? [y/(n)] y Please enter the escape sequence to connect to the Remote Management Console.
After you enter the escape sequence, the system enters the CLI and the RMC> prompt is displayed. When the RMC session is completed, reset the system with the halt/reset button (if configured for reset) on the operator control panel or issue the RMC reset command. (Jumper J22 pins 13-14 must be inserted for the halt/reset button to operate as a reset button.)
RMC> reset Returning to COM port
7-12
7.6
The remote management console supports setup commands and commands for managing the system. For detailed descriptions of the RMC commands, see Section 7.11. Command Conventions Observe the following conventions for entering RMC commands: Enter enough characters to distinguish the command. NOTE: The reset, quit, and rmcreset commands are exceptions. You must enter the entire string for these commands to work.
For commands consisting of two words, enter the entire first word and enough characters of the second word to distinguish it from others. For example, you can enter disable a for disable alert. For commands that have parameters, you are prompted for the parameter. Use the Backspace key to erase input. If you enter a nonexistent command or a command that does not follow conventions, the following message is displayed: *** ERROR - unknown command ***
7-13
7.6.1
The RMC status command displays the system status and the current RMC settings. Table 71 explains the status fields. See Section 7.11 for information on the commands used to set the user-defined status fields.
RMC>status hp AlphaServer DS15 Platform Status RMC Runtime Firmware Revision: V0.6-5 RMC Booter Firmware Revision: V1.0-0 System Power: ON System Halt: Deasserted Escape Sequence: ^[^[RMC Remote Access: Disabled Modem RMC Defaults: Disabled Status: Not Initialized RMC Password: Not Set Alerts: Disabled Warning Alerts: Disabled Alert Pending: NO Latest Alert: Fan failure Init String: Dial String: ATD72125 Alert String: pager # User String: there is something wrong with my DS15 system Com1 Baud:9600 Flow:SOFTWARE Mode:THROUGH Modem:DISABLED Rmc:ENABLED Logout Timer: 10 minutes Voltage Status: OK Thermal Status: OK Thermal Shutdown: Enabled Warning Threshold: 45.00C Fatal/Power-Down Threshold: 50.00C Fan Status: OK Fan Shutdown: Enabled PCI Riser: Installed POST DPR: OK NVRAM: OK GPIOs: OK LM75: OK RMC>
7-14
Meaning
RMC runtime firmware revision RMC booter firmware revision State of system power: ON = System is on. OFF = System is off. System halt state: Asserted = Halt is asserted Deasserted = Halt is not asserted Current escape sequence used to access the RMC Remote access state: Enabled = System is enabled for remote access via modem. Disabled = System is not enabled for remote access via modem. Older AlphaServer / AlphaStation modem-initialization sequence: Enabled = System is configured to append additional fixed commands to the user-supplied modem initialization string Disabled = System will not append additional fixed commands to the user-supplied modem initialization string Message indicating the current COM1 modem status. Messages include Initialized, Not Initialized, Not Present, and various modem initialization error messages. Modem access password state Set = Password set for modem access. Not set = Password not set for modem access.
RMC Password
Alerts
Dial-out alert status: Enabled = Dial-out for sending alerts is enabled. Disabled = Dial-out for sending alerts is disabled. Warning alert status: Enabled = System warnings will generate alerts. Disabled = System warnings will not generate alerts. Alert pending status: YES = Alert condition is awaiting delivery. NO = No alert condition is awaiting delivery. Text string that describes the last alert generated on the system.
Warning Alerts
Alert Pending
Latest Alert
7-15
Field
Init String Dial String Alert String User String COM1
Meaning
Initialization string that was set for modem. Dial string that is sent to modem when an alert occurs Identification string to be sent to pager when an alert occurs. Usually set to phone number of alerting system. System notes supplied by the user. State of the systems COM1 settings: COM1_BAUD: 1800, 2000, 2400, 3600, 4800, 7200, 9600, 19200, 38400, 57600 COM1_FLOW: NONE, SOFTWARE, HARDWARE, BOTH COM1_MODE: THROUGH, SNOOP, SOFT_BYPASS, FIRM_BYPASS COM1_MODEM: ENABLED, DISABLED COM1_RMC: ENABLED, DISABLED
The amount of time before the RMC terminates an inactive modem connection (in minutes). Current state of system power: OK = All power is good FAIL = One or more of the system voltages has crossed fatal threshold System thermal status: OK = Thermal status is good WARNING = Thermal warning threshold has been crossed (fatal threshold has not been crossed) FAIL = Thermal fatal threshold has been crossed Thermal failure shutdown status: Enabled = System will shutdown if the thermal fatal threshold is crossed Disabled = System will not shutdown if the thermal fatal threshold is crossed The temperature at which a thermal warning is generated. The temperature at which a thermal failure is generated. Current fan status: OK = All fans are good
Thermal Status
Thermal Shutdown
7-16
Field
Meaning
WARNING = One or more of the fans has crossed warning threshold (none have crossed fatal threshold) FAIL = One or more of the fans has crossed fatal threshold
Fan Shutdown
Fan failure shutdown status: Enabled = System will shutdown if a fan crosses its fatal threshold Disabled = System will not shutdown if a fan crosses its fatal threshold Indicates if the PCI Riser is installed: Installed = PCI Riser is installed Not Installed = PCI Riser is not installed
PCI Riser
POST
Status results of various RMC power-on self tests: DPR (Dual-Port RAM): OK or FAIL NVRAM (RMC Non-volatile storage): OK or FAIL GPIOs (GPIOs/PCF8574 IO Expander): OK or FAIL LM75 (Thermal sensor): OK or FAIL
7-17
7.6.2
System Temperature: Inlet Air : 24.00C Warning Threshold: 45.00C Fan Speeds: System Fan: 1950RPM PCI Fan
Fatal/Power-Down Threshold:
50.00C
: 1560RPM
System Status Summary: Voltage: OK (System Power is ON) Temperature: OK Fan: OK RMC>
NOTE:
If the system is configured with an internal storage cage, there is no disk fan. In this case the output will not display (Disk Fan: xxxRPM).
System Voltages
System Temperature
Fan Speeds System Status Summary of: system power, system temperature, and system fans.
7-18
7.6.3
The RMC power {on, off}, halt, and reset commands perform the same functions as the buttons on the operator control panel. Power On and Power Off The RMC power on command powers the system on, and the power off command powers the system off. The Power button on the OCP, however, has precedence. If the system has been powered off with the Power button, the RMC cannot power the system on. If you enter the power on command, the message Power-On Error: Cannot power on system when power button is off is displayed, indicating that the command will have no effect. If the system has been powered on with the Power button, and the power off command is used to turn the system off, you can toggle the Power button to power the system back on.
When you issue the power on command, the terminal exits RMC and reconnects to the servers COM1 port.
RMC> power on Returning to COM port hp AlphaServer DS15 Remote Management Controller - Revision V1.1-0 RMC> power off RMC>
7-19
Halt In and Halt Out The halt in command halts the system, while the halt out command releases the halt. When you issue the halt in or halt out command, the terminal exits RMC, and reconnects to the servers COM1 port. Toggling the Power button on the operator control panel overrides the halt in condition.
hp AlphaServer DS15 Remote Management Controller - Revision V1.1-0 RMC>halt in Returning to COM port hp AlphaServer DS15 Remote Management Controller - Revision V1.1-0 RMC>halt out Returning to COM port
NOTE: Halt
The SRM will not boot any images with halt asserted (halt in).
The halt command halts the system. This is the same as pressing the halt/reset button (when configured for halt, which is the default). Jumper J22 pins 13-14 must not be inserted for the halt/reset button to operate as a halt button.
RMC>halt Returning to COM port
Reset The RMC reset command restarts the system. The terminal exits RMC and reconnects to the servers COM1 port.
RMC> reset Returning to COM port
RMCReset The rmcreset command resets the RMC controller. It does not reset the system.
7-20
7.7
Before you can dial in through the RMC modem port or enable the system to call out in response to system alerts, you must configure RMC for remote dial-in. You can use either a VT terminal or a VGA monitor to configure the RMC for remote dialin:
1. 2. 3.
Connect to the RMC using either a VT terminal attached to COM1 or through the VGA monitor. See Figure 73 and Figure 74. Initialize the remote dial-in configuration as shown in Example 71. Complete one of the following: a. b. If you use a VT terminal, disconnect the terminal and connect the modem to COM1. If you are using a VGA monitor, connect the modem to COM1. When configuring the system for dial-in access, com1_mode must be set so that you are able to gain access to the RMC via either the VT terminal on COM1 or the VGA monitor.
NOTE:
7-21
Latest Alert: AC Loss Init String: AT&H2E0&C1&D0S0=2 Dial String: ATD915085554444 Alert String: ,,,,,,,,,,5551234 User String: Com1 Baud:9600 Flow:SOFTWARE Mode:THROUGH Modem:DISABLED Rmc:ENABLED Logout Timer: 20 minutes Voltage Status: OK Thermal Status: OK Thermal Shutdown: Enabled Warning Threshold: 45.00C Fatal/Power-Down Threshold: 50.00C Fan Status: OK Fan Shutdown: Enabled PCI Riser: Installed POST DPR: OK NVRAM: OK GPIOs: OK LM75: OK RMC>
Sets the password that is prompted for at the beginning of a modem session. The string cannot exceed 14 characters and is not case sensitive. For security, the password is not echoed on the screen. When prompted for verification, type the password again. Sets the initialization string. The string is limited to 31 characters and can be modified depending on the type of modem used. Because the modem commands disallow mixed cases, the RMC automatically converts all alphabetic characters entered in the init string to uppercase. Clears the current alert. Tells the RMC not to append its own fixed flow-control and carrier-detect commands to the user-supplied modem initialization string. Instead, these will be included as part of the usersupplied initialization string. Enables remote access to the RMC modem port by configuring the modem with the setting stored in the initialization string once the modem is connected to the system. Status of the RMC configuration.
NOTE:
Once the RMC is configured, disconnect the VT terminal from COM1 (if present) and connect the modem.
7-22
Dialing In This example shows the screen output when a modem connection is established.
ATDT915085553333 CONNECT 9600/ARQ/V34/LAPM RMC Password: ***** Welcome to RMC V1.1-0 >>> >>> hp AlphaServer DS15 Remote Management Controller - Revision V1.1-0 RMC>stat hp AlphaServer DS15 Platform Status RMC Runtime Firmware Revision: V1.1-0 RMC Booter Firmware Revision: V1.1-0 System Power: ON System Halt: Deasserted Escape Sequence: ^[^[RMC Remote Access: Enabled Modem RMC Defaults: Disabled Status: Initialized RMC Password: Set Alerts: Disabled Warning Alerts: Disabled Alert Pending: NO Latest Alert: AC Loss Init String: AT&H2E0&C1&D0S0=2 Dial String: ATD915085554444 Alert String: ,,,,,,,,,,5551234 User String: Com1 Baud:9600 Flow:SOFTWARE Mode:THROUGH Modem:DISABLED Rmc:ENABLED Logout Timer: 20 minutes Voltage Status: OK Thermal Status: OK Thermal Shutdown: Enabled Warning Threshold: 45.00C Fatal/Power-Down Threshold: 50.00C Fan Status: OK Fan Shutdown: Enabled PCI Riser: Installed POST DPR: OK NVRAM: OK GPIOs: OK LM75: OK RMC>hangup +++ NO CARRIER
At the RMC> prompt, enter commands to monitor and control the remote system. When you have finished a modem session, enter the hangup command to cleanly terminate the session and disconnect from the server. Unsetting the password If the password is forgotten, you can reset it by using the set password command.
1.
7-23
2. 3.
Intentionally type in an incorrect verification password. The following appears: *** ERROR Password verification failed (Password is NOT set) *** You also may reset RMC to use factory defaults. See section 7.10.
NOTE:
Modem Initialization Commands The modem initialization commands in the following table do not necessarily apply to all modems because different modems use different command sets. Consult the users guide for your modem when determining the modem initialization string for your system configuration.
Description
Flow control, where x is as follows: 0: No flow control 1: Hardware flow control 2: Software (XON/XOFF) flow control 3: Both hardware and software flow control Local echo off Normal Carrier Detect (CD) operations DTR override Auto answer after 2 rings
7-24
7.8
When you are not monitoring the system from a modem connection, you can use the RMC dial-out alert feature to remain informed of system status. If dial-out alert is enabled, and the RMC detects alarm conditions within the managed system, it can call a preset pager number. You must configure remote dial-in for the dial-out feature to be enabled. See Section 7.7. To set up the dial-out alert feature, enter the RMC from the COM1 serial terminal or local VGA monitor.
7-25
User String: Com1 Baud:9600 Flow:SOFTWARE Mode:THROUGH Modem:DISABLED Rmc:ENABLED Logout Timer: 20 minutes Voltage Status: OK Thermal Status: OK Thermal Shutdown: Enabled Warning Threshold: 45.00C Fatal/Power-Down Threshold: 50.00C Fan Status: OK Fan Shutdown: Enabled PCI Riser: Installed POST DPR: OK NVRAM: OK GPIOs: OK LM75: OK RMC>
A typical alert situation might be as follows: The RMC detects an alarm condition, such as over temperature failure. The RMC dials your pager and sends a message identifying the system. You dial the system from a remote serial terminal. You enter the RMC, check system status with the env command, and, if the situation requires, power down the managed system. (In many cases, a failure may have already powered the system down.) When the problem is resolved, you power up and reboot the system.
7-26
The elements of the sample dial string and alert string are shown in Table 73. Paging services vary, so you need to become familiar with the options provided by the paging service you will be using. The RMC supports only numeric messages.
Sets the string to be used by the RMC to dial out when an alert condition occurs. The dial string must include the appropriate modem commands to dial the number. Sets the alert string, typically the phone number of the modem connected to the remote system. The alert string is appended after the dial string, and the combined string is sent to the modem when an alert condition is detected. Enables remote access to the RMCs modem port. Clears current alert condition Enables the RMC to page a remote system operator. Forces an alert condition. This command is used to test the setup of the dial-out alert function. It should be issued from the local serial terminal or local VGA monitor. As long as no one connects to the modem and there is no other alert pending, this alert will be sent to the pager as soon as the modem is connected to the system. If the pager does not receive the alert, re-check your setup. Status of the RMC configuration.
NOTE:
If you do not want dial-out paging enabled at this time, enter the disable alert command after you have tested the dial-out alert function. Alerts continue to be logged, but no paging occurs.
7-27
ATXDT
9,
Alert String
,,,,,,
5085553332# ;
Each comma (,) provides a 2-second delay. In this example, a delay of 12 seconds is set to allow the paging service to answer. A call-back number for the paging service. The alert string must be terminated by the pound (#) character. A semicolon (;) must be used to terminate the entire string.
NOTE:
1. The above sample dial string commands are commonly used sequences that don't necessarily apply to all configurations. Because different modems use different command sets, consult the user's guide for your modem when determining the dial-string for your system configuration. 2. The above alert string sequence, including the pound and semicolon termination characters, is not necessarily applicable to all configurations. Consult with your paging service to determine the appropriate alert string for your configuration.
7-28
7.9
This section contains definitions, explanations, and examples about RMC firmware update and recovery. Flash Accessibility Under normal circumstances, the RMC flash part is fully write-enabled. LFU has the ability to update the firmware components contained within this part. However, write access to this flash can be completely disabled by installing the DISABLE_FLASH jumper (J21) on pins 1-2. Installing this jumper disconnects the write-enable line from the RMC to the flash part. This disables LFU (or any other utility) from modifying the contents of the flash part. RMC Flash Update The RMC code consists of two images - the booter image and the runtime image Firmware updates for the RMC are performed using the standard SRM Console Loadable Firmware Update (LFU) utility. The runtime image is the FW image most likely to be updated. Updating the Booter It is unlikely that this image will ever need to be updated. However, should it become necessary to update the booter image, that image will be included in the manual portion of the LFU update utility. (See Example 74) If a booter image update is available, the revision of the image is displayed in favor of No Update Available. In order to update the booter, the write enable jumper (BOOTER_ENABLE J22 7-8) must be installed first. If this jumper is not installed, the booter image update is not allowed.
7-29
7-30
Emergency Runtime Image Recovery Should the RMC runtime image become corrupted or is otherwise deemed unusable, an emergency recovery mechanism has been placed in the booter. If the situation arises where this mechanism needs to be utilized, remove power (unplug) from the system and install the RMC emergency runtime image recovery jumper (J22 pins 11-12) (see Figure 75 which follows). Because this mode requires that the RMC be able to control com1_mode, move jumper J30 to pins 2-3. After re-applying power to the system (plug in), the RMC comes up in emergency update mode, which utilizes only the booter image. Power the system on using the OCP button (the RMC prompt is not available). Once at the SRM prompt, use the standard LFU mechanisms to update the runtime image. At the completion of the update, remove power (unplug) and then remove the RMC emergency runtime image recovery jumper. If jumper J30 was moved, return it to its initial position.
NOTE:
1. The booter image cannot be updated while in the emergency runtime image recovery mode. 2. The amber LEDs on the OCP sequentially blink when updating the RMC images. 3. When the booter detects that the runtime image is corrupt, the system and fan fault LEDs will flash on and off in unison. The user must configure the system for emergency runtime image recovery to correct this problem. 4. When the user configures the system to enter emergency runtime image recovery mode by adding jumper J22 pins 11-12, all three amber lights flash on and off in unison until the FW update is started. 5. For a complete listing of OCP LED indications, see Section 2.2.2.
7-31
WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. The following procedure restores the default settings:
1. 2. 3. 4. 5. 6. 7. 8. 9.
Shut down the operating system and unplug the power cord from the power supply. Remove the system cover (see Chapter 4) and wait for all the internal LEDs to go out. Insert the FORCE_DEFAULT jumper (J22 / pins 9 10) on the main logic board. Re-install the system cover and plug system in. Note: you do not need to power the system on. When the RMC becomes available on the external COM1 port, the defaults have been reset. Unplug the power cord. Remove the system cover and make sure all the internal system LEDs are not lit. Remove the FORCE_DEFAULT jumper from the main logic board. Re-install the system cover and plug in the system.
10. Press the power button on the OCP to turn the system On.
NOTE:
Resetting the RMC to the factory settings does not alter the personality of the system set by the RMC set systemtype command (Section 7.11).
7-32
To set the RMC-related system jumpers to their default settings, configure as follows (see Figure 75 for locations): Feature_1 Jumper / J22 pins 13 14 On OCP halt/reset button performs reset Off OCP halt/reset button performs halt (default) Feature_2 Jumper /J22 pins 11 12 On Forces RMC emergency image recovery mode Off Normal operation (default) RMC_PASSTHRU Mode Jumper / J30 No jumper Always bypass the RMC 1 2 Always pass through the RMC 2 3 Normal operation (default) Note: The user selects modes through COM1_MODE. RMC Force_Default Jumper / J22 pins 9 10 On Forces RMC environment to default state Off Normal operation (default) FORCE_DTR Jumper / J28 On Forces DTR Off DTR unaffected (default) Booter_enable Jumper / J22 pins 7-8 On Allows RMC booter image updates Off Disables RMC booter image updates (default)
7-33
7-34
NOTE: alert
The CPU, deposit, and dump commands are reserved for service providers.
The alert command displays the latest alert condition along with detailed system status information gathered when the alert was generated. clear alert The clear alert command clears the current alert condition and causes the RMC to stop paging the system operator. If the alert is not cleared, the RMC pages the operator every 30 minutes (if the dial-out alert feature is enabled). Once the current alert is cleared, the RMC can capture a new alert. The Alert Pending field of the status command becomes NO after the alert is cleared.
7-35
clear log The clear log command clears all events from the system event log. clear port The clear port command clears the UARTs controlled by the RMC in an attempt to clear any stuck conditions that might exist. disable alert The disable alert command disables the RMC from paging the system operator in the event that an alert condition is detected. System monitoring continues and any alert conditions that are detected will still be logged. disable fan The disable fan command disables the system from powering off in the event that a fatal fan failure occurs. By default, fan failures result in the system being powered off after a 3 minute lapse. disable modemdef This command instructs the RMC to use the user-supplied modem initialization string without the additional commands that were automatically appended to the initialization string on older AlphaServer and AlphaStation models. disable reboot The disable reboot command disables the watchdog timer from rebooting the system when the watchdog timer expires. By default, the system does not reboot if the watchdog timer expires. NOTE: The watchdog timer is not available on DS15 systems.
disable remote The disable remote command disables remote access to the RMCs modem port and disables automatic dial-out. disable thermal The disable thermal command disables the system from powering off in the event that a thermal failure occurs. By default, thermal failures powers off the system after a 3 minute lapse.
7-36
disable warning When the disable warning command is issued, warning-level events no longer generate system alerts (this is the default state). disable wdt The command disable wdt disables the operating system watchdog timer (the default state). This does not disable the operating system from providing the watchdog clock; it simply prevents the RMC from monitoring it. NOTE: The watchdog timer is not available on DS15 systems.
enable alert The enable alert command enables the RMC to page the system operator. Before the enable alert command can be used, the system must be configured for remote dial-in and dial-out. See sections 7.7 and 7.8. enable fan The enable fan command allows the RMC to power off the system in the event of a fatal fan failure condition (the default state) after a 3 minute lapse. enable modemdef The enable modemdef command instructs the RMC to append additional fixed commands to the user-supplied modem initialization string. These commands were automatically appended to the initialization string on older AlphaServer / AlphaStation models. See Table 7-4 which follows.
7-37
Description
Normal Carrier Detect (CD) operations Select flow control per the current COM1 settings, where x is as follows: 0: No flow control 3: Hardware flow control 4: Software (XON/XOFF) flow control 6: Both hardware and software flow control
enable reboot The enable reboot command enables the watchdog timer to reset the system if the timer should expire. By default, the system does not reset if the watchdog timer expires (and the watchdog timer is enabled). NOTE: The watchdog timer is not available on DS15 systems.
enable remote The enable remote command enables remote access to the RMCs modem port. It also allows the RMC to automatically dial the pager number set with the set dial command upon the detection of an alert condition, if alerts are enabled. Before the enable remote command can be used, the system must be configured for remote dial-in. See section 7.7. enable thermal The enable thermal command allows the RMC to power off the system in the event of an over-temperature condition. By default, thermal failures powers off the system after a 3 minute lapse. enable warning The enable warning command allows warning-level events to generate system alerts (by default, warnings do not generate alerts). Note that alerts are delivered in the order in which they occur. Therefore, a pending warning-level alert blocks the delivery of a fatal-level alert (although the fatal alerts continue to be logged).
7-38
enable wdt The command enable wdt enables the operating system watchdog timer (disabled by default). NOTE: env The env command provides a current snapshot of the status of the system environment (voltages, temperature, fans). If a sensor has crossed its warning threshold, it is displayed bold; if a sensor has crossed its fatal threshold, the reading is displayed bold and blinking. fwrev The fwrev command displays the RMC-accessible firmware revisions. Note that prior to the first successful SRM-console load, the RMC only has access to the RMC Booter image and RMC Runtime image firmware revisions. halt The halt command halts the system. This is the same as pressing and releasing the momentary contact halt button on the OCP. (Jumper J22 pins 13-14 must not be installed for the halt/reset button to operate as a halt button.) halt in The halt in command asserts halt to the system, halting the platform. To deassert a halt, issue the halt out command. NOTE: halt out The halt out command releases the system from the halted state. hangup The hangup command terminates the current modem session. A modem session automatically terminates after a period of idle time set by the set logout command (default = 20 minutes). Halt will de-assert if system power is cycled. The watchdog timer is not available on DS15 systems.
7-39
help or ? The help or ? command displays the RMC command set. help or ? command-word Issuing the command help or ? followed by the first word of another command provides additional information on all of the commands that start with the supplied word. log The log command prints out a brief summary of the last 10 system events that have been logged. log number Issuing the log command followed by a number (0-9) provides detailed information about the selected system event (0=most recent event). poe The poe command displays the latest power-on error (if any). power off The power off command performs the same function as releasing the on/off button on the OCP; it turns the system power off. power on The power on command performs the same function as pressing the on/off button on the OCP; it turns the system power on. The system cannot be powered on with this command if the OCP power button is in the off position. quit The quit command exits the RMC and returns the terminal to external control. reset The reset command restarts the system. It performs the same function as pressing the reset button on the OCP. (Jumper J22 pins 13-14 must be inserted for the halt/reset button to operate as a reset button.)
7-40
rmcreset The rmcreset command resets the RMC controller; it does not reset the DS15. send alert The send alert command forces an alert condition. It is used primarily to test the set-up of the dial-out alert function. set alert The set alert command sets the alert string that is transmitted through the modem when an alert condition is detected. Generally, the alert string is set to the phone number that can be used to dial-in to the system that is experiencing the alert condition. The alert string is appended to the dial string and the combination is sent to the modem. set com1_baud The set com1_baud command is used to set the baud rate on the external 9-pin RMC/COM1 port. The available choices are: 1800, 2000, 2400, 3600, 4800, 7200, 9600, 19200, 38400, and 57600. This command changes the setting of the SRM environment variable COM1_BAUD. set com1_flow The set com1_flow command is used to set the flow control that is to be used on the external 9-pin RMC/COM1 port. The available choices are: none, software, hardware, both. This command changes the setting of the SRM environment variable COM1_FLOW. set com1_mode The set com1_mode command specifies the COM1 data flow path so that data either passes through the RMC or bypasses it. The available choices are: through, snoop, soft_bypass, firm_bypass. The set com1_mode command changes the setting of the SRM environment variable COM1_MODE.
7-41
Description All data passes through RMC and is filtered for the escape sequence that is used to enter the RMC CLI. Data partially bypasses RMC, but RMC taps into data lines listening for the escape sequence that is used to enter the RMC CLI. Data bypasses RMC; however, RMC automatically switches into Snoop Mode if the system is powered off or DCD is not detected. Data bypasses the RMC. You cannot gain access to the RMC CLI from this mode.
set com1_modem The set com1_modem command is used to indicate whether or not modem control signals are to be used on the external 9-pin RMC/COM1 port. The available choices are: enabled or disabled. This variable is intended for use by the OS; it is not used by the RMC. This command changes the setting of the SRM environment variable COM1_MODEM. set com1_rmc The set com1_rmc command is used to enable/disable the ability of the internal COM1 port (Acer) to access the RMC command set. After issuing the command, the user is prompted for the desired setting: enabled or disabled. This command changes the setting of the SRM environment variable COM1_RMC. The setting of COM1_RMC is generally controlled by the SRM console; under normal circumstances, the user should not change the setting of COM1_RMC and will, therefore, not use this command. set dial The set dial command sets the string to be used by the RMC to dial out whenever an alert condition occurs. The string must be in the correct format for the attached modem. If a paging service is to be contacted, the string should include the appropriate modem commands to dial the number. NOTE: All lowercase characters are converted to uppercase.
set escape The set escape command sets a new escape sequence for invoking the RMC. The escape sequence can be any string, but cannot exceed 14 characters. A typical escape sequence includes two or more control characters.
7-42
set init The set init command sets the modem initialization string. The string is limited to 31 characters and is converted to uppercase. set logout The set logout command sets the amount of time before the RMC terminates an inactive modem connection. The default is 20 minutes. The settings are in tens of minutes (0-9). The zero (0) setting disables logout. When logout is disabled, the RMC never disconnects an idle modem session. set password The set password command lets you set or change the password at the beginning of a modem session. You must set a password to enable access through a modem. The string cannot exceed 14 characters and is not echoed to the screen. set systemtype The set systemtype command is a special hidden command that sets the current system type AlphaServer DS15, AlphaStation DS15, or AlphaServer TS15. This command cannot be abbreviated it must be typed in its entirety. When the command is issued, it prompts for a special hard-coded password (password=setsystem15). After correctly typing the password, the user is then prompted to select the system type from a list. NOTE: This command is only for use by HP personnel. It does not appear in any user documentation and is not listed by the help/? command.
set user The set user command allows the user to set a user string to be displayed by the status command. This string is typically used to make notes about the current status of the system. The string is limited to 63 characters. status The status command displays information about the current status of the system and its RMC settings. (See section 7.6.1.)
7-43
Possible Cause
The RMC may be in soft bypass or firm bypass mode. System and terminal baud rates do not match.
Suggested Solution
Issue the show com1_mode command from SRM and change the setting if necessary. Set the baud rate for the terminal to be the same as for the system. For first-time setup, note that the RMC and system default baud is 9600. Check modem phone lines and connections. From the local serial terminal or VGA monitor, enter the set password and set init commands, and then enter the enable remote command. (See Section 7.7.) Modify the modem initialization string according to your modem documentation. Wait 30 seconds after powering up the system and RMC before attempting to dial in.
Modem cables may be incorrectly installed. RMC remote access is disabled or the modem was power cycled since last being initialized.
The modem is not configured correctly. RMC will not answer when modem is called. On AC power-up, RMC defers initializing the modem for 30 seconds to allow the modem to complete its internal diagnostics and initializations.
7-44
Possible Cause
Suggested Solution
RMC console must be reset to factory defaults.
7-45
This chapter presents detailed procedures for removing and replacing Field Replaceable Units (FRUs) on AlphaServer DS15 systems. Unless otherwise specified, install an FRU by reversing the steps shown in the removal procedures.
8.1
The procedures are organized by relative difficulty. For example, replacing a PCI fan is easier than replacing a memory DIMM because the DIMMs are underneath the center internal storage bay. Virtually all the procedures are of the remove and replace style. You can remove and replace the following components directly: Top cover Side panel PCI fan CPU fan Disk drive in center internal storage bay Front access drive (disk or tape) Front access storage cage Internal storage cage PCI option module
You can remove the following components only after removing one or more other components: PCI riser card Bottom disk drive of front access storage cage Bottom disk drive, middle drive, or half-height DVD/CD-RW drive of internal storage cage Power supply
8-1
System fan Memory DIMM Operator control panel (OCP) Speaker Motherboard
You can also refer to video procedures on the HP Intranet or order the CD. Intranet:
http://mediadocs.mro.cpqcorp.net/video_Presentations/video%20fru/video%20fru.htm
CD: hp AlphaServer DS15 Field Replaceable Unit (FRU) video presentation AG-XXXXX-BE, release August 5, 2003 WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. These measures include: 1. Remove any jewelry that may conduct electricity. 2. If accessing the system card cage, power down the system and wait 2 minutes to allow components to cool. 3. Wear an anti-static wrist strap when handling components.
WARNING: Before servicing the system, power it down, unplug the power cord from the power supply, and make sure that the OCP power LED on the PCI riser card is not lit. Failure to do this may result in damage of modules such as the system motherboard, and Dual Inline Memory Modules (memory DIMMs). IMPORTANT! After replacing FRUs and determining that the system has been restored to its normal operating condition, you must clear the system error information repository (error information logged to the DPR). Use the clear_error all command to clear all errors logged in the FRU EEPROMs and to initialize the central error repository. See Section 4.4 for details on clear_error.
8-2
CAUTION: Static electricity can damage integrated circuits. Always use a grounded wrist strap (29-26246) and grounded work surface when working with internal parts of a computer system. Remove jewelry before working on internal parts of the system.
NOTE:
If you are installing or replacing memory DIMMs or PCI modules, become familiar with the location of the module slots and configuration rules. See Chapter 6.
8-3
8.2
The operating system must be shut down before you replace any FRUs. After replacing an FRU, you must clear the system error information repository with the SRM clear_error all command. Tools You need the following tools to remove or replace FRUs. Phillips #1 and #2 screwdrivers (10-inch magnetic tools are recommended) Flat blade screwdriver Cordless screwdriver Allen wrench (3 mm) Anti-static wrist strap
Hot-Plug FRUs There are no hot-plug FRUs on the AlphaServer DS15. Before Replacing FRUs Follow the procedure below before replacing any FRUs. For universal disk drives, you must shut down the operating system, but you do not need to turn off system power. 1. 2. 3. 4. Shut down the operating system. Shut down power to external options, where appropriate. Turn off power to the system. Unplug the power cord from the power supply.
8-4
8.3
Recommended Spares
Table 81 lists the recommended spare parts (or FRUs) by part number and description. Figure 81 shows their location.
8-5
Universal Disk Drives for Front Access Storage Cage 3R-A3848-AA 3R-A3838-AA 3R-A3849-AA 3R-A3839-AA 3R-A3851-AA 3R-A3841-AA 18.2-GB Ultra320 SCSI 15,000 rpm 1-inch Univ. disk drive 36.4-GB Ultra320 SCSI 10,000 rpm 1-inch Univ. disk drive 36.4-GB Ultra320 SCSI 15,000 rpm 1-inch Univ. disk drive 72.8-GB Ultra320 SCSI 10,000 rpm 1-inch Univ. disk drive 72.8-GB Ultra320 SCSI 15,000 rpm 1-inch Univ. disk drive 146-GB Ultra320 SCSI 10,000 rpm 1-inch Univ. disk drive
Tape Drives for Internal Storage Cage 3R-A2392-AA 3R-A3752-AA 3R-A3753-AA 3R-A3623-AA AIT 35/70-GB tape drive (LVD), carbon black DAT 20/40-GB DDS4 AIT 50/100-GB AIT 100/200-GB
Universal Tape Drives for Front Access Storage Cage 3R-A2396-AA 3R-A2779-AA 3R-A2780-AA 3R-A3621-AA AIT 35/70-GB LVD Univ. tape drive, uses two slots AIT 50/100-GB LVD Univ. tape drive, uses two slots DAT 20/40-GB DDS4 LVD Univ. tape drive, uses two slots AIT 100/200-GB LVD Univ. tape drive, uses two slots
8-6
8.3.1
Power Cords
Table 83 lists the country-specific power cords for tower and pedestal systems.
8-7
8.4
FRU Locations
3 8 7
1 2
11 10
REAR
8 7 3 15
A
5
11
14
11
13 9
14
4
h 15 DS ver Ser lpha pA
13
12 6 1
FRONT
2
MR0546B
8-8
Key to Figure 81 Center internal storage bay and disk drive PCI Fan Assembly PCI riser card Memory DIMMs Disk fan front access storage cage only System fan CPU fan Power supply Bottom disk drive Optional disk or tape drive DVD/CD-RW drive Operator control panel Front accessible disk drive Optional front accessible disk drive Motherboard A B Internal storage cage Front access storage cage
8-9
8.5
To access internal components, you must first remove the top cover. Refer to the following figure and procedure.
1
2
MR0511A
8-10
Removing the Top Cover 1. 2. 3. 4. 5. Unlock the system if it is locked. that secures the cover to the enclosure.
Loosen the thumbscrew Pull the catch lever Slide the cover
NOTE:
Notice the quick reference labels on the inside of the top cover. The labels provide detailed information about the system.
8-11
8.6
To gain access to components on the PCI side of the enclosure, you must first remove the side panel. Refer to the following figure and procedure.
1
A
2 1
2
2 1
A
B
MR0556B
8-12
Removing the Side Panel 1. 2. 3. 4. 5. First remove the top cover as explained in Section 8.5. Locate the metal tab on the panel on the PCI side of the system. and set it aside.
To allow more room for your hand, you may lift up the PCI fan
Press the metal tab, push the panel to the rear to release it, and slide it away from the system. To replace the cover, follow these steps in reverse order.
8-13
8.7
The PCI fan provides cooling for the PCI side of the enclosure. Refer to the following figure and procedure when replacing the PCI fan.
2 3
MR0606A
8-14
Replacing the PCI Fan 1. 2. 3. 4. 5. 6. 7. 8. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Lift the PCI fan and lay it on its side. from the motherboard connector into the motherboard connector and lift the fan from .
Unplug the fan connector the enclosure. Plug the new fans connector
Lower the fan into place in the front of the PCI side of the enclosure. Make sure that the fan snaps into place. Replace the top cover as explained in Section 8.5.
8-15
8.8
The CPU fan is mounted directly atop the CPU heat sink and provides cooling for the CPU. Refer to the following figure and procedure when replacing the CPU fan.
8-16
Replacing the CPU Fan 1. 2. 3. 4. 5. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Remove the fan connector from the motherboard connector .
Starting at a corner of the fan cover away from the center partition, press down on one of the metal retaining clips and pull it slightly away from the heat sink . Repeat this action for the other clip on that same side. Move to the clips next to the center partition, press down and pull each one away from the heat sink. (A flat-bladed screwdriver may be needed for these clips.) Lift the fan and fan cover from the heat sink and separate the two items.
6. 7. 8.
Place the fan cover and new fan onto the CPU heat sink so that the fan connector reaches the motherboard connector J32. is down, toward the heart sink.
Starting at the side of the fan cover next to the center partition, press down on one of the metal retaining clips and snap the clip into place. Repeat this action for the other clip on that same side.
10. Move to the clips away from the center partition, press down on each one, and snap them into place. to the motherboard connector 11. Reconnect the fan connector keyed to install in only one way. 12. Replace the top cover as explained in Section 8.5. 13. Reconnect the power cord, turn on system power, and boot the system. . The connector is
8-17
8.9
The center internal storage bay provides a disk drive as optional storage. Refer to the following figures and procedures when replacing this disk drive.
3
1
hp
8-18
4
MR0509A
Removing the Center Internal Storage Bay 1. 2. 3. 4. 5. 6. 7. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Pull the two spring-loaded pull pins at the rear of the storage bay unit toward the back of the enclosure. Lift the storage bay from the enclosure and turn it over. Remove the power cable the storage bay aside. and data cable from the rear of the storage bay and set and slide the
8-19
3
MR0532
8-20
Replacing the Disk in the Center Internal Storage Bay 1. 2. 3. 4. 5. 6. 7. Remove the four screws out. from the bottom of the storage bay and slide the disk drive
Install the new disk drive and replace the bottom screws. Referring to preceding Figure 86, connect the power cable new disk drive . Slide the storage bay forward into the enclosure. Pull the two spring-loaded pull pins into place. and lower the storage bay until the pins snap and data cable to the
Replace the top cover as explained in Section 8.5. Reconnect the power cord, turn on system power, and boot the system.
8-21
Figure 88
2
MR0020A
CAUTION: Do not remove a drive that is in operation. Remove a drive only if its activity LED is off.
8-22
Replacing a Front Access Disk Drive 1. 2. 3. 4. 5. Verify that the disk drive is not in use (the activity LED is off). To remove the drive, press in the colored rubber button to release the handle.
forward to release the SCSI connection and then pull the drive from Pull the handle the cage. If only one disk drive is installed, a filler plate fills the other drive slot. fully opened. With the Insert the new disk drive into the cage with the front handle drive resting on top of the rail guides of the cage, slide the drive in until it stops. Push in the handle cage. to make the backplane connection and to secure the drive into the
Verification You must enter the init command and use the show device command to verify that the system sees the new drive.
8-23
Figure 89
1
MR0634
CAUTION: Do not remove a drive that is in operation. Remove a drive only if its activity LED is off.
8-24
Replacing a Front Access Tape Drive 1. 2. 3. 4. 5. 6. Verify that the tape drive is not in use (the activity LED is off). To remove the drive, press the locking tab Pull on the handle with the tape drive. to the left to disengage the tape drive. slides out
Snap the new plastic filler onto the new tape drive. Insert the new tape drive into the cage. With the drive resting on top of the rail guides of the cage, slide the drive in until it stops. Push on the handle snaps into place. to make the backplane connection and verify that the locking tab
Verification You must enter the init command and use the show device command to verify that the system sees the new drive.
8-25
4
12
3
14 13
6
7
8
9
MR0597
8-26
Accessing the Front Access Storage Cage 1. 2. 3. 4. 5. 6. 7. 8. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Remove the IDE data cable CD-RW drive. and power cable from the back of the slimline DVD-
Pull the storage cage back, pivot the rear end up, and remove it from the enclosure. Turn the cage over to access the remaining cables. Remove the SCSI cable and power cable .
from the motherboard. All the cables are routed through the Disconnect the fan cable , except the fan cable, which is routed through the lower slot area . top slot area The cage is now completely disconnected. Reverse the procedure to install the storage cage.
9.
NOTE:
8-27
1
4
12
3
13
14
5
6
MR0525
8-28
Accessing the Internal Storage Cage 1. 2. 3. 4. 5. 6. 7. 8. 9. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Pull the two spring loaded insert posts receiving holes . inward so that the posts come out of the
Pull the storage cage back, pivot the rear end up, and remove it from the enclosure. Turn the cage over to access the remaining cables. power cable from the DVD/CD-RW drive. All the Remove the IDE data cable cables are routed through the top slot area . Remove the SCSI cable and power cable .
Reverse the procedure to install the storage cage. Since this storage cage does not have a fan, verify that the Feature_4 jumper is installed. See Table A-2 for details.
8-29
WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others.
WARNING: To prevent fire, use only modules with current limited outputs. See National Electrical Code NFPA 70 or Safety of Information Technology Equipment, Including Electrical Business Equipment EN 60 950.
V @ >240VA
WARNING: High current area. Currents exceeding 240 VA can cause burns or eye injury. Avoid contact with parts or remove power prior to access.
WARNING: The I/O area houses parts that operate at high temperatures. Avoid contact with components to prevent a possible burn.
WARNING: To prevent personal injury or damage to any of the system modules, unplug the power cord from the power supply before installing components. Make sure the power LEDs are not lit before removing or replacing modules. PCI slot 1 is the bottom slot on a desktop or rackmounted system or the right-hand slot as viewed from the back of a pedestal system. The following figure shows the positions of, and other details for, the PCI slots.
8-30
MR0502C
Slot 1 66/33 MHz, 3.3v Slot 2 66/33 MHz, 3.3v Slot 3 33 MHz, 3.3v Slot 4 33 MHz, 3.3v LED connected to +5 VAUX When installing PCI option modules, you do not normally need to perform any configuration procedures. The system configures PCI modules automatically. But because some PCI option modules require configuration CDs, refer to the documentation for that PCI option module.
8-31
2 1 4
5 3
MR0522
8-32
Replacing or Installing a PCI Module CAUTION: Check the keying before you install the PCI option module and do not force it into place. Plugging a module into the wrong slot can damage it. 5v cards are not allowed. 1. 2. 3. 4. 5. 6. 7. 8. 9. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the side panel as explained in Section 8.6. Remove the slot cover screw , slide out the slot cover , and set it aside.
To install a PCI option module, grasp it at the corners and push it into the appropriate unused slot in the PCI riser card . Insert the retaining screw to secure the module.
Replace the side panel as explained in Section 8.6. Replace the top cover as explained in Section 8.5.
10. Reconnect the power cord, turn on system power, and boot the system. Verification 1. 2. 3. Turn on power to the system. At the >>> prompt, enter the SRM show config command. Examine the PCI bus information in the display to make sure that the new option is listed. If you installed a bootable device, enter the SRM show device command to determine the device name.
8-33
1
2
MR0521
8-34
Replacing the PCI Riser 1. 2. 3. 4. 5. 6. 7. 8. 9. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the side panel as explained in Section 8.6. Remove all PCI option modules as explained in Section 8.13. Remove the two screws Grasp the card from the top corners of the PCI riser card.
Grasp the new PCI riser card by its upper corners so that the option slots face the open side of the enclosure. Push it down into its slot on the motherboard. Line up the holes in the riser's upper corners with the holes in the support bracket and insert the two screws. Reinstall all PCI option modules as explained in Section 8.13.
10. Replace the side panel as explained in Section 8.6. 11. Replace the top cover as explained in Section 8.5. 12. Reconnect the power cord, turn on system power, and boot the system.
8-35
6
MR0593
8-36
Replacing Bottom Drive Front Access Storage Cage 1. 2. 3. 4. 5. 6. 7. 8. 9. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Remove the front access storage cage and all cables as explained in Section 8.11. Remove the four screws from the cage. from the sides of the bottom drive and slide the drive
Slide the new drive into the cage and insert the four screws. Reinstall the front access storage cage as explained in Section 8.11. Replace the top cover as explained in Section 8.5. Reconnect the power cord, turn on system power, and boot the system.
Verification Enter the init command and use the show device command to verify that the system has identified the new drive.
8-37
3 2
3
MR0592
8-38
Replacing Bottom Drive Internal Storage Cage 1. 2. 3. 4. 5. 6. 7. 8. 9. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Remove the internal storage cage and all cables as explained in Section 8.12. Remove the four screws from the cage. from the sides of the bottom drive and slide the drive
Slide the new drive into the cage and insert the four screws. Reinstall the storage cage as explained in Section 8.12. Replace the top cover as explained in Section 8.5. Reconnect the power cord, turn on system power, and boot the system.
Verification Enter the init command and use the show device command to verify that the system has identified the new drive.
8-39
A
1
6
MR0594
8-40
Replacing Middle Drive Internal Storage Cage 1. 2. 3. 4. 5. 6. 7. 8. 9. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Remove the internal storage cage and all cables as explained in Section 8.12.
Remove the four screws from the sides of the storage cage and slide the drive assembly from the storage cage. Remove the four screws from the bottom of the drive assembly and slide the drive out. Set the drive aside. Insert the new drive into the drive assembly and fasten the four bottom screws. Slide the drive assembly into the storage cage and fasten the four side screws. Reinstall the storage cage as explained in Section 8.12.
10. Replace the top cover as explained in Section 8.5. 11. Reconnect the power cord, turn on system power, and boot the system. Verification Enter the init command and use the show device command to verify that the system has identified the new drive.
8-41
PK0277A
8-42
Replacing DVD/CD-RW Drive Internal Storage Cage 1. 2. 3. 4. 5. 6. 7. 8. 9. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Remove the internal storage cage and all cables as explained in Section 8.12. Remove the four screws that fasten the drive to the storage cage.
Pull drive forward. Push away EMC finger stock clips if they hang up on the drive. Before installing the new drive, be sure all eight (8) EMC finger stock clips are in place. Slide the new drive into the storage cage and insert the four screws as shown. Reinstall the storage cage as explained in Section 8.12.
10. Replace the top cover as explained in Section 8.5. 11. Reconnect the power cord, turn on system power, and boot the system.
8-43
1
2
MR0617A
8-44
Removing Connectors from the Power Supply WARNING: Hazardous voltages are contained within the power supply. Do not attempt to service. Return to factory for service. 1. 2. 3. 4. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. To provide clearance for the power supply, remove the storage cage from that bay. Remove a front access storage cage and all cables as explained in Section 8.11. Remove an internal storage cage and all cables as explained in Section 8.12. near the power supply and move that end of the cable channel Lift the pull pin the side. The cable channel covers the cables running from the power supply to the motherboard. to
5.
6.
from the motherboard by pressing the locking tab and Remove the three connectors pulling back on the connector. The connectors are different sizes, making it simple to reconnect. Continue with the next procedure, Replacing the Power Supply.
7.
8-45
1 2
1 2
MR0603A
8-46
Replacing the Power Supply 1. 2. Perform the steps as explained in the preceding procedure. At the rear of the enclosure, remove the four screws from the power supply case. Three require use of a screwdriver, while the fourth screw (thumbscrew) small screws varies with the configuration. For rackmounted systems, a bracket is provided as a lock for the power cord. into the enclosure until it stops. Lift up the cable end of the Push the power supply power supply, slide it out at a slight angle, and set it aside. Before installing the new power supply, take note of two clips clips must slide into matching channels in the enclosure. on its bottom. These
3. 4. 5.
at a slight angle, slide it into the enclosure until it lays Holding the power supply flat. Push the power supply into its corner until it stops. You should be able to feel the two bottom clips lock into place. Insert the four screws into the power supply case.
6. 7.
Referring to Figure 819, plug the three connectors (from the power supply) into their sockets on the motherboard. Work from shortest to longest cable. Be sure to gather the cables together as tightly as possible so they fit under the cable channel . Slide the cable channel over the cables, carefully tucking the wires under the channel, into place. and snap the pull pin Reinstall the storage cage into that bay. Install a front access storage cage and all cables as explained in Section 8.11. Install an internal storage cage and all cables as explained in Section 8.12.
8. 9.
10. Replace the top cover as explained in Section 8.5. 11. Reconnect the power cord, turn on system power, and boot the system. Verification At the >>> prompt, enter the show power command or the RMC status command to verify system voltages.
8-47
8-48
WARNING: Contact with moving fan can cause severe injury to fingers. Avoid contact or remove power prior to access.
V @ >240VA
WARNING: High current area. Currents exceeding 240 VA can cause burns or eye injury. Avoid contact with parts or remove power prior to access.
Replacing the System Fan 1. 2. 3. 4. 5. 6. 7. 8. 9. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Remove the center internal drive bay as explained in Section 8.9. This provides access to the fan connector. Locate the fan at the front of the enclosure. Pull back the three clips the fan and slowly work it free. Unplug the fan connector from the motherboard and remove the fan. is as shown. that secure
Install the new fan by pushing it into the three clips until they snap into place. Slide the fan cable under the partition and insert the connector into the motherboard at in inset A). connector J3 (callout
10. Replace the center internal drive bay as explained in Section 8.9. 11. Replace the top cover as explained in Section 8.5. 12. Reconnect the power cord, turn on system power, and boot the system. Verification 1. 2. Invoke the remote management console. Enter the env command to verify the fan status.
8-49
CAUTION:
Memory Configuration Rules You can install up to four (4) DIMMs on the motherboard. A maximum of 4 GB of memory is supported. There are two memory arrays, numbered 0 and 2, with two slots per array. A memory array must be populated with two DIMMs of the same size and speed. (See the table above for supported sizes and capacity.) Memory arrays must be populated in numerical order, starting with array 0.
8-50
1 3 0 2
0 2 0 2
The DIMMs in the preceding table are located as shown in the following figure.
8-51
WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others.
WARNING: Do not remove memory DIMMs until the green LED on the PCI riser card is off (approximately 20 seconds after a power-down).
WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module.
WARNING: To prevent personal injury or damage to any of the system modules, unplug the power cord from the power supply before installing components. Make sure the power LEDs are not lit before removing or replacing modules.
8-52
MR0518
Removing a Memory DIMM 1. 2. 3. 4. 5. 6. 7. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Remove the center internal storage bay as described in Section 8.9. Use Table 84 and Figure 822 to determine the location of all DIMMs. Release the clips securing the affected DIMM , grasp it by its top corners, and pull upward. Note the capacity and slot location of the DIMM. Continue with the next procedure, Installing a Memory DIMM.
8-53
MR0517
8-54
Installing a Memory DIMM 1. 2. 3. 4. 5. Perform the steps in the preceding procedure, Removing a Memory DIMM. Use preceding Table 84, Figure 822, and the memory configuration rules to determine the proper location in which to install the new memory DIMM. Make sure the clips Remove the DIMM edges. are pushed down and away from the memory socket. from the static-free-container and hold it by its left and right
Align the notches on the DIMMs gold fingers with the connector keys in the memory snap into place. slot, and push the DIMM down firmly into the slot until the clips Verify that the clips are engaged. Replace the center internal drive bay as explained in Section 8.9. Replace the top cover as explained in Section 8.5. Reconnect the power cord, turn on system power, and boot the system.
6. 7. 8.
Verification 1. 2. At the SRM console prompt, issue the buildfru -dimm command to provide each new DIMM with a unique serial number. Issue the show memory command to display the amount of memory in each array and the total amount of memory in the system.
8-55
3
1 2
MR0512A
8-56
Removing the Front Bezel CAUTION: Care must be taken when installing a new OCP so that the LEDs line up with the holes in the enclosure. Failure to align the LEDs correctly may result in damage to an LED. 1. 3. 4. 5. 6. 7. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Remove the side panel as explained in Section 8.6.
Remove the center internal storage bay as explained in Section 8.9. If installed, remove the front access storage cage and all cables as explained in Section 8.11. Otherwise, remove the internal storage cage and all cables as explained in Section 8.12. Remove the front bezel. To do this, remove the four side screws , then remove the front bezel . and one front screw
8.
8-57
A
1
MR0619A
8-58
Replacing the Operator Control Panel 1. 2. 3. 4. 5. 6. 7. 8. 9. After removing the front bezel (as explained in the preceding procedure), unplug the OCP connector from the motherboard. Push in the tabs (shown in insert A) that fasten the OCP to the front panel .
Pull the OCP away from the front panel and remove the two button caps. Put the button caps on the new OCP, and snap the new OCP back into place inside the front panel. Be sure to align the LEDs with their mounting holes in the enclosure. Plug the OCP connector into connector J7 on the motherboard. Reinstall the front bezel by inserting the four side screws and one front screw. Reinstall the side panel as explained in Section 8.6. Reinstall either the front access storage cage (as explained in Section 8.11) or the internal storage cage (as explained in Section 8.12). Reinstall the center internal storage bay as explained in Section 8.9.
10. Replace the top cover as explained in Section 8.5. 11. Reconnect the power cord, turn on system power, and boot the system.
8-59
MR0611A
8-60
Replacing the Speaker 1. 2. 3. 4. 5. 6. 7. 8. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove the top cover as explained in Section 8.5. Remove the center internal drive bay as explained in Section 8.9. Remove the speaker connector Slide the speaker from the motherboard. and set it aside.
Slide the new speaker down and into its retaining clips. into connector J2 on the motherboard. To properly Insert the speaker connector connect the fan, note that pin 1 is marked on the motherboard and the red speaker wire with a small black dot connects to pin 1. Reinstall the center internal drive bay as explained in Section 8.9.
9.
10. Replace the top cover as explained in Section 8.5. 11. Reconnect the power cord, turn on system power, and boot the system.
8-61
P12
P14
P10
P13
P15
P2
P3
P1
MR0024C
8-62
WARNING: To prevent personal injury or damage to any of the system modules, unplug the power cord from the power supply before touching components. Make sure the power LEDs are not lit before removing or replacing modules. WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module.
CAUTION: When removing the system motherboard, be careful not to flex the board. This can result in damage to the circuitry.
NOTE:
Replacing the system motherboard requires the removal of other FRUs. Review the removal procedures for the fans, CPUs, and options before beginning the system motherboard removal procedure. Mark the original locations of all components and cables as you remove them. A cordless screwdriver is highly recommended for these procedures.
8-63
1 2
3
MR0641
8-64
Removing Intervening Components 1. 2. 3. 4. 5. 6. 7. 8. 9. Shut down the operating system. Turn off system power and unplug the power cord from the power supply. Remove all external cables from the rear of the enclosure. Remove the four hex screws that fasten the COM ports, the one mouse/keyboard screw , and one screw from the SCSI connector, as shown in the preceding figure. Remove the top cover as explained in Section 8.5. Remove the side panel as explained in Section 8.6. Remove all PCI option modules as explained in Section 8.13. Remove the PCI riser card as explained in Section 8.14. Remove the PCI fan as explained in Section 8.7.
10. Remove the center internal storage bay as explained in Section 8.9. 11. Remove the center support bracket as shown on the following page.
8-65
1 2
1
MR0610
8-66
Removing the Center Support Bracket 1. 2. Perform the steps in the preceding procedure, Removing Intervening Components. Remove the three retaining screws the bracket up and set it aside. from the top and rear side of the bracket . Lift
Removing the Remaining Components 1. If installed, remove the front access storage cage and all cables as explained in Section 8.11. Otherwise, remove the internal storage cage and all cables as explained in Section 8.12. Remove the system fan as explained in Section 8.20. Remove the connector for the operator control panel as explained in Section 8.22. Remove the power supply cables that are under the channel as explained in Section 8.19.
2. 3. 4.
8-67
1 2
3
B
MR0612
8-68
Removing the Motherboard 1. After removing all intervening components as described in the preceding procedures, release the motherboard by removing all the screws that securing the motherboard to the enclosure. Slide the motherboard slightly away from the rear face of the enclosure. Lift the edge of the motherboard near the front of the enclosure and carefully lift it out. Set it aside. When removing the motherboard, be careful not to lose the small metal shield on the SCSI connector. Remove the shield and set it aside because you will need to install it on the new motherboard.
2. 3. 4.
Installing the New Motherboard 1. 2. 3. 4. 5. 6. Slide the metal shield removed from the old motherboard onto the SCSI connector. Be careful not to drop the shield when installing motherboard into the enclosure. Hold the motherboard enclosure. with its COM port end pointing down toward the rear of the
Slide the motherboard gently toward the rear of the enclosure until the I/O connectors on the motherboard pass through the openings on the rear panel of the enclosure. that fasten the COM ports, the one mouse/keyboard screw Insert the four hex screws , and one screw from the SCSI connector, as shown in Figure 829. Insert all the retaining screws enclosure. through the motherboard into the bottom of the
Note the positions of all jumpers on the old motherboard and be sure to correctly set the jumpers on the new motherboard. See Appendix A for more information.
8-69
5. 6. 7. 8. 9.
10. Reinstall the side panel as explained in Section 8.6. 11. Replace the top cover as explained in Section 8.5. 12. Reconnect the power cord, turn on system power, and boot the system.
8-70
After Installing a New Motherboard: 1. 2. 3. Power up to the P00>>> prompt. Enter the clear_error all command. Enter the set sys_serial_num command to set the system serial number. (The serial number is on a label on the back of the system.) For example: >>> set sys_serial_num NI900100022
IMPORTANT:
The system serial number must be set correctly. System Event Analyzer will not work with an incorrect serial number.
The serial number propagates to all FRU devices that have EEPROMs.
8-71
This appendix describes the configuration of jumpers on the system motherboard. Sections are as follows: Locations of Jumpers Function of Jumpers Setting Jumpers
A-1
A-2
A.2.1
System Jumpers
System jumpers are used for system-level functions. The default state for each jumper is off, that is, no jumper is installed. Refer to preceding Figure A1 for the locations of these jumpers.
A-3
Jumper J35
Pins
Function (Off = No Jumper) (default) Off = 8-bit SCSI on Channel B On = Wide 16-bit SCSI on Channel B (default) Off = Enable SCSI terminator On = Disable SCSI terminator and enable shared bus (Tru64 UNIX)
J41
A.2.2
Server management jumpers control functions related to the Server Management (SM) subsystem. The Feature jumpers provide an extended set of features to the Remote Management Console (RMC). The other jumpers either configure portions of the SM logic or provide information to Zircon. The following table explains the function of these jumpers. The default state for these jumpers is normally off (no jumpers). Refer to preceding Figure A1 for the locations of these jumpers.
11 12 13 14 15 16
A-4
Name
Pins
A.2.3
Jumper J30 enables or disables the COM1 pass through mode. The settings are show in the following figure. Unlike other jumper settings, normal mode requires a jumper to be installed.
J30
12 23
None
A-5
CAUTION: Static electricity can damage integrated circuits. Always use a grounded wrist strap (29-26246) and grounded work surface when working with internal parts of a computer system. Remove jewelry before working on internal parts of the system.
Setting Jumpers
1. 2. 3. 4. 5. 6. 7. 8. 9. Shut down the operating system. Shut down power on all external options connected to the system. Turn off power to the system. Unplug the power cord from each power supply and wait for all LEDs to turn off. Remove the top cover (as explained in Chapter 8) to gain access to the system motherboard. If you need to remove a Field Replaceable Unit (FRU) to set jumpers, see Chapter 8. Locate the jumper you need to set. Refer to Figure A1 in this appendix. Set the jumpers as needed. Reinstall any FRUs you removed. Reinstall the enclosure panels.
A-6
This appendix explains how to manually isolate a failing DIMM from the failing address and failing data bits. It also covers how to isolate single-bit errors. The following topics are covered: Information for Isolating Failures DIMM Isolation Procedure EV68 Single-Bit Errors
B.1
Table B1 lists the information needed to isolate the failure. The failing address and failing data can come from a variety of different locations such as the SROM serial line, SRM screen displays, the SRM event log, and errors detected by the 21264 (EV68) chip. Convert the address to data bits if the address is not on a 256-bit alignment (address ends in a value less than 20 or address xxxxx20 or address xxxxxnn, where nn is 1 through 1F). For example, using failing address 0x1004 and failing data bit 8(dec), first multiply the failing address 4 by 8 = 32. Then add 32 to the failing data bit to yield the actual failing data bit 40. This conversion yields the new failing information to be failing address 0x1000 and failing data bit = 40(dec).
NOTE:
Arrays 1 and 3 do not exist on the AlphaServer DS15. Registers for these arrays (AAR1, AAR3, DPR 82, and DPR 86) are always zero.
B-2
B.2
Use the following procedure to isolate a failing DIMM. 1. Find the failing array by using the failing address and the Array Address Registers. Use the AAR base address and size to create an Address range for comparing the failing address. Determine if the Address XORing is enabled. If Address XORING is enabled, use Table B2 to find the real array on which twoway interleaving has failed. If Bit 51 of the CSC register is set to 1, XORing is disabled.
2.
Original Array 0
Real Array 0 Real Array 2
Original Array 2
Real Array 2 Real Array 0
3.
After finding the real array, determine whether it is the lower array set or the upper array set. Use DPR locations 80 and 84 listed in Table B1. Table B3 shows the description of these locations.
84 4.
Array 2 (AAR 2) configuration Now that you have the real array, the failing Data/Check bits, and the correct set, use Table B4 to find the failing DIMM or DIMMs.
The table shows data bits 0127 and check bits 015. These data bits indicate a single-bit error. An SROM compare error would yield address and data bits from 063. When you convert the address to be in the correct range, the failing data would be somewhere between 0 and 127.
B-4
14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 12 12 12 12 12 12 12 12
12 12 12 12 12 12 12 12 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14
B-6
12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 14 14 14 14 14 14 14 14
14 14 14 14 14 14 14 14 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
B-8
14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 12 12 12 12 12 12 12 12
12 12 12 12 12 12 12 12
B-10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
14 14 12 12 14 14 12 12 14 14 12 12 14 14 12 12
15 15 13 13 15 15 13 13 15 15 13 13 15 15 13 13
B.3
The procedure for detection down to the set of DIMMs for a single-bit error is very similar to the procedure described in the previous sections. However, you cannot isolate down to a specific data or check bit. The 21264C (EV68) chip detects and reports a C_ADDR<42:6> failing address that is accurate to the cache block (64 bytes). The syndrome registers (shown in Table B5) detect data syndrome information, providing isolation down to the low or high quadword of the target octaword that the fault has been detected within. Each of the syndrome registers is able to report 64 data bits (the quadword) and 8 check bits (memory data bus ECC bits). Table B5 shows the syndrome hexadecimal to physical data or check bit decoding. For example, if you have an EV68 single-bit C_Syndrome_0 hexadecimal error value equal to 23, the second column indicates the decoded physical data or check bit for this encoding. Use these physical data bits in conjunction with the previously described isolation procedure to isolate the failing DIMMs.
B-12
B-14
Index
A
AAR memory addresses, B-2 Alert string, 7-28 Alpha System Reference Manual, 4-24 Auto start Tru64 UNIX or OpenVMS, 6-14 auto_action environment variable, 6-7 auto_action environment variable, SRM, 6-6 Automatic booting, 6-14
B
beep codes, 2-4 Beep codes, 3-12 Boot device, changing, 6-15 Boot problems, 2-10 boot_file environment variable, 6-7 boot_osflags environment variable, 6-7 bootdef_dev environment variable, 6-7 Booting Linux, 6-28 buildfru command, 4-4 Bypass modes, 7-6
Commands RMC, 7-35 Common components, 1-5 Components common, 1-5 Configuration front access storage cage, 1-22 internal storage cage, 1-20 memory, 6-21, 6-24 system, 1-2 Configuring devices, 6-19 console environment variable, 6-9 Console terminal COM port, 1-24 video card, 1-25 Correctable System Event, 5-8, 5-9 CPU location, 6-20 CPU correctable error (630), 5-16 CPU overview, 1-17 CPU uncorrectable error (670), 5-16 crash command, 4-10 Crash dumps, 2-21, 4-10
C
cat el command, 4-8 Checksum error, 3-13 clear password command, 6-18 clear_error command, 4-9, 4-57 Clearing errors, 4-9 COM1 pass through jumper, A-5 com1_ modem environment variable, 6-9 com1_baud environment variable, 6-9 com1_flow environment variable, 6-9 com1_mode environment variable, 6-9 com2_baud environment variable, 6-9 com2_flow environment variable, 6-9 com2_modem environment variable, 6-9 Command conventions (RMC), 7-13
D
Data structures, info command, 4-24 De-installing Q-Vet, 2-29 deposit command, 4-11 Desktop system front access storage cage, 1-4 internal storage cage, 1-3 Devices, configuring, 6-19 Diagnostic categories, 2-3 Diagnostic commands, 4-2 buildfru, 4-4 cat el, 4-8 clear_error, 4-9 crash, 4-10 deposit, 4-11
Index-1
examine, 4-11 exer, 4-15 grep, 4-20 hd, 4-22 info, 4-24 kill, 4-39 kill_diags, 4-39 memexer, 4-40 memtest, 4-42 more el, 4-8 net, 4-47 nettest, 4-49 set sys_serial_num, 4-52 show error, 4-53 show fru, 4-55 show_status, 4-58 sys_exer, 4-60 test, 4-62 diagnostic LEDs, 2-5 Diagnostics power-up, 3-1 running in background, 4-1 SRM console, 4-1 Dial string, 7-28 Dial-in configuration, 7-21 Dial-out alert, 7-25 DIMM isolating failures, B-2 isolation procedure, B-3 lookup table, failures, B-5 DIMM slots, 1-16 DIMMs configuration, 6-21 locations, 8-52 stacked and unstacked, 6-22 supported configurations, 6-21 DIMMs overview, 1-17 Display device selecting, 6-3 verifying, 6-3 Displaying FRU configuration, 4-55 DPR, 7-3 DPR locations, B-4 DPR memory addresses, B-2 Dual-port RAM, 7-3 DUART ports, 7-5
E
ECC logic, 5-15 EEPROMs, 7-3 ei*0_inet_init environment variable, 6-10 ei*0_mode environment variable, 6-10 ei*0_protocols environment variable, 6-10 Emergency runtime image recovery, 7-31 env command, 2-15 env command (RMC), 7-18 Environment variables, 6-5, 6-7 set command, 6-6 show command, 6-6 Environment, monitoring, 7-18 Error handling tools, 2-20 Error log event structure map, 5-18 Error log analysis, 5-2 Error log information (RMC), 7-3 Error logs, 5-1 Error messages memory, 3-15 power-up beep codes, 3-12 Error repository, clearing, 8-3, 8-5 Errors logged to FRU EEPROMs, 4-53 Escape sequence (RMC), 7-11 Event structure map error log, 5-18 ew*0_inet_init environment variable, 6-10 ew*0_mode environment variable, 6-10 ew*0_protocols environment variable, 6-10 examine command, 4-11 exer command, 4-15 Exercising devices, 4-15, 4-60
F
Fail-safe booter, 2-31, 3-13, 3-17 automatic start, 3-17 jumpers, 3-18 manual start, 3-17 Fail-safe booter utility, 2-16 Fan replacing, 8-49 Fault detection and reporting, 5-14 Field replaceable units. See FRUs Firm bypass mode, 7-8 Firmware updating RMC, 7-29
Index-2
Firmware files, 2-17 Firmware updates, 2-30 Firmware, updating, 2-18 Floppy device, 3-20 Front access storage cage desktop system:, 1-4 Front access storage cage, 1-22 Front view, 1-6 FRU assembly hierarchy, 4-5 FRU descriptor, 4-6 FRU list designator SEA, 5-8 FRU procedure accessing front access storage cage, 8-26 accessing internal storage cage, 8-28 after replacing motherboard, 8-71 installing memory DIMM, 8-55 installing motherboard, 8-70 motherboard, prior to removal, 8-63 removing center support bracket, 8-67 removing components above motherboard, 8-64 removing memory DIMM, 8-51, 8-54 removing motherboard, 8-70 removing side panel, 8-12 removing top cover, 8-10 replacing bottom drive, front access cage, 8-36 replacing bottom drive, internal cage, 8-39 replacing CPU fan, 8-16 replacing disk, center storage bay, 8-18 replacing DVD/CD-RW drive, internal cage, 8-43 replacing front access drive, 8-22 replacing middle drive, internal cage, 8-41 replacing OCP, 8-57 replacing PCI fan, 8-14 replacing PCI option module, 8-30 replacing PCI riser card, 8-34 replacing power supply, 8-45 replacing speaker, 8-61 replacing system fan, 8-49 FRUs before replacing, 8-4 locations, 8-8 physical configuration, 4-55 recommended spares, 8-5 replacement, 8-1
G
Graycode test, 4-43, 4-44 grep command, 4-20
H
Halt remote, 7-20 hangup command (RMC), 7-23 Hardware configuration displaying, 6-4 hd command, 4-22 heap_expand environment variable, 6-11 Hex dump, 4-22
I
info 0 example, 4-25 info 1 example, 4-26 info 2 example, 4-27 info 3 example, 4-28 info 4 example, 4-29 info 5 example, 4-31 info 6 example, 4-35 info 7 example, 4-37 info 8 example, 4-38 info command, 4-24 Information resources, 2-30 Installing Q-Vet, 2-24 Internal storage cage desktop system, 1-3 Internal storage cage, 1-20 Interrupts, 5-16
J
Jumpers, 3-18, A-1 COM1 pass through, A-5 default positions, 7-34 location, A-2 resetting to factory defaults, 7-33 server management, A-4 setting, A-6 system, A-3
Index-3
K
kbd_hardware_type environment variable, 611 kill command, 4-39 kill_diags command, 4-39 kzpsa_host_id environment variable, 6-11
nettest command, 4-49 Network connections, 1-12 Network port test, 4-49 No MEM error, 3-15
O
OCP, 1-14 Operating system autostart, 6-14 errors reported by, 2-11 Operator control panel, 1-14 Options, supported, 2-31 os_type environment variable, 6-12 Overtemperature, 2-15
L
language environment variable, 6-11 Learning Utility, 2-31 LEDs OCP, 2-5 LFU, 2-18 LFU utility, 3-14, 3-19 Linux booting, 6-28 Loadable Firmware Update utility, 2-18, 3-14 Lock, 1-28 log command, 2-15 login command, 6-17 Loopback connectors, 4-61 Loopback tests, 2-20
P
Pagers, 7-27 PAL handler, 5-14 PALcode error routines, 5-16 exception/interrupt handling, 5-14 password environment variable, 6-12 Patches, 2-31 PCI configuration rules, 6-26 option modules, 6-25 slot locations, 6-27 slots, 6-25 PCI bus problems, 2-13 PCI overview, 1-17 PCI parity error, 2-13 PCI slots, 1-18, 8-31 pci_parity environment variable, 6-12 Pedestal configuration, 1-2 Physical and logical I/O slots, 6-26 Physical and logical slots, 1-19 pk*0_fast environment variable, 6-12 pk*0_host_id environment variable, 6-13 pk*0_soft_term environment variable, 6-13 Ports and slots, 1-10 Power desktop, 1-26 rackmounted, 1-27 Power cords, 8-7 Power problems, 2-7 Power-up diagnostics, 3-1, 3-2 Power-up display, 3-5
M
Machine checks, 5-16 memexer command, 4-40 Memory supported configurations, 6-21 Memory configuration, 6-21 pedestal, 6-24 Memory configuration rules, 8-51 Memory exercisers, 4-40, 4-42 Memory overview, 1-17 Memory problems, 2-12 Memory slots, 1-16 memory_text environment variable, 6-11 memtest command, 4-42 Memtest test 1, 4-44 Modem initialization commands, 7-24 MOP loopback tests, 4-49 more el command, 4-8 Motherboard, 1-16
N
net command, 4-47
Index-4
console, 3-8 Power-up error messages, 3-12 Power-up sequence, 3-3, 3-4, 3-6 Problem report SEA, 5-5 Problem report details SEA, 5-6, 5-7
Q
quit command (RMC), 7-11 Q-Vet de-installing, 2-29 installation verification, 2-22 installing, 2-24 reviewing results, 2-28 running, 2-26
logic, 7-3 operating modes, 7-4 overview, 7-2 power, 7-3 quit command, 7-11 remote power on/off, 7-19 remote reset, 7-20 resetting to factory defaults, 7-32 snoop mode, 7-7 soft bypass mode, 7-7 status command, 7-14 terminal setup, 7-9 Through mode, 7-5 troubleshooting, 7-44 updating firmware, 7-29 RMC commands, 1-15 Running Q-Vet, 2-26
R
Rackmount configuration, 1-2 Rear view, 1-10 Recommended spares, 8-5 Registers, info command, 4-24 Remote commands, 1-15 Remote Management Console, 2-21, 6-2 overview, 7-2 Replacing FRUs, 8-1 Reset, from RMC, 7-20 Resetting RMC defaults, 7-32 RMC, 2-21 bypass modes, 7-6 CLI, 7-13 command conventions, 7-13 command reference, 7-35 configuring remote dial-in, 7-21 data flow diagram, 7-4 default jumper positions, 7-34 default jumper settings, 7-33 dial-out alert, 7-25 emergency runtime image recovery, 7-31 entering, 7-11 env command, 7-18 escape sequence, 7-11 exiting, 7-11 exiting from local VGA, 7-12 firm bypass mode, 7-8 hangup command, 7-23
S
SCB offsets, 5-16 SCSI problems, 2-14 scsi_poll, 4-3 scsi_reset, 4-3 SDD errors, 4-54 SEA. See System Event Analyzer Security clear password, 6-18 set password, 6-16 set secure, 6-17 SRM, 6-16 Serial number mismatch, 4-54 Serial terminal, 6-3 Server management jumpers, A-4 Service help file, 2-30 Service tools CD, 2-30 set console command, 6-3 set envar command, 6-6 set password command, 6-16 set secure command, 6-17 set sys_serial_num command, 4-52 Setting jumpers, A-6 Shared RAM, 7-3 show console command, 6-3 show envar command, 6-6 show error command, 4-53 message translation, 4-56 show fru command, 4-55
Index-5
show power command, 2-15 show_status command, 4-58 Single-bit errors, detecting, B-12 Slots DIMM, 1-16 memory, 1-16 PCI, 1-18 Snoop mode, 7-7 Soft bypass mode, 7-7 Software patches, 2-31 Speaker, testing, 4-62 SRM COM1 environment variables, 7-10 SRM console, 6-2 commands, 6-4 diagnostic commands, 4-2 diagnostics, 4-1 environment variable, 6-3 Fail-safe booter utility, 2-16 problem accessing, 2-8 problems reported by, 2-9 SRM Console event log, 3-10 SRM console commands, 2-20 SRM security, 6-16 status command, 2-15 status command (RMC), 7-14 Storage cage front access, 1-22 internal, 1-20 Storage cage options, 1-20 Storage drives optional, 8-6 Supported options, 2-31 sys_com1_rmc, 4-3 sys_exer command, 4-60 sys_serial_num environment variable, 6-13 System access lock, 1-28 System configuration, 1-2 System correctable error (620), 5-17 System environmental error (680), 5-17 System Event Analyzer, 2-20 System Event Analyzer WEBES Director, 5-3 System Event Analyzer documentation, 5-3 System Event Analyzer initial screen, 5-4
System Event Analyzer Problem Reports, 5-5 System Event Analyzer Problem Report details, 5-6 System Event Analyzer Problem Report details, 5-7 System jumpers, A-3 System serial number setting, 4-52, 8-72 System uncorrectable error (660), 5-17
T
TDD errors, 4-54 Technical information on Internet, 2-31 Terminal setup (RMC), 7-9 Terminating diagnostics, 4-39 test command, 4-62 Test script, 4-63 Testing devices, 4-62 Testing drives, 4-61 Testing floppy and drives, 4-63 Testing VGA console, 4-63 Thermal problems, 2-15 Through mode, RMC, 7-5 Tools and utilities, 2-20 Tools for FRUs, 8-4 Top view, 1-8 troubleshooting with beep codes, 2-4 Troubleshooting boot problems, 2-10 console event log, 3-10 crash dumps, 2-21 diagnostic categories, 2-3 memory problems, 2-12 operating system errors, 2-11 PCI bus problems, 2-13 power problems, 2-7 power-up beep codes, 3-12 problem getting to console, 2-8 problems reported by console, 2-9 RMC, 7-44 SCSI problems, 2-14 strategy, 2-2 System Event Analyzer, 5-2 thermal problems, 2-15 tools and utilities, 2-20
Index-6
U
Updating firmware, 2-18 Updating firmware with floppy device, 3-20 Updating RMC, 3-19
W
WEBES Director, 5-3
V
VGA monitor, 6-3
Index-7