Chapter 4. Troubleshooting and Diagnostics

If you are experiencing problems with your Silicon Graphics® Tezro™ visual workstation, please review the material in this chapter. If you are unable to resolve the problem, contact your service provider as follows:

This chapter includes the following sections:

Troubleshooting

This section covers the following topics:

Environmental Fault Monitoring

The workstation monitors its environment to ensure proper operation. It will automatically power off if any of the following faults are found:

  • Any fan spins at less than 80% of nominal speed.

  • Any temperature sensor registers 158 °F (70 °C) or above.

  • Any voltage reaches +/- 20% of nominal.

If your workstation is powering off unexpectedly, check for these conditions.

Bezel LEDs

The LEDs in the workstation bezel can provide important troubleshooting information. Table 4-1 shows a list of LED signals and what they mean.

Table 4-1. Bezel LED Signals

LED Signal

Explanation

Blinking White

Power button pressed (On or Off)

Solid white

Successful PROM boot/ OS running

Solid yellow

L1 has detected a problem. Check the L1 display for more information

Blinking red

General system failure
Check the L1 display for more information

Solid red

System node board failure
(failed to read PROM at power on)


Diagnostics

The Silicon Graphics Tezro visual workstation is equipped with diagnostics to test the system hardware and diagnose part failures. These diagnostics are grouped into three categories:

  • Power-on diagnostics (POD)
    Power-on diagnostics are PROM-resident tests that run automatically when you power on the system. As the boot process discovers hardware components, it runs power-on diagnostics to verify that each component that is needed to boot the system is working correctly. Refer to “Power-on Diagnostics” for more information about POD.

  • Offline diagnostics
    Offline diagnostics use a standalone diagnostic environment to test the system hardware; the operating system cannot be running while you use offline diagnostics. Refer to “Offline Diagnostics ” for more information.

  • Online diagnostics
    Online diagnostics are tests that verify system hardware while the operating system is running. To prevent data loss, you should use the online diagnostics only when the system is idle. Refer to “Online Diagnostics ” for more information.

All diagnostics are loaded on your workstation when you receive it. To upgrade to future revisions of the diagnostics, download the appropriate Customer Diagnostics package from Supportfolio (http://support.sgi.com ). Contact your service representative for more information.


Note: The diagnostics described in this document run only on Silicon Graphics Tezro visual workstations. They will not work on any other SGI systems.


Power-on Diagnostics

The power-on diagnostics run automatically when you power on or reset the system. As the boot process discovers hardware, it verifies that each component is functional enough to load the operating system.

The power-on diagnostics test the hardware in the following order:

  • CPU

  • Bedrock ASIC

  • PROM

  • Memory DIMMs

  • Secondary cache

  • PIC ASICs

  • PCI slots

  • Serial ports

  • SCSI controller

  • VPro graphics

If the power-on diagnostics complete successfully, the System Maintenance menu appears or the system automatically boots, depending on how the system is configured.

If the power-on diagnostics detect errors, the diagnostics disable the failing hardware and continue testing. When testing completes, the system may or may not be able to boot, depending on the hardware that has been disabled. If the system does not boot, contact your service representative.

Offline Diagnostics

Offline diagnostics run a sequence of tests on the system hardware under a standalone diagnostic environment; the operating system cannot be running while the offline diagnostics test the system

The offline diagnostics include a “launcher” that automatically runs a sequence of tests. In most cases, you should run the offline diagnostics automatically with the launcher. Use the following procedure to run the launcher:

  1. Power on the system.

  2. Wait until the System Maintenance menu appears.


    Note: If the Autoload PROM variable is set to Yes, you must click on the Stop for Maintenance button to access the System Maintenance menu.


  3. Select the Run Diagnostics option.


    Note: You can also start the launcher by entering the following command at the command monitor (PROM) prompt (>>):
    boot -f dksc (0,1,0) /stand/smdk/smdk --a


The launcher automatically runs the offline diagnostics on system components in the following order:

  • CPU


    Note: The CPU test supports single-CPU systems; if a system has more than one CPU, the CPU test does not run.


  • Secondary cache

  • Memory DIMMs

  • I/O components: IO9 card and audio and I/O daughtercard (including the SCSI controller, serial ports, Ethernet port, mouse port, keyboard port, and RTO/RTI connectors)


    Note: The offline diagnostics test the simpler components first and then proceed to the more complex components.


Table 4-2 shows the approximate time required (in minutes and seconds format) to automatically run the offline diagnostics on various workstation configurations. (Your testing time will vary, depending on your hardware configuration.)

Table 4-2. Time Required to Run Offline Diagnostics

Total Elapsed Time

 

 

Testing Progress

1-CPU Workstation with 512MB memory

2-CPU Workstation with 1GB memory

4-CPU Workstation with 1GB memory

CPU testing completes

0:26

N/A[a]

N/Aa

Secondary cache testing completes

1:18

0:25

1:54

Memory DIMM testing completes

4:47

4:32

5:07

I/O testing completes

6:15

5:34

6:09

[a] CPU testing is not performed on systems that have more than one CPU.

The offline diagnostics display test status information as they run. If the diagnostics complete testing without detecting errors, the output is similar to the following example:

Starting diagnostic program...

                       Press <Esc> to return to the menu.

SMDK SGI Version 6.152 TEST built 08:41:26 AM Mar  6, 2003
smdk loading io discovery code...
smdk loading launcher code...
smdk>
sMDK Diagnostic Launcher: Version 2.0
Built 00:42:56 Mar  6 2003
Setting up diagnostics.....
term none
Starting diagnostics.....
 


Testing  CACHE..........   PASSED
Testing  DIMM.................................................   PASSED
Testing  IO................... PASSED
FINISHED
Reseting...
resetting the system...

If the launcher detects an error, it displays a FAILED status message for the hardware it is testing and stops testing. If any of the components do not pass the offline diagnostics, contact your service representative.

Online Diagnostics


Caution: The runalldiags script should be run while the system is idle. If you run the online diagnostics while the system is in use, data may be lost.

Online diagnostics are tests that verify system hardware while the operating system is running. When you run the online diagnostics from the IRIX operating system prompt, each diagnostic runs a set of tests for a certain number of loops. The online diagnostics test the following areas of the system:

  • CPU

  • Memory

  • I/O

  • Graphics

  • Storage devices

  • Network devices

The online diagnostics also run a system stress test, which tests all areas of the system under heavy load.

The runalldiags script automatically runs a sequence of online diagnostics. It runs in three modes:

  • Basic mode verifies memory and performs 30 minutes of stress testing. (If you want to perform regularly scheduled testing, use basic mode.)

  • Normal mode performs the same tests as basic mode and also performs I/O testing. (The I/O testing may disrupt any serial port and USB devices.)

  • Extensive mode performs more disruptive I/O testing. (Ethernet is unavailable, and USB operations are disrupted.) It also performs more intensive CPU, memory, and stress testing. Use this mode only if you suspect there is a problem with the system.

Follow these steps to run the runalldiags script:


Note: You must have root level access to the system to run online diagnostics.


  1. Enter the following command at the IRIX command prompt to change to the directory that contains the diagnostics:
    #>cd /usr/diags/bin

  2. Enter the following command to start the script:
    #>./runalldiags [options]


    Note: When you run runalldiags in -normal or -extensive modes, you should run it from the console. The Ethernet testing that runalldiags performs in -normal and -extensive modes disrupts any telnet sessions on the system.


Refer to Table 4-3 for descriptions of the command-line options.

Table 4-3. runalldiags Command-line Options

Option

Description

-h | -help

Displays help information

-basic

Runs the script in basic mode

-normal

Runs the script in normal mode (default)

-extensive

Runs the script in extensive mode

-host <host>

Specifies a system to target for network tests

-d <directory>

Specifies the directory that contains the online diagnostics

If a diagnostic fails, the script saves the output from the diagnostic in a file in the /tmp directory (for example, /tmp/diagTestOutput.1.olenet). Output from the script indicates the actual name of the file. When a diagnostic fails, the script continues to run the remaining diagnostics.


Note: If you have USB devices connected to your workstation, you must disconnect the USB cables from the rear of the enclosure after the online diagnostics have finished running. Then reconnect the cables to restore the USB devices.

Online diagnostics display PASS [testname] when a test passes and FAIL [testname] when a test fails. If any of the components do not pass the online diagnostics, contact your service representative.

Example 1

The following example shows output from running runalldiags in basic mode with no errors:

olab1 12# ./runalldiags -basic

Running online diagnostics at Basic level
Time: Tue Jun 24 16:25:36 CDT 2003
System Information: IRIX64 olab1 6.5 6.5.20m 04091957 IP35
Plan on running: olmem pandora

olmem - Online Memory Diagnostic    (Check /var/adm/SYSLOG for error message)
PASS(olmem)
pandora - System Stress Test
PASS(pandora)

Finished running at Tue Jun 24 17:00:05 CDT 2003
Ran: 2  Failed: 0

Example 2

The following example shows output from running runalldiags in basic mode with one error:

olab1 3# ./runalldiags -basic

Running online diagnostics at Basic level
Time: Tue Jun 24 10:55:36 CDT 2003
System Information: IRIX64 olab1 6.5 6.5.20m 04091957 IP35
Plan on running: olmem pandora

olmem - Online Memory Diagnostic    (Check /var/adm/SYSLOG for error message)
PASS(olmem)
pandora - System Stress Test
FAIL(pandora): see /tmp/diagFailure.0.pandora

Time: Tue Jun 24 11:35:38 CDT 2003
Ran: 1 Failed: 1