This chapter provides an overview of the physical and architectural aspects of your SGI Altix 3000 series system. The major components of the Altix 3000 series systems are described and illustrated.
The Altix 3000 series is a family of multiprocessor distributed shared memory (DSM) computer systems that scale from 4 to 64 Intel Itanium 2 processors as a cache-coherent single system image (SSI). In a DSM system, each processor contains memory that it shares with the other processors in the system. Because the DSM system is modular, it combines the advantages of low entry-level cost with global scalability in processors, memory, and I/O. You can install and operate the Altix 3000 series system in a rack in your lab or server room.
This chapter consists of the following sections:
Figure 3-1 shows the front views of a single-rack system (the Altix 3300 system) and a multiple-rack system (the Altix 3700 system).
The C-brick contains the processors (zero or four processor per C-brick) for the server system. The number of processors and whether or not a router (R-brick) is configured determines the Altix 3000 server model. The following two models, discussed in the sections that follow, are available:
Altix 3300 server. A 17U rack is used to house the power bay, up to three C-bricks, and one IX-brick.
Altix 3700 server. The 40U rack in this server houses all bricks, drives, and other components.
The Altix 3300 server system has up to 12 Intel Itanium 2 processors, one I/O brick (an IX-brick), no routers, and a single power bay. The system is housed in a short 17U rack enclosure with a single power distribution system (PDS). Although the L2 controller is optional with the Altix 3300 server, the L2 controller touch display is not an option. You can also add additional racks containing D-brick2s and TP900 storage modules to your Altix 3300 server system.
Figure 3-2 shows one possible configuration of an Altix 3300 server system.
The Altix 3700 server system has up to 512 Intel Itanium 2 processors, a minimum of one IX-brick for every 64 processors, and a minimum of four 8-port routers for every 32 processors. The system requires a minimum of one 40U tall rack with at least one power bay and one single-phase PDU per rack. (The single-phase PDU has two openings with three cables that extend from each opening to connect to the power bay. The three-phase PDU has two openings with six cables that extend from each opening to connect to two power bays.)
Each tall rack enclosure containing C-bricks comes with an L2 controller. An L2 controller touch display is provided in the first rack of the system (rack 001).
You can also add additional racks containing C-bricks, R-bricks, I/O bricks, and disk storage to your server system.
Figure 3-3 shows an example configuration of a 32-processor Altix 3700 server.
The Altix 3000 computer system is based on a distributed shared memory (DSM) architecture. The Altix 3000 computer system uses a global-address-space, cache-coherent multiprocessor that scales to 64 Intel Itanium 2 processors in a cache-coherent domain. Because it is modular, the DSM combines the advantages of low entry cost with the ability to scale processors, memory, and I/O independently.
The system architecture for the Altix 3000 system is a third-generation NUMAflex DSM architecture known as NUMA 3. In the NUMA 3 architecture, all processors and memory are tied together into a single logical system with special crossbar switches (R-bricks). This combination of processors, memory, and crossbar switches constitute the interconnect fabric called NUMAlink.
The basic building block for the NUMAlink interconnect is the C-brick, which is sometimes referred to as the compute node. A C-brick contains two processor nodes; each processor node consists of a Super-Bedrock ASIC and two processors with large on-chip secondary caches. The two Intel Itanium 2 processors are connected to the Super-Bedrock ASIC via a single high-speed front side bus. The two Super-Bedrock ASICS are then interconnected internally by a single 6.4-GB/s NUMAlink 4 channel.
The Super-Bedrock ASIC is the heart of the C-brick. This specialized ASIC acts as a crossbar between the processors, local SDRAM memory, the network interface, and the I/O interface. The Super-Bedrock ASIC has a total aggregate peak bandwidth of 6.4 GB/s. Its memory interface enables any processor in the system to access the memory of all processors in the system. Its I/O interface connects processors to system I/O, which allows every processor in a system direct access to every I/O slot in the system.
Another component of the NUMA 3 architecture is the router ASIC. The router ASIC is a custom designed 8-port crossbar ASIC in the R-brick. If you use the router ASIC with highly specialized cables, the R-bricks provide a high-bandwidth, extremely low-latency interconnect between all C-bricks in the system. This interconnection creates a single contiguous system memory of up to 2 TB (terabytes).
Figure 3-4 shows a functional block diagram of the Altix 3000 series system.
The main features of the Altix 3000 series server systems are introduced in the following sections:
The Altix 3000 series systems are modular systems. The components are housed in building blocks referred to as bricks. You can add different brick types to a system to achieve the desired system configuration. You can easily configure systems around processing capability, I/O capability, memory size, and storage size. You place individual bricks that create the basic functionality (compute/memory, I/O, and power) into custom 19-inch racks. The air-cooled system has redundant, hot-swap fans at the brick level and redundant, hot-swap power supplies at the rack level.
In the Altix 3000 series server, memory is physically distributed among the C-bricks (compute nodes); however, it is accessible to and shared by all compute nodes. Note the following types of memory:
If a processor accesses memory that is physically located on a compute node, the memory is referred to as the node's local memory.
The total memory within the system is referred to as global memory.
If processors access memory located in other C-bricks, the memory is referred to as remote memory.
Memory latency is the amount of time required for a processor to retrieve data from memory. Memory latency is lowest when a processor accesses local memory.
Like DSM, I/O devices are distributed among the compute nodes (each computed node has an I/O port that can connect to an I/O brick) and are accessible by all compute nodes through the NUMAlink interconnect fabric.
As the name implies, the cache-coherent non-uniform memory access (ccNUMA) architecture has two parts, cache coherency and nonuniform memory access, which are discussed in the sections that follow.
The Altix 3000 server series use caches to reduce memory latency. Although data exists in local or remote memory, copies of the data can exist in various processor caches throughout the system. Cache coherency keeps the cached copies consistent.
To keep the copies consistent, the ccNUMA architecture uses directory-based coherence protocol. In directory-based coherence protocol, each block of memory (128 bytes) has an entry in a table that is referred to as a directory. Like the blocks of memory that they represent, the directories are distributed among the compute nodes. A block of memory is also referred to as a cache line.
Each directory entry indicates the state of the memory block that it represents. For example, when the block is not cached, it is in an unowned state. When only one processor has a copy of the memory block, it is in an exclusive state. And when more than one processor has a copy of the block, it is in a shared state; a bit vector indicates which caches contain a copy.
When a processor modifies a block of data, the processors that have the same block of data in their caches must be notified of the modification. The Altix 3000 server series use an invalidation method to maintain cache coherence. The invalidation method purges all unmodified copies of the block of data, and the processor that wants to modify the block receives exclusive ownership of the block.
The Altix 3000 server series components have the following features to increase the reliability, availability, and serviceability (RAS) of the systems.
Power and cooling:
Power supplies are redundant and can be hot-swapped.
Bricks have overcurrent protection.
Fans are redundant and can be hot-swapped.
Fans run at multiple speeds in all bricks except the R-brick. Speed increases automatically when temperature increases or when a single fan fails.
System monitoring:
System controllers monitor the internal power and temperature of the bricks, and automatically shut down bricks to prevent overheating.
Memory, L2 cache, L3 cache, and all external bus transfers are protected by single-bit error correction and double-bit error detection (SECDED).
The NUMAlink interconnect network is protected by cyclic redundancy check (CRC).
The L1 primary cache is protected by parity.
Each brick has failure LEDs that indicate the failed part; LEDs are readable via the system controllers.
Systems support Embedded Support Partner (ESP), a tool that monitors the system; when a condition occurs that may cause a failure, ESP notifies the appropriate SGI personnel.
Systems support remote console and maintenance activities.
Power-on and boot:
Automatic testing occurs after you power on the system. (These power-on self-tests or POSTs are also referred to as power-on diagnostics or PODs).
Processors and memory are automatically de-allocated when a self-test failure occurs.
Boot times are minimized.
Further RAS features:
Systems support partitioning.
Systems have a local field-replaceable unit (FRU) analyzer.
All system faults are logged in files.
Memory can be scrubbed when a single-bit error occurs.
The Altix 3000 series system features the following major components:
17U rack. This deskside rack is used for the Altix 3300 systems.
40U rack. This is a custom rack used for both the compute rack and I/O rack in the Altix 3700 system. The power bays are mounted vertically on one side of the rack.
C-brick. This contains the compute power and memory for the Altix 3000 series system. The C-brick is 3U high and contains two Super-Bedrock ASICs, four Intel Intanium 2 processors and up to 32 memory DIMMs.
IX-brick. This 4U-high brick provides the boot I/O functions and 12 PCI-X slots.
PX-brick. This 4U-high brick provides 12 PCI-X slots on 6 buses for PCI expansion.
R-brick. This is a 2U-high, 8-port router brick.
Power bay. The 3U-high power bay holds a maximum of six power supplies that convert 220 VAC to 48 VDC. The power bay has eight 48-VDC outputs.
D-brick2. This is a 3U-high disk storage enclosure that holds a maximum of 16 low-profile Fibre Channel disk drives.
TP900 disk storage module. This is a 2U-high disk storage enclosure that holds a maximum of eight low-profile Ultra160 SCSI disk drives.
SGIconsole. This is a combination of hardware and software that allows you to manage multiple SGI servers.
Figure 3-5 shows the Altix 3300 system components, and Figure 3-6 shows the Altix 3700 system components.
Bays in the racks are numbered using standard units. A standard unit (SU) or unit (U) is equal to 1.75 inches (4.445 cm). Because bricks occupy multiple standard units, brick locations within a rack are identified by the bottom unit (U) in which the brick resides. For example, in a tall 40U rack, the C-brick positioned in U05, U06, and U07 is identified as C05. In a short 17U rack, the IX-brick positioned in U13, U14, U15, and U16 is identified as I13.
A rack is numbered with a three-digit number. Compute racks are numbered sequentially beginning with 001. A compute rack is a rack that contains C-bricks. I/O racks are numbered sequentially and by the physical quadrant in which the I/O rack resides. Figure 3-7 shows the rack numbering scheme for multi-rack Altix 3700 systems. The Altix 3300 system is a single compute rack system; therefore, the rack number is 001.
The Altix 3000 series system has the following external storage options:
Host bus adapter interfaces (HBA)
2Gbit Fibre Channel, 200MB/s peak bandwidth
Ultra160 SCSI, 160MB/s peak bandwidth
Gigabit Ethernet copper and optical
JBOD (just a bunch of disks)
SGI TP900 9Ultra160 SCSI
RAID
D-brick2, 2 Gbit Fibre Channel (Model 3700 only)
SGI TP9500, 2 Gbit Fibre Channel
Data servers
SGI File Server 830, Gigabit Ethernet interface
SGI File Server 850, Gigabit Ethernet interface
SGI SAN Server 1000, 2 Gbit Fibre Channel interface
Tape libraries
STK L20, L40, L80, L180, and L700
Tape drives
STK 9840B, 9940B, LTO, SDLT, and DLT
ADIC Scalar 100, Scalar 1000, Scalar 10000, and AIT