PCI Express switching and Remote I/O in the data center.
The emergence of the PCI Express A high-speed peripheral interconnect from Intel introduced in 2002. Note that although sometimes abbreviated "PCX," PCI Express is not the same as "PCI-X" (see PCI-SIG and PCI-X for comparison). As a result of the confusion, "PCI-E" or "PCIe" is the accepted abbreviation. standard enables us to realise some of the advanced features enjoyed within a blade server A server architecture that houses multiple server modules ("blades") in a single chassis. It is widely used in datacenters to save space and improve system management. Either self-standing or rack mounted, the chassis provides the power supply, and each blade has its own CPU, memory and environment while still using standard commodity parts. By providing a remote I/O (Input/Output) The transfer of data between the CPU and a peripheral device. Every transfer is an output from one device and an input to another. See PC input/output.
I/O - Input/Output module connected via PCI Express to the original 1U servers, users can upgrade the I/O capability without resorting to larger, complex, expensive servers. This dramatically reduces the overall system cost for these features enabling some of the benefits of high availability Also called "RAS" (reliability, availability, serviceability) or "fault resilient," it refers to a multiprocessing system that can quickly recover from a failure. There may be a minute or two of downtime while one system switches over to another, but processing will continue. solutions to cascade into the entry level server markets.
To achieve this aim, there needs to be a resolution to the three problems:
* Hot plug server and I/O modules at commodity prices
* Suitable PCI Express interconnect with support for multiple hosts
* PCI Express cable standards that support the interconnect requirement
Remote I/O--Building blocks
The basic technology building blocks of Remote I/O are starting to become available, driven in part by the higher specification requirements of the blade server system architecture.
[FIGURE 1 OMITTED]
Server I/O Modules
Server I/O Modules (SIOM SIOM Shanghai Institute of Optics and Fine Mechanics (Chinese Academy of Sciences)
SIOM Software Input/Output Manual
SIOM Sinhgad Institute of Management (Pune, India)
SIOM Server Input Output Module ) are designed to meet the needs of the next generation of servers and workstations. The standardization of these new modules has already been started by the PCI-SIG and a draft specification is currently in progress. The basic physical requirements of a Remote I/O entry appliance and a SIOM are identical. They fundamentally address the concerns of current PCI cards A circuit board that plugs into the computer's PCI bus and contains the control electronics for a peripheral device. Starting in the latter half of the 1990s, PCI became very popular. See PCI. , namely:
* Need for a robust low cost module, providing improved power, cooling, and EMI (ElectroMagnetic Interference) An electrical disturbance in a system due to natural phenomena, low-frequency waves from electromechanical devices or high-frequency waves (RFI) from chips and other electronic devices. Allowable limits are governed by the FCC. handling
* Hot plug capability
* I/O module installation and removal from a closed chassis without the need for any special service skills or tools
* Internal system bandwidth provisioned to meet future IO requirements
* 10G Ethernet and Fibre Channel
The SIOM is available in a size that is approximately half the current full length PCI card with a power budget the same as today. It is thus compatible in form factor to most of the current generation of high end I/O cards See expansion board and PC card. . However, with its more prescribed case, power and connection specification it enables:
* Improved system I/O density
* Greater flexibility of chassis designs
Packaged in a protective module, the SIOM features EMI gasketing, insertion/removal levers and a defined cooling specification suitable for 10G Ethernet or Fibre Channel. It also provides an expansion route via a double width card for very high end applications.
The aim of the SIOM standard is to provide commodity PCI Express technology modules at a very small overhead to standard PCI cards of equivalent performance.
PCI Express switching
Interconnect is the next issue. Consider a typical solution shown in Fig 1. This shows a simplified system diagram incorporating a Remote I/O appliance. The I/O devices See peripheral. (HBAs and NICs) are legacy PCI (1) (Payment Card Industry) See PCI DSS.
(2) (Peripheral Component Interconnect) The most widely used I/O bus (peripheral bus). devices with legacy PCI drivers; therefore each must be logically connected to a single host. However, unlike legacy systems where the connection is physical, i.e., a particular HBA (Host Bus Adapter) See host adapter. is plugged into a particular server; here the connection is only determined by the configuration of the switch.
PCI Express was not designed to support this environment and so non-transparent bridges are introduced to support a multiple root complex connection. However, this moves away from the required need for transparent legacy I/O card support and the need for full sharing of resource as the ultimate requirement. It is therefore necessary to look for solutions beyond current PCI Express capability. There are two possible solutions to consider:
* Shared PCI Express. The PCI Express SIG is looking at shared PCI Express, which offers remote I/O capability and the enhancement to allow two hosts to truly share a single I/O resource. This would require additional routing fields within the PCI Express packet structure, which requires at minimum additional silicon within the switch. However, potentially it does promise a lightweight entry to Remote I/O and with expansion to Shared I/O, but until the findings of the SIG are finalised, the full capability cannot be judged. Clearly if this additional function can be provided for a nominal silicon cost then this would suit smaller Remote I/O configurations and would allow ASI ASI,
n See Anxiety Sensitivity Index. to tunnel traffic from these boxes into a larger fabric, possibly providing a best of both worlds solution.
* Advanced Switching Interconnect (ASI). Clearly ASI was designed to tunnel PCI Express and so fully fulfils the requirement for Remote I/O because the additional routing is already part of the ASI specification. ASI also provides a rich set of capabilities as it additionally supports full fabric capability with advanced traffic management support. ASI is a fully published specification and multiple vendors are already implementing silicon suitable for Remote I/O and hence is the current solution of choice.
The (ASI) standard permits the simple encapsulation (1) In object technology, the creation of self-contained modules that contain both the data and the processing. See object-oriented programming.
(2) The transmission of one network protocol within another. and transportation of PCI Express packets in a way which is 100% backwards compatibile with existing PCI Express hardware and firmware A category of memory chips that hold their content without electrical power. Firmware includes flash, ROM, PROM, EPROM and EEPROM technologies. When holding program instructions, firmware can be thought of as "hard software." See flash memory, ROM, PROM, EPROM, EEPROM and FOTA. . This allows systems with ASI system interconnect to use legacy PCI Express I/O devices and software. ASI also allows encapsulation and transportation of many other communication protocols, such as Ethernet, Fibre Channel and Infiniband. ASI can therefore bridge across networks containing multiple communication protocols whilst providing the capability to reuse previously installed legacy hardware and software.
The diagram above shows how simple AS-to-Express bridges are easily embedded Inserted into. See embedded system. in the switch port. Thus the Remote I/O system using the SIOM modules has a number of benefits:
* A failed server, HBA or NIC (1) (Network Interface Card) See network adapter. See also InterNIC.
(2) (New Internet Computer) An earlier Linux-based computer from The New Internet Computer Company (NICC), Palo Alto, CA. can be "replaced" by an automated system management function--employing legacy O/S hot-swap mechanisms--without any physical intervention by a system administrator. The mean time to repair (MTTR (Mean Time To Repair, Mean Time To Restore) The average time it takes to repair a failed component. See reliability.
MTTR - Mean Time To Recovery ) is now seconds rather than minutes or hours
Another advantage is that it is now possible to better manage PCI Express enumeration 1. (mathematics) enumeration - A bijection with the natural numbers; a counted set.
2. (programming) enumeration - enumerated type. . Events that may have previously caused direct resets or hot plug events can now be controlled and sequenced using the fabric manager embedded in the appliance. We believe this feature will enable a more reliable and controlled operating environment In computing, an operating environment is the environment in which users run programs, whether in a command line interface, such as in MS-DOS or the Unix shell, or in a graphical user interface, such as in the Macintosh operating system. and help to improve overall system availability.
PCI Express is currently ratifying the cable standard and looks set to standardize on a 7 meters maximum cable length. For simple remote I/O complexes, this cable length will suffice. Any server with a PCI Express external connector could attach directly to the Remote I/O appliance.
It is important to understand the capability of the chipset to maximize PCI Express efficiency and for the switch to provide the minimum overhead and latency. However, for most applications a single 4X cable provides a perfect solution for 10G capability at reasonable efficiency. This would more than address the entry level needs to connect to existing Fibre Channel or Ethernet resources.
For larger complexes, with more servers and more I/O, the capability to extend cable length to 15 meters is essential. The current cost of optical connections is too high and this means that higher power Higher power is a term used in a 12-step program, such as Alcoholics Anonymous, to describe "a power greater than yourself." Although many participants equate their higher power with God, a belief in God or in formal religion is not mandatory; the higher power is intended as a electrical transceivers would be needed, but already similar lengths have been achieved with copper connections for higher speed interfaces than PCI Express. It can be assumed therefore, that suitable Host Channel Adapters and switch transceivers would be capable of meeting this requirement.
Remote I/O--A practical solution
With all the building blocks available, this section examines how they can be combined in a useful product.
Typical configurations for small applications would require a minimum of two to four servers with three cards in each server. The SIOM form factor requires a 3U height for vertically mounted modules. Vertical mounting with horizontal stacking is essential to minimize conflicts with I/O cable plugging. Within this 3U form factor, it is possible to provide sufficient SIOM capability and server cable interconnects for the small system.
In addition to basic connectivity, a practical solution requires:
* Redundant power supplies
* Resilient switch managers
* Hot plug switch modules
* Fully passive backplane A backplane that adds no processing in the circuit. See backplane.
* Dual cable connectivity
Such capabilities can easily be fitted within a 3U Enclosure form factor. A typical 2+1 redundant server system with currently five PCI cards in each PC could be reduced from 9U to a 6U solution requiring lower power, smaller space, fewer I/O cards and greater reliability. On top of this, the roadmap could drive additional benefits of performance and scalability.
Initially Remote I/O will be providing support for the PI-8 protocol which permits the transportation of legacy PCI Express packets through an ASI switch fabric. The Servers will communicate with the IO Modules using PCI Express, and consequently, the legacy Operating System operating system (OS)
Software that controls the operation of a computer, directs the input and output of data, keeps track of files, and controls the processing of computer programs. PCI Software and IO Module device drivers will not require modification. The ASI switch will provide a "virtual" PCI Express link for each Server, and the user will be able to allocate IO Modules to the various servers. Each Server will see only its allocated IO Modules. Redundant or spare IO Modules or Servers may easily be switched in when required through a simple management interface or automated failover mechanism. Thus the power and flexibility of the switch product is made available without the need for extensive SW modifications.
[FIGURE 2 OMITTED]
It would also be expected that the embedded appliance would support a plug-and-play operation and include the following basic feature set:
* Configuration management. This could be a web browser The program that serves as your front end to the Web on the Internet. In order to view a site, you type its address (URL) into the browser's Location field; for example, www.computerlanguage.com, and the home page of that site is downloaded to you. interface for simple systems or a full API (Application Programming Interface) A language and message format used by an application program to communicate with the operating system or some other control program such as a database management system (DBMS) or communications protocol. with management of the dual paths in redundant systems
* Hot plug event management. Hot plug events need to re-enumerate the PCI Express bus, but it is essential to manage when and the extent that this causes
* Enclosure service. Power management is very important and control of power down, fans or PSU PSU - power supply unit failures etc. needs to be tied in to the server strategy
Remote IO--Shared IO Roadmap
It is clear that the roadmap requirement to move from Remote I/O to Shared I/O will involve the internal fabric moving from simple PCI Express to the full ASI implementation. The ASI transport facilitates the fast switching of data over a wide range of Networking, Storage, and Peer to Peer protocols. To support this however, it will be necessary to provide software and hardware that interfaces between the various protocol stacks The set of protocols used in a communications network. A protocol stack is a prescribed hierarchy of software layers, starting from the application layer at the top (the source of the data being sent) to the data link layer at the bottom (transmitting the bits on the wire). and the AS endpoint hardware device driver. The production of the full set of required SW is a substantial task, and one that requires the participation of the wider open source community.
It would appear that the simplest next stage of ASI integration into the Server is to facilitate the exchange of Peer to Peer information. This would fully operate on the existing Remote I/O hardware base if an ASI Endpoint is used in the server. Several established silicon vendors are indicating that external ASI endpoint devices will be available later this year. Currently this can only be achieved in the PCI domain by the use of non-transparent bridging. Non-transparent bridging, however, does not scale, is asymmetrical, and requires additional software. Other technologies, such as Infiniband, are available for Peer to Peer communications but would require a separate control network in addition to the PCI Express based IO Modules. Remote I/O based architectures require only one switching network which can be designed to support both base PCI Express and Peer to Peer ASI communications. A typical implementation of Peer to Peer communication can be achieved by providing a shared memory (1) Using part of main memory to support a low-cost display circuit that does not have its own memory. See shared video memory.
(2) The common memory in a symmetric multiprocessing system that is available to all CPUs. See SMP.
1. structure within the Remote I/O solution.
The final stage of integration of the AS capability is to "replace" the PCI Express based IO Modules with ASI attached Modules, and to provide the appropriate SW in the Servers that will allow the use of the existing Network and Storage Stacks. This will allow full IO Virtualization An umbrella term for enhancing a computer's ability to do work. Following are the ways virtualization is used.
Partitioning the computer's memory into separate and isolated "virtual machines" simulates multiple machines within one physical computer. and multiple Servers will be able to share single IO Modules. Again this requires end point silicon and appropriate software for the type of application involved. The drawing below shows the general direction, discussed in the previous white paper, and highlights the key software modules within the roadmap.
[FIGURE 3 OMITTED]
These are expected to be implemented through the cooperation of partners within the ASI standards committees and we are encouraging an open source approach to any solutions. This will facilitate acceptance of the technology and speed up its adoption.
The introduction posed the questions is it possible to provide the TCO (1) (Total Cost of Ownership) The cost of using a computer. It includes the cost of the hardware, software and upgrades as well as the cost of the inhouse staff and/or consultants that provide training and technical support. See ROI. and flexibility of a blade centre for the SMB market See SMB. . The aims were:
* Low initial investment
* Greater expansion capability
* Use of commodity hardware
* Open standards Specifications for hardware and software that are developed by a standards organization or a consortium involved in supporting a standard. Available to the public for developing compliant products, open standards imply "open systems;" that an existing component in a system can be replaced
We can meet these basic requirements by implementing a Remote I/O Appliance which complies with the emerging SIOM standard and consolidates all I/O for a multi server environment in a single sub system enclosure as shown below.
When the Remote I/O appliance is connected to multiple 1U servers it would deliver many of the high availability benefits of blade server system architecture to the SMB market. The number of standby components would also be significantly reduced relative to conventional system architecture of equivalent resilience. This would deliver a significant saving in capital outlay capital outlay
See capital expenditure. , power consumption and floor space. The benefits of the new bladed architecture will also scale if the installed server and I/O base were of a higher density allowing an even greater reuse of capital.
No additional software beyond assignment of I/O to server would be required and thus the open standards and commodity hardware would be fully utilized. In addition the overall system management and upgrade processes would be greatly simplified
As we move forward, the base PCI Express switching can be combined with the ASI fabric to provide a platform that offers numerous expansion options:
* For Remote I/O, only the I/O subsystem carries forward the host-centric PCI architecture. The servers will be interconnected by the emerging ASI switch fabric with peer-to-peer message passing and high performance RDMA (Remote Direct Memory Access) A communications protocol that provides transmission of data from the memory of one computer to the memory of another without involving the CPU. InfiniBand, Virtual Interface (VI) and RDMA Over IP are all forms of RDMA. capabilities.
* The Remote I/O model allows users to scale beyond the bottlenecks in their current systems. If more computing power is required, adding another processor does not require extra I/O if sharing is available. Adding I/O where processing is required does not require another server to house the I/O etc. By introducing storage in to the picture then a very flexible system is possible.
Remote I/O provides us with just the first step on a roadmap, enabled by ASI and PCI Express, towards the next generation of data centers and compute platforms.
Paul Millard is CTO (Chief Technical Officer) The executive responsible for the technical direction of an organization. See CIO and salary survey. for Network Systems and Paul Lombardelli is Project Manager for Network Systems at Xyratex (Havant, Hampshire UK).