PCI Express switching and Remote I/O in the data center.
To achieve this aim, there needs to be a resolution to the three problems:
* Hot plug server and I/O modules at commodity prices
* Suitable PCI Express interconnect with support for multiple hosts
* PCI Express cable standards that support the interconnect requirement
Remote I/O--Building blocks
The basic technology building blocks of Remote I/O are starting to become available, driven in part by the higher specification requirements of the blade server system architecture.
[FIGURE 1 OMITTED]
Server I/O Modules
Server I/O Modules (SIOM) are designed to meet the needs of the next generation of servers and workstations. The standardization of these new modules has already been started by the PCI-SIG and a draft specification is currently in progress. The basic physical requirements of a Remote I/O entry appliance and a SIOM are identical. They fundamentally address the concerns of current PCI cards, namely:
* Need for a robust low cost module, providing improved power, cooling, and EMI handling
* Hot plug capability
* I/O module installation and removal from a closed chassis without the need for any special service skills or tools
* Internal system bandwidth provisioned to meet future IO requirements
* 10G Ethernet and Fibre Channel
The SIOM is available in a size that is approximately half the current full length PCI card with a power budget the same as today. It is thus compatible in form factor to most of the current generation of high end I/O cards. However, with its more prescribed case, power and connection specification it enables:
* Improved system I/O density
* Greater flexibility of chassis designs
Packaged in a protective module, the SIOM features EMI gasketing, insertion/removal levers and a defined cooling specification suitable for 10G Ethernet or Fibre Channel. It also provides an expansion route via a double width card for very high end applications.
The aim of the SIOM standard is to provide commodity PCI Express technology modules at a very small overhead to standard PCI cards of equivalent performance.
PCI Express switching
Interconnect is the next issue. Consider a typical solution shown in Fig 1. This shows a simplified system diagram incorporating a Remote I/O appliance. The I/O devices (HBAs and NICs) are legacy PCI devices with legacy PCI drivers; therefore each must be logically connected to a single host. However, unlike legacy systems where the connection is physical, i.e., a particular HBA is plugged into a particular server; here the connection is only determined by the configuration of the switch.
PCI Express was not designed to support this environment and so non-transparent bridges are introduced to support a multiple root complex connection. However, this moves away from the required need for transparent legacy I/O card support and the need for full sharing of resource as the ultimate requirement. It is therefore necessary to look for solutions beyond current PCI Express capability. There are two possible solutions to consider:
* Shared PCI Express. The PCI Express SIG is looking at shared PCI Express, which offers remote I/O capability and the enhancement to allow two hosts to truly share a single I/O resource. This would require additional routing fields within the PCI Express packet structure, which requires at minimum additional silicon within the switch. However, potentially it does promise a lightweight entry to Remote I/O and with expansion to Shared I/O, but until the findings of the SIG are finalised, the full capability cannot be judged. Clearly if this additional function can be provided for a nominal silicon cost then this would suit smaller Remote I/O configurations and would allow ASI to tunnel traffic from these boxes into a larger fabric, possibly providing a best of both worlds solution.
* Advanced Switching Interconnect (ASI). Clearly ASI was designed to tunnel PCI Express and so fully fulfils the requirement for Remote I/O because the additional routing is already part of the ASI specification. ASI also provides a rich set of capabilities as it additionally supports full fabric capability with advanced traffic management support. ASI is a fully published specification and multiple vendors are already implementing silicon suitable for Remote I/O and hence is the current solution of choice.
The (ASI) standard permits the simple encapsulation and transportation of PCI Express packets in a way which is 100% backwards compatibile with existing PCI Express hardware and firmware. This allows systems with ASI system interconnect to use legacy PCI Express I/O devices and software. ASI also allows encapsulation and transportation of many other communication protocols, such as Ethernet, Fibre Channel and Infiniband. ASI can therefore bridge across networks containing multiple communication protocols whilst providing the capability to reuse previously installed legacy hardware and software.
The diagram above shows how simple AS-to-Express bridges are easily embedded in the switch port. Thus the Remote I/O system using the SIOM modules has a number of benefits:
* A failed server, HBA or NIC can be "replaced" by an automated system management function--employing legacy O/S hot-swap mechanisms--without any physical intervention by a system administrator. The mean time to repair (MTTR) is now seconds rather than minutes or hours
Another advantage is that it is now possible to better manage PCI Express enumeration. Events that may have previously caused direct resets or hot plug events can now be controlled and sequenced using the fabric manager embedded in the appliance. We believe this feature will enable a more reliable and controlled operating environment and help to improve overall system availability.
PCI Express is currently ratifying the cable standard and looks set to standardize on a 7 meters maximum cable length. For simple remote I/O complexes, this cable length will suffice. Any server with a PCI Express external connector could attach directly to the Remote I/O appliance.
It is important to understand the capability of the chipset to maximize PCI Express efficiency and for the switch to provide the minimum overhead and latency. However, for most applications a single 4X cable provides a perfect solution for 10G capability at reasonable efficiency. This would more than address the entry level needs to connect to existing Fibre Channel or Ethernet resources.
For larger complexes, with more servers and more I/O, the capability to extend cable length to 15 meters is essential. The current cost of optical connections is too high and this means that higher power electrical transceivers would be needed, but already similar lengths have been achieved with copper connections for higher speed interfaces than PCI Express. It can be assumed therefore, that suitable Host Channel Adapters and switch transceivers would be capable of meeting this requirement.
Remote I/O--A practical solution
With all the building blocks available, this section examines how they can be combined in a useful product.
Typical configurations for small applications would require a minimum of two to four servers with three cards in each server. The SIOM form factor requires a 3U height for vertically mounted modules. Vertical mounting with horizontal stacking is essential to minimize conflicts with I/O cable plugging. Within this 3U form factor, it is possible to provide sufficient SIOM capability and server cable interconnects for the small system.
In addition to basic connectivity, a practical solution requires:
* Redundant power supplies
* Resilient switch managers
* Hot plug switch modules
* Fully passive backplane
* Dual cable connectivity
Such capabilities can easily be fitted within a 3U Enclosure form factor. A typical 2+1 redundant server system with currently five PCI cards in each PC could be reduced from 9U to a 6U solution requiring lower power, smaller space, fewer I/O cards and greater reliability. On top of this, the roadmap could drive additional benefits of performance and scalability.
Initially Remote I/O will be providing support for the PI-8 protocol which permits the transportation of legacy PCI Express packets through an ASI switch fabric. The Servers will communicate with the IO Modules using PCI Express, and consequently, the legacy Operating System PCI Software and IO Module device drivers will not require modification. The ASI switch will provide a "virtual" PCI Express link for each Server, and the user will be able to allocate IO Modules to the various servers. Each Server will see only its allocated IO Modules. Redundant or spare IO Modules or Servers may easily be switched in when required through a simple management interface or automated failover mechanism. Thus the power and flexibility of the switch product is made available without the need for extensive SW modifications.
[FIGURE 2 OMITTED]
It would also be expected that the embedded appliance would support a plug-and-play operation and include the following basic feature set:
* Configuration management. This could be a web browser interface for simple systems or a full API with management of the dual paths in redundant systems
* Hot plug event management. Hot plug events need to re-enumerate the PCI Express bus, but it is essential to manage when and the extent that this causes
* Enclosure service. Power management is very important and control of power down, fans or PSU failures etc. needs to be tied in to the server strategy
Remote IO--Shared IO Roadmap
It is clear that the roadmap requirement to move from Remote I/O to Shared I/O will involve the internal fabric moving from simple PCI Express to the full ASI implementation. The ASI transport facilitates the fast switching of data over a wide range of Networking, Storage, and Peer to Peer protocols. To support this however, it will be necessary to provide software and hardware that interfaces between the various protocol stacks and the AS endpoint hardware device driver. The production of the full set of required SW is a substantial task, and one that requires the participation of the wider open source community.
It would appear that the simplest next stage of ASI integration into the Server is to facilitate the exchange of Peer to Peer information. This would fully operate on the existing Remote I/O hardware base if an ASI Endpoint is used in the server. Several established silicon vendors are indicating that external ASI endpoint devices will be available later this year. Currently this can only be achieved in the PCI domain by the use of non-transparent bridging. Non-transparent bridging, however, does not scale, is asymmetrical, and requires additional software. Other technologies, such as Infiniband, are available for Peer to Peer communications but would require a separate control network in addition to the PCI Express based IO Modules. Remote I/O based architectures require only one switching network which can be designed to support both base PCI Express and Peer to Peer ASI communications. A typical implementation of Peer to Peer communication can be achieved by providing a shared memory structure within the Remote I/O solution.
The final stage of integration of the AS capability is to "replace" the PCI Express based IO Modules with ASI attached Modules, and to provide the appropriate SW in the Servers that will allow the use of the existing Network and Storage Stacks. This will allow full IO Virtualization and multiple Servers will be able to share single IO Modules. Again this requires end point silicon and appropriate software for the type of application involved. The drawing below shows the general direction, discussed in the previous white paper, and highlights the key software modules within the roadmap.
[FIGURE 3 OMITTED]
These are expected to be implemented through the cooperation of partners within the ASI standards committees and we are encouraging an open source approach to any solutions. This will facilitate acceptance of the technology and speed up its adoption.
The introduction posed the questions is it possible to provide the TCO and flexibility of a blade centre for the SMB market. The aims were:
* Low initial investment
* Greater expansion capability
* Use of commodity hardware
* Open standards
We can meet these basic requirements by implementing a Remote I/O Appliance which complies with the emerging SIOM standard and consolidates all I/O for a multi server environment in a single sub system enclosure as shown below.
When the Remote I/O appliance is connected to multiple 1U servers it would deliver many of the high availability benefits of blade server system architecture to the SMB market. The number of standby components would also be significantly reduced relative to conventional system architecture of equivalent resilience. This would deliver a significant saving in capital outlay, power consumption and floor space. The benefits of the new bladed architecture will also scale if the installed server and I/O base were of a higher density allowing an even greater reuse of capital.
No additional software beyond assignment of I/O to server would be required and thus the open standards and commodity hardware would be fully utilized. In addition the overall system management and upgrade processes would be greatly simplified
As we move forward, the base PCI Express switching can be combined with the ASI fabric to provide a platform that offers numerous expansion options:
* For Remote I/O, only the I/O subsystem carries forward the host-centric PCI architecture. The servers will be interconnected by the emerging ASI switch fabric with peer-to-peer message passing and high performance RDMA capabilities.
* The Remote I/O model allows users to scale beyond the bottlenecks in their current systems. If more computing power is required, adding another processor does not require extra I/O if sharing is available. Adding I/O where processing is required does not require another server to house the I/O etc. By introducing storage in to the picture then a very flexible system is possible.
Remote I/O provides us with just the first step on a roadmap, enabled by ASI and PCI Express, towards the next generation of data centers and compute platforms.
Paul Millard is CTO for Network Systems and Paul Lombardelli is Project Manager for Network Systems at Xyratex (Havant, Hampshire UK).