Implementing PCI Express for storage.
Busses like PCI have developed extensions to increase capacity, but often adversely impact cost and complexity. A newer solution is PCI Express technology, which brings special application and implementation considerations to bear on storage system design. Data flow, quality of service (QoS), software modifications, silicon component selection criteria and board layout are worthy of review for this next-generation PCI bus.
Storage Array Data Flow
The data path between clients and storage arrays is a journey across diverse networking devices and I/O busses. The weakest link in this network is typically an I/O bottleneck or a source of non-recoverable errors. For some, the PCI bus has been a weak link and PCI Express will alleviate their issues with bandwidth and robustness. Many clients, servers, and storage arrays implement PCI busses, creating five or more locations where PCI Express can improve the data flow.
Data Path: Figure 1 shows a network connecting clients to a storage array, as well as the locations of PCI busses and where PCI Express migrations are likely to occur. In this example, clients on an Ethernet LAN access multiple heterogeneous servers. The servers also connect to the storage array through an alternate interface. The data flow between a client and a storage array is described in the following, and references Figure 1:
* The process begins with a client request to a general purpose server on the LAN. The client may be an individual PC or a workstation with specific data and application demands. The client is a CPU and chipset with a PCI bus, which interfaces to an Ethernet controller connected to the LAN.
* The server runs applications and manages data for clients. The server implements a similar architecture as the client's CPU/chipset combination. However, the server uses faster speed PCI-X busses to interface its Ethernet controller to the LAN.
[FIGURE 1 OMITTED]
* The server requires a connection to the storage array independent of the LAN. The server connects to a channel adapter that manages the communication with the switched backplane. The server uses another PCI-X bus to host a network controller that could support an Ethernet, iSCSI, Fibre Channel (FC) or custom bus.
* The channel adapter sends the message to the switch on an iSCSI, FC, Infiniband or custom bus. In some cases, the switch may be more of a router, understanding how to send messages to specific devices as opposed to just steering messages to a particular port. The channel adapter handles messaging tasks such as TCP/IP offload, data queuing and bus translation.
* The storage array controller is often a special function server implementing PCI-X busses. One PCI-X link will host a network controller connected to the switch/router. The storage array controller runs applications, caches data and helps ensure the data integrity of the system. For low to midrange systems, the storage array controller may perform RAID functions.
* The storage array controller sends data requests to the disk adapter on PCI-X bus. The disk adapter controls the disk array though an FC channel link and typically perform RAID functions.
* The disks implement an FC-AL link, where AL is the arbitration loop. The arbitration loop is a link that connects all the disk drive nodes together and manages with a token-acquisition protocol.
In this example, requests from the client to the storage array travel across five PCI/PCI-X busses. The PCI/PCI-X busses are located in the client, the applications server and the storage array controller. In the second half of 2004, these PCI/PCI-X busses will begin to migrate to PCI Express.
In addition, some switch/router interfaces in the storage array may transition from Ethernet, iSCSI, Fibre Channel or custom to PCI Express. PCI Express offers cost advantages, scalability, and full-duplex operation.
RAS improvements can increase data integrity, a key criteria in storage. PCI Express enhances reliability by implementing differential pairs for signal lines with greater noise immunity than high-speed parallel busses. Reliability is further enhanced by 8b/10b decoding that embeds the clock in the data signal, alleviating signal/clock line timing skew. PCI Express supports two levels of Error Correction Codes (ECC) checking for both Data Link Layer and Transaction Layer errors.
With parallel busses, a bus failure can bring down all the boards connected to the bus. As a point-to-point bus, a PCI Express link failure may be isolated from other boards so portions of the system continue to function and remain available.
To aid serviceability, PCI Express supports features such as hot plug, power budgeting and power management.
[FIGURE 2 OMITTED]
QoS for Storage
Storage arrays maintain data for a wide variety of servers hosting a range of applications. PCI Express offers Quality of Service (QoS) features to provide higher bus bandwidth to priority data types.
Figure 2 shows a conceptual example of storage array implementing PCI Express. A switch is connected to three servers as well as a root complex that manages the storage array. PCI Express implements Virtual Channels and Traffic Classes to provide a flexible control mechanism to shape data flow.
Each server is transmitting three data types, which are assigned a priority in terms of a Traffic Class (TC). There are eight TCs with TC0 the lowest priority and TC7 the highest. The Administration Application Server assigns Error Signaling messages its highest priority, TC6. Since the primary responsibility of a storage array is to maintain data integrity, error-signaling messages receive the highest priority so error recovery sequences can be launched as quickly as possible.
Data Backup is assigned to TC1 and Power Management is assigned TC0. Data types maintain the same TCs throughout the system.
At each node, TCs are assigned to a Virtual Channel (VC). VCs provide a means for the application to allocate bus bandwidth. In Figure 2, the root has three VCs and assigns timeslots as follows: one time slot for VC2, followed by two time slots for VC1, then another time slot for VC2 and finally two time slots for VC0. The sequence repeats itself continuously.
VC2, which is mapped to TC6 for error signaling, has two of the seven time slots. The two time slots are separated in the cycle so the latency is no greater than three time slots. Time slots can be assigned in one of three tables to allow applications to define flexible weighted round-robin arbitration schemes. These tables accommodate 32, 64, and 128 entries.
PCI Express is software compatible with PCI and PCI-X systems. This means that existing operating systems and device driver software will function properly in a PCI Express system without change. However to take advantage of additional features of PCI Express, software needs to be rewritten.
In particular, system designers should consider taking advantage of PCI Express enhancements to interrupts, quality of service, power management and error correction and detection capabilities. The implementation of these features involves various levels of hardware and software modifications.
PCI Express supports inband Message Signaled Interrupts (MSI), similar to PCI-X. This mechanism reduces interrupt servicing latency as well as eliminates interrupt signal lines on the printed circuit board.
Quality of service has been discussed, and new software is required to configure this capability as well as to determine the arbitration schemes for virtual channels and ports. These mechanisms help manage data flow for specific transaction types.
Power budgeting features include mechanisms to query add-in cards for power requirements, so software can determine whether the new card can be supported from power delivery and cooling perspectives. Power management provides the means to place PCI-enabled devices into different power states (fully active, standby, sleep, off, etc.) depending upon the state of the storage array. In addition, software elements supporting hot-plug are defined by the PCI Express specification.
PCI Express supports more extensive error detection, signaling and logging than predecessor PCI busses. New software is required to respond intelligently to this additional error information.
To take full advantage of new PCI Express features, significant software coding is required. However, the backward legacy support of PCI Express allows system developers to migrate to these new features at their desired pace. Also, many of these features are system level, which may have limited impact on the basic functionality of endpoints that were originally designed for PCI and PCI-X.
The selection of PCI Express enabled devices encompasses data bandwidth, system architecture and usage model considerations. System designers should ensure their PCI Express subsystems are well-balanced and the required data traffic profile is realized.
Considerations: A storage array may employ a PCI Express topology with four components. Figure 2 shows three PCI Express enabled components: root complex, switch, and four endpoints corresponding to the four channel adapters. If any endpoint is not PCI Express capable, a bridge is required such as a PCI Express to PCI/PCI-X bridge.
The selection criteria for root complex, switch, endpoints and bridges components are driven by data bandwidth, system architecture and usage model considerations. The data bandwidth needs of different priority transactions of the storage array dictate requirements around port configuration, arbitration mechanisms and maximum payload size. System architecture and usage model determine the applicability of features such as peer-to-peer transfers, hot plug capability and power consumption.
The first order task is to properly provision priority bandwidth to meet guaranteed latency specifications. For example, error signaling transactions have a higher priority than data backup transactions. To ensure a PCI Express port complies with priority data flow requirements, port capability is computed using simple calculations involving link speed, width (lanes) and maximum payload size as well as more complex modeling of port and virtual channel arbitration.
Port arbitration can occur in two components, switches and the root complex. In Figure 2, the switch performs port arbitration for the four application servers to control the traffic flow between its four ingress ports and the root complex. A root complex with multiple ports performs port arbitration for peer-to-peer transactions and for access to a common egress port such as system memory. An examination of the virtual channel port arbitration schemes, such as weighted round robin, is advised to ensure sufficient data flow for priority transactions.
The next task is to size non-priority data flow and to budget for incidentals such as Data Link Layer retries. At this point, it may be useful to consider whether the priority transactions are bursty in nature. This warrants a special analysis of the arbitration schemes to confirm low-priority data flow is sufficient to ensure functional correctness.
[FIGURE 3 OMITTED]
Finally, storage array system architecture and usage model drive other component selection criteria. System architecture may require peer-to-peer transaction support, such as direct communication between two application servers across the switched backplane. The usage model may support hot swap of cards to increase the availability of the storage array. For some appliances, power consumption may be a key concern, especially if the system includes a mix of root complexes, bridges, switches and endpoints.
Selection: First generation root complexes, switches and bridges typically support one or two virtual channels with eight traffic classes. For most storage arrays, this configuration offers ample traffic shaping capability; high and low priority transactions may be split between the two virtual channels, with the high priority channel assigned greater bandwidth than the low priority channel.
Designers should select components with balanced capabilities. For example, connections between endpoints and switches default to the lowest common denominator for the number of virtual channels and maximum payload size. In other words, an endpoint can only transfer data as fast as the switch can handle, and vice versa. A switch supporting flexible port widths is useful for handling various combinations of ports and lane widths to match different endpoints.
In future generations of PCI Express components, root complexes with two or more ports will act as a switch. From a topology perspective, a discrete switch component is eliminated, saving board space. On the other hand, the flexibility to choose among a wide variety of discrete switches and features is forgone.
PCI Express provides system designers greater flexibility to shape the data traffic flow of their system. During component selection, system designers balance the features and capabilities of various components to ensure bandwidth, architecture and usage model requirements are supported throughout the system.
Laying out fewer signal lines isn't always easier. Although PCI Express boasts greater I/O bandwidth per pin than prior PCI bus family members, the faster bit rate for PCI Express necessitates using different layout techniques.
PCI design guides recommend routing signal lines along a similar path with trace lengths matching to within 25 mils. With PCI Express, the two signals comprising a lane must match trace lengths to within 5 mils. This stringent matching specification helps maintain the signal integrity of this differential pair. The impact is that "bumps" must be added to PCI Express signal lines to add additional trace length to compensate for bends.
Figure 3 illustrates trace length compensation, showing PCI Express and PCI bus routing from the chipset pins. In A, the PCI Express signal lines have additional bumps to match trace lengths, whereas these bumps are not required for PCI bus lines. For example, signal Z and signal Y make two bends after leaving the device pin and signal Z cuts the corner more sharply than signal Y. Signal Z traverses a slightly shorter distance which necessitates the addition of four bumps after point C for trace length compensation. Similarly, signal Y takes a shorter path than signal Z at point D and two compensation bumps were added to its trace.
The need to carefully match trace lengths has several repercussions. First, layout designers need to spend additional time counting bends and planning the placement of compensation bumps nearby. Differential pairs require length bumps to be placed near bends to minimize line-to-line signal skew. Therefore, trace compensation is needed throughout signal traces and cannot be 'ganged' together in one location. Adding bumps is a manual process today, but in the future, one can anticipate layout tools to assist bump placement.
Second, layout designers quickly learn to relax the spacing between PCI Express signal lines so that bumps can be easily added intermittently without impacting the fan-out of the overall PCI Express bus. This may effectively double the line-to-line spacing (shown on the left side of Figure 3, part A).
Third, layout engineers also need to ensure trace lengths between lanes match to within 15 mils. This tolerance is less forgiving than PCI and also requires layout designers to pay attention to an additional set of constraints.
Fourth, layout designers are discouraged from switching layers when routing PCI Express signal lines. This restriction necessitates greater up-front planning of component placement on the board. Although PCI Express busses have fewer traces than prior PCI busses, the strict trace length matching specifications make board layout a time intensive task.
PCI Express addresses the need for bus performance in storage arrays, capable of supporting 10Gbit/sec networks and beyond. Additional features for quality of service and power management provide for even more reliable storage arrays. PCI Express offers reduced complexity, although system designers must do additional work to capture all the benefits offered by new PCI Express features.
PCI Express is a major departure from prior PCI busses. Although backward compatibility has been maintained, the impact of PCI Express on I/O bandwidth, system architecture and usage models for storage arrays will be significant.
Craig Szydlowski is strategic marketing engineer at Intel Corporation (Santa Clara, CA)
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Storage Networking|
|Publication:||Computer Technology Review|
|Date:||Aug 1, 2004|
|Previous Article:||The looming SAN storm in the SMB market part 2; continuation of the roundtable discussion on the emerging SAN market for small to medium-sized...|
|Next Article:||Scalable network storage architectures.|