Back-end switching in storage server design: improves the performance and availability of storage systems. (High Availability).
Until recently, including a back-end switch as an essential component of a storage server has not been practical because the power, cooling, packaging, and cost of a back-end switch has made it unreasonable. In addition, adding a back-end switch would have also required significant changes to the firmware of the storage server. However, recent changes in back-end switch design now make it practical and feasible.
In order to set the stage to illustrate the benefits of adding a back-end switch to a storage server, we will first discuss design issues of conventional storage servers. A storage server consists of one or more "controllers" that actually deliver the storage service, plus the packaging that holds those controllers and their back-end disks. Storage servers are designed to provide a variety of value-added services, but a primary goal they all share is to enhance the characteristics of their individual disk drives in the areas of performance (bandwidth, throughput, and latency) and RAS (reliability, availability, and scalability).
We will discuss storage server design issues using the example of modular RAID servers, which represent the majority of enterprise storage servers shipped worldwide. In general, the same design issues arise in monolithic RAID servers such as the EMC Symmetrix, HDS Lightning, and IBM Shark, and in enterprise file servers (NAS). We will then demonstrate how incorporating a back-end switch into the design of a storage server can lead to significant performance and availability improvements.
Modular RAID Server Design
Modular RAID servers are available from many manufacturers and they are all variants on a common theme. As shown in Figure 1, each controller in a modular RAID server consists of the following functional elements:
* Processor on a control bus with local memories for programs (program load memory) and control structures (control memory).
* Data bus, data/cache memory and data retention system.
* Host and disk interfaces.
* Cache mirror interface connecting to the other controller in the server.
* Parity computation logic.
All of these functional elements are integrated into a single controller that is replicated in its entirety to create a high-availability server. The controller's processing power, memory bandwidth, number of host interfaces, and disk interfaces are all fixed, although the amount of cache memory and the number of back-end disks in the server may be upgradeable.
Common elements of a modular RAID server include host and disk interfaces, processor and data bus, cache, modular disk packaging, and high availability mechanisms.
Host and disk interfaces: The most prevalent host interface found today in modem modular RAID controllers is Fibre Channel, and the most prevalent disk interface is FCAL. Modular RAID controllers generally have two host interfaces. The number of disk interfaces determines the maximum number of disks the controller can connect to as well as how much performance the controller can get from requests that miss its cache.
Processor and data path design: The RAID controller processor has a private bus connecting to a memory that holds code and control structures to assure that the processor's memory traffic does not interfere with data traffic through the controller. The two controllers in a modem modular storage server have enough processing power to operate 120-500 of the fastest available back-end disks at full speed, and processing power is rarely a performance bottleneck in these controllers. The data bus connects the host interfaces, the disk interfaces, the data/cache memory, and the cache mirror interface in a manner that is optimized for burst data traffic. A pair of modern controllers have enough combined data bus and data memory bandwidth to transfer data between six 2Gbps back-end FC-AL buses and six 2Gbps front-end Fibre Channel fabrics at full speed, which removes internal bandwidth as a performance bottleneck.
With the recent and continuing improvements in commodity processor, memory, and bus performance, back-end bus efficiency and disk drive connectivity have become the most significant areas in which storage server designers can differentiate themselves on performance.
Cache: Controller cache provides two functions--read caching and write caching. Read caching improves latency and throughput by holding disk data that is anticipated to be read by applications. Write caching captures write data in the cache instead of writing it immediately to disk, thus providing the illusion of low-latency disk writes. Customers buy lots of controller cache believing it will help their application performance. The performance improvements on real-world workloads due to adding more than the minimum amount of controller cache, however, are far less than most customers (and some storage server designers) believe!
Modern operating systems and database systems have figured out how to use today's large host memories to effectively cache application data requests close to the application. The read requests that miss in the application server caches and are issued to the storage server controller have a very low probability of hitting in the controller's read cache. Database systems even cache writes effectively and safely by using journaling techniques.
Given the limitations of storage controller cache in improving I/O throughput and latency, performance of backend buses and disks becomes much more important. As back-end disks get faster, there is more strain on back-end buses to maintain low bus latency in the face of high bus utilization.
Modular disk packaging: This is critical to the availability of a modular RAID server; it must provide reliable power, cooling, and interconnect, and prevent failures in a single disk from affecting other disks. The most popular form of enterprise disk packaging is the modular disk shelf or JBOD (Just a Bunch of Disks). The JBODs and controllers are connected to an external FC-AL hub or are daisy-chained using a three-port hub built into each JBOD and controller.
High-availability mechanisms: High availability in general is achieved by component replication, failure independence, failover, and online component replacement. In a modular RAID server there are five components that these high availability mechanisms apply to: the disks themselves, the controllers, the cache, the packaging (power supplies and fans), and the back-end disk buses.
All storage servers use some form of RAID to protect data from disk failures through disk redundancy. Because all RAID implementations must store redundant information to reconstruct application data if a disk falls, a single application write causes multiple I/O operations on the controller backend. The number of back-end I/Os per application write is always two for RAID 1; in OLTP applications it is generally four for RAID 5 and six for RAID 6.
This I/O multiplication on writes puts a further strain on back-end disks and buses in RAID controllers.
Failover: RAID acts to protect data against disk failure, but a failure in a storage controller can prevent applications from accessing the protected data. As a result, all storage servers must implement mechanisms to protect against controller failure. In modular RAID servers, the RAID controller in its entirety is replicated for high availability. When one controller fails, the other controller assumes the failed controller's I/O load in addition to its own; when the failed controller returns to service the I/O load is redistributed across the two controllers.
Cache mirroring: Read cache is inherently tolerant to cache memory failures because the data in the read cache is a copy of the "real" data on the back-end disks. Write cache, however, holds the sole copy of application data until it is written to disk. Therefore, write cache must be mirrored across controllers to protect data across failures that can affect cache memory: power failure, cache failure, and controller failure. This mirroring is generally done by forwarding write data to the other controller's cache across a dedicated inter-controller link before reporting to the application that the write has completed.
Providing dedicated links between controllers just for cache mirroring adds cost, but current back-end bus topologies allow no alternative; the added strain of cache mirroring traffic on congested back-end buses would produce unacceptable write performance and latency.
Back-end buses: Every FC-AL disk has two ports for connection to two independent FC-AL loops, and every IBOD runs two PC-AL buses to each disk slot, but this redundancy does not guarantee high availability. Any disk slot that does not contain a disk, or any disk that fails in such a way that it becomes non-responsive, can interrupt the continuity of both FC-AL loops and effectively cause all disks on those loops to fail. JBOD. electronics must provide a way to disconnect an empty slot :or a non-responsive disk from both of its PC-AL loops. This is usually accomplished by the use of a port bypass circuit (PBC). A single PBC can disconnect a single disk slot from one of its two FC-AL loops. The PBCs for each loop are generally placed on an interface electronics card along with the three-port hub used .for external connections. Each JBOD has two of these electronics cards, one for each FC-AL loop in the shelf.
Even though a JBOD has two independent PC-AL loops running to each dual-ported disk, failures in the two PC-AL buses are not independent. Both PC-AL buses share common logic inside each disk, and that logic can fail in such a way that the disk issues excessive loop initializations, corrupts packets passing through it, or transmits out of turn, disrupting communication on one or both FC-AL loops. Port bypass circuits cannot detect or correct any of these conditions. As a result, the high-availability strategy of redundant components with fail over is less effective on the back-end buses of a storage server than in any other part of the server architecture.
Advantages of Back-end Switches in Storage server Design
Enhancing the back-end FCAL buses of a storage server from a loop topology to a switched topology can significantly improve availability, performance, and even overall system cost.
Implementation issues had previously prevented designers from including back-end switches in storage servers. The switch implementations were too large, too expensive compared with the cost of a disk, had significant power requirements, and required adding complex fabric service code to the back-end software of the storage server. All these issues have disappeared with the introduction of an embedded storage switch on a single chip.
Given a pair of PC-AL loops that connect a pair of controllers to one or more JBQDs, there are two back-end switching topologies that can be added. With intra-shelf switching, embed a pair of switches into each JBOD, converting it into an SBOD (switched bunch of disks). One switch would take the place of the multiple port bypass circuits and the three-port mini hub on the interface electronics card for each of the two FC-AL loops.
With inter-shelf switching, place an external switch on each loop connecting the controllers and disk shelves (JBODs or SBODs).
Performance: Inter-shelf switching has the implementation advantage that theatre no changes required in the disk shelves. It has the performance advantage, in multi-shelf modular storage servers, of dramatically increasing the back-end bandwidth of the storage server. This is because the two controllers in the storage server can simultaneously communicate with two disks as long as the disks are not in the same disk shelf. The back-end bandwidth dividend resulting from inter-shelf switching is 50% when there are only two disk shelves and asymptotically approaches 100% as the number of shelves increases. Next generation storage servers incorporating more than two controllers for scalability could see a bandwidth dividend well in excess of 100% from intershelf switching.
Intra-shelf switching alone will also produce a bandwidth dividend, but only when each pair of FC-AL loops runs to a single disk shelf. In this case, the switch will allow the two controllers to communicate simultaneously with two disjoint disks in the shelf, resulting in a 90-95% bandwidth dividend if the shelf is full of disks.
In addition to the above bandwidth dividend from switching, both inter-shelf and intra-shelf switching reduce FC-AL transit time and thereby increase the inherent efficiency of the back-end buses. FC-AL transit time is the sum of all the delays in the loop caused by node logic (including elasticity buffers), port bypass circuits, cables, hubs, and switches in the loop path. For 2Gbps FCAL loop, the overhead that transit time adds to every 110 request can be approximated as the number of nodes (disks plus controllers) on the bus, multiplied 1.08 microseconds for read requests or 1.4 microseconds for write requests. Given that the 8KB 110 operations common to OLTP workloads take only 41 microseconds of useful bus time, the efficiency of fully configured PC-AL loops can fall below 20%.
A switch topology reduces transit time, and therefore increases efficiency, by effectively shortening the size .of the FC-AL loop by bypassing all nodes not in the direct switched path between initiator and target. Inter-shelf switching increases efficiency more than intra-shelf switching when multiple disk shelves are involved. The benefits of backend switching continue to grow as more disks and disk shelves are added to the back-end bus. The combination of intra- and inter-shelf switching produces over a 7:1 improvement in effective bandwidth for transaction processing workloads on a fully loaded back-end .bus, as can be seen .by the graph in Figure 2. Even in data warehousing workloads with large (64KB) transfers, the combination of inter-shelf and intrashelf switching. increases the effective back-end bus bandwidth by more than 3:1 in large configurations.
This increase in effective back-end bus bandwidth can be exploited in one or more of the following ways: improved performance, lower RAID reconstruction time leading to improved data availability, fewer back-end buses leading to reduced cost and/or wiring complexity, and simplification of cache mirroring.
Availability: Switches provide significant added functionality over port bypass circuits in isolating misbehaving disks that interfere with proper FC-AL loop operation. Port bypass circuits can only remove dead disks from a loop; :hey cannot detect a crazy disk that disrupts communication on one or both FC-AL loops. These failures cannot be isolated without a global view of all activities on the FC-AL loops. A controller, in conjunction with the environmental monitoring unit (EMU) in all shelves on the affected loops, must isolate the failure to the malfunctioning disk and then disconnect that disk slot from one or both FC-AL buses. Unfortunately, port bypass circuits provide no information as to which drive is malfunctioning, so the storage server designer must improvise. One method used today is to disable every disk on the affected loop, one at a time, using custom designed logic designed into the shelves, checking each time to see if the problem goes away. This method is complex to implement, disruptive to ongoing 110 operations, and only works if the disk failure is a hard failure-it will not isolate an intermittent failure.
Because a switch must receive and analyze all incoming packets to determine packet routing, a switch in the shelf packaging of a storage server is in an ideal position to know that a disk is violating low level bus standards. Switches provide some natural fault isolation of misbehaving nodes because they do not route packets through any node other than the one being addressed, but they can also log the detection of improper packets or bus protocol violations for later examination by the EMU. An inter-shelf switch will track any failure to a single shelf, and an SBOD will track any failure to a single disk slot. This is a superior way of finding misbehaving disks, and the only practical way to track down an intermittent fault.
A switch is, of course, a single point of failure for a single FC-AL loop, but no more so than the multiple port bypass circuits and/or hub that it replaces. By enabling true failure independence between PC-AL loops, switches increase the availability of the back-end buses of storage servers to the same level as the other redundant, failure independent components of the storage server.
Cost: Storage server vendors typically recommend configuring a limited number of disks per storage server, less than the maximum of 124 disks per loop pair supported by PCAL, due to concerns about performance. if users need to add drives past this limit and maintain performance, they are obliged to add more storage servers. This increases the cost of purchasing and managing storage. By adding backend switching and reducing the number of storage. servers, the vendor can lower the total system cost to the end user by 1520% without compromising performance.
The replacement of the loop topology currently used to connect storage servers with their disk drives with a switched topology can bring significant gains in performance, availability, and total systemcost. The technology to create switched back-end topologies was not available until Vixel introduced its InSpeed Technology in 2001. InSpeed Technology allows RAID controller vendors to incorporate back-end switching in new storage server designs and turbo charge existing storage server designs and even existing storage servers via field upgrade, because InSpeed Technology can change a loop topology to a switched topology transparently to the firmware in the storage server.
[FIGURE 2 OMITTED]
Richard Lary is an independent consultant in the storage industry and former technical director for storage at Compaq and Digital.
|Printer friendly Cite/link Email Feedback|
|Publication:||Computer Technology Review|
|Date:||Jul 1, 2002|
|Previous Article:||Ultra320 SCSI and adaptive active filtering: the alternative to transmitter pre-compensation. (High Availability).|
|Next Article:||Data protection service level agreement: implementing SLA support based on infrastructure design. (Storage Networking).|