Printer Friendly
The Free Library
19,607,059 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Back-end switching in storage server design: improves the performance and availability of storage systems. (High Availability).


Moving from a shared back-end bus structure to a switch-based backend structure in the design of RAID servers or file servers can significantly enhance the performance and availability of these systems.

Until recently, including a back-end switch as an essential component of a storage server has not been practical because the power, cooling, packaging, and cost of a back-end switch has made it unreasonable. In addition, adding a back-end switch would have also required significant changes to the firmware A category of memory chips that hold their content without electrical power. Firmware includes flash, ROM, PROM, EPROM and EEPROM technologies. When holding program instructions, firmware can be thought of as "hard software." See flash memory, ROM, PROM, EPROM, EEPROM and FOTA.  of the storage server. However, recent changes in back-end switch design now make it practical and feasible.

In order to set the stage to illustrate the benefits of adding a back-end switch to a storage server, we will first discuss design issues of conventional storage servers. A storage server consists of one or more "controllers" that actually deliver the storage service, plus the packaging that holds those controllers and their back-end disks. Storage servers are designed to provide a variety of value-added services A value-added service (VAS) is a telecommunications industry term for non-core services or, in short, all services beyond standard voice calls and fax transmissions. , but a primary goal they all share is to enhance the characteristics of their individual disk drives in the areas of performance (bandwidth, throughput, and latency) and RAS (1) See network access server.

(2) (Remote Access Service) A Windows NT/2000 Server feature that allows remote users access to the network from their Windows laptops or desktops via modem. See RRAS and network access server.
 (reliability, availability, and scalability).

We will discuss storage server design issues using the example of modular RAID servers, which represent the majority of enterprise storage servers shipped worldwide. In general, the same design issues arise in monolithic RAID servers such as the EMC Symmetrix The Symmetrix is EMC's flagship enterprise storage array. There have been seven generations of Symmetrix hardware, with the first appearing in 1994 and the latest introduced in 2006. , HDS (Hitachi Data Systems, Santa Clara, CA, www.hds.com) A leading provider of high-end storage hardware, software and services. Part of the Information Systems & Telecommunications Division of Hitachi Ltd.  Lightning, and IBM (International Business Machines Corporation, Armonk, NY, www.ibm.com) The world's largest computer company. IBM's product lines include the S/390 mainframes (zSeries), AS/400 midrange business systems (iSeries), RS/6000 workstations and servers (pSeries), Intel-based servers (xSeries)  Shark, and in enterprise file servers (NAS (1) See network access server.

(2) (Network Attached Storage) A specialized file server that connects to the network. A NAS device contains a slimmed-down operating system and a file system and processes only I/O requests by supporting the popular
). We will then demonstrate how incorporating a back-end switch into the design of a storage server can lead to significant performance and availability improvements.

Modular RAID Server Design

Modular RAID servers are available from many manufacturers and they are all variants on a common theme. As shown in Figure 1, each controller in a modular RAID server consists of the following functional elements:

* Processor on a control bus with local memories for programs (program load memory) and control structures (control memory).

* Data bus, data/cache memory and data retention system.

* Host and disk interfaces.

* Cache mirror interface connecting to the other controller in the server.

* Parity computation logic.

All of these functional elements are integrated into a single controller that is replicated in its entirety to create a high-availability server. The controller's processing power, memory bandwidth Memory bandwidth is the rate at which data can be read from or stored into a semiconductor memory by a processor. Memory bandwidth is usually expressed in units of bytes/second, though this can vary for systems with natural data sizes that are not a multiple of the commonly used , number of host interfaces, and disk interfaces are all fixed, although the amount of cache memory and the number of back-end disks in the server may be upgradeable.

Common elements of a modular RAID server include host and disk interfaces, processor and data bus, cache, modular disk packaging, and high availability Also called "RAS" (reliability, availability, serviceability) or "fault resilient," it refers to a multiprocessing system that can quickly recover from a failure. There may be a minute or two of downtime while one system switches over to another, but processing will continue.  mechanisms.

Host and disk interfaces: The most prevalent host interface found today in modem modular RAID controllers A disk controller card that supports one or more RAID configurations. Originally only for SCSI drives, RAID controllers have become very popular for PATA and SATA drives. See RAID.  is Fibre Channel, and the most prevalent disk interface is FCAL FCAL Fibre Channel Arbitrated Loop . Modular RAID controllers generally have two host interfaces. The number of disk interfaces determines the maximum number of disks the controller can connect to as well as how much performance the controller can get from requests that miss its cache.

Processor and data path design: The RAID controller processor has a private bus connecting to a memory that holds code and control structures to assure that the processor's memory traffic does not interfere with data traffic through the controller. The two controllers in a modem modular storage server have enough processing power to operate 120-500 of the fastest available back-end disks at full speed, and processing power is rarely a performance bottleneck in these controllers. The data bus connects the host interfaces, the disk interfaces, the data/cache memory, and the cache mirror interface in a manner that is optimized for burst data traffic. A pair of modern controllers have enough combined data bus and data memory bandwidth to transfer data between six 2Gbps back-end FC-AL (Fibre Channel-Arbitrated Loop) See Fibre Channel.

FC-AL - Fibre Channel-Arbitrated Loop.
 buses and six 2Gbps front-end Fibre Channel fabrics A Fibre Channel fabric (or Fibre Channel switched fabric, FC-SW) is a switched fabric of Fibre Channel devices enabled by a Fibre Channel switch. Fabrics are normally subdivided by Fibre Channel zoning. Each fabric has a name server and provides other services.  at full speed, which removes internal bandwidth as a performance bottleneck.

With the recent and continuing improvements in commodity processor, memory, and bus performance, back-end bus efficiency and disk drive connectivity have become the most significant areas in which storage server designers can differentiate themselves on performance.

Cache: Controller cache provides two functions--read caching and write caching. Read caching improves latency and throughput by holding disk data that is anticipated to be read by applications. Write caching captures write data in the cache instead of writing it immediately to disk, thus providing the illusion of low-latency disk writes. Customers buy lots of controller cache believing it will help their application performance. The performance improvements on real-world workloads due to adding more than the minimum amount of controller cache, however, are far less than most customers (and some storage server designers) believe!

Modern operating systems Operating systems can be categorized by technology, ownership, licensing, working state, usage, and by many other characteristics. In practice, many of these groupings may overlap.  and database systems have figured out how to use today's large host memories to effectively cache application data requests close to the application. The read requests that miss in the application server caches and are issued to the storage server controller have a very low probability of hitting in the controller's read cache. Database systems even cache writes effectively and safely by using journaling techniques.

Given the limitations of storage controller cache in improving I/O (Input/Output) The transfer of data between the CPU and a peripheral device. Every transfer is an output from one device and an input to another. See PC input/output.

I/O - Input/Output
 throughput and latency, performance of backend buses and disks becomes much more important. As back-end disks get faster, there is more strain on back-end buses to maintain low bus latency in the face of high bus utilization.

Modular disk packaging: This is critical to the availability of a modular RAID server; it must provide reliable power, cooling, and interconnect, and prevent failures in a single disk from affecting other disks. The most popular form of enterprise disk packaging is the modular disk shelf or JBOD (Just a Bunch Of Disks) A group of hard disks in a computer that are not set up as any type of RAID configuration. They are just a bunch of disks.

JBOD - Just a Bunch Of Disks
 (Just a Bunch of Disks See JBOD.

(jargon, storage) Just a Bunch Of Disks - (JBOD, or "Just a Bunch of Drives") A storage subsystems using multiple independent disk drives, as opposed to one form of RAID or another.
). The JBODs and controllers are connected to an external FC-AL hub or are daisy-chained using a three-port hub built into each JBOD and controller.

High-availability mechanisms: High availability in general is achieved by component replication, failure independence, failover, and online component replacement. In a modular RAID server there are five components that these high availability mechanisms apply to: the disks themselves, the controllers, the cache, the packaging (power supplies and fans), and the back-end disk buses.

RAID

All storage servers use some form of RAID to protect data from disk failures through disk redundancy Writing to two or more disks at the same time. Having the same data stored on separate disks enables the data to be recovered in the event of a disk failure without resorting to expensive data recovery techniques. RAID 1 and RAID 5 are common approaches to disk redundancy. . Because all RAID implementations must store redundant information to reconstruct application data if a disk falls, a single application write causes multiple I/O operations on the controller backend. The number of back-end I/Os per application write is always two for RAID 1; in OLTP (OnLine Transaction Processing) See transaction processing and OLCP.

OLTP - On-Line Transaction Processing
 applications it is generally four for RAID 5 and six for RAID 6.

This I/O multiplication on writes puts a further strain on back-end disks and buses in RAID controllers.

Failover: RAID acts to protect data against disk failure, but a failure in a storage controller can prevent applications from accessing the protected data. As a result, all storage servers must implement mechanisms to protect against controller failure. In modular RAID servers, the RAID controller in its entirety is replicated for high availability. When one controller fails, the other controller assumes the failed controller's I/O load in addition to its own; when the failed controller returns to service the I/O load is redistributed re·dis·trib·ute  
tr.v. re·dis·trib·ut·ed, re·dis·trib·ut·ing, re·dis·trib·utes
To distribute again in a different way; reallocate.

Adj. 1.
 across the two controllers.

Cache mirroring: Read cache is inherently tolerant to cache memory failures because the data in the read cache is a copy of the "real" data on the back-end disks. Write cache, however, holds the sole copy of application data until it is written to disk. Therefore, write cache must be mirrored across controllers to protect data across failures that can affect cache memory: power failure, cache failure, and controller failure. This mirroring is generally done by forwarding write data to the other controller's cache across a dedicated inter-controller link before reporting to the application that the write has completed.

Providing dedicated links between controllers just for cache mirroring adds cost, but current back-end bus topologies See bus network.  allow no alternative; the added strain of cache mirroring traffic on congested con·gest·ed
adj.
Affected with or characterized by congestion.


congested ENT adjective Referring to a boggy blood-filled tissue. See Nasal congestion.
 back-end buses would produce unacceptable write performance and latency.

Back-end buses: Every FC-AL disk has two ports for connection to two independent FC-AL loops, and every IBOD IBOD Isdn Bandwidth on Demand Daemon  runs two PC-AL buses to each disk slot, but this redundancy does not guarantee high availability. Any disk slot that does not contain a disk, or any disk that fails in such a way that it becomes non-responsive, can interrupt the continuity of both FC-AL loops and effectively cause all disks on those loops to fail. JBOD. electronics must provide a way to disconnect disconnect - SCSI reconnect  an empty slot :or a non-responsive disk from both of its PC-AL loops. This is usually accomplished by the use of a port bypass circuit (PBC PBC 1 Peripheral blood cells 2 Primary biliary cirrhosis, see there ). A single PBC can disconnect a single disk slot from one of its two FC-AL loops. The PBCs for each loop are generally placed on an interface electronics card along with the three-port hub used .for external connections. Each JBOD has two of these electronics cards, one for each FC-AL loop in the shelf.

Even though a JBOD has two independent PC-AL loops running to each dual-ported disk, failures in the two PC-AL buses are not independent. Both PC-AL buses share common logic inside each disk, and that logic can fail in such a way that the disk issues excessive loop initializations, corrupts packets passing through it, or transmits out of turn, disrupting communication on one or both FC-AL loops. Port bypass circuits cannot detect or correct any of these conditions. As a result, the high-availability strategy of redundant components with fail over is less effective on the back-end buses of a storage server than in any other part of the server architecture.

Advantages of Back-end Switches in Storage server Design

Enhancing the back-end FCAL buses of a storage server from a loop topology Noun 1. loop topology - the topology of a network whose components are serially connected in such a way that the last component is connected to the first component
loop

network topology, topology - the configuration of a communication network
 to a switched topology can significantly improve availability, performance, and even overall system cost.

Implementation issues In the Business world, companies frequently set-up a connection between which they transfer data. When the connection is being set-up, it is referred to as implementation. When issues occur during this phase, they are known as implementation issues.  had previously prevented designers from including back-end switches in storage servers. The switch implementations were too large, too expensive compared with the cost of a disk, had significant power requirements, and required adding complex fabric service code to the back-end software of the storage server. All these issues have disappeared with the introduction of an embedded Inserted into. See embedded system.  storage switch on a single chip.

Given a pair of PC-AL loops that connect a pair of controllers to one or more JBQDs, there are two back-end switching topologies that can be added. With intra-shelf switching, embed a pair of switches into each JBOD, converting it into an SBOD SBOD Switched Bunch of Disks (storage networking)
SBOD Student Board of Directors
SBOD Spinning Beachball of Doom (Mac OS X)
SBOD Settlement Balance Order Destined
 (switched bunch of disks). One switch would take the place of the multiple port bypass circuits and the three-port mini hub on the interface electronics card for each of the two FC-AL loops.

With inter-shelf switching, place an external switch on each loop connecting the controllers and disk shelves (JBODs or SBODs).

Performance: Inter-shelf switching has the implementation advantage that theatre no changes required in the disk shelves. It has the performance advantage, in multi-shelf modular storage servers, of dramatically increasing the back-end bandwidth of the storage server. This is because the two controllers in the storage server can simultaneously communicate with two disks as long as the disks are not in the same disk shelf. The back-end bandwidth dividend resulting from inter-shelf switching is 50% when there are only two disk shelves and asymptotically approaches 100% as the number of shelves increases. Next generation storage servers incorporating more than two controllers for scalability could see a bandwidth dividend well in excess of 100% from intershelf switching.

Intra-shelf switching alone will also produce a bandwidth dividend, but only when each pair of FC-AL loops runs to a single disk shelf. In this case, the switch will allow the two controllers to communicate simultaneously with two disjoint dis·joint
v.
To put out of joint; dislocate.
 disks in the shelf, resulting in a 90-95% bandwidth dividend if the shelf is full of disks.

In addition to the above bandwidth dividend from switching, both inter-shelf and intra-shelf switching reduce FC-AL transit time transit time

the time required for ingesta to pass through the gastrointestinal tract; a shorter transit time is seen in conditions associated with gut hypermotility, such as diarrhea. Delayed passage from any cause results in a longer transit time.
 and thereby increase the inherent efficiency of the back-end buses. FC-AL transit time is the sum of all the delays in the loop caused by node logic (including elasticity buffers), port bypass circuits, cables, hubs, and switches in the loop path. For 2Gbps FCAL loop, the overhead that transit time adds to every 110 request can be approximated as the number of nodes (disks plus controllers) on the bus, multiplied 1.08 microseconds for read requests or 1.4 microseconds for write requests. Given that the 8KB 110 operations common to OLTP workloads take only 41 microseconds of useful bus time, the efficiency of fully configured PC-AL loops can fall below 20%.

A switch topology reduces transit time, and therefore increases efficiency, by effectively shortening the size .of the FC-AL loop by bypassing all nodes not in the direct switched path between initiator and target. Inter-shelf switching increases efficiency more than intra-shelf switching when multiple disk shelves are involved. The benefits of backend switching continue to grow as more disks and disk shelves are added to the back-end bus. The combination of intra- and inter-shelf switching produces over a 7:1 improvement in effective bandwidth for transaction processing Updating the appropriate database records as soon as a transaction (order, payment, etc.) is entered into the computer. It may also imply that confirmations are sent at the same time.

Transaction processing systems are the backbone of an organization because they update constantly.
 workloads on a fully loaded back-end .bus, as can be seen .by the graph in Figure 2. Even in data warehousing See data warehouse.

data warehousing - data warehouse
 workloads with large (64KB) transfers, the combination of inter-shelf and intrashelf switching. increases the effective back-end bus bandwidth by more than 3:1 in large configurations.

This increase in effective back-end bus bandwidth can be exploited in one or more of the following ways: improved performance, lower RAID reconstruction time leading to improved data availability Refers to the degree to which data can be instantly accessed. The term is mostly associated with service levels that are set up either by the internal IT organization or that may be guaranteed by a third party datacenter or storage provider. , fewer back-end buses leading to reduced cost and/or wiring complexity, and simplification of cache mirroring.

Availability: Switches provide significant added functionality over port bypass circuits in isolating misbehaving disks that interfere with proper FC-AL loop operation. Port bypass circuits can only remove dead disks from a loop; :hey cannot detect a crazy disk that disrupts communication on one or both FC-AL loops. These failures cannot be isolated without a global view of all activities on the FC-AL loops. A controller, in conjunction with the environmental monitoring unit (EMU emu or emeu (both: ē`my), common name for a large, flightless bird of Australia, related to the cassowary and the ostrich. ) in all shelves on the affected loops, must isolate the failure to the malfunctioning mal·func·tion  
intr.v. mal·func·tioned, mal·func·tion·ing, mal·func·tions
1. To fail to function.

2. To function improperly.

n.
1. Failure to function.

2.
 disk and then disconnect that disk slot from one or both FC-AL buses. Unfortunately, port bypass circuits provide no information as to which drive is malfunctioning, so the storage server designer must improvise im·pro·vise  
v. im·pro·vised, im·pro·vis·ing, im·pro·vis·es

v.tr.
1. To invent, compose, or perform with little or no preparation.

2.
. One method used today is to disable To turn off; deactivate. See disabled.  every disk on the affected loop, one at a time, using custom designed logic designed into the shelves, checking each time to see if the problem goes away. This method is complex to implement, disruptive to ongoing 110 operations, and only works if the disk failure is a hard failure-it will not isolate an intermittent failure.

Because a switch must receive and analyze all incoming packets to determine packet routing, a switch in the shelf packaging of a storage server is in an ideal position to know that a disk is violating low level bus standards. Switches provide some natural fault isolation of misbehaving nodes because they do not route packets through any node other than the one being addressed, but they can also log the detection of improper packets or bus protocol violations for later examination by the EMU. An inter-shelf switch will track any failure to a single shelf, and an SBOD will track any failure to a single disk slot. This is a superior way of finding misbehaving disks, and the only practical way to track down an intermittent fault An intermittent fault is a phenomenon common to all branches of engineering and also in computer software. It is defined as a malfunction of a device or system that occurs periodically, either at regular intervals or more commonly at irregular intervals. .

A switch is, of course, a single point of failure for a single FC-AL loop, but no more so than the multiple port bypass circuits and/or hub that it replaces. By enabling true failure independence between PC-AL loops, switches increase the availability of the back-end buses of storage servers to the same level as the other redundant, failure independent components of the storage server.

Cost: Storage server vendors typically recommend configuring a limited number of disks per storage server, less than the maximum of 124 disks per loop pair supported by PCAL, due to concerns about performance. if users need to add drives past this limit and maintain performance, they are obliged o·blige  
v. o·bliged, o·blig·ing, o·blig·es

v.tr.
1. To constrain by physical, legal, social, or moral means.

2.
 to add more storage servers. This increases the cost of purchasing and managing storage. By adding backend switching and reducing the number of storage. servers, the vendor can lower the total system cost to the end user by 1520% without compromising performance.

The replacement of the loop topology currently used to connect storage servers with their disk drives with a switched topology can bring significant gains in performance, availability, and total systemcost. The technology to create switched back-end topologies was not available until Vixel introduced its InSpeed Technology in 2001. InSpeed Technology allows RAID controller vendors to incorporate back-end switching in new storage server designs and turbo charge existing storage server designs and even existing storage servers via field upgrade, because InSpeed Technology can change a loop topology to a switched topology transparently to the firmware in the storage server.

[FIGURE 2 OMITTED]

Richard Lary is an independent consultant in the storage industry and former technical director for storage at Compaq and Digital.
COPYRIGHT 2002 West World Productions, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2002, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Author:Lary, Richard
Publication:Computer Technology Review
Geographic Code:1USA
Date:Jul 1, 2002
Words:2870
Previous Article:Ultra320 SCSI and adaptive active filtering: the alternative to transmitter pre-compensation. (High Availability).
Next Article:Data protection service level agreement: implementing SLA support based on infrastructure design. (Storage Networking).
Topics:



Related Articles
SANs Are Here.
Clustered Servers And Redundant I/O Ports.
Enabling Technology For The Data Center OF The Future: DAFS And The InfiniBand Architecture.
Vixel and LangChao Elec partner to deploy SAN solutions.
Smart networks: embedded devices and intelligent storage. (Storage Networking).
Database and storage management: new storage-management products reduce administration and accelerate transactions. (Storage Networking).
High transaction Websites challenge storage admins: businesses turning from DAS to SAN.
Hardware beats out software for top honors.
"Best of the Best" products of 2003 will make 2004 a better year!
NAS Gateways simplify file serving for Windows environments.

Terms of use | Copyright © 2012 Farlex, Inc. | Feedback | For webmasters | Submit articles