Roundtable: "switched on storage arrays": Part 2 of 3.
Mark Ferelli: How, in very real terms, would a switched architecture-enabled solution enable providers to add benefit to reliability, availability, and serviceability? Our favorite new acronym is RAS (Reliability-Availability-Serviceability) of storage architectures, and I'd like to identify some of the more serious RAS issues that a switched architecture can solve. Brian and Bob, I'd like you to start on this.
Brian Reed: Even though today's systems are highly reliable, there are some nuances of problems that you see in shared-based architectures. In any type of shared-based architecture, every read/write request has to move to every individual drive before a completion, which causes latency and congestion issues. And some of the biggest RAS issues you see are things like--even though Fibre Channel drives have dual Ports--you get dual loop failures because there's individual parts on a drive that can bring down both loops.
You have things like rogue and flaky drives, which don't necessarily fail but may imminently fail, and it's very difficult to diagnose those problems and, therefore, it's very hard to prevent these problems from happening. Switched-based architectures basically isolate drive and solve those types of problems today.
Mark Ferelli: Bob, what d you think?
Bob Rumer: There's a couple of things that haven't been mentioned yet--or haven't bee directly called out. One of the main issues here that gives you the diagnostics benefit is the we're moving from JBODs manufactured with PBC--port bypass circuits. These are gigabytes being analyzed by mixes. When your signal is always in serial demand, then you rarely have access to that information and diagnostics is an added cost. It is always fairly difficult to implement.
What we're seeing in a switched architecture is a SERDES-based disk array, and because they are SERDES-based we have parallel data going in and then coming out c the SERDES that can be monitored for a variety of diagnostics and capabilities. And that really is transforming the industry. Vitesse has provided an awfully large quarterly part of the world's PBCs. We'll never build another one again.
The second thing is that many of these systems are also deployed with more sophisticated enclosure managements, which is the CPU that a company is referred. Vitesse's emphasis has always been in-band management that can play over Fibre Channel and mitigate diagnostics to drive autonomously. So the entire level and sophistication of the whole system is going at the end-user benefits as a whole.
Mark Ferelli: Bob, does the enclosure issue impact the total cost of ownership of the device?
Bob Rumer: The initial hardware cost is intact, but the total cost of ownership drops drastically with all the benefits that you've already heard, again, from our system friends.
Mark Ferelli: Speaking of systems, James, how do you at HP look at switched architectures for reliability, availability, and serviceability issues?
James Myers: The two situations we've seen where it's really been beneficial is before, when we used to have a couple hundred drives on the loop, if you get a rogue hard drive and many times going through a myriad of errors and error recovery, it can actually tie up and saturate and just dominate the entire loop.
Now, any other production application that's trying to do I/Os to the remaining couple of hundred hard drives begins to just bog down; service times for users are impacted and, basically, it's an impact semi-failure situation for customers.
The other thing is, many customers need instantaneous growth in their environment and they would like to do that any time of the day, any day of the week.
With all of these couple of hundred drives on a Fibre Channel loop, when you interrupt the loop to do an addition, you basically still have access but you sort of cause things to bog down. And, again, customers don't want to have to wait until the wee hours of the morning to do capacity expansions and those sorts of things, so by having a switched backend architecture, you can be assured that you can add capacity on the fly, in real time, without any degradation to your application.
Mark Ferelli: Then you can take a real time--or real byte, rather--out of a lot of the latency problems in terms of installations and implementation?
James Myers: Right. Exactly.
Mark Ferelli: Mark, where do you see the reliability issue sitting and the serviceability ones?
Mark Nossokoff: Yes, some the errors that occur in the arbitrated loop situations--when they occur--they kind of get congregated down and around the loop and it is hard to identify specifically where the individual problem is. Through implementing these loop switches in the JBOD, we'll be able to really pinpoint where a rogue drive actually resides and the drive can be more quickly diagnosed and more quickly serviced.
Mark Ferelli: Chris, as a solution supplier yourself, where are you impacted by the RAS issue?
Chris Bennett: Well, if you look at switches themselves, they often provide the ability to have a collection of link events where you can actually detect what's failing or on the way to failing. That provides you if the system is designed for online diagnostics access--with the ability to do some preemptive diagnostic work, fast fault isolation or, even better, predict that the device is going to fail before it actually happens.
This provides a richer tool set that allows you to go in and probe what's going on in the system. The net resolve of all that, of course, is a higher availability system but it just provides a richer access, more pinpointed target look, at what it is that's going on. So I think that's another big win.
Mark Ferelli: Does it give the opportunity to short stop some of those problems before you have to dispatch the field engineer?
Chris Bennett: Yes, that's the whole idea.
Mark Ferelli: Good deal. How does switching enable storage solution providers like yourself and your customers to impact latency? Latency is a very, very, very big issue, especially as you are trying to scale. Are there applications that are more likely to benefit from the increased performance that a switched architecture provides and, if so, what are they? Jim, would you like to lead off on this one?
Jim Beckman: Sure. By moving to a switched architecture, it provides more path obviously to the back-end, the multiple non-blocking path--so you have better access to the drive. It is the size of the drive that allows you to better utilize that drive. In certain workloads, it allows faster access to the back-end so that you can accommodate those workloads to maintain consistent performance.
So, from the workload standpoint, obviously the random type of workload is going to benefit more from a switched architecture and can allow better access at the drive level to get to that data and get it back out to the end-user.
Mark Ferelli: Chris, what do you think?
Chris Bennett: I think the big win for the decreased latency is applications that are latency sensitive. I think the other big win is if you are doing some heavy streaming workloads where keeping the link fully utilized is also important. I think those are two applications where the decreased latency and increased performance would really help.
Mark Ferelli: James, HP is very application-sensitive and I hear it in all areas of storage where you play. Where do you see the applications that are likeliest to benefit from what switched architectures have to offer?
James Myers: I mentioned earlier that we do a lot of benchmarking. We did the SPC benchmark and saw a dramatic increase and we, for awhile, were scratching our heads as to "What's going on here? Why are we seeing such a dramatic 40 percent increase in I/O performance?"
Well, what we found was under intense transaction and data stringing workloads, the actual sporadic right burst required to flush the cache memory of the controllers were actually being smoothed out by the back-end switched architectures to increase the overall I/Os per second performance. So, that was a very interesting phenomenon that took us awhile to sort out but was real.
Mark Ferelli: Did file size make a difference?
James Myers: In some cases, yes. Again, in data streaming files that were being sequentially accessed, it had a dramatic impact.
Mark Ferelli: Mark, where do you see the calls from the clients?
Mark Nossokoff: We expect to see the performance improvements in OLTP space where you have lots of spindles and in the larger configuration where you've got the hundreds of drives in your system today--in traditional arbitrated loop architectures, you've got a latency for each bop along the loop which can be over 100 drives along the loop. With the switch back there and the loop switch on it, it reduces the 100 hop down to a handful, or less than a handful of hops, depending on where the drive is in the topology configuration.
So the latencies get drastically reduced and those are, again, typically in the larger configurations where you have lots of spindles operating back there.
Mark Ferelli: Okay, thank you. Now, let's get to one of the harder questions and it is a two-parter. We have focused on technology to this point. I now would like to know how does this back-end switched architecture impact costs for the better for the storage service supplier? That's something to really dig into and, Jim, why don't you take a swing at that?
Jim Beckman: What we find is that with the switched architecture, from a cost standpoint, it allows us to put a lot more storage behind a controller and, obviously, drive down the overall cost of the storage solution.
It also allows us to keep up. As the drives continue to grow in capacity, there's a lot of apprehension from a customer standpoint on moving to a larger drive size from an aerial density standpoint. They have a lot of fear along those lines and the switched architecture has allowed us to continue to move to larger drive sizes, maintain consistent or increased performance to those customers, make them feel comfortable with the larger drive sizes, and allow them to move forward with this technology. This then, obviously, allows us to reduce the manufacturing cost of our products to the end user.
Mark Ferelli: Mark, from LSI's point of view, where are these driving?
Mark Nossokoff: We see a lot of the request and desire for this from a financial perspective and improving the margins in the service call aspects of it. The expectation that there'll be fewer service calls, better diagnostic ability, and less need for problem resolutions means less need to send field engineers out to customer sites and to just reduce the overall service costs, and service costs both to the end user as well as of the service and support overhead of the storage supplier.
Mark Ferelli: Actually, margin brings up an issue, though. One of the big questions would be how this would open up room for improved margins for storage providers as long as end users are benefited with bettter diagnostics and a lesser frequency in servicing.
Arun, you have done some homework into space. How does this open up room for those margins?
Arun Taneja: Anytime that you can have reduced time for repairing, and as we all know very large percentages of the repair time is actually done and spent in diagnostics, so if you improve diagnostics and better the diagnostics, the lower MTTR means the service providers can actually enjoy better margins.
I think in reality what's going to happen is the improved margins are going to be shared between the storage vendor who's providing the service, and also the end user who enjoys some lower prices.
This is a kind of situation where you can therefore have a win-win: better margins for the vendor and you have a lower price for the end-user, and the quality of the product is better, the uptime is better, and all of those kinds of things. So this is a win-win on both sides and it's not just on the service provider side. By the way, I think the same thing also applies on the product price side, because as it has already been stated: you can put more drives behind a single controller, you get better performance and you can put bigger drives. So, actually, what it does is it makes your product more scalable and more competitive so you can price it better and this, therefore, results in more competitiveness. I think what it does is it improves margins for the storage providers as well. So, I really see a big, huge win-win in all directions from this.
Mark Ferelli: I'm going keep driving on cost-related issues for a moment. I'd like to know how a switched architecture, from your experience, is going to enable solution providers to lower the overall cost-per-megabyte of storage for your end-user type customers, and still get these yummy margins that Arun is talking about. Brian, would you take a swing at this one?
Brian Reed: Yes. Arun began talking about this. Today, with shared architectures, all the system vendors are limited to how many hard drives and capacity you can put behind a controller--just because of the nature of the shared architecture. With a switched-based architecture, they can now scale much larger systems and maintain--or have greater performance than they had before. Therefore, they are maximizing The expensive controller resource with the hard-disk drives behind them.
So, the number of hard-disk drives may be equal in configuration but I have less controllers now. Therefore, the overall system cost that they get to sell to an end-user is less because same capacity, greater performance--but with less control--is required now behind there, and that's how they get to sell less expensive systems.
Bob Rumer: One area that hasn't been explicitly mentioned is the small player, the guy who buys one or maybe two controllers and wants to scale over time. Traditionally, JBODs based on AL architectures have started suffering performance penalties with relatively small numbers of drives--15 or 30, something in that range.
But now, using a switch embedded inside the JBOD--what we're calling an SBOD--allows them to keep adding to the full size of the loop and the performance scales along with capacity, linearly.
And he avoids ever having to buy a standalone switch to go between that controller and those JBODs. So, for the guy who is going from a small stance to a mid-range stance, I think he has a lower cost impact of his growth in capacity.
Part 3 of this discussion will continue in an upcoming issue of Computer Technology Review.
|Printer friendly Cite/link Email Feedback|
|Publication:||Computer Technology Review|
|Date:||Jul 1, 2003|
|Previous Article:||Employing IP SANs for Microsoft exchange deployment.|
|Next Article:||Serial ATA ensures data availability.|