In computer technology, benchmarks as old as computers themselves, are useful tools for comparing different configuration alternatives for purchase or performance- tuning decisions. There are a large number of storage-related benchmarks, and this feature discusses the most popular ones, including the following topics:
* Benchmarking definitions, general tips, and guidelines
* Block I/0-based storage benchmarks
* File system-based storage benchmarks
* Application-based system benchmarks
Artificial, over-simplified programs (known as benchmarks) are needed to study performance because real applications and workloads are costly, hard to measure, impractical to set up, not repeatable, and made of unknown, complex interactions. Benchmarks, which overcome these limitations are a set of well-defined, representative workloads that can be executed on many systems to compare their performance providing measurable, repeatable performance results. They are used for monitoring system performance, diagnosing problems, and comparing alternatives. Benchmarks are limited, by definition (they must be simple), providing only a partial picture. Consider them a complementary source of information to other studies involving cost, reliability, ease of use, and so on.
Benchmarks range from toy benchmarks that have simple kernel loops, to large system benchmarks that simulate enterprise-level information processing. This analysis focuses on benchmarks that provide meaningful information about storage subsystems.
A good benchmark must provide understandable, relevant information for its domain. It must be scalable for testing a wide range of systems, and must be sufficiently unbiased to be acceptable by a wide range of users and vendors.
The following is a list of items that you should consider when setting up a benchmark test-bed for storage. Further tips are discussed later as different benchmarking concepts are introduced.
* In multiple central processing unit (CPU) systems (Symmetric Multi-Nocessing, or SMP), the number of processors will affect the systems' performance. In addition, on these systems, assigning subsets of processors to specific processes and their threads will change execution performance. To utilize processor caches more effectively, threads could be assigned to a small subset of available processors. On many systems, this is accomplished through processor affinity system calls. Using affinity calls should be carefully considered.
* Caches change benchmark behavior in unexpected ways. Therefore, you must make sure that all caches are empty before each benchmark execution. Examples of caches to consider in storage benchmarks include the processor caches (L1, L2, and L3), file system buffer cache, network file system client and server caches. You may need to power-cycle client and server machines, or unmount and remount file systems in between benchmark runs to ensure that the caches have a cold start.
* In addition to the choice of using or not using a cache, the size and scheduling policies of caches profoundly affects execution profiles. Expanding or restricting cache sizes and cache policies are some of the first factors to try in a benchmarking study.
* In benchmarks that can execute on multiple machines simultaneously, you might need to synchronize the machine times and timings of events. Network Time Protocol (NTP) can be used to synchronize machine clocks. Distributed benchmark utilities generally have options that can be set to synchronize the benchmark events (such as start and stop execution) between multiple benchmark instances.
Here we discuss storage benchmarks in three major categories: block I/O benchmarks, file system benchmarks, and application-level benchmarks.
Block I/O Benchmarks
Benchmarks that exercise and measure storage systems using block I/O interfaces are useful for obtaining the performance characteristics of storage layers below the file system level. These could be regarded as raw storage performance benchmarks. IOmeter and SPC-1 are the two most widely accepted benchmarks in this category.
IOmeter was developed by Intel Corp., and is currently distributed as an open-source project. It is a tool for generating tightly controlled I/Oworkloads and collecting performance data such as response time, throughput, and CPU usage.
It is best used for stress testing I/O systems to find system bottlenecks. Because it does not have any prescribed, real-world workload definitions, it cannot be used as an application performance predictor. However, it can easily be configured to generate almost any kind of I/O pattern through a graphical user interface.
Although IOmeter sometimes appears in literature for benchmarking file servers, it lacks the necessary capabilities to generate a proper file access workload. IOmeter generates reads and writes to volumes (raw or formatted). However, file server workloads are generally dominated by meta- data-type operations such as directory searches and file attribute checks. When used to benchmark file systems, IOmeter is only useful for checking the read/write throughput of a file system.
It is more useful for testing the block devices directly (as raw devices, which bypass the file system). At the block device level, everything is a data read or a data write operation, no matter what the higher levels might be doing (for example, accessing file attributes).
IOmeter can be set to execute with any queue length. A device's queue length denotes the number of outstanding I/Os on that device at a given time. Deep queue lengths generally increase the throughput and the response time.
Figure 1 contains sample throughput-response time curves obtained using IOmeter. In these experiments, four IOmeter "worker" processes are executed on four different client machines. These machines access a disk array as their back-end storage, using 16KB random write operations.
In the figure, the two curves represent tests with write caching enabled and disabled on the disk array, respectively. Successive data points on each curve are obtained by setting the client queue depths at values of 2, 8, 32, 128, and 512. The figure shows the range where write caching significantly reduces the response time. The figures are obtained from spreadsheet outputs of IOmeter runs. This example shows the usefulness of IOmeter as a stress test tool.
Storage Performance Council Benchmarks
Storage Performance Council (SPC) is a group comprised of companies that are predominantly in the data storage and server business. SPC was formed to develop industry-standard benchmarks for storage networks. By overseeing the development and publication of benchmark results, the group aims to provide a level playing field for storage system vendors. SPC plans to release a series of benchmarks, each intended for use in a different environment. The first benchmark, SPC-1 (SPC, 2002; McNutt, 2001), represents a workload that is both throughput- and response time-sensitive. It was developed by studying the workload of transaction processing systems that require small, mostly random, read and write operations (for example, database systems, OLTP systems, and mail servers).
SPC started working on the SPC-2 benchmark, which will represent workloads with large I/O sizes and mostly sequential access. It is intended to emulate video on-demand servers, film rendering applications, and backup/restore operations.
SPC benchmark source code is controlled by the member companies, and SPC has put an audit structure in place to authenticate test results. After a test sponsor (a member company) submits benchmark results in the form of a Full Disclosure Report (FDR), auditors assigned by SPC review the results and post them for peer-review for 60 days, at the end of which time the results are assumed to be official.
SPC-1 reports two main performance metrics. SPC-1 IOs per second (IOPS) represents the highest IOPS rate achieved during the benchmark. Any reported SPC-1 IOPS result should not have a response time greater than 30ms. The second reported metric is the SPC-1 LRT (Least Response Time), which is obtained at 10 percent of the load level of the reported SPC-1 IOPS rate. Figure 2 illustrates these two metrics.
An FDR is supposed to contain the throughput-response time curve, as shown in Figure 2, and the IOPS rate and LRT. In addition, sponsors who want to publish SPC-1 results are supposed to disclose the total price of their Tested Storage Configuration (TSC). An FDR contains a cost/performance value in the form of dollars per SPC-1 IOPS.
Table 1 presents a summary of public SPC-1 results currently available. Test sponsors are also supposed to disclose the data capacity and the data protection level they have used in the tests, as shown in Table 1.
In SPC-1, all host computers involved in the test, the storage network, and the storage subsystems comprise the Benchmark Contiguration (BC), as shown in Figure 3. All the storage- related items, including the host adapters, cables, network switches, and hubs, constitute the Tested Storage Contiguration (TSC). This is significant because the reported system cost involves everything in the TSC.
The workload generator in SPC-1 is based on Business Scaling Units (BSUs). One BSU represents a group of users collectively generating a prescribed I/O demand. Each BSU demands 50 IO operations per second. These are generated through eight streams. Five streams generate random reads and writes. Two of the streams generate sequential reads. The final stream generates sequential writes. The streams are assigned to Application Storage Units (ASUs), as shown in Figure 4.
There are three ASUS. ASU-1 is the data store that holds the raw data for the applications. It contains 45 percent of the total ASU capacity. ASU-2 is the user store that holds an organized, secure store for user files. It contains 45 percent of the total ASU capacity. ASU-3 is the log that provides information consistency for the data in the ASU-1 and ASU-2. It contains 10 percent of the total ASU capacity.
An official SPC-1 test, which can be submitted to SPC, comprises several mandatory phases. Figure 5 shows the progress of the test in time (not drawn to scale). Changing the number of simulated BSUs controls the load in SPC-1. To increase the throughput, more BSUs are added, and to decrease the load, fewer BSUs are used. The primary metrics (IOPS and LRT) are measured only in the stable states (shown as plateaus in figure 5).
The first phase is the sustainability test, which executes at the highest BSU load for a long period. The intention is to ensure that the highest throughput can be sustained over time. Through the end of the sustainability test, a measurement is taken to obtain the official SPC-1 IOPS rate. Then, in the response time ramp test, the load is decreased gradually to obtain a response time-throughput graph. At the end of the ramp, at the 10 percent load level, SPC-I LRT is recorded.
IOPS and LRT measurements are repeated twice to check the repeatability of the test results after system shutdowns. SPC-1 IOPS and SPC-1 LRT results must be within 5 percent of the values obtained in the sustainability and repeatability tests to be valid for official submittal.
SPC-1 is the first and only industry benchmark applicable to storage networking environments. Although SPC-1 has a narrow workload scope, SPC is working on additional benchmarks to broaden the covered workload types.
File System Benchmarks
File systems add another layer above the block FO interface, and they change the storage work- load coming down to the storage devices. File system caching and metadata handling muse a unique type of workload that must be tested with special benchmarks. This category includes Bonnie, IOzone, NetBench, PostMark, and SPEC SFS.
File System Benchmarking Considerations
File system performance is sensitive to many system configuration parameters, and benchmarking file systems can be a tricky task. The following is a short list of tips to consider when setting up file systems for benchmarking:
* File systems (local or network-mounted) buffer write operations in buffer caches before committing them to stable storage (such as hard disks). This data destaging can happen immediately before the write call returns, or later at a synchronization time or at the file close time. Applications that require absolute reliability an use the synchronous write operations, which force all data to stable storage before an acknowledgement is generated. In most systems, this is done through the 0_SYNC option at the file open time. You should be aware of the method your benchmark is using. Using synchronous write operations will prohibit any gains from write buffering. If synchronous operations are allowed, the benchmark developer or user must decide whether to include file destaging overhead (fsynch, and fflush system calls) in the benchmark timings.
* File locking keeps data consistent when multiple readers/writers are operating on the file. Files can be locked on local or network-mounted file systems. In both cases, locking files might disable file caching. Check your benchmark options for allowing or disallowing file locks.
* Memory-mapped files effectively cache the entire file on the client computer's memory. This eliminates almost all disk I/O until the file is closed or a synchronization system call is made. Performance and reliability implications of this behavior are obvious and should be considered in file system benchmarks.
The following list contains tips that are especially useful when benchmarking network file systems:
* Decide where the benchmark program, data files, and output files will reside. When benchmarking network file systems, executing the benchmark programs from local file systems while the data files reside on the remote mounted directories is advisable
* Network File System (NFS) client caches must be cleared by unmounting and remounting the remote directories between test runs. This guarantees that the performance of later benchmark runs are not tainted with the cached data of previous runs.
* Server write commit times must be included in the execution time. Any outstanding write operations on the server side are committed to the disk storage at file close time. This could have a large impact on the performance of small files that are entirely in the server cache.
* While benchmarking networked storage, the parameters used to set up the network connection will directly impact performance. NFS, for example, can execute over Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) sockets. The choice between the two protocols will determine where data error handling is performed (in the network stack in the case of TCP, and in upper-layer protocols with UDP).
* Socket buffer sizes and TCP window sizes determine the amount of in-transit data over the network. These can be changed by changing the registry keys in MS Windows-based machines, and by manipulating the system parameters in the /proc file system on Linux machines.
* NFS allows the user to set the granularity of the read and write data exchange (rsize and wwsize). You will need to experiment with multiple values to find the optimum settings for your environment.
* Operational parameters of Ethernet connections are important as well. Full-duplex connections will eliminate contention at the wire level. Many older Ethernet adapter cards might be set to half-duplex connections by default. In addition, a larger Maximum Transmission Unit (MTU) size (jumbo frames) will reduce the network packets' fragmentation.
Network file system performance is proportional to the number of clients accessing the file server. The throughput, as well as the response time, will increase by the number of clients.
* The number of NFS daemons on the client (biod) and server (nfsd) affects the throughput. The number of daemons determines the number of operations that can be served simultaneously.
* Besides data, file system metadata (inodes and unodes) is cached on both the client side and the server side. Metadata caches allow fast access to file attributes, and they will eliminate disk accesses as long as the data is in the cache. Therefore, the inode cache size is important.
Bonnie and its variants (for example, Bonnie++) are simple file system workload generators that can be used to quickly test a file system's throughput on UNIX machines. You can use it for quick comparisons. However, the benchmark results do not have a real-world correspondence.
Bonnie uses standard C library calls, which are portable to many platforms. The benchmark performs a series of operations on a large file. A sample output looks like this:
-------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec testsys 500 4332 13.7 4722 .3.3 1413 .2.8 4674 10.3 4744 %CPU /sec %CPU .5.6 52.0 1.0
One of the useful outputs is the CPU utilization percentage that can be used to check whether CPU is a bottleneck.
IOzone is a free, open-source file system benchmark. It enables the study of system configurations on file system performance. The user can set IOzone to generate a wide variety of access patterns and to collect statistics on performance. Rather than being an application-specific benchmark, IOzone is a tool for generating a large number of access patterns.
The source code is in ANSI C and can be compiled on a large number of platforms, including many Microsoft Windows and UNIX-based machines. The parameters can be set to generate the following:
* Read * Write * Re-read * Re-write * Read/write backwards/strided * Random read/write * Memory-mapped read/write * Asynchronous read/write
On a single machine, the benchmark can use multiple processes or threads. On multiple machines, IOzone can execute as a distributed file system benchmark, processes or threads. On multiple machines, IOzone can execute as a distributed file system benchmark.
IOzone can purge processor reaches and mount/unmount file systems to remove dirty cache effects. It can generate Microsoft Excel output data that can be used to draw surface plots showing the interactions between file sizes, access sizes, and performance metrics. IOzone can be configured to generate synchronous or asynchronous I/O operations.
IOzone has useful parameters for benchmarking both local file systems and network mounted file systems. A typical invocation of IOzone for a network mounted file system would look like this:
./iozone -acR -U /mnt/test -f /mnt/test/testfile -b output.xls > logfile
In this example, -a is used to let IOzone test all file sizes between 64KB and 512MB and record sizes from 4KB to 16MB. The parameter -c is used to tell IOzone to include write commit times at the end of an NFS V3 file close. The -R option generates output data as Excel spreadsheets. The -U option causes IOzone to unmount the file system between tests. The next two parameters denote the directory and file name to use for data accesses. -b denotes the Excel output filename. The standard outputs are piped to a local file.
Figure 6 shows the results of file system performance data obtained using IOzone. Here, two file systems (fs1 and fs2) are compared for two file sizes (16MB and IGB) and various buffer sizes (4KB to 16MB).
The figure clearly shows the effects of various caches on the data path. For small-sized reads from a small file, most of the data is in the processor cache, which yields a very high throughput. The data for the large file size comes directly from the physical disks, causing a big drop in throughput. The point that differentiates the two file systems is the use of the buffer cache for the small file size. The first file system (fs1) effectively uses the buffer cache, while the second one (fs2) bypasses the buffer cache and causes poor performances. These results were obtained on Linux servers using two commercially available journaling file systems.
Although IOzone results cannot be used to predict the performance of a particular application on a particular platform, its wide variety of configuration parameters and ease of use make it an excellent tool for diagnosing and debugging the performance pitfalls in file system-based storage networks.
NetBench is a network file system benchmark for Common Internet File System (CIFS) clients and servers. CIFS is a network file system protocol based on Microsoft's Server Message Block (SMB), and is the native resource-sharing protocol for Microsoft Windows platforms.
Although NetBench is freely available, the source code is controlled by Ziff-Davis, Inc. NetBench accepts workload definition files and replays these workloads on client machines. In standard NetBench practice, the "disk-mix" workload definition file provided as part of the distribution is used. This workload was obtained by collecting traces of popular desktop applications.
Figure 7 shows a breakdown of SMB operations generated between a client machine running NetBench and a CIFS server. The figure shows that a vast majority of the operations are writes and metadata access (get attribute, open, close) operations.
The workload has a profound effect on performance outcomes. For example, a home-directory server, which keeps and serves user files, will have a very different operation distribution than the one shown in Figure 7. A home-directory server will face mostly metadata-type operations (such as directory opens, closes, searches, and file attribute checks). One study shows that actual read and writes in a home-directory server are less than 25 percent of all operations (Ramany, 2001).
Figure 8 shows the distribution of operation sizes generated by the NetBench disk-mix workload. While most of the write operations (updates) are less than 1KB, read operations center around 4KB.
Another important factor determining workload is the place where the workload is defined in the storage network. In NetBench, CIFS workload is defined from the perspective of end-user desktops. However, network-attached storage (NAS) devices are increasingly being used as back- end storage for file servers, Web servers, or database servers. Therefore, the client traffic is filtered and transformed into server traffic before it reaches a CIFS server (for example, a NAS device).
Note: Previous studies showed that file servers generate exclusively write-dominant traffic because almost all read and metadata traffic is captured by the large caches on the servers.
NetBench executes workload generators on multiple clients (Windows 95/98/NT/2000), which are controlled through a control station (Windows NT/2000). It incorporates a GUI-based control program, which enables the easy launch of the benchmark and generation of output files. The output is in the form of Excel spreadsheets that contain total bandwidth (throughput, in NetBench terms) in Mb/s and average response time in milliseconds. A single setup can be repeated for different numbers of clients.
A sample result is shown in Figure 9. The figure combines the NetBench results obtained for two different server file systems. The objective is to study the effect of the server's local file system on overall network file access performance. In this example, the first file system scaled better to a high number of clients and consistently provided better response time.
As the figure shows, NetBench performance depends on the number of clients. A high-end file server or a NAS device might require 60 clients before it is saturated. This makes NetBench impractical if you do not have access to a large test-bed.
Another concern with NetBench is the small footprint of the accessed data. This causes most of the data to be served from client and/or server caches and makes the benchmark insensitive to back-end stable storage (disk) performance. The server's processing and communication power becomes the key factor for higher NetBench results. To remove the effects of various caches, the caches must be enabled/disabled using separate system configuration manipulations.
Postmark is a very specialized file system workload generation tool. It is intended for testing small file, high-throughput environments such as e-mail and netnews servers. It generates a large pool of small files and performs opens, closes, reads, and updates on that pool. The results are very specific to that environment and are not generally applicable to other workloads. Postmark can exercise local file systems and both CIFS- and NFS-mounted file systems.
Standard Performance Evaluation Corporation (SPEC) is an industry consortium that develops and publishes a broad range of benchmarks--from CPU to Web server benchmarks--for the evaluation of computing systems. from CPU to Web server benchmarks--for the evaluation of computing systems.
System File Server (SFS) is SPEC's benchmark for the performance of NFS servers. It is based on two earlier benchmarks, LADDIS and Nhfsstone. The latest version is SPEC SFS97_RI V3.0. It supports both NFS V2 and NFS V3, as well as TCP and UDP as transport protocols.
Newer versions are updated to include workload specifications for modern NFS servers. SFS executes on client computers (which must be UNIX-based) and can access any server that supports NFS. The primary outputs include a table of throughput versus response time. The single figure of merit is the highest throughput obtained at a response time less than 50ms.
SPEC SFS is storage throughput-sensitive, and using more spindles will provide better SFS numbers. Instead of using the newer, bigger disk drives, using a larger number of older, smaller drives is more advantageous. Although this might be seen a's a benchmark anomaly, it is a fact of throughput performance.
Application-level benchmarks stress the system end to end, from the CPU to network, to storage devices. These benchmarks emulate business application loads, and their results are more mean- ingful in their application domains. This category includes TPC and SPEC benchmarks.
The Transaction Processing Performance Council (TPC) is a group of companies that produces benchmarks for transaction processing and database applications. Most of the TPC benchmarks are system-level, end-to-end benchmarks that exercise almost all parts of the computing system, including the clients, the network, the servers, and the storage subsystems.
The current flagship TPC benchmark is TPC-C, which simulates an Online Transaction Processing (OLTP) environment with multiple terminal sessions in a warehouse-based distribution operation (TPC, 1998). It contains read-only and read/write operation mixes that simulate new-order, payment, order-status, stock-level, and delivery transactions. TPC-C can be scaled quite well by increasing the number of warehouses and the users.
The TPC-C metrics include throughput (new-order transactions per minute, tpmc) and price/performance ($/tpmC). TPC-C is a widely accepted benchmark with results submitted from all major systems companies. Because it stresses all components, it is hard to tell the effect of storage subsystems on TPC-C results directly, unless the storage subsystem is the bottleneck. One obvious expectation from storage subsystems is a high throughput rate, rather than high data bandwidth.
TPC-H and TPC-R benchmarks simulate Decision Support System (DSS) environments. However, they are not as popular as TPC-C, and most of the time vendors ignore them.
TPC-W is one of the latest benchmarks from TPC (TPC, 2000). TPC-W simulates a transactional Web environment such as that seen with e-commerce sites. It provides performance and price/performance metrics. It is modeled after a Web bookstore. Primary transactions include browsing, shopping, ordering, and business-to-business transactions.
TPC-W's primary metrics are Web Interactions per Second (WIPS), dollars per WIPS ($/WIPS), and Web Interaction Response Time (WIRT). TPC-W improves over TPC-C/H/R by requiring a very detailed system performance disclosure that includes the CPU utilizations, database logical and physical I/O activity, and network and storage I/O rates.
SPEC produced a series of Web server benchmarks over the years (Eigenmann, 2001). The latest version, SPECWeb99, is based on Web workloads obtained from logs of large Web installations and agreed upon by major server vendors. A companion benchmark, SPECWeb99_SSL, measures the performance of Web servers using secure communication protocols.
Newer versions reflect the latest developments in Web technology, including dynamic HTTP, rotating ads, cookies, and so on Similar to SPEC SFS, SPECWeb99 is a client-based benchmark and supports any Web server capable of serving HTTP.
The benchmark's primary outputs are a table of requested load and response times. The peak throughput is the single figure of merit, with no limits on response time. Web servers generate mostly read, random, small-size storage 1/0 operations. Therefore, SPECWeb99 will be sensitive to the throughput of the storage system for such I/Opatterns.
Real Applications as Benchmarks
Common assumption says that the best benchmark for testing alternative systems is the real application that will be used on these systems in the production phase. Although there is truth in this argument, there are some pitfalls as well.
The problem with real applications is that their workload is very difficult to control. The real workloads are mostly dynamic, time- and input-sensitive, which makes repeating the same execution twice almost impossible. Without repeatability, comparing two configurations in a meaningful way is difficult. Real applications are also difficult, costly, and time-consuming to set up.
Therefore, the application under study is the best benchmark for a purchase decision or performance tuning only if it enables the generation of repeatable workloads.
Performance results obtained from applications will not be publishable because they will not be repeatable out of the test-bed in which they are obtained.
Tested ASU Data FDR Storage SPC-1 SPC-1 Capacity Protection Submission Configuration IOPS LRT (GB) $/IOPS Level Date 3PAR InServ 47,001 2.34 4,444.44 $34.65 Mirroring 10-Oct-02 S800 Storage Server Dell Corp., 7,650 3.10 440.00 $4.48 Mirroring 19-Jun-02 PERC3/QC SCSI RAID Controller HP StorageWorks 24,006 2.29 2596.3 $22.00 Mirroring 2-Oct-02 Enterprise Virtual Array Model 2C12D IBM Enterprise 8,009 2.99 1,259.85 $44.58 RAIDS 20-May-02 Storage Server F20 LSI E4600 FC 15,708 1.64 400.00 $16.01 Mirroring 20-May-02 Storage System Sun StorEdge 8,404 2.07 343.51 $74.29 Mirroring 20-May-02 9910 Table 1. SPC-1 Results Submitted as end of October 2002
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Data Storage|
|Date:||May 1, 2004|
|Previous Article:||RealSpeak Word.|
|Next Article:||Storage Performance Council update.|