Supercharging SATA drive performance: Native Command Queuing makes SATA viable in workstation, server applications.
Take the elevator in a high-rise office building. As workers and visitors enter the elevator to move to different floors, they push floor buttons. When a number of people enter at different levels, all bound for different floors, the elevator doesn't go to the floors in the sequence the buttons were pushed--an approach that would cause unnecessary wear and tear to various components of the elevator such as its roping system, brakes, hydraulics, gears and roller guides. For many riders, a dumb elevator would lead to excessive wait times as the car yo-yos from one floor to another, dropping off passengers in the sequence the floor buttons were pushed. Instead, the elevator moves to the closest floor and then the next, in smooth, efficient succession, working its way to the lowest or highest drop-off point before reversing course.
NCQ on a disc drive uses a similar approach to stage and move data commands with high efficiency. Drives without NCQ suffer from inefficiency similar to the unintelligent elevator, executing commands in the order they are delivered to the drive. With NCQ, the drive considers the location of the read/write head on the disc platter and determines the most efficient path for executing the commands, moving, like the elevator, to the command closest to the head and then to others in a similar fashion.
More specifically, NCQ increases performance and disc drive life by allowing the drive to internally optimize the execution order of workloads or commands. Intelligent reordering of commands within the drive's internal command queue helps improve performance of queued workloads by minimizing mechanical positioning latencies on the drive. This is important because hard drives, with all their moving parts--actuator arms, platters and the like--are the only mechanical devices in today's computing systems.
It is this ability to efficiently rank order transport pathways to minimize mechanical wear and optimize performance that makes Serial ATA disc drives and elevators kindred technologies.
For system builders, NCQ offers a way to easily scale high-capacity Serial ATA disc for use in desktop PCs to high performance PCs, workstations and entry servers--and at just pennies per Gigabyte.
NCQ, among the most advanced features introduced in the "Extensions to Serial ATA 1.0 Specification." is a command protocol that allows multiple commands to be outstanding within a drive at the same time. NCQ drives have an internal queue where these unsent commands, along with mechanisms that track outstanding and completed portions of the workload, are rescheduled or re-ordered. NCQ also allows the host to issue additional commands to the drive while the drive seeks data for another command. The result is higher performance with less mechanical movement to reduce rotational disc latencies.
There are several ways to minimize rotational latencies. One is to deploy high-RPM drives such as ATA drives with 10K or even higher spindle speeds. However, drives with high-RPM spindle rates are costly. Another approach is to re-order unsent commands in a way that takes into account the rotational position of the drive head in determining the best command to service next. Rotational latency can also be reduced using out-of-order data delivery, a feature that doesn't require the head to access the starting LBA first. Instead, the head can start reading the data at any position within the target LBAs. Instead of bypassing the partial rotation necessary to return to the first LBA of the requested data, the drive starts reading as soon as it has settled on the correct track and adds the missing data at the end of the same rotation.
NCQ reduces the rotation latency by processing more IOPS with fewer disc revolutions. The result is a higher number of IOPS can be processed with aggressive transactional workloads. NCQ is dynamic--when the workload increases, so does the disc drive performance.
While there is clearly a need to reorder outstanding commands to reduce mechanical overhead and improve input/output (I/O) latencies, to do so goes well beyond simply collecting commands in a queue. Reordering algorithms optimizes both the linear and the angular position of the target data to minimize total service time in a process called "command re-ordering based on seek and rotational optimization," or tagged command queuing (TCQ). Decreasing the mechanical workload through command queuing reduces mechanical wear, extending drive life. NCQ is a highly efficient protocol implementation of tagged command queuing.
Unlike TCQ, Serial ATA NCQ reduces the overhead a queuing algorithm can create. The reason TCQ was not widely implemented in ATA drives is that the performance gain was wiped out by the overhead. Serial ATA significantly reduces overhead by using an efficient status return mechanism (race free), low interrupt overhead and First Party DMA--all capabilities that could not be implemented with Parallel ATA.
Race-Free Status Return Mechanism
This feature eliminates the "handshake" traditionally required with the host to enable the status return, allowing the status of any command to be communicated at any time and the drive to complete multiple commands sequentially or at the same time
The drive typically interrupts the host multiple times for each command it completes. The more interrupts, the bigger the host processing burden and the slower the performance. NCQ reduces the average number of interrupts per command to as few as one. If the drive completes multiple commands in a short time span--common with a highly queued workload--NCQ can aggregate the individual interrupts so that the host controller only has to process one interrupt.
First Party DMA (FPDMA)
NCQ uses First Party DMA to allow the drive set up Direct Memory Access (DMA) operation for a data transfer without host software intervention. The drive selects the DMA context by sending a DMA Setup FIS (Frame Information Structure) to the host controller. This FIS specifies the tag of the command for the DMA to be set up. Based on the tag value, the host controller will load the PRD table pointer for that command into the DMA engine and the transfer can proceed with no software intervention, allowing the drive to efficiently re-order commands since it can select the buffer to transfer on its own.
[FIGURE 2 OMITTED]
Detailed Description of NCQ
NCQ consists of three main processes:
* Building a queue of commands in the drive
* Transferring data for each command
* Returning status for completed commands
When a drive receives a command, it needs to know whether to queue or immediately execute the command and which protocol--NCQ, DMA or PIO, for example--to use to process it. The drive determines the protocol by using the particular command operational code or opcode issued.
To make it possible for Serial ATA to take advantage of NCQ, two special NCQ commands--Read FPDMA Queued and Write FPDMA Queued--were developed. Both are extended LBA and sector count commands designed to accommodate today's large-capacity drives. The commands also contain a force unit access (FUA) bit for high availability applications. When the FUA bit is set for a Write FPDMA Queued command, the drive will commit the data to media before indicating that the command has been completed with no errors. By using the FUA bit for writes, the host can manage data in the drive's internal cache that has not been committed to the platter.
In addition, each queued command is assigned a unique tag value used to identify any outstanding commands between the host and the device. Tag values can range from 0 to 31--although the drive can support a queue depth of less than 32--allowing the status for all commands to be reported in one 32-bit value.
The differences between queued and non-queued commands are revealed after the command is issued. When a non-queued command is issued, the drive transfers the associated data, clears the busy (BSY) bit in the Status register and notifies the host that the command was completed. When a queued command is issued, the drive clears the BSY before any data is transferred to the host. The BSY bit is not used to signal that the command has been completed. Instead, it communicates that the drive is ready to accept a new command. As soon as the BSY bit is cleared, the host can issue another queued command to the drive, allowing a queue of commands to be built in the drive.
NCQ uses First Party DMA, a feature that gives the drive control to program the DMA engine for a data transfer and allows it to re-reorder commands in the most efficient way to reduce rotational latency, to transfer data between the drive and the host--an important enhancement to Serial ATA since only the drive knows the angular and rotational position of the drive head at a given point in time. With this DMA engine programming control, the drive can select data to be transferred to minimize both seek and rotational latencies.
First Party DMA further minimizes rotational latency by allowing the drive to return data out-of-order. First Party DMA allows the drive to return partial data for a command, send partial data for another command and then finish sending the data for the first command if this is the most efficient way to complete the data transfers--all the while performing like the intelligent elevator.
Race-free status return enables the interrupts for multiple commands to be aggregated to increase performance. The host and drive work in concert to achieve race-free status return without handshakes and to ensure that the SActive register, a 32-bit register that enables the host and drive to determine which commands are outstanding, is accurate at all times. The SActive register has one bit allocated to each possible tag--for example, bit x shows the status of the command with tag x. A bit in the SActive register means that a command with that tag is outstanding in the drive (or a command with that tag is about to be issued to the drive). A bit cleared from the SActive register means that a command with that tag is not outstanding in the drive.
Another key status return capability is that the Set Device Bits FIS can notify the host that multiple commands have been completed at the same time, ensuring that the host receives just one interrupt for multiple command completions. Queuing only optimizes command re-ordering if a queue of requests is built up in the drive. In today's desktop workloads, many applications request only one piece of data at a time and often only ask for another once the previous piece has been received. In these cases the drive receives only one outstanding command at a time. The drive, therefore, can't take advantage of queuing since no re-ordering is necessary.
Hyper-Threading Technology, however, makes it is possible to build a queue even if applications issue one request at a time. Hyper-Threading Technology enables significantly higher amounts of multi-threading--the management of multiple concurrent uses of the same program--so that multiple applications are more likely to have I/O requests pending at the same time. However, applications modified to take advantage of queuing will see the best performance improvements.
The modifications necessary to prepare an application for queuing are fairly minor. Today most applications are written to use synchronous I/O, also called blocking I/O. In synchronous I/O, the function call to read from or write to a file does not return until the actual read or write is complete. To take advantage of queuing, applications should be written to use asynchronous I/O. Asynchronous I/O is nonblocking, meaning that the function call to read from or write to a file returns before the request is complete. The application determines whether the I/O is complete by checking for an event or receiving a callback. Since the call returns immediately, the application can continue to do useful work, including issuing more read or write file functions.
It's not enough for the hardware to support NCQ. The OS and driving software applications also need to send asynchronous I/O commands to the storage device to take advantage of NCQ. Hardware and software NCQ can produce less heat, improve system reliability and increase IOPS performance for systems ranging from single-drive desktop and notebook PCs to workstations and entry-level servers.
The best way to write an application to access multiple files is to issue all of the file accesses using non-blocking I/O calls. The application then can use events or callbacks to determine when individual calls have been completed. If there are a large number of I/Os--say, four to eight--issuing all of them at the same time can cut total data retrieval time in half.
Independent Software Vendors (ISVs) and operating system providers are key players in the drive to bring the significant performance benefits of NCQ to mainstream computing. Most NCQ performance gains can be experienced during the system boot process along with application loading and file copying. But the performance gains of NCQ will be extended to all computing applications when asynchronous I/O is widely deployed in software applications and operating systems. By implementing asynchronous I/O now, ISVs can take a big step in optimizing the performance of computers of all stripes--from notebook and desktop PCs to workstations and entry servers where NCQ disk drives and host hardware are becoming widely available.
Joni Clark is Serial ATA product marketing manager at Seagate Technology (Scotts Valley, CA)
|Printer friendly Cite/link Email Feedback|
|Publication:||Computer Technology Review|
|Date:||Jun 1, 2005|
|Previous Article:||Gartner: LTO now dominates DLT technology; Sales of SDLT drives decline for first time.|
|Next Article:||Intelligent archiving drives STK ILM strategy: will ILM be its day in the Sun?|