Taming The Scalability Beast.
Why is scalability becoming so important, especially in the e-commerce world? Winter lists three primary factors for operational databases: huge numbers of concurrent users, the need for continuous availability, and extremely large stored-data volume.
* Huge numbers of concurrent users. With e-commerce handling larger and larger shares of retail products and services, it will soon require online, constantly updated databases that are being used directly by huge populations from 50 to 100 million and more. Many of these people won't need a PC, but will be accessing e-commerce sites from cheap and ubiquitous appliances. With transactions increasing as buyers pour onto the Web, e-commerce databases must be easily scalable, far more so than they truly are now.
* Continuous availability. Large numbers of concurrent users represent multiple time zones and mobile Internet access, which means that significant e-business operations must be continuously available. The challenge of this level of availability is scaling database size and transaction volume, which grow quickly along with user volume.
* Large stored-data volume. Businesses are increasingly storing clickstream data in massive data warehouses and mining it. Winter says, "Clickstreams for large user populations are the biggest things around. It doesn't take long to accumulate a terabyte of clickstream when you have a large-scale, actively used Web site."
Measuring For Scalability Success
So how do you measure and plan for scalability in a high-transaction environment? Winter suggests a Fantastic Four of data size, speed, workload, and transaction cost to define linear-scalable systems.
* Data size or "size up." It means that if your database size increases by a factor of x, given a constant hardware configuration, then your query response time will increase by no more than a factor of x. Example: If you increase your stored data volume from 100GB to 500GB and make no hardware changes, the system should increase its query and update transaction time from one second to no more than five seconds.
* Speed or "speed up." If you increase your hardware configuration's capacity by a factor of x, then your query response time will decrease by no less than a factor of x. Example: If you upgrade from one node to 64 and the node configuration is balanced for the task at hand, the system's transaction time should decrease from 64 seconds to no more than one.
* Workload or "scale up." If you increase the workload on your system by a factor of x, then you can maintain response time, throughput, or by increasing your capacity by a factor to no greater than x. Example If your transaction volume increased from 3.6 per hour to 3,600, you've got an increase of 1,000 percent. The system should continue to deliver the same response time with a capacity increase of no more than 1,000 times.
* Transaction cost. There are two considerations with transaction cost in a scalable system. First, workload increases should not increase transaction cost. If it costs seven cents to process an order when you have one processor, it should still cost no more than seven cents to process an order with 1,000 processors. Capacity should not have to increase faster than demand.
Second, if data size increases by a factor of x, transaction cost should increase by no more than a factor of x. Example: If a query consumes 15 cents of system resources when it is run against a 100GB database, then it should consume no more than 75 cents of system resources on a 500GB database.
There are, of course, software products that deal well with simpler types of questions of scalability. For example, Viathan's soon-to-be-released middleware product Internet Database System is an XML-based platform for developing and running Internet applications. This program handles the non-relational database needs of web-scale applications, per-user data such as customer data information, shopping carts, address-books, and clickstream data. Their product allows database administrators to add new servers without having to reprogram the existing database. (This is something its founders understand very well. They came from MSN, where the main part of their jobs as database administrators was reprogramming the database for ever-expanding hardware requirements. They got tired of getting calls to do this at 3 a.m. in the morning, which is why they must be the first guys ever to found an Internet start-up in order to rest.)
This type of approach is very useful with straightforward data and database installations, but if you're an integrator or consultant to an enterprise or an Internet heavy hitter and they need a heavy-duty workgroup or enterprise database like Oracle8, Microsoft SQL Server, or IBM DB2, scalability issues mean intensive up-front planning and heavy up-front spending. This is particularly true in e-business, where customer data flows in at an alarming rate. That's great for business--until the database outgrows its own structure. Then, what do you do? Do you restructure, reprogram, rebuild and come up? Not when reprogramming a database can take months. Start too late, you're history.
Common scalability challenges exist in particular database environments such as OLAP, where the system needs to effectively handle large volumes of historical data and provide real time answers. The approach found in traditional databases in dealing with large volumes of data is to use indexes. This results in the database engine scanning the indexes in high I/O RAM, avoiding the majority of the data stored in low I/O media. However, most data that an OLAP system needs to deal with cannot be indexed or cached without experiencing serious scalability problems. SeaTab Software's approach with PivotLink was to establish a finite amount of unique values in the key elements, which managed to curb the growth of data storage of an OLAP system. This resulted in a more scalable solution, as much of the data ended up in shared memory for fast processing while maintaining a more linear degradation pattern.
Other scalability challenges exist because of the nature of the project. An example is the Sloan Digital Sky Survey Archive Software, produced by the Johns Hopkins University. The project is to produce an ultra high bandwidth database server that allows astronomers to capture data from five different filters that span the spectrum from the ultraviolet to the near infrared, detecting over 200 million objects in this area. Other phenomena such as spectra and redshifts will be measured to the brightest one million galaxies. The challenge is to enable astronomers to perform this and other large-scale surveys on the "Digital Sky" using multiple, Terabyte size databases interoperating seamlessly. Scalability represents the absolute need to balance the network speed, disk I/O, and CPU resources. (Johns Hopkins believes that "astronomers will have to be just as familiar with mining data as with observing on telescopes"--and e-commerce thinks it has problems!)
Planning For Scalability
According to developer OOP/L (Object Oriented Pty., Ltd.), the first step in planning for scalability is a flexible system architecture. A component of that architecture might be a mature middleware product such as Orbix, which allows for various client and server objects to be easily redistributed, allowing for easier future expansion such as adding, migrating or relocating servers.
Another important scalability issue with relational databases is being able to efficiently handle high volume data access by using an object cache. Since objects in the cache are accessed at memory speed, this approach provides great flexibility in tuning the database to increase performance and throughput. Also consider how the system handles event logging, alarms and alarm filtering, network and server performance monitoring, user requests monitoring, and high availability. These kinds of issues must be met early in the design stage. Parallelism is another consideration. It depends on the number of servers the database utilizes, including cluster configurations or MPP. The concept applies not only to the level of query performance, but also to any database management or data loading such as index creation.
The new releases of Oracle8, DB2, and SQL Server all talk about how scalable they are and, to a certain extent, that's true. They're all selling in the e-commerce space where things are happening very quickly and incoming user data can quickly reach critical mass, but you can only apply scalability to your own installation by doing your upfront work.
Winter says, "The truth is that it's hard to make statements about this product's or that product's degree of scalability. If your company wants to implement a database with a scalability requirement that's near or beyond the frontier of prior experience, then you really can't be guided by any great extent by what anybody knows. You have to measure for yourself."
The database companies are, of course, extremely aware of the Internet market's fast-growing technological needs and expectations and of the tremendous challenges and opportunities this presents for highly scalable products. Of course, scalability is only a part of the database mix that includes integrated databases, middleware, and development tools. Aberdeen Group notes that database suppliers will need to add more multimedia and object support in the database engine, more push and ORB technology in the middleware, and more scalability support in the development toolset. As if that weren't enough, customers are also clamoring to incorporate oncoming technologies such as workflow or business-process support. It's a brave new world.
Editor's note: Winter Corporation's DATABASE SCALABILITY PROGRAM 2000 is an annual survey that identifies and honors the world's largest databases.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Industry Trend or Event|
|Publication:||Computer Technology Review|
|Date:||Aug 1, 2000|
|Previous Article:||ask THE SCSI EXPERT.|
|Next Article:||Addressing The Data Storage Dilemma With-Internet Protocol Storage Devices.|