AMPP: combining SMP and MPP to speed database queries.In the current age of data warehousing See data warehouse. data warehousing - data warehouse , the timely processing and retrieval of vast amounts of data is vital to the decision-making process. However, just as important as timeliness is the depth of data analysis possible. According to according to prep. 1. As stated or indicated by; on the authority of: according to historians. 2. In keeping with: according to instructions. 3. Greg's Law, it is estimated that on average a company's data doubles every nine months. With the growing size of the average data warehouse, the task of data analysis has become increasingly difficult; already, terabyte-sized data warehouses are fairly common. The Challenge Most organizations today experience exponential growth Extremely fast growth. On a chart, the line curves up rather than being straight. Contrast with linear. in the amount of data they want to store. That's not just because they're growing their base of customers. It's because they're storing much more information about almost everything: * Lots of detail about purchases and transactions * Click stream data encapsulating the interests displayed by visitors to their website * Supply chain management data * Competitive intelligence * Employee knowledge and expertise * Demographic information about customers If it can be accessed effectively, this information becomes valuable intelligence, capable of improving business operations Business operations are those activities involved in the running of a business for the purpose of producing value for the stakeholders. Compare business processes. The outcome of business operations is the harvesting of value from assets across the board, from marketing to merchandising. If the data can't be stored or retrieved efficiently, the organization risks losing critical ground to those competitors who have better intelligence. In addition to experiencing exponential growth in data storage, organizations are also experiencing growth in user demand for that data. Through their Intranet, Extranet, and Internet sites, they are providing concurrent access The ability to gain admittance to a system or component by more than one user or process. For example, concurrent access to a computer means multiple users are interacting with the system simultaneously. to their data to more people for more hours in the day than ever before. For example: * Over an Intranet, marketers access demographic information to perform finer segmentation and to achieve better targeting for their new marketing strategies and tactics. * Over an Extranet, suppliers analyze purchases by location and time to manage inventories and capitalize on Cap´i`tal`ize on` v. t. 1. To turn (an opportunity) to one's advantage; to take advantage of (a situation); to profit from; as, to capitalize on an opponent's mistakes s>. transient sales opportunities at the local geographical level. * Over the Internet, customers are invited to view the hottest products and track order status. Lots of people now have access to lots of data. And all of these people want answers quickly. If decisions are made in real-time, then the information required for those decisions must be available in real-time. Waiting an hour or more for a query to return results is unacceptable; users want answers in seconds. Leading software vendors have made great strides in accelerating database queries, but these applications cannot address the inherent bottlenecks of existing data warehousing architectures. As demand for access and analysis continues to grow, users must look to new database architectures to provide improved performance. Traditional Database Architectures Many databases currently use either symmetric multiprocessing See SMP. (parallel) symmetric multiprocessing - (SMP) Two or more similar processors connected via a high-bandwidth link and managed by one operating system, where each processor has equal access to I/O devices. (SMP (Symmetric MultiProcessing) A multiprocessing architecture in which multiple CPUs, residing in one cabinet, share the same memory. SMP systems provide scalability. As business increases, additional CPUs can be added to absorb the increased transaction volume. ) or massively parallel See MPP. processing (MPP (Massively Parallel Processing or Massively Parallel Processor) A multiprocessing architecture that uses up to thousands of processors. Some might contend that a computer system with 64 or more CPUs is a massively parallel processor. ) to handle queries. SMP uses multiple CPUs with a common memory. While additional CPUs can be added to improve performance, the SMP architecture often creates system bottlenecks. With MPP, queries are sent to multiple processors, each with their own storage device. This prevents the bottlenecks experienced with SNIP snip a white mark on the muzzle extending into one nostril. , but requires additional programming to enable all of the segments to communicate with each other. The higher cost and administrative hassle of the no-shared-resources approach makes pure MPP systems infrequent in practice. Typical MPP systems are implemented virtually in clusters of SMPs. This preserves some of the performance and scalability advantages of MPP while reducing cost and communication latencies. However, sharing resources to any degree imposes coordination overhead that ultimately limits performance and scalability. The most common argument against the effectiveness of MPP approaches is that uneven data distributions mitigate the inherent advantages of parallelism. While this is true in theory, most MPP-based systems use a hash distribution scheme that spreads data evenly across all nodes, for most queries. In any event, a similar concern applies to architectures like SNIP and clusters that share disks. In these cases, commonly accessed data may reside on selected sections of the shared disk. As multiple processors try to update the same disk blocks, they experience delays as they coordinate their locks. Fundamentally, neither SMP nor MPP are successful on their own, due to these constraints. The AMPP AMPP Apache, MySQL, PHP and Perl AMPP Actual Medicinal Product Pack (UK) AMPP Advanced Materials and Processing Program Approach A more effective approach can be found by combining the strengths of SMP and MPP to create a new architecture called Asymmetric Massively Parallel Processing (AMPP). AMPP architecture is built to harness the processing power of SNIP and combine it with the scalability of MPP. Simply put, AMPP applies SNIP and MPP approaches in a two-tier system The two-tier system, in the context of labor relations, is a type of contract employed by companies to scale back negotiated wages and benefits. When a two-tier system is in place in a new contract, workers hired before ratification of that contract have a wage progression to the areas where they can provide the largest benefits in performance and scalability. On the first tier, an SMP-based host compiles queries into parallel execution plans and provides the right amount of processing power to sort and aggregate large sets of query results. On the second tier, data is distributed across many nodes to minimize I/O (Input/Output) The transfer of data between the CPU and a peripheral device. Every transfer is an output from one device and an input to another. See PC input/output. I/O - Input/Output latency and increase scalability. A host divides a query up into a sequence of smaller requests called snippets that can be executed in parallel, and it distributes snippets to the second tier for execution. In addition to coordinating second-tier components, the host is also available for query processing on its own. It is typically called upon to perform aggregate operations like sorting, joining and grouping intermediate results. The host makes good advantage of the SMP's shared memory (1) Using part of main memory to support a low-cost display circuit that does not have its own memory. See shared video memory. (2) The common memory in a symmetric multiprocessing system that is available to all CPUs. See SMP. 1. model and intrinsic load balancing The fine tuning of a computer system, network or disk subsystem in order to more evenly distribute the data and/or processing across available resources. For example, in clustering, load balancing might distribute the incoming transactions evenly to all servers, or it might redirect them . AMPP combines the best elements of SNIP and MPP into a new architecture to allow a query to be processed in the most optimized way possible. It is architected to remove all the bottlenecks to data flow so that the only remaining limit is the disk speed--a "data flow" architecture where data moves at "streaming" speeds. Through standard interfaces, it is fully compatible with existing BI applications, tools and data. And it is extremely simple to use. With AMPP, the traditional 110 bus bandwidth problem of SMP is mitigated by delegating most I/O to the second, massively parallel tier. The traditional difficulty of managing large numbers of independent components is lessened by concentrating system management and reliability functions in the SNIP host(s). The second tier of AMPP architecture consists of hundreds to thousands of query snippet A small amount of something. In the computer field, it often refers to a small piece of program code. processors, called snippet processing units (SPUs). Each SPU SPU Seattle Pacific University SPU Seattle Public Utilities SPU Strategy and Policy Unit SPU Sripatum University (Thailand) SPU Split, Croatia (Airport Code) SPU Synergistic Processor Unit is solely responsible for managing a slice of the overall database. To this end, it has dedicated memory, disk, 110 bus, general purpose CPU CPU in full central processing unit Principal component of a digital computer, composed of a control unit, an instruction-decoding unit, and an arithmetic-logic unit. and a programmable disk controller. The massively parallel, shared-nothing set of SPUs provides the performance advantage of MPP. The SPUs in the second tier are not directly accessible to the end user or application. They respond to requests from the host elements in the first tier. The requests sent from the host elements to the SPUs typically require significant processing on the part of the SPUs, so that the higher interprocess communication See IPC. typical of MPP is dispersed over more processing. This mitigates one of the traditional disadvantages of MPP. While the SPUs in the second tier respond to requests from the hosts in the first tier, they are highly autonomous, performing their own scheduling, storage management, transaction management, concurrency control In a DBMS, managing the simultaneous access to a database. It prevents two users from editing the same record at the same time and is also concerned with serializing transactions for backup and recovery. and replication. This significant degree of autonomy relieves the host from the responsibility of coordinating these functions and synchronizing this coordination with other hosts in the first tier, allowing full scalability in both tiers of the architecture. Using this massively parallel approach, AMPP requires only two percent of the host's processing power, as compared to other database architectures. Simply put, AMPP architecture applies SMP and MPP approaches to the areas where they can provide the largest benefits in performance and scalability, enabling companies to quickly harness data for real-time decisions. Conclusion The size of the average data warehouse is increasing and showing no signs of slowing down--multi-terabyte sized data warehouses are becoming more and more common--and with this increased store of knowledge comes an increased demand to generate intelligence from data. Businesses should not need to discard customer data from two months ago because their database slows to a crawl when the data is kept. As the amount of data continues to grow, the data warehouse must be fast enough, and flexible enough, to grow as well. While almost all large companies have a growing amount of data, those organizations experiencing a true data explosion include financial services, retail, telecommunications, bioinformatics and government. Companies in these sectors are struggling to access data quickly in order to make more intelligent business decisions. An AMPP architecture will be very valuable as companies in these and other areas explore data for business purposes and begin building data warehouses for the first time. The benefits of this type of architecture are impressive. Users who are forced to wait hours and days for analysis on traditional systems benefit from tremendous speeds with the new architecture, which can generate results in minutes and seconds rather than hours or days. That speed translates into complete freedom to conduct ad hoc For this purpose. Meaning "to this" in Latin, it refers to dealing with special situations as they occur rather than functions that are repeated on a regular basis. See ad hoc query and ad hoc mode. and complex analyses without restraints. In the next chapter of data warehousing, AMPP architecture will play an important role as companies continue to struggle to make sense of ever-growing oceans of data. www.netezza.com Foster Hinshaw is co-founder and CTO (Chief Technical Officer) The executive responsible for the technical direction of an organization. See CIO and salary survey. of Netezza Corp. (Framingham, Mass.) |
|
||||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion