Dataflow distributed database systems.
Key words: dataflow, DDBMS, query, SQL, transaction, replication.
Development of database system technology is shortly connected with development in computer networks and distributed technologies. The results are distributed database systems. These systems become dominant in data intensive applications. Results of last years show that multiprocessor database systems represent the solution for providing data to several users. The department of computers and informatics is oriented in highload data processing. The research concentrate on dataflow systems with emphasis on design and realization own dataflow system. This article deals with design of dataflow system from view of distributed database systems.
2. DATAFLOW KPI ARCHITECTURE
Dataflow systems are based on different architecture as classic von Neumann's. These architectures are based on controlling program by a stream of instructions. The dataflow systems are based on controlling program by the stream of data. Single program operations could be executed only if operands are available. The result of dataflow program is graphicaly represented as dataflow graph, which definite identify when single operations could be executed. The main facility of this architecture lies in solving parallel problems. Dataflow architecture avoids synchronization problems of parallelism in instruction control architecture and includes parallism directly in a architecture. On department of computers and informatics Kosice runs intensive research in parallel architectures with emphasis on dataflow systems. One of the result is design of own dataflow system [VEGA: 1/9027/2002], [VEGA1/1064/04]. System Dataflow KPI  fig. 1 is designed as dynamic system with direct operands matching. Combination of local control flow model with global one enable effectively organizes parallel implementation of functional program. Mentioned architecture Dataflow KPI become core for dataflow DDBMS design. Distributed database represents collection of multiple, logicaly related databases distributed throught computer network. There are more architectures related with DDBMS. The most famous architecture is CLIENT/SERVER, multiple users access to one data server. Main task of a server is to respond user's queries, whole database management is administred centraly by database system. More distributed and more flexibile is MULTI-CLIENT/MULTI-SERVER architecture. Data are distributed among single servers, which share data. Users access to home server, which routes their queries to defined data centers. Exactly distributed systems are peer-to-peer systems. In this architecture, there is no difference between client and server. Every network node provides function of server and client at the same time. Designed architecture DDBMS is based on mentioned peer-to-peer type. There is visible interconnection of independent central procesors by linking network, fig. 3. Each of main units have it's own database of data. Single central processor based on input query selects asked data, request information from other central processor and results of queries are saved into Data Queue Unit for futher processing. It is obtained that every central processor is at the same time client and server. Very important role takes coordination processor. Coordination processor has in addition to matching operands and others tasks, it has to fragment data among central processor central processor. Data fragmentation is performed by input condition by dependence of data. There is horizontal fragmentation in case of selection whole records base on selection criterion. Second form vertical fragmentation chooses only some attributes from relation, it is projection.
[FIGURE 1 OMITTED]
3. QUERY EXECUTION PLAN
One of the most important problem linked with designed architecture dataflow DDBMS is problem of executing queries. Because data are distributed among single central processors, finding the most optimal plan for executing user's queries represents NP-problem. That is why designed system uses heuristic analyse, which remove time difficulty.
The query executing is process, which input request for a data transform into low level operations working with data. Currently mentioned possibility of dividing user's queries into atomic operations working with a data is supported by dataflow architecture. User's query is possible to transform into sequence of operations that enable creating dataflow graph of query executing and allow parallel data processing without lack of synchronization problem solution.
The best plan selection consists from four steps, fig. 2: query decomposition, data localization, global and local optimalization. First three steps are task for coordination central processor (decomposition, data localization, global optimalization). Decomposition trasforms input query into form of relational algebra. In this step is query sematic analysed for elimination incorrect queries. Sematic correct query is than simplified and redundant predicates are eliminated. After first step, continues process of data localization. The main task of data localization is to find localization of data in distributed database based on input data fragmentation. Localization is processed with reverse operations to fragmentation operations. Join and union operations are applicated. In this process of trasformation query are global relations changed to correct one considering to data localization and than assigned to particular central unit. Optimalization process of the input query includes exchange input algebraic expresion and its trasformation based on algebraic rules into equivalent one with better rating. In designed system this process of optimalization includes global optimalization, coping operations from each central unit and local optimalization. Optimalization considering to central processor, where the query is excecuted. The third step is local optimalization. This trasformation of query is very similar to optimalization in centralized DBMS. Come from system resources as amount of cache, speed of central processor, and cost of necessary operations. Last step of query optimalization is global optimalization. Global optimalization uses results from previous steps and based on created dataflow graph try to apply ability of parallel query executing.
[FIGURE 2 OMITTED]
Whenever multiple users access to shared database, it has to be synchronized because of database consistency. User's accesses are wrapped into so called transactions, which consists of low level access to data, read and write. The control of concurrently access to data is secured by isolation single transaction. It is necessary to avoid eventual problems of synchronization read-write and write-write operations.
One of the most used form to control concurrently access to data is locking. In case of DDBMS, serious problem of transaction is a serialization . The problems in distributed databases are mainly in executing operations of transactions. Operations have to by processed at same time on several nodes of network. This coordinated transaction executing is possible only if:
1. transaction executing is serialized in each node of DDBMS
2. order of transactions is identical on each node of DDBMS network
In case of locking, the most used are three options ensuring global serialization  : central locking, locking primary copy and distributed algorithm. Centralized locking is based on central node locking. All locking is performed by one central point. Implementation is relatively simple but it has two mistakes. Central node can become bottleneck and second no less relevant problem is possibility of error that can lead to unavaibility of DDBMS. Locking of primary copy become from axiom of replication data in database among single unit of network. All nodes of network have information about primary copies. In case of transactions, locking is performed only on primary copies. In distributed locking the task of locking is divided between single nodes of network DDBMS. Processing of transactions need participation and coordination from several lock-managers. Distributed locking does not have so high expenses as central locking. However complexity of algorithm is in distributed locking higher. In designed architecture dataflow DDBMS it is used system of transaction nearest to central locking. However mentioned mechanism represents chance of unavaibility whole system or bottleneck. The task is performed by coordinating processor. The main task of coordinating processor is to ensure serialization of single operations formed transactions . It is necessary to ensure data consistency. That is why every transaction has to be performed concurrently on every central processor. Parallelism of transaction is not necessary to solve because of dataflow architecture . After a transaction is assigned to single central processor, central processor decides, if it is possible to perform operations of the transaction and return answer to coordinating processor, which waits answers from all nodes of network. If only one central processor refuses transaction, transaction is aborted. During transactions executing, other transactions wait in transaction queue, so it is ensured autonomy of transactions.
[FIGURE 3 OMITTED]
The goal of this article was point to selected distributed database systems properties and to present design of dataflow DDBMS architecture. Nowadays when data increase continually, it is necessary to handle these data very effectively. Already desribed DDBMSs represent one way of processing increasing number of data. They afford adequate space for futher scalebility. Created KPI dataflow DDBMS architecture represents system based on existing architecture KPI DF . Every central unit has its own database distributed by coordinating central processor. In designed architecture has very important role coordinating processor. A part from matching operands it has assigned futher tasks in processing user's queries, data fragmentation and transaction executing also. In spite of bottleneck with coordinating processor or unavaibility whole DDBMS, created architecture provides assumed performance properties of DDBMS.
Vokorokos, L. : Data Flow computer architecture principles. Monograph. Copycenter, spol. s.r.o, Kosice, 2002. ISBN 80-7099-824-5
Carey, M. : Parallelism and concurrency control performance in distributed database machines, 1989, ACM 0-89791-317-5/89/0005/0122
Ozsu, M. : Distributed database systems, Waterloo Bernstein, P. : Concurrency control in distributed database systems, June 1981, Computing Surveys Vol. 13 No.2 Supported by VEGA project No. 1/1064/04
|Printer friendly Cite/link Email Feedback|
|Author:||Vokorokos, L.; Balaz, A.; Adam, N.; Petrik, S.|
|Publication:||Annals of DAAAM & Proceedings|
|Article Type:||Technical report|
|Date:||Jan 1, 2005|
|Previous Article:||Adaptive control of the CSTR using polynomial approach.|
|Next Article:||Shape complexity measure study.|