An approach for reverse engineering of relational databases.
There are various motives for reverse engineering of databases. One may want to migrate among database paradigms: from past hierarchical, network, and relational databases to modern relational and object-oriented databases. A more mundane task would be to migrate between different implementations of a database paradigm, for example, from one vendor's RDB to another's RDB. Reverse engineering can also elucidate poorly documented existing software when the developers are no longer available for advice. We have derived substantial benefit from reverse engineering of vendor software. Reverse engineering provides an unusual source of insight.
Object-oriented models provide a natural language for facilitating the reengineering process. An object-oriented model can describe the existing software, the reverse-engineered semantic intent, and the forward-engineered new system. This article adopts the object-modeling technique (OMT) notation for modeling data . Graphical OMT models are intuitive, yet provide a rigorous basis for specifying software. OMT concepts are similar to those in extended entity-relationship modeling.
This article makes two major contributions: 1) We propose a more robust process than advanced in the literature. We recognize generalization early in our reverse-engineering process and provide guidelines for coping with design optimizations and unfortunate implementation decisions. Our process emphasizes analysis of candidate keys rather than primary keys. We incorporate three sources for information: schema, observed patterns of data, and our semantic understanding of application intent. 2) We report our experiences with industrial examples.
Typical Implementation Strategies for Forward Engineering
Reverse-engineering strategies must consider past practices. In our case studies we have seen some surprising design techniques. Often the model or its implementation as a schema was fundamentally flawed or violated good design practice. We found violations of most of the assumptions that are made in existing reverse-engineering methods.
Classes. Each class is usually implemented as a table, with a column for each attribute. Unique indexes are usually specified on primary and candidate keys. There are several approaches to constructing a primary key. Often the primary key is a single generated identifier. Sometimes one or more attributes from the problem domain serve as the primary key. Finally, a class may derive its identity from another class or classes through a relationship.
Classes may be split or combined to form tables. Classes related by a one-to-one associaion are sometimes mapped to a single table. A class may be split into vertical or horizontal fragments. A vertical split maps attributes to columns in different tables. A horizontal split apportions records to several tables with identical columns.
If a class has no attributes, it may simply not be mapped to a table. We call this a lightweight class, and have seen examples in many of our case studies.
Generalizations. The usual approach for generalizations is to map each class in a hierarchy to a table with columns corresponding to the attributes and a common primary key. A generalization partitions data. For each record in the superclass table there is a record in at most one subclass table. Unique indexes are created for primary keys and candidate keys. Some relational databases support foreign-key constraints.
Most generalizations are exhaustive and nonoverlapping without multiple inheritance. In a few cases we have seen subclasses without attributes that are not mapped into tables. Then, some superclass records do not have corresponding subclass records. A discriminator may or may not be used. The values for a discriminator are usually similar to the names of the subclasses, including any lightweight subclasses. In a multiple-level hierarchy, there may be a discriminator at each level, or there may be a single discriminator at the root.
Other implementations of generalizations infrequently occur:
* Push superclass attributes down. Map each leaf subclass to a table, including columns for all inherited attributes.
* Push subclass attributes up. Map the root superclass to a table, including columns for all subclass attributes. Each record will not populate all columns.
Associations. The most common association construct is the buried foreign key, used to implement binary associations that are not many-to-many. The foreign key is usually buried on one end of the association, subject to multiplicity constraints. Usually the foreign-key column names are semantically meaningful names. Sometimes the name of each foreign-key column matches the name of the referenced column. Unique indexes can enforce multiplicity constraints, particularly for qualified associations.
Implementation of null values varies. For example, we have seen schemas in which I) "not null"was specified
for a foreign key and 2) the designer selected a special value to represent null. Usually the data types of the foreign-key columns match the data types of the columns of the corresponding primary or candidate key; we have seen elaborate schemes in which foreign keys are concatenated from other information. One-to-one and zero-one-to-one associations may be implemented with a common primary key. This implementation is nearly indistinguishable from a generalization.
If an association involves a light-weight superclass, it may be pushed down to the subclasses resulting in multiple foreign keys or a foreign key which could point to multiple tables. Ternary or higher associations are seldom encountered. The usual practice is to promote these associatins into classes.
Many-to-many associations are usually implemented wit an intersection table. The two foreign keys combine to form a candidate key. In some cases, the intersection table may have another candidate key. typically a generated identifier. Some unusual ways for implementing many-to-many associations will be discussed later. Associations that are not many-to-many may also be implemented as a separate table, either at the discretion of the designer, or because the association has link attributes or is a link class.
Usually the implementation of associations does not introduce redundancy; however, sometimes an association is implemented by burying a pointer on both ends. For example, we have encountered one-to-one associations implemented by burying foreign keys in both associated tables.
Our reverse-engineering experiments have exercised a suite of tools: OMTool (an OMT diagram editor developed by General Electric), SQL code, AWK scripts, simple programs, editor macros, and manual analysis. We found this to be an effective way to cope with surprises. In our case studies, we have observed a wide range of styles and unusual implementation constructs that complicate reverse engineering.
Data dictionary for a RDBMS. We chose the data dictionary for a leading relational database management system (RDBMS) in order to have an example with optimization by database design experts. The resulting object diagram spanned a dozen sheets. We had several sources of information: the data stored in the data dictionary, user manuals, and our own problem domain knowledge. We did not have foreign-key information when we started; we deduced keys.
This schema included only one simple generalization that was implemented with separate superclass and subclass tables. The discriminator was stored as an encoded enumerated type in the superclass table. We deciphered the coding with view definitions.
We did not have enough data to empirically determine all candidate and foreign keys. Without semantic clues (our general understanding of RDBMS implementation), we would have been unable to do a very good job. Most candidate key and foreign-key attributes ended with the character #. Candidate key attribute names were usually similar to the name of the table. For example, TS# is a candidate key for the table TS$. Foreign-key attribute names usually (but not always) matched the primary-key attribute names to which they pointed.
We proceeded as follows. First, we seeded the object diagram with a class for each table in the schema. Next, we found unique indexes from the data dictionary to determine some candidate keys. We found additional candidate keys using database queries and semantic knowledge. Next we looked for foreign keys. We suspected attributes with # in their names and verified them using database queries. We mechanically generated tentative associations from candidate keys and foreign keys. We then applied transformations to rationalize the model. In particular, we converted a group of one-to-zero-one associations into a generalization. A few unusual constructs prompted us to formulate additional queries and further refine the model.
We used transformations to refine the model. Figure 1 shows a transformation to raise the visibility of a qualifier. A Cluster contains many Tables. A Table may belong to at most one Cluster. The combination of a Cluster and a table# yields a specific Table. This association was implemented by burying cluster-id as a foreign key in Table. Because of the optional membership in a cluster, the foreign key can be null, and the combination of cluster-id and table# is not a candidate key of Table. Therefore it is difficult to mechanically detect this qualified association. In this case, we believe the qualified association is a more accurate representation of the designer's intent.
Another unusual construct was an alternate qualifier, as shown in Figure 2. A Column derives its identity from a Table plus a qualifier, either column-name or column#. In fact, the association pairs column names with column numbers. At first, we thought there were two separate associations.
There were two classes in the schema that we thought should participate in a one-to-one association. However, there appeared to be an extra, unmatched record in one of the tables. On closer examination, we discovered that the extra record is apparently used by the ID-generating algorithm as a placeholder for the next record in the table.
Another unusual implementation of an association is shown in Figure 3. The original design model derived from the data dictionary is shown at the top; our revised model of the semantic intent is shown at the bottom. The qualifier between Table and Comment can be null; optional comments can be attached to either a table or a column. If the qualifier between Table and Comment is null, the comment is attached to a table; otherwise the comment is attached to the corresponding Column. The association is not between Table and Comment as shown in the top of the figure, but is the more complex diagram at the bottom of the figure. We deduced the association between Column and Column-comment by traversing both qualified associations in the top of the figure.
A product data manager. We studied the database for a leading data manager for mechanical parts. We studied its database because we were evaluating the product and felt that an understanding of its underlying model would enrich our evaluation.
This was our most complex case study. The final object diagram spanned 20 sheets, containing a deep inheritance hierarchy, many associations, and several unusual implementation techniques. On one hand, the size of the schema complicated reverse engineering and prevented us from working on a table-by-table basis. On the other hand, the size appears to have encouraged the designers to use a uniform strategy for implementing classes and associations. We were able to extrapolate from specific tables to the entire schema; at each step of the process we would closely inspect a few tables. When we thought we understood these few tables, we generated queries for all the tables. Because of several unexpected implementation strategies, we required iteration. Nevertheless, we were able to automate much testing of our theories.
First we seeded an object diagram with classes, attributes, and candidate keys by querying the relational database's data dictionary. We scruntinized data and schema to develop an implementation strategy. We noticed apparent generated IDs consisting of alphanumeric character strings. Also many tables had a candidate key consisting of an attribute with "ID" embedded in the name, which we assumed to be primary keys. Some tables had multiple candidate keys. Examining a few tables closely, we conjectured that many associations were implemented with buried foreign keys and that most generalizations were implemented with a common primary key. This assumption turned out to be true, with a few notable exceptions, as we discovered later.
In order to determine generalizations, we generated a list (directed-assoc-list) of possible one-to-zero-one associations through queries involving every pair of primary keys. The resulting list was large, indicating a possibly extensive generalization structure. This list is the transitive closure of all direct subclass-superclass pairs. We then deduced the direct pairs with the query shown in Figure 4, producing a simple inheritance hierarchy with more than 40 classes and four levels. Later, we discovered that a few one-to-zero-one associations were included by our query, but we rejected these by inspecting generalizations were exhaustive, and none were overlapping.
Next we considered foreign keys and associations. Attribute names did not help. We did spot a few obvious ones, which gave us a few clues to determine strategy. It appeared that IDs were char(14) or char(15) and that foreign keys were char(14), char(15), char(27), or char(28). We discovered that the char(14) and char(15) formats were equivalent to a char(14) ID with an appended blank. Furthermore, both char(27) and char(28) IDs were equivalent to char(14) and char(15) formats, with appended strings of "A." We thought there might be some semantic significance to these different formats, but could not tell from our data. This highlights the problem: one can suggest a proposition with data but can only prove a proposition with semantic knowledge. We automatically checked every possible associations. Using this list and the inheritance hierarchy, we generated a list of actual associations.
Next, we determined multiplicity by querying the data. A complicating factor was that sometimes a special value was used to represent a null pointer in which the schema formally did not allow nulls. We did not get the correct results until we took the special value into account.
By this point, we had made the following discoveries:
* Pointers were used on one or both ends of an association, based on the direction(s) of traversal. For example, a one-to-one association traversed in both directions was implemented by burying a pointer in both ends. At first, this lead us to the mistaken conclusion that there were two separate associations where in fact a careful examination of the data showed that there were really only one.
* A typical implementation of a many-to-many association is shown in Figure 5. The left object diagram in Figure 6 is the corresponding design model; the right object diagram in Figure 6 is the reverse-engineered intent. Many approaches for reverse engineering would have difficulty correctly interpreting these fixed-length arrays. The primary key is (PUID,PSEQ). PVALU-0 through PVALU-9 are foreign keys on table USER. One record in List of Receivers points to as many as 10 users. Whenever all 10 array elements in a record as filled, another record is created with the same PUID and the next available value for PSEQ.
At this point we still had implementation constructs in the model. Although the model was consistent with the schema and the data, it is hardly what the database designer conceived. To produce a better model, we applied further transformations:
* We added several lightweight classes, based on evidence that the corresponding tables had been dropped from the schema as an optimization.
* We eliminated several lightweight classes that appeared to be implementation artifacts.
Other Experiences. With other case studies, we have encountered several additional unusual practices that must be considered during reverse engineering.
Table names may be computed. We encountered this as part of a strategy to partition the data in a master class. There were several tables with the same columns. Each table had a different name, concatenated from the base table name and a code for the partition of the data stored in the master table.
A database may mix data with metadata. For example, we encountered a column in a table with values equal to the names of the columns in another table. This situation can be difficult to detect.
Composite qualifiers can also arise. As shown in Figure 7, the combination of version# and archive-date qualifies the association between Archive and Test-data-set.
Here we summarize our recommended process for reverse engineering. A practical approach must tolerate a wide range of problem styles; some database designers are more talented than others. In one sense, the flawed database designs are the most relevant problems for reverse engineering, because remedy of the flaws provides additional benefits. A practical approach must consider all information: semantics, schema, and patterns of data. Data analysis often cannot prove a proposition, but the more data that is encountered, the more likely will be the conclusion.
We propose an informal process that requires judgment. Our steps are weakly ordered--in practice there is much iteration, backtracking, and reordering of steps. The linear presentation is largely an artifact of the presentation medium (i.e., writing on paper). To date, we have only automated portions of the process. In the longer term, we envision a toolkit of reverse-engineering functions for designers. A compiler is too rigid to be practical; reverse engineering requires frequent interaction with the designer, making semantic decisions as patterns emerge.
In general, the mapping between object models and schemas are many-to-many. Various optimizations and design decision can be used to transform an object model into a database schema. Similarly, when reverse-engineering a database, alternate interpretations of the structure and data can yield different object models. Usually, there is no obvious, single correct answer for reverse engineering. Multiple interpretations can yield plausible results.
Step 1. Prepare an initial object model.
* Represent each table as a tentative class. All columns of tables become attributes of classes.
Step 2. Determine candidate keys.
* Look for unique indexes, but keep in mind that some candidate keys may not be enforced by unique indexes. Automated scanning of data can yield potential candidate keys. Semantic knowledge can be used to interpret suggestive patterns of data.
Step 3. Determine foreign-key groups. If you are lucky you will have a modern RDB with foreign-key clauses specified as part of the schema. Otherwise proceed as listed below.
* Try to resolve homonyms, attributes with the same name that refer to different things, and synonyms, attributes with different names that refer to the same thing.
* Matching attribute names, data types, and/or domains may suggest foreign keys. Join clauses in view definitions and frequent traversal paths with secondary indexes may also be germane. Data analysis can refute some foreign-key hypotheses.
* Aside from homonyms and synonyms, the target of a foreign-key reference may be ambiguous because of generalization. This is why this step refers to foreign-key groups. During this step we do not attempt to determine specific reference-referent attribute pairs--but merely groups of attributes within which foreign keys may be found. With generalization, the patterns of superclass-subclass, disjoint or overlapping, and multiple inheritance are not immediately apparent.
Step 4. Refine tentative classes.
* Agglomerate horizontally partitioned classes into a single class. Horizontally partitioned classes have the same schema. Distributed databases often use horizontal partitioning to disperse records. (Horizontally partitioned classes must also have the same semantic intent. We presume that identical schema is a good indicator of semantic intent.)
* Detect functions and constraints that are represented as tables. Classes that do not participate in foreign key are suspect.
Step 5. Discover generalizations.
* Analyze large foreign-ke~ groups, particularly those with 5, 10, or more cross-related attributes. A primary key that is entirely composed of a foreign key of another table may indicate generalization. Derived identity is symptomatic of an implementation of generalization with distinct superclass and subclass tables or propagation of identity via a one-to-one association. Data analysis can increase confidence in the discovery of generalization by revealing subsets of records.
* Look for patterns of many replicated attributes. A generalization may have been implemented by pushing superclass attributes down to each subclass.
* Look for patterns of data where a class has mutually exclusive subsets of attributes. This may indicate an implementation of generalization where subclass attributes were pushed up to the superclass.
* When discovering generalizations one must not forget there may be a forest of generalizations with multiple superclass roots and intermediate levels. Data analysis can help distinguish multiple, disjoint, and overlapping inheritance. As is true throughout reverse engineering, data analysis only yields hypotheses, and semantic understanding is required to reach firm conclusions.
Step 6. Discover associations.
* Convert a tentative class to an association when a candidate key is a concatenation of two or more foreign keys.
* Introduce a qualified association when a candidate key combines a foreign key with nonforeign-key attributes. This will find some, but not all, qualifiers. (See Figure 1.)
* The remaining associations are buried and manifest as foreign keys.
* Note minimum multiplicity for associations. Optional multiplicity is the permissive case; a lower limit of one (or another number) is more restrictive. One can never prove the restrictive case with data, but can sometimes prove the permissive case.
* Note maximum multiplicity for associations. Many multiplicity is the permissive case; an upper limit of one (or another number) is more restrictive. One can never prove the restrictive case with data, but can sometimes prove the permissive case.
* Apply semantic understanding and restate some associations as aggregations. Aggregation is the a-part-of relationship.
Step 7. Perform transformation. Various optimizations may have been employed in preparing the original RDB schema. Optimizations are often motivated by current RDB limitations and the desire to improve time and/or space performance. Some transformations are listed here.
* Convert a class to a link class as needed. A link class is an association whose links can participate in associations with other classes. An association has derived, rather than intrinsic, identity.
* Lightweight one-to-one associations should be more simply represented as an attribute. For example, it is unnecessary to represent city as a class, when city-name is the only attribute of interest.
* Nonatomic n-ary associations should be decomposed into their constituent associations of lesser order. Our experience is that most associations ae binary. We have encountered an occasional bonafide ternary association, but never an association of higher order.
* Consider shifting associations via transitive closure. For example associations from A to B and B to C could possibly be restated as associations from A to B and A to C. In general, multiplicity constrains derivation of associations, but the vague multiplicity limits often obtained through reverse engineering allow more latitude.
* Double-buried associations should be merged into a single association. For example, an association between A and B may have been buried in both the A and B classes.
* You may need to insert an intermediate class in a generalization hierarchy to recognize common semantics, attributes, and associations.
* Transitive closure also arises through the combination of generalization and association. This is because association and generalization are largely indistinguishable intil semantic interpretation is applied. Where possible, one should eliminate an imprecise association to a super-class in favor of a more restrictive association to a subclass. Figure 8 is similar to design models that are obtained; before applying semantic knowledge we cannot determine the referent in the hierarchy for a buried foreign key in X. If our semantic knowledge is that X only associates with B and never with C, then we can eliminate the association between X and A and the association between X and C.
* Similarly, we can eliminate associations to subclasses by recognizing patterns of commonality. In Figure 8, if all instances of B partition across classes D and E, we can eliminate the association between X and D and the association between X and E.
Hogshead-Davis and Arora  propose a systematic approach for reverse engineering of RDB schemas into entity-relationship (ER) models. Their goal is an invertible transformation from RDB schema to Er schema. By invertible, we mean that reverse engineering of a RDB schema followed by forward engineering of the resulting Er schema will yield the original RDB schema. Their approach has several limitations. They ignore inheritance. The input RDB schema must be in third-normal form (3NF). There cannot be any homonyms or synonyms. In essence they require prior knowledge of foreign-key relationships.
Navathe and Awong  propose a more powerful approach than Hogshead-Davis and Arora that considers inheritance. They also require semantic preprocessing to convert to third-normal form and eliminate homonyms and synonyms. Consequentially they are vulnerable to ambiguities in recognizing foreign keys. They relax the requirement of invertibility by permitting designers to inject semantic content into the reverse-engineering process. Their rationale is that the ultimate purpose of reverse engineering is to discover intent and not just change syntactic form. Their approach has the drawback of requiring much early semantic input by the designer. For example, before using their technique the designer must choose a primary key for each table from the candidate key possibilities.
Markowitz et al. [4,5] present a thorough and theoretically sound treatment of the mathematical basis for reverse-engineering RDB schema to ER schema. However their work does not address the cascading optimizations and poor design that occurs in practice.
Our approach differs from the literature in that we address non-3NF RDBs and the heavily optimized and/or flawed schema that are often encountered in practice. The databases may be large, and semantic advice may not be readily available. Past RDBs complicate reverse engineering by not enforcing foreign keys. We exploit information obtained from multiple sources: semantic understanding, analysis of schema, and analysis of data. We consider it impractical to adopt a rigid process as embodied in a compiler. Instead we favor a toolkit approach, where appropriate functions can be combined with semantic understanding to solve a problem.
Based on our observations, we have reached the following conclusions:
* Database designers, even the experts, occasionally violate rules of good database design and often employ unusual constructs. In some cases, it is impossible to produce a complete, accurate model of the database, because it never existed.
* A flexible, interactive approach to reverse engineering of databases is more likely to succeed. Batch-oriented compilers will fail for most existing databases. The designer needs to suite of flexible, loosely coupled tools.
* Determination of candidate keys and foreign keys is an important step in reverse engineering. Finding candidate keys is easy. Finding foreign keys is difficult.
* Databases are typically designed using a consistent strategy, including consistent violations of good design practice. It is usually possible to look at the schema as a whole and deduce the underlying strategy. Reengineering tools should allow the user to specify assumptions to guide the reverse-engineering process.
* A purely mechanical approach to reverse engineering of databases does not consider the transformations designers often apply in moving from design to implementation. Furthermore, the mapping between models and shemas is many-to-many. Transformations are useful for exploring the space of equivalent (equivalent given the context of semantic insight) models and approaching the original intent of the database designer.
[1.] Blaha, M, and Premerlani, W. Object-Oriented Modeling and Design for Database Applications. Prentice-Hall, Englewood Cliffs, N.J. To be published.
[2.] Blaha, M.R., Premerlani, W.J., Bender, A.R., Salemme, R.M., Kornfein, M.M., and Harkins, C.K. Bill-of-material configuration generation. In the Sixth International Conference on Data Engineering. IEEE Computer Society (Los Alamitos, Calif., 1990).
[3.] Hogshead-Davis, K. and Arora, A.K. Converting a relational database model into an entity-relationship model. In Proceedings of the Sixth International Conference on Entity-Relationship Approach. (1987).
[4.] Markowitz, V.M. and Makowsky, J.A. Identifying extended entity-relationship object structures in relational schemas. IEEE Trans. Softw. Eng. 16, 8 (Aug. 1990), 777-790.
[5.] Markowitz, V.M. and Shoshani, A. On the correctness of representing extended entity-relationship structures in the relational model. In SIGMOD '89. ACM, New York, 1989.
[6.] Vavathe, S.B. and Awong, A.M. Abstracting relational and heirarchical data with a semantic data model. In Proceedings of the Sixth International Conference on Eighty-Relationship Approach. (1987).
[7.] Rumbaugh, J., Blaha, M., Premerlani, W., Eddy, F., and Lorensen, W. Object-Oriented Modeling and Design. Prentice-Hall, Englewood Cliffs, N.J., 1991.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||excerpt from paper presented at the May 1993 Association of Computing Machinery/IEEE Computer Society's Working Conference on Reverse Engineering|
|Author:||Premerlani, William J.; Blaha, Michael R.|
|Publication:||Communications of the ACM|
|Date:||May 1, 1994|
|Previous Article:||DOD legacy systems; reverse engineering data requirements.|
|Next Article:||Automated support for legacy code understanding.|