Deriving functional dependencies from the entity-relationship model.
* Requirements collection: In this stage, users are interviewed to obtain descriptions of their information needs. This results in a preliminary specification of needs of all user groups in an organization.
* Conceptual design: A conceptual model of the application is developed using the data collected in the first stage. This model is independent of the database management system to be employed.
* Logical design: The conceptual model is translated into a specific model such as the relational, network or hierarchical model. For example, if a relational database management system (DBMS) is to be used, the relations to be implemented are specified.
* Physical design: The logical schema from the logical design step is mapped into a storage representation for a DBMS. The appropriate storage structures, indexes, and physical parameters to optimize performance are selected.
In this article we concentrate on a specific part of this design cycle, namely the interface between the conceptual and logical design stages for the relational model [ILLUSTRATION FOR FIGURE 1 OMITTED]. Specifically, this article describes an expert system called FDEXPERT that derives functional dependencies from a conceptual model of the database in the form of an Entity-Relationship (ER) model .
The ER model was chosen because it is one of the most well-known and frequently used models  for analysis and design of the conceptual model of a database. The concept of Normal Forms has provided a sound basis for relational database design . Identifying associations among attributes is an essential precursor to designing relations. These logical associations, known as functional dependencies (FDs) can then be used in two different ways to yield an efficient and effective design: decomposition and synthesis.
In the synthesis approach, pioneered by Bernstein  relations are directly created from a set of FDs after determining a minimal cover. This approach has been enhanced by several authors [18, 24, 25]. In general this approach works well for functional dependencies, but is not very suitable for multivalued dependencies (MVD). In the decomposition approach, a set of relations that are not in 3NF  are decomposed into 3NF relations by using a set of FDs [12, 15].
Regardless of the approach, the core of the design problem lies in determining a set of FDs. The design of the relations using either of the previously mentioned approaches requires a complete set of functional dependencies. The responsibility for specifying these FDs is that of the database designer. In most of the research on relational design, it is assumed that FDs are determined before the logical design stage. However, it is well-known that derivation of FDs is not a simple process - it requires considerable experience on the part of designers, who use their design expertise and a knowledge of the application to derive a set of FDs. Deriving FDs can be a time-consuming and error-prone exercise. Further, it is not possible to prove that the set of FDs is complete since there is no well-known algorithm that can be applied to derive the FDs. For large applications that deal with hundreds of data items, deriving FDs is no small problem. The question that is often asked is: How are functional dependencies derived or inferred from a description of the application? We have attempted to provide a solution to this question in this article.
The ER model used in this article is based on the original model defined by Chen . An entity is a concrete or abstract object of an existing system. Relationships can be defined to be of three types: one-one (1-1), one-many (1-m), or many-many (m-n). Ternary relationships (such as 1-1-1 or one-one-one) can also be represented. These relationships relate three entities together. The isa relationship is a special type of relationship that is also included in our definition. We also support notions of weak entities. Figures 2 and 3 show diagrammatic examples of ER models for two different applications. Figure 2 corresponds to a university database application while Figure 3 illustrates the ER model for a department store. Standard symbols have been used to represent entities, relationships, and attributes. The keys of each entity are underlined.
An FD represented as: X [right arrow] Y defines a logical association between two sets of attributes X and Y . Functional dependencies act as structural descriptors as well as integrity constraints. The example FD indicates that no more than one value of Y corresponds to one value of X. The attributes on the left-hand side (LHS) of an FD are known as determinants.
Automated Tools for Relational Database Design
Some of the prominent research on tools for database design is briefly reviewed in this section - for a detailed survey of several other automated tools for database design, see [7, 20]. Several automated tools that support the different stages of database design have been developed under the DATAID project [1, 2, 8]. INCOD and GINCOD are two tools that can be used for conceptual design of database applications using the ER model [1, 4]. GDOC is a tool that starts with the ER model and translates it into the relational model . REDI (Relational Design Tool) is a tool that aids in logical and physical design of databases .
The logical design of the relational database is performed by accepting functional dependencies that are user-defined. SECSI, a tool developed at INRIA in France supports requirements collection in natural language followed by logical and physical design of relational database applications . The tool has some built-in heuristics that are able to isolate the functional dependencies from a semantic description of the application using instances of data. These dependencies are then used for normalizing the relational schema to 3NF. A system to derive functional dependencies from forms has also been proposed . Computer Corporation of America has developed a workbench of tools that aid in relational database design using the ER model . Recently, Dogac et al. have described a two-module system that can be used to support the process of database design. The first module assists in generating a methodology for database design, while the second module uses the methodology to generate the database design . Use of an object-oriented methodology for relational database design has been described by Blaha et al. .
While a large number of tools have been developed to support the database design process, few deal exclusively with generating functional dependencies automatically from the conceptual model of a database. Most of the tools that do make use of FDs for relational design require the designer to manually derive FDs. In certain cases such as SECSI, FDs are derived from instances of data. However, our contention is that it is absolutely essential to shift the burden of specifying the functional dependencies from the user to an automated tool. Further, most FDs can be derived directly from the conceptual model rather than from specific instances of data.
Heuristics for Inferring Functional Dependencies
FDEXPERT requires the user to input a description of requirements in the form of an ER model. The system then uses a set of rules (heuristics) and a fact base consisting of semantic knowledge about specific applications. The heuristics captured in the system are based partly on rules described by Teorey et al. . Many of the heuristics were also obtained by interviewing expert database designers.
The system first prompts the user to describe the application in one word or phrase. This word or phase is used to associate with a semantic fact base consisting of a set of tables representing semantic knowledge about the particular type of application. For instance, if the current application deals with a hospital database the user may enter: Hospitals or Medical-database to describe the application. The system uses this input as a keyword to open the appropriate tables containing information about a hospital database. This implies that these tables comprising the list of facts known to FDEXPERT are built in advance. They contain keywords and names of typical entities, attributes, keys, and relationships that are found in a hospital database. The tables also contain information on possible synonyms for attributes. Other application-specific information in the fact base consists of typical entity class names for the application. For instance, in a fact base for a hospital, typical entity classes could be Physicians, Doctors, Nurses, and Patients. In addition, the fact base contains information on typical specializations/generalizations of entity classes and a list of typical FDs to be found for each application. Note that this information is constantly updated as more users utilize the tool for their database design. If there is no information available for a particular application, a set of tables is created using the current application. These tables are then made available for subsequent consultations.
FDEXPERT prompts the user to enter the names of entities, their characteristics and relationships. It checks all the entities and relationships to make sure they have unique names. The system also detects the presence of synonyms and homonyms and clarifies them with the user. If the user enters more than one relationship between the same two entities, the system checks with the user to determine if the user is in fact referring to the same relationship by two different names. It may be that the application has two different relationships between the same two entities: An example would be: Faculty-members teach students (m-m) and faculty-members advise students (1-m).
Often, a user may forget to specify an ISA relationship between entity classes. In such cases, the semantic fact base is used to determine if such a relationship can exist among the various classes already defined by the designer. Examples of such classes are given and the user is asked to verify them. Similarly, the type of relationship specified by the user (such as 1-1, 1-m) is verified using the fact base. If there is a divergence between the cardinality specified by the user and that contained in the fact base, the user is prompted to verify this. In the case of an m-m relationship, information in the fact base is used to suggest possible ways to further describe the relationship using a third entity class. For example, in a university application, the m-m relationship between STUDENT and COURSE can be further described by introducing a third entity class called TRANSCRIPT. Finally as stated earlier, once all the FDs are generated, they are verified by looking at examples in the fact base. If there is a discrepancy between the generated FDs and the examples, the user is asked to verify them. Thus the semantic fact base is used at every step of the process described here.
Once the ER model is defined, the ruins for deriving FDs are invoked. These rules are classified into four categories [ILLUSTRATION FOR FIGURE 4 OMITTED]. The first category consists of rules to identify potential determinants. These are used to generate intra-entity FDs, which are FDs that relate attributes of one entity. Category two consists of rules that use binary relationships to derive inter-entity FDs. Rules in category three also generate inter-entity FDs, but use ternary relationships. Finally, rules in category four use instances of data to suggest FDs for confirmation by the user.
Category I: Heuristics using Keyword Analysis
This category consists of rules that use keyword analysis to identify potential determinants.
The first rule is based on the name given to attributes. If an attribute has a suffix or prefix of ID, number, no, #, etc., it is marked as a potential determinant. Combinations of attributes such as name and address are also identified as potential determinants. Once an attribute is identified as a potential determinant, the user is consulted to check if that attribute has a unique value. For instance, if student-id is the attribute in question, the user is consulted to determine if the attribute will only accept unique values and if it can take on null values. This helps to identify attributes that can be used as determinants. Attributes of an entity that are not of this type are marked as the right-hand side (RHS) attributes of FDs, with the LHS consisting of the attribute(s) identified as determinants.
Certain attributes may have a mathematical relationship with other attributes. The attributes that fall in this category may be identified by key words such as maximum, minimum, total, sum, average, etc. For example, in a university application, an attribute such as maximum salary of an administrative employee may be determined by the level and years of experience of the employee. In other cases, Total-qty, Average-price, etc., are attributes that can usually be derived from other attributes of an entity. In such cases, the system asks the user to enter the formula for calculating the attribute value and uses the attributes in the formula as the determinants for the calculated attribute. Note that an FD is generated only if the left-hand side of the formula indicates other attributes of the same entity class or other entity classes. Similarly in an application dealing with suppliers and parts, it is possible that an attribute such as total-discount for an order may be determined by the attribute quantity-ordered.
Category II: Heuristics Using Binary Relationships
The rules in this category use the type of relationship defined between entities to determine FDs. As stated earlier, relationships may be of several types: 1-1, 1-m, m-n, or isa. The rules of this category are invoked after the rules of category I are exhausted on all the attributes.
Inferences from ISA relationships: If entity1 ISA entity2 then:
key of entity2 [right arrow] all attributes of entity1
If an isa relationship is defined between two entities such as: ENTITY1 isa ENTITY2, then all attributes of entity2 are inherited by entity1. However, there may be several specific attributes describing entity1 further. For example, in the department store ER model, there is a relationship between MANAGER and EMPLOYEE: MANAGER isa EMPLOYEE [ILLUSTRATION FOR FIGURE 3 OMITTED]. Attributes for EMPLOYEE are ename, address, and salary. The entity type MANAGER inherits all the attributes of the entity type EMPLOYEE in addition to having its own attributes such as date-of-appointment. The inherited key from EMPLOYEE viz. ename can be used to infer that:
Ename [right arrow] Date-of-appointment, manager-id
Inferences from 1-1 relationships: If there is 1-1 relationship between two entities, the following FDs can be inferred:
Key of entity1 [right arrow] key of entity2 Key of entity2 [right arrow] key of entity1
For instance, consider the relationship manages between DEPARTMENT and MANAGER.
Dname [right arrow] manager-id Manager-id [right arrow] dname
Inferences from 1-m relationships: If entity1 is related to entity2 using a 1-m relationship, then:
key of entity2 [right arrow] key of entity1.
For instance, consider the relationship works-for between DEPARTMENT and EMPLOYEE:
Ename [right arrow] Dname.
Inferences from m-m relationships: Relationships of the type m-m can be used to infer FDs. The system considers a relationship between entity1 and entity2 of the type m-m and conducts a dialog with the user to determine if the m-m relationship can be refined by defining a third entity type with several attributes. This concept is similar to the decomposition process used in defining a CODASYL schema . In some cases an intersection record that helps to decompose an m-m relationship into several 1-m relationships may also be established through the dialog. For instance, consider the relationship enrollment between STUDENT and COURSE. The system asks the user to enter details describing the relationship enrollment to check if there are any attributes that can be used to define the relationship further. The user may enter information for a new entity called TRANSCRIPT, with attributes such as Student-id, Course-id, semester, grade, and number-of-credits. Two of the attributes that define the new entity type are actually keys for the entities STUDENT and COURSE. Further, TRANSCRIPT is a weak entity type [ILLUSTRATION FOR FIGURE 2 OMITTED]. Hence, we use the following rule:
Keys of strong entities [right arrow] attributes of weak entity
Since TRANSCRIPT is related to both the strong entities STUDENT and COURSE, it is inferred that
Student-id, Course-id [right arrow] grade, semester, number-of-credits.
Category III: Heuristics using Ternary Relationships
Relationships among three or more entities can be used to derive FDs of the following form [ILLUSTRATION FOR FIGURES 2 AND 3 OMITTED]:
Inferences from 1 to 1 to 1 relationships: If there is a 1 to 1 relationship between entity1, entity2 and entity3 then:
Key of entity1, Key of entity2 [right arrow] key of entity3 Key of entity2, Key of entity3 [right arrow] key of entity1 Key of entity1, Key of entity3 [right arrow] key of entity2
For instance consider the relationship uses between INSTRUCTOR, CASEBOOK and COURSE. This relationship expresses the fact that an instructor uses one casebook for a given course. Different instructors may use different casebooks for the same course. No instructor may use the same casebook for different courses, but different instructors may use the same casebook for different courses. From this relationship the following FDs can be derived:
Instructor-id, Book-id [right arrow] Course-id Book-id, Course-id [right arrow] Instructor-id Instructor-id, Course-id [right arrow] Book-id
Inferences from 1 to 1 to many relationships: If entity1, entity2, and entity3 are related in a 1 to 1 to m relationship then the following FDs can be inferred:
Key of entity1, Key of entity3 [right arrow] key of entity2 Key of entity2, Key of entity3 [right arrow] key of entity1
For instance, consider the relationship supervises between INSTRUCTOR, PROJECT, and STUDENT: This relationship expresses the fact that students work on projects under the supervision of instructors. No instructor can supervise any given student on more than one project. No student can work on a given project under the supervision of more than one instructor.
Instructor-id, Student-id [right arrow] Project-id Project-id, Student-id [right arrow] Instructor-id
Inferences from 1 to many to many relationships: If entity1, entity2 and entity3 are related in a 1 to many to many relationship then the following FD can be inferred:
Key of entity2, Key of entity3 [right arrow] Key of Entity1
For instance, consider the relationship assigned-to between PROJECT, EMPLOYEE, and LOCATION. This relationship expresses the fact that an employee can be assigned to at most one project at a given location.
Employee-id, Location-id [right arrow] Project-name
After processing all the attributes using rules in categories I, II and III, we may still be left with some attributes that do not take part in any FDs at all. In such cases, the system will prompt the user to enter sample data for each entity and the system will postulate FDs using this data. In general, FDs reflect intension rather than extension, hence they cannot be definitively derived from data. Hence the system suggests FDs to be confirmed by the human designer. This is similar to the concept used in SECSI . The sample data can also be used to confirm some of the FDs that were determined using the previous rules. For instance, the following data (see Table 1) can be used to infer or verify that
Course-id, Student-id [right arrow] Grade
Architecture and Implementation of FDEXPERT
Figure 5 illustrates the components of the FDEXPERT tool that has been implemented incorporating the heuristics described in the previous section. The tool has been implemented in Turbo Pascal to run on an IBM PC or equivalent system.
The user interacts with FDEXPERT using an intelligent dialog. The tool uses a combination of menus and question-and-answer dialog to direct the user. The dialog manager uses a sophisticated parser to detect synonyms, homonyms and other types of errors during the definition process. The dialog thus assists in capturing the entities, attributes and relationships. The inference control module invokes the rules for inferring FDs. These rules are contained in the rule base. As explained earlier, the inferencing process also uses a fact base consisting of semantic knowledge about the application such as names of attributes that are potential determinants for specific applications. Other information in the fact base includes typical entity classes, relationships and their cardinalities, generalization/specializations of entity classes, synonyms for attributes and entity class names, and typical FDs. All this information is application-specific.
With every consultation using the tool, the semantic fact base is enhanced. The semantic fact base can be accessed independently by expert database designers. These users can fill the fact base with descriptions of the most commonly occurring application areas. For example, designers familiar with a university database can fill the fact base with names and descriptions of entities, relationships, and attributes that can be found in a university database. They can also store the common keywords that could be identified as potential determinants. The fact base is refined over time as information from several consultations for different applications is stored in it.
Table 1. Sample Data for Inferring FDs Course-id Student-id Grade MIS 531B 111 A MIS 507A 111 C MIS 531B 254 B MIS 696G 254 A
During the inferencing process, the system interacts with the user for extra information or clarifications as described in the previous section. Thus FDEXPERT replicates a human database designer as closely as possible. Included in the system is an Explain facility that can be used to provide explanations to the user during the inferencing process. The output from the system consists of a set of FDs as well as the ER model of the application and a data dictionary.
Sample Design Session
A sample session using FDEXPERT to design a database schema and generate FDs is described here. Screen 1 is the introductory screen in the system. To begin, the user chooses option 1 to define a schema using the ER model. Screen 2 asks the user to describe the application using some keywords. In our example, we define the schema for a university database. Hence the user enters the keyword University. Using this keyword, the system checks its fact base to see if there are any existing ER models for a university application. In this case, it finds one such example. It then checks with the user to see if the user would like to use the fact base. The user then proceeds to define the entity classes, relationships and attributes using screen 3, 4, 5, 6, and 7. Screen 6 shows an example of how the fact base is used to refine the definition provided by the user. For instance, the user defined a m-m relationship called "Supervises" between INSTRUCTOR and STUDENT. The system checks its fact base and asks whether the user would like to refine this relationship by adding Projects to the relationship. The user considers this suggestion and adds Projects to the relationship, thus creating a ternary relationship between INSTRUCTOR, STUDENT and PROJECT as suggested by the system. Similarly, the system can suggest the addition of attributes to entity classes about which it has information in its fact base. Once the schema definition is complete, the FDs are generated using the options shown on Screens 8, 9 and 10. The Explain option on Screens 9 and 10 can be used by the user to understand how a certain FD was generated. Screen 11 illustrates the use of the Explain option. The user wanted to know how the following FD was generated:
Instructor-id, Student-id [right arrow] Project-id
The system explains that this FD was generated because of the ternary (1-m-m) relationship specified between INSTRUCTOR, STUDENT and PROJECT.
Design Experience with FDEXPERT
FDEXPERT has been used in the design of eight real-life database design projects. The number of FDs inferred for these applications ranged from 35 to 210. These applications included a university, a hospital, a manufacturing organization, an automobile dealership, a library, and several small businesses. In many cases, FDEXPERT was able to simplify and accelerate the process of deriving functional dependencies. Several experiments has been conducted to compare the output of FDEXPERT with that of expert database designers. For small applications FDEXPERT has performed as well as the experts. For large applications (more than 200 attributes), FDEXPERT has exceeded the performance of experts. In fact, the tool has assisted the experts in the process of deriving FDs.
Use of the system in these design projects has resulted in very useful feedback for improving the system. The most important one is adding a graphical user interface. An important area of research is to extend the tool to be used in a group environment. Many of the designers who used the tool expressed interest in working with the system in a group rather than in a standalone fashion. Such an extension is not trivial, since it will require the tool to be implemented in a networked environment. More importantly, it will need to have some features and techniques for identifying and resolving design conflicts among the designers. This requires a definition of techniques for view integration that could be implemented in an intelligent system.
Another important area for further research is the application of case-based reasoning to the problem of defining functional dependencies. Case-based reasoning uses the process of solving a problem by retrieving the solution of a previously solved problem and transforming it into a solution that is suitable for the current problem. Past experience is used to solve new problems so that the process does not have to start from scratch when a familiar problem is encountered. In order to apply case-based reasoning to the problem of defining FDs, several issues need to be explored. An appropriate representation of cases needs to be developed along with a basis for selecting the ones that match the problem at hand. The matching process could be based on a similarity metric to compare different cases and select the most applicable one. Techniques for transforming the existing solution to that for the new problem need to be explored as well.
Currently, FDEXPERT also functions as a natural front-end to a relational database design tool called Synthesizer . Synthesizer accepts the defined functional dependencies, determines all the minimal covers for a given set of FDs, and creates relations in Fourth Normal Form.
The tool described in this article has made several contributions. It is an enhancement over current automated tools for database design because it aids in one of the most difficult phases of the design process, namely, definition of functional dependencies. The identified FDs can be used in relational design using either the decomposition or the synthesis approach. It is an expert system, because it uses knowledge that has been gleaned from several database design experts. The tool has been implemented and validated extensively. Currently we are enhancing the user interface for FDEXPERT by adding a graphical interface to the tool.
1. Albano, A., De Antonellis, V., and Di Leva, A., Eds. Computer-Aided Database Design: The DATAID Project. Elsevier Science Publishers, North-Holland, 1985.
2. Batini, C. and Ceri, S. Database design: Methodologies, tools, and environments. In Proceedings of ACM-SIGMOD, 1985.
3. Batini, C. and Lenzirini, M. INCOD: A system for interactive conceptual data base design. Proceedings of the IEEE 18th Design Automation Conference, 1981.
4. Bernstein, P.A. Synthesizing third normal form relations from functional dependencies. ACM Trans. Database Syst. 1, 4, (Dec. 1976).
5. Bjornerstedt, A. and Hulten, C. REDI: A Database Design Tool for the Relational Model. SYSLAB Report, 18, Department of Computer Sciences, University of Goteborg, 1983.
6. Blaha, M., Premerlani, W., and Rumbaugh, J. Relational database design using an object oriented methodology. Commun. ACM 31, 4 (Apr. 1988), 414-427.
7. Bouzeghoub, M. and Gardarin, G. The design of an expert system for database design in New Applications of Databases, Gardarin, G. and Gelenbe, E., Ed. Academic Press, London, 1984.
8. Ceri, S., Ed. Methodology and Tools for Database Design. North-Holland Publishing Company, 1985.
9. Chen, P.P.S. The entity relationship model - Towards a unified view of data. ACM Trans. Database Syst. 1, (1976).
10. Chilson, D. and Kudlac, M. Database design: A survey of logical and physical design techniques Database 15, (Fall 1983).
11. Choobineh, J. A form based expert systems approach for derivation of functional dependencies. In Proceedings of the 22nd Hawaii International Conference on System Sciences, Kona, HI, Jan. 1989.
12. Codd, E.F. A relational model of data for large shared data banks. Commun. ACM, 13, 6 (June 1970), 377-387.
13. Date, C.J. An Introduction to Database Systems. Vol. 1, Fourth Edition, Addison-Wesley, Reading, MA, 1986.
14. Dogac, A., B. Yuruten, and S. Spaccapietra. A generalized expert systems for database design. IEEE Trans. Softw. Eng. 15, 4 (Apr. 1989).
15. Fagin, R. Multivalued dependencies and a new normal form for relational databases. ACM Trans. Database Syst. 2, (Sept. 1977).
16. Ferrara, F.M., and C. Batini. GDOC: A tool for computerized design and documentation of database systems. Database (Summer 1984).
17. Navathe, S., Ceri, G. and Batini. C. Automating Database Design. Benjamin Cummings, 1991.
18. Ram, S. and Curran, S.M. The synthesis approach for relational database design. Infor. Sci. 49, (Oct. 1989a).
19. Ram, S. and Curran, S.M. An automated tool for relational database design. Infor. Syst. 14, 3 (1989b).
20. Ram, S. Automated Tools for Database Design: State of the Art. CMI Working Paper, Department of MIS, University of Arizona, 1994.
21. Reiner, D., et al. The database design and evaluation workbench (DDEW) project at CCA. Database Engineering, 7, 4 (Dec. 1984).
22. Teorey, T.J., D. Yang, and J.P. Fry. A logical design methodology for relational databases using the extended entity-relationship model. ACM Comput. Surv., 18, 2 (June 1986).
23. Ullman, J.T. Principles of Database and Knowledge Base Systems. Second Edition, Computer Science Press, Rockville, Md, 1988.
24. Zaniolo, C. and Melkanoff, M. On the design of relational database schemata. ACM Trans. Database Syst. 6, 1 (Mar. 1981).
25. Zaniolo, C. and Melkanoff, M. A formal approach to the definition and the design of conceptual schemata for database schemata. ACM Trans. Database Syst. 7, 1 (Mar. 1982).
SUDHA RAM is Associate Professor of Management Information Systems at the University of Arizona. Current research interests include semantic modeling, object-oriented modeling and design, data allocation in distributed database systems, schema and view integration, automated tools for database design, and interoperability among heterogeneous database systems. Author's Present Address: Department of Management Information Systems, University of Arizona, Tucson, AZ, 85721; email: firstname.lastname@example.org
|Printer friendly Cite/link Email Feedback|
|Publication:||Communications of the ACM|
|Date:||Sep 1, 1995|
|Previous Article:||Unraveling the semantics of conceptual schemas.|
|Next Article:||The relationship between theory and practice in software engineering.|