Salvaging Information Engineering Techniques In A Data Warehouse Environment.During the 1980s and early 90s, Information Engineering (IE) was in its prime. Most major corporations were utilizing some form of system development methodology that, could be tied back to IE. The first step in any IE project was the Information Strategy Plan or ISP (1) See in-system programmable. (2) (Internet Service Provider) An organization that provides access to the Internet. Connection to the user is provided via dial-up, ISDN, cable, DSL and T1/T3 lines. . The ISP would look at the data, process, organization, technology, and interactions of an enterprise. The ISP was top-down analysis at its best. Three key deliverables of an ISP were a data model, functional decomposition Breaking down a process into non-redundant operations. In structured programming, it provides a hierarchical breakdown of the program into the individual operations, or routines, that are required. , and an interaction (CRUD) matrix. The data model was an entity relationship diagram Same as entity relationship model. that encompassed the entire enterprise. The functional decomposition diagram would examine the business functions and decompose de·com·pose v. de·com·posed, de·com·pos·ing, de·com·pos·es v.tr. 1. To separate into components or basic elements. 2. To cause to rot. v.intr. 1. to a process level. The CRUD matrix, which stands for CREATE, READ, UPDATE, DELETE To remove an item of data from a file or to remove a file from the disk. See file wipe, trash and undelete. 1. (operating system) delete - (Or "erase") To make a file inaccessible. , examined the interaction of data and process. These three deliverables provided a basis for top down analysis. The mid-90s saw the rise of data warehousing See data warehouse. data warehousing - data warehouse and its related disciplines. One aspect of data warehousing is the data mart A subset of a data warehouse for a single department or function. A data mart may have tens of gigabytes of data rather than hundreds of gigabytes for the entire enterprise. See data warehouse. . A data mart is a single subject area data warehouse usually developed to support a single business unit. The data warehouse market has realized that there must be a balancing act between the "build it and they will come" approach to data warehousing and the single subject legacy data marts or "legamarts". The salvaging of some information engineering techniques can provide the contextual models and top down requirements needed to create architected data warehouses. Since many people in a leadership or architecture role in data warehousing are descendents of the information engineering age, they should be familiar with the CRUD matrix and its techniques. This article will describe three techniques that can utilize existing information engineering in a data warehouse project. First, the entity relationship diagram and its use in a three phase data model approach. Second, the functional, decomposition decomposition /de·com·po·si·tion/ (de-kom?pah-zish´un) the separation of compound bodies into their constituent principles. de·com·po·si·tion n. 1. diagram and its use in segmenting and defining key performance indicators Key Performance Indicators (KPI) are financial and non-financial metrics used to quantify objectives to reflect strategic performance of an organization. KPIs are used in Business Intelligence to assess the present state of the business and to prescribe a course of action. and dimensions. Third, creating a modified. CRUD matrix that deals with logical entities and current systems. The Entity Relationship Diagram The entity relationship diagram is the standard data technique for creating data models. The entity relationship diagram enables an analyst to create a graphical view of the data concepts of an organization and their relationships. Traditional system development dictates creation of an entity relationship (Entity Relationship) diagram that is converted to a database design of a relational database relational database Database in which all data are represented in tabular form. The description of a particular entity is provided by the set of its attribute values, stored as one row or record of the table, called a tuple. . In a data warehouse environment the traditional normalized Entity Relationship cannot be easily translated into a database design. By nature a normalized Entity Relationship diagram tends to separate the data concepts into separate entities. A traditional approach to Entity Relationship modeling is concerned with three concepts: entities, relationships, and attributes. Components Of The Entity Relationship Diagram Entity. A data concept which has relevance to the enterprise. An entity can be a person, place, thing, or concept. Typically an entity consists of a single identifiable concept such as EMPLOYEE, STUDENT, CLASS, PURCHASE ORDER, or SHIPMENT. An entity can consist of subtypes. Subtypes are a decomposition of an entity into its various types. For example an EMPLOYEE entity can be modeled with subtypes FULL-TIME and PART-TIME. Subtyping is necessary when clarity is required about the data (and to some respect, the behavior) of the Supertype entity. Relationship. A relationship is a description about the relationship that exists between two entities. Information about how the entities relate, in particular, the optionality and cardinality A quantity relationship between elements. For example, one-to-one, one-to-many and many-to-one express cardinality. See cardinal number. (mathematics) cardinality - The number of elements in a set. If two sets have the same number of elements (i.e. of the relationship is modeled. A relationship should only be modeled when the relationship has relevance. If one desired, any entity could loosely be related to any other entity, but this is not the intention of modeling relationships. A special relationship, known as a recursive See recursion. recursive - recursion relationship, exists between and entity and itself, such as an EMPLOYEE to EMPLOYEE related by a REPORTS TO relationship. Attributes. Attributes are details about a specific entity. These details provide greater clarification about the data that can or will be captured regarding an entity. One must be careful not to confuse con·fuse v. con·fused, con·fus·ing, con·fus·es v.tr. 1. a. To cause to be unable to think with clarity or act with intelligence or understanding; throw off. b. entities and attributes. Entities can exist without attributes, but attributes cannot exist without entities. Data Modeling For A Data Warehouse There has been significant work done on utilizing specialized spe·cial·ize v. spe·cial·ized, spe·cial·iz·ing, spe·cial·iz·es v.intr. 1. To pursue a special activity, occupation, or field of study. 2. data modeling techniques for data warehousing. In particular, the dimensional approach has been adopted to model data warehouses for a relational database. With a dimensional modeling Dimensional modeling (DM) is the name of a logical design technique often used for data warehouses. It is different from, and contrasts with, entity-relationship modeling (ER). According to Prof. approach, many of the traditional normalization In relational database management, a process that breaks down data into record groups for efficient processing. There are six stages. By the third stage (third normal form), data are identified only by the key field in their record. techniques are not utilized. Instead, the model utilized a mixed approach of highly normalized portions of the model and highly denormalized parts of the model. The model is centered on two types of entities, facts and dimensions. Facts are entities that deal with measurements or indicators. A fact entity for a sales organization could measure revenue per month, or units sold per day. A fact for a manufacturing organization could measure defects per lot per day or units produced per week. Dimensions are entities that represent dimensional information about the facts. Dimensions are ways that the data can be sliced or viewed or segmented. Dimensions typically represent an n-leveled hierarchy such as a product hierarchy or sales organization hierarchy. For example, in a sales organization the SALES TERRITORY can be a dimension which represents the sales territory, its district, the district's region, and the region's division. In a dimensional model, this is one entity. In a traditional Entity Relationship diagram, this would be four entities, TERRITORY, REGION, DISTRICT, DIVISION. The same is true for a product dimension. The product dimension can represent SKU (StockKeeping Unit) The number of one specific product available for sale. If a hardware device or software package comes in different versions, there is an SKU for each one. SKU - stock-keeping unit , its brand, the brand's category, the category's division. In a traditional Entity Relationship diagram, this would be represented as four entities, SKU, BRAND, CATEGORY, and DIVISION. Shortcomings A shortcoming is a character flaw. Shortcomings may also be:
Although the dimensional model is a useful approach for modeling the data needed to create a database for data warehouses and quick queries, there are shortcomings. One of the shortcomings is similar to the problems that traditional Entity Relationship modeling encountered. The data model is designed with an implementation in mind. The dimensional model typically focuses on facts and dimensions for the reporting and analysis needs of the system being designed. The broad brush approach of traditional Entity Relationship modeling is discounted or ignored. This leads to a myopic my·o·pi·a n. 1. A visual defect in which distant objects appear blurred because their images are focused in front of the retina rather than on it; nearsightedness. Also called short sight. 2. view of the data as represented in the dimensional Entity Relationship model. The dimensional model also pushes business rules and representations to a lower level of abstraction The level of complexity by which a system is viewed. The higher the level, the less detail. The lower the level, the more detail. The highest level of abstraction is the single system itself. . This can lead to overlooking o·ver·look tr.v. o·ver·looked, o·ver·look·ing, o·ver·looks 1. a. To look over or at from a higher place. b. or discounting of business rules that would be much more apparent in a traditional Entity Relationship model. For example, if a sales organization represented a product in the following fashion. A multidivisional company sells some products which are part of consumer brand which is part of category. Some other products, though, do not have a consumer brand, since they are sold through a non-consumer channel. Take a pharmaceutical company that sells prescription and over the counter (OTC OTC See: Over-the-counter. OTC See over-the-counter market (OTC). ) drugs. The OTC may have brands associated with the product and the category may be OTC. The pharmaceutical products may not have a name brand, but are associated with the category or PHARM PHARM Pharmacy . While in a dimensional model, the product would be compressed or denormalized into one table. The table would be encoded with attributes which would represent the entities in a hierarchy. The Role Of The Two Data Models Based on the evaluation of each method, there are uses for each. Yet each is not complete enough to represent the logical and dimensional data dimensional data see dimensional data. needs. This drives the requirement for a multiphase Mul´ti`phase a. 1. (Elec.) Having many phases; Adj. 1. multiphase - of an electrical system that uses or generates two or more alternating voltages of the same frequency but differing in phase angle data model approach. This approach loosely follows a Zachman model of a Conceptual Model and a Logical Model. The conceptual is represented as a traditional Entity Relationship model and the Logical is represented as the dimensional Entity Relationship model. The importance of these separate, but related models is amplified in a data warehouse environment. In a traditional OLTP (OnLine Transaction Processing) See transaction processing and OLCP. OLTP - On-Line Transaction Processing environment, normalization is the norm in the conceptual and logical Models. There may be some differences, but generally the models are quite similar in number of entities. In a data warehouse environment, a conceptual of 25 entities could yield a logical model of seven entities. The example above compressed the five entity structure of a product into one entity with many attributes. This compression is a double edged sword. The concise logical model is much easier to convert to a database structure which is geared toward data warehousing. This logical model, though, hides the business rules in attributes and their optionality and cardinality. It is only through the full examination of the attributes and their allowable values, optionality, and cardinality that the rules are uncovered. Even then, there is not a simple graphical way to represent the model short of represented as the conceptual model. This leads to the proposal that in a data warehouse environment, a traditional Entity Relationship model is a business modeling tool while a logical model is a technology tool. This in fact was the practice in traditional Information Engineering SDLC (Synchronous Data Link Control) The primary data link protocol used in IBM's SNA networks. It is a bit-oriented synchronous protocol that is a subset of the HDLC protocol. See SNA, DLC and Microsoft DLC. 1. projects. The current problem is that much of the industry has shunned the traditional model as unnecessary. Instead the industry and methodologies are proposing to start with the dimensional model. It is true that the system development starts with the dimensional model, but business data understanding is facilitated through the traditional model. For many dimensional modelers, this is not enough to justify creating a traditional model. There are other uses, though. First, the Entity Relationship model can be used as a contextual map of the data connecting the multiple dimensional modeling subject areas. Second, Entity Relationship model can represent divergent di·ver·gent adj. 1. Drawing apart from a common point; diverging. 2. Departing from convention. 3. Differing from another: a divergent opinion. 4. hierarchies in the data. Third, the Entity Relationship model can feed the CRUD matrix, explained later, for a technical scoping and increment To add a number to another number. Incrementing a counter means adding 1 to its current value. management tool. The Functional Decomposition Another major component of information engineering is the functional decomposition diagramming technique. In this practice, an organization is modeled in terms of the functions which it is responsible for. This follows a strict top-down approach Top-down approach A method of security selection that starts with asset allocation and works systematically through sector and industry allocation to individual security selection. . The enterprise is modeled as one box, which is then decomposed de·com·pose v. de·com·posed, de·com·pos·ing, de·com·pos·es v.tr. 1. To separate into components or basic elements. 2. To cause to rot. v.intr. 1. to the next level. This continues until the decomposition reaches a process level. It is at this point, that the diagram ceases being a functional decomposition and instead becomes a process decomposition. It is important at this time to differentiate between a function and a process. * A function is a set of business activities which does not have a finite start and end point. The function is generally an ongoing effort within in a company. * A process is a business activity which does have a finite start and end point. Processes are generally a lower level of abstraction than a process. Some examples are: * Class Registration is a function and Register for a Class is a process. * Accounts Payable is a function and Create a Check is a process. * Flight Reservations is a function and Reserve a Seat is a process. A functional decomposition diagram was useful in an information engineering environment. The functions were decomposed to processes. The processes were converted to program specifications. The program specifications were converted to program code. This was a logical progression. If done properly, a functional decomposition diagram should represent the entire organization. This representation should categorize cat·e·go·rize tr.v. cat·e·go·rized, cat·e·go·riz·ing, cat·e·go·riz·es To put into a category or categories; classify. cat functional needs. As data warehousing has shown, the design is user centric and is based on the needs of a particular set of users. By examining the functional decomposition diagram at the leaf (lowest) level, an inquiry should be made as to the measurements of that function. This measurement, usually a fact in a data warehouse, should represent how this function analyzes or measures their function. One should guard against analyzing what data is captured by that function, and instead focus on the measurement of fact. By organizing at a fact to function level an organization has a high level understanding of the measurements across the organization. This is done by function also, as compared with by organizational unit In computing, an Organizational Unit (OU) provides a way of classifying objects located in directories, or names in a digital certificate hierarchy, typically used either to differentiate between objects with the same name (John Doe in OU "marketing" versus John Doe in OU "customer . Suppose an organization has the functional decomposition (greatly simplified for discussion purposes) illustrated in Table 1. Upon interviewing the respective employees in the functional areas, it is determined that: * Sales Management Sales Management Role and Goal Importance of sales management is critical for any commercial organization. Expanding business in not possible without increasing sales volumes, and effective sales management goal is to organize sales team work in such a manner that ensures a measures Revenue per Week per Sales Representative. * Sales Forecasting Sales forecast A key input to a firm's financial planning process. External sales forecasts are based on historical experience, statistical analysis, and consideration of various macroeconomic factors. measures Revenue per Month per Region. * Production Scheduling measures Units Produced per Hour per Plant. * Quality Assurance measures Defective Units Percentage per Lot per Plant. By arranging the captured information in a matrix format where the functions are rows and the measurements are columns, one for each measure and one for each dimension, a concise graphical representation of enterprise high-level reporting needs can be demonstrated (Table 2). Once this analysis is complete and the matrix is validated val·i·date tr.v. val·i·dat·ed, val·i·dat·ing, val·i·dates 1. To declare or make legally valid. 2. To mark with an indication of official sanction. 3. , it should be examined and analyzed an·a·lyze tr.v. an·a·lyzed, an·a·lyz·ing, an·a·lyz·es 1. To examine methodically by separating into parts and studying their interrelations. 2. Chemistry To make a chemical analysis of. 3. for commonalties. These commonalties can be exploited in system design and scoping as well as having impact on organizational and business process refinement. The CRUD Matrix A CRUD matrix by nature is enterprise wide. It deals with both data and process, but more importantly with their interactions. It is in this interaction that the true relation to data warehousing disciplines becomes relevant. The CRUD matrix in its traditional ISP use compares functions to entities. For use with a data warehouse, the CRUD should compare current systems to entities. By taking this more physical slant to the analysis, implementation and data redundancy Writing data to two or more locations for backup and data recovery. For example, data can be stored on two or more disks or disk and tape or disk and the Internet. See disk redundancy and data recovery. issues are more easily identified. Identification of Systems. A system should be a logical grouping of programs that support at least one business function. This could be as small as one program or could be an entire purchased system. For example, an SAP ERP The SAP ERP application is an integrated enterprise resource planning (ERP) software manufactured by SAP AG that targets business software requirements of midsize and large organizations in all industries and sectors. It is the successor product to SAP R/3. system may have Accounts Payable, Order Entry, and Accounts Receivable accounts receivable n. the amounts of money due or owed to a business or professional by customers or clients. Generally, accounts receivable refers to the total amount due and is considered in calculating the value of a business or the business' problems in paying . The system should be represented at the Accounts Payable level, not the SAP level. This system inventory will most likely be part of the plan since it is required prior to estimating or designing a back end effort. Identification of Entities. The entities should be representative of data concepts which have meaning to the organization. For most organizations, this can be derived from an enterprise data model, as described above. Usually this model is at the level of CUSTOMER, ORDER, SHIPMENT, etc. Anything below this level is probably too detailed for this analysis. By completing a CRUD matrix, data redundancy and synchronization (1) See synchronous and synchronous transmission. (2) Ensuring that two sets of data are always the same. See data synchronization. (3) Keeping time-of-day clocks in two devices set to the same time. See NTP. issues are highlighted. First, standard CRUD analysis must be done to determine that the checks and balance are present (Table 3). Once the matrix has been determined to be complete and correct, a set of data warehouse related checks should be done (Table 4). In the example (see Table 4), the Customer entity can be created by two systems. Let us assume that the Customer Information System is a file for Customers, and the Order Entry System uses the Customer Master Table. Since the two systems use different physical resources for the data, a yellow flag must be raised. The data must be examined on a field by field basis. The metadata (1) (meta-data) Data that describes other data. The term may refer to detailed compilations such as data dictionaries and repositories that provide a substantial amount of information about each data element. associated with each field must be gathered. If the data sources are similar or the same, a combination strategy must be defined. The matrix does not define this combination, instead the matrix can instead be used as an estimating and impact analysis tool. The number of creates/updates that utilize separate data sources assists in determining the amount of effort required when performing the cleansing and loading of the data warehouse. One must caution from drawing a direct relationship between data sources and effort. Instead it is possible that the combination of two data sources may require two units of effort whereas the combination of three data sources may require four or more units of effort. With each added data source comes the increased risk that the fields will be incongruent in·con·gru·ent adj. 1. Not congruent. 2. Incongruous. in·con gru·ence n. .Let as also assume that the Sales Tracking and the Billing Systems read from the Customer Master Table and create a local copy. In this case, one must examine this duplication duplication /du·pli·ca·tion/ (doo-pli-ka´shun) 1. the act or process of doubling, or the state of being doubled. 2. or replication strategy. Some key issues are: * What is the timing of the replication (daily, hourly, monthly)? * Is the timing the same for both the Sales Tracking and Billing Systems? * Is there any filtering or aggregation of the replicated data? * What is the physical mechanism for replication? By following a set of questions similar to this, one can determine if this is truly a replication and what impact could be made by combining the reporting capabilities of these "downstream" systems with the envisioned data warehouse. Information Engineering by nature is a data centric methodology. The same is true for current data warehouse methods and projects. Some of the techniques from information engineering can be reused and modified for a pragmatic approach to data warehousing. Three of these techniques are Entity Relationship Diagramming, Functional Decomposition, and Interaction (CRUD) Analysis. These three techniques can be used disjointedly dis·joint·ed adj. 1. Separated at the joints. 2. Out of joint; dislocated. 3. Lacking order or coherence: disjointed sentences. , but are best used in concert. In Table 5, each of the information engineering techniques plays a role in building the star schema A data warehouse design that enhances the performance of multidimensional queries on traditional relational databases. One fact table is surrounded by a series of related tables. Data is joined from one of the points to the center, providing a so-called "star query." See OLAP. . The Entity Relationship Diagram identifies the enterprise level data interactions. It also identifies commonalties and shared data. The Functional Decomposition identifies the functional data needs. One or more of these needs must then become the subject of the iteration One repetition of a sequence of instructions or events. For example, in a program loop, one iteration is once through the instructions in the loop. See iterative development. (programming) iteration - Repetition of a sequence of instructions. of the data warehouse being designed. The CRUD matrix identifies the system limitations and redundancies required to source the data. Based on these three prerequisite pre·req·ui·site adj. Required or necessary as a prior condition: Competence is prerequisite to promotion. n. tasks, a star schema design can concentrate on the functional information needs without losing sight of the contextual data view. The schema, and subsequent design, can also be scoped to a manageable level of technical complexity based on the star schema. Anthony L. Politano is CEO (1) (Chief Executive Officer) The highest individual in command of an organization. Typically the president of the company, the CEO reports to the Chairman of the Board. of MIS AG's U.S. operations (Short Hills, NJ).
Table 2
Measurement Time Dimensions
Sales Management Revenue Week Sales
Representative
Sales Forecasting Revenue Month Region
Production Scheduling Units Hour Plant
Quality Assurance Defective N/A Lot, Plant
Percentage
Table 3
Check Action
Is there at least If not, one must question
one Create? how the data is being
populated.
Is there at least If not, one must determine
one Delete? if the business really does
require this.
Does each system have If not, this could indicate
interaction with at that the system's decomposure
least one entity? is too gradual. Or, it may
indicate that the analysis
is not complete. Or it may
indicate that this system
is just a big calculator.
Does each entity If not, this could indicate
interact with at that all systems are not
least one system? listed. Or, it could
indicate that the data may
be manually stored, such as
a file cabinet or index cards.
Table 4
Area Explanation
Multiple creates If the complete matrix
shows that more than one
system can create the data,
this may be a data quality
problem.
Creates related Most entities will have at
to Reads least one C and more than
one R.
Clustering By performing an affinity
analysis on the rows and
columns, groupings of
data and process can be
found.
Area Actions
Multiple creates The first step is to determine
if both the systems are accessing
the same physical file (s) or
table (s). If they are not,
this must be further analyzed to
determine the business rules and
prloritization rules. If they are
accessing the same physical file,
the purpose, method and user
community must be determined. In
particular the aspect of timeliness
and consistency must be examined.
Creates related If there is more than one R, each
to Reads R must be analyzed to determine:
Is this the same file? If not
what is the replication, reason
and method, and what is the
synchronization method? If it is
the same file what is the business
usage?
Clustering Based on the affinity analysis,
physical increments of the data
warehouse can be defined. A small
set of data sources and systems can
be outlined to be the increment
from a technical perspective.
Obviously, this must be balanced
with the actual data needs
outlined across the organization
as defined by a technique such
as User Community Segmentation or
Measure to Dimension Matrix.
Example: Multiple Creates and Reads
Customer Order Invoice Sales Person Check
Customer CRUD R R R
Information
System
Order Entry CRU CRUD R R
System
Sales R R R R
Tracking
Billing R R R
System
Check CRUD
Writer
|
|
||||||||||||||||||||

gru·ence n.
Printer friendly
Cite/link
Email
Feedback
Reader Opinion