Printer Friendly

An evaluation on current research trends in data cleaning on data warehousing.

Introduction

Data cleaning process is used to determine inaccurate, incomplete, or unreasonable data and then to improve the quality through correction of detected errors and elimination [JA, 00], [REH, 00], [MTH+, 99]. Errors are in data warehouse can be the result of human error in entering the data, the merging of two databases, a lack of data formats, erroneous data, inaccurate data or outdated data and missing data [PMV, 99], [Pyl, 99]. Different approaches address different issues in data cleaning [OJA, 05], [REH, 00]. The major area of data cleaning is that data warehousing, data mining and data quality management.

A data warehouse is a centralized database that captures information from various sources. The purpose of a data warehouse is to provide information to determine predictive relationships to analyze and then to make better business decisions through the use of data mining techniques [AIR+, 99]. Data Mining is the process of extracting useful hidden information from the data warehouse. The useful information should be effective to find relationships and patterns and utilize the result to predict the decisions for the business organization [SS, 06]. Data quality is the process of verifying the reliability and effectiveness of data. To improve the quality of data we need to check the data periodically and scrubbing it. The data quality is so important in data mining from a business perspective, because many large companies depend on the data of quality [Ric, 96], [WKM, 93].

Data cleaning process are appeared in literature as follows: record linkage, semantic integration, instance identification, or object identity problem, Merge-purge problem, and Data duplication problem [JA, 00], [OJA, 05], [MS, 98]. Most of an organization takes extra effort to correct data errors before doing any work with the data. A manual process of data cleansing is also laborious, time consuming, and itself prone to errors [JA, 00]. The need for useful and powerful tools that automate or greatly assist in the data cleansing process are necessary and may be the only practical and cost effective way to achieve a reasonable quality level in an existing data set.

One important process of data cleaning is the identification of the basic types of errors can be occurred in the data warehouse to improve the data quality for the mining work [JA, 03]. This document will review the existing methods for detecting and cleaning errors in the data warehouse. It discusses Common errors, guidelines, methodologies and tools that should be followed in any data cleaning exercises. This paper presents a survey of data cleansing approaches and methods. This enables the evaluation and comparison of existing approaches for data cleansing regarding the types of errors handled and eliminated by them.

Common Errors in Data Warehousing

The understanding of errors and error propagation can lead to active quality control and managed improvement in the overall data quality in a data warehouse. Identification of common errors used to improve the quality of data to make them "fit for use" through reducing errors in the data [Art, 05], [REH, 00].

Missing data

The missing may contain null values, lack values, zeros, and the value of blank. Dealing with missing or unknown data is critical in the process of data mining work. Handling null data is further complicated by the variety of ways data [Art, 05], [KCH, 03]. Before doing any work with data, we need to replace the missing values by neighboring non missing values, in a sequence order, in some definite order.

Invalid data

One of the first and most important steps in any data cleaning task is to verify that the data values are correct or not [REH, 00]. For example, a variable called GENDER would be expected to have only two values; a variable representing height in inches would be expected to be within reasonable limits. In general, we need to identify invalid data values and to determine where they occur in the data warehouse.

Data Standards

While integrating the data bases from multiple sources with different data standards also make some problem in data representation. There are many ways to represent the same piece of information in the different ways [Kim, 96], [XGS+, 06]. E.g., "M" and "F", "Female" and "Male"; "0" and "1"

Inconsistent data

There may be inconsistencies in the data recorded for some transactions. Some data inconsistencies may be corrected manually using external references. There may also be inconsistencies due to data integration, where a given attribute can have different names in different databases. However, data within one system can be inconsistent across locations, reporting units, and time [KCH, 03].

In Correct Data

These may be caused by transposition of key-strokes, entering data in the wrong place, misunderstanding of the meaning of the data captured, where mandatory fields require a value but the data entry operator does not know a value for entry. Incorrect data values are the most obvious and common errors and can affect every data value in every field. Spelling errors in scientific names is a common pattern associated with incorrect data values in taxonomic [BL, 94].

Data Entry Problem

A user can easily type Sydeny instead of Sydney where it is assumed that the person doing the data entry does know the correct spelling of a word but makes a typing error. If the data entry program is not intelligent enough to trap these types of errors, the systems will insert dirty data into production data stores. This type of data integrity problem is the most difficult to spot [Art, 05], [Pet, 06], [XGS+, 06].

Poor database design

Many issues arise because of initial design decisions that do not take into account future growth plans for the database. Poor database design ensures the nonnormalized database design. First of all, we need to recognize the purpose of normalization in database design. Most people are aware that a normalized database design helps ensure data integrity and increases the performance of updates [Art, 05].

Redundancy

Non-standardised data values, or synonym values exist and in which two or more values or codes have the same meaning. Redundancy is very typical with descriptive data if standardised terminologies are not followed, or where compilation of data from different sources is badly controlled [ACG, 02].

In complete

Motro and Rakov (1998 from Dalcin 2004) referred to completeness as "whether all the data are available" and divided data completeness into the completeness of files (no records are missing), and the completeness of records (all fields are known for each record).

Synonyms

Synonyms are words that have the same or almost the same meaning that means different names for same real life object [ETO, 05], [Kim, 96]. For example, A+ and 99% for Excellent grade performance.

Homonyms

Homonyms are words that have the same pronunciation as another. It shares the same spelling or pronunciation but have a different meaning that means the same name for different objects [ETO, 05], [Pyl, 99]. E.g., "John Smith" and "John Smith" (two different people).

Abbreviations

An abbreviation is a shortened form of an important word. Usually, but not always, it consists of a letter or group of letters taken from that word [ETO, 05], [OJA, 05]. For example, the word "University of Windsor" can itself be represented by the abbreviation "U of W".

Format Differences

The format of data in the source system does not always conform to the desired format of data in the data warehouse. Examples include storing addresses as they would appear on an envelope as opposed to a group of separate address lines or atomic address fields (that is, city, state, zip). Other examples include the formatting of orders with associated items or the formatting of any type of data to look like forms used by the analysts accessing the data warehouse [ETO, 05]. E.g., "24/12/1999" and "Dec, 24, 1999", "519-256-6416" and "2566416"

Data duplication

Data might be stored in one field that ought to be in multiple fields. Data duplication can also become a major factor in data warehouse. Removing duplicate data is a difficult process. Serious problems occur without careful data scrubbing and list de-duplication (de-duping). Repetitive data entry and redundancy in programs causes data duplication, particularly when a business has multiple entry points for customer data [XGS+, 06].

Existing Approaches for Data Cleansing

This section describes the existing data cleansing approaches with an explanation of the data cleaning process and methods or techniques which is used. Finally, this section compares the existing data cleansing approaches. The existing approaches for data cleaning are [HDD+, 00]:

* AJAX

* FraQL

* Potter's Wheel

* ETL

* ARKTOS

* IntelliClean

* WHIRL

* SchemaSQL

* Febrl

The AJAX system

AJAX [is an extensible and flexible framework to separate the logical and physical levels of data cleansing. AJAX major concern is transforming existing data from one or more data collections into a target schema and eliminating duplicates within this process. For this purpose a declarative language based on a set of five transformation operations is defined. The transformations are mapping, view, matching, clustering, and merging. This transformation operator is fundamental for duplicate detection GFS+, 01], [WC, 03], [HJ, 03].

The first contribution of AJAX is a framework wherein the logic of a data cleaning program is modeled as a directed graph of data transformation that start from some input source data and ends by returning clean data. Second, AJAX provides a declarative and extensible language for specifying those transformations, which consist of SQL statements enriched with a set of specific primitives [GFS, 00].

The main features provided by AJAX data cleaning systems are [HDE+, 00]:

* an expressive declarative language for specifying data cleaning programs

* the set of five types of predefined data transformations which support invocation of externally defined, general or domain specific, functions

* a data lineage mechanism to interactively backtrack within the data cleaning process for providing explanation

* extensibility by supporting an open library of java functions and algorithm implementations very easy to extend

* a graphical interface to display the flow of data transformations applied and to facilitate the user interaction within the cleaning process.

FraQL

FraQL is another declarative language supporting the specification of a data cleansing process. The language is an extension to SQL based on an object-relational data model. It supports the specification of schema transformations as well as data transformations at the instance level, i.e., standardization and normalization of values. This can be done using user-defined functions. The implementation of the user defined function has to be done for the domain specific requirements within the individual data cleansing process [HJ, 03].

With its extended join and union operators in conjunction with user-defined reconciliation functions FraQL supports identification and elimination of duplicates. Also, FraQL is used to fill in missing values and eliminating invalid tuples by detection and removal of outliers, and noise in data.

FRAQL provides the following benefit [SS, 01]:

--Transparent access to external sources,

--Integrating relations from different sources via join and/or union operations,

--resolving description, structural and semantic conflicts with the help of renaming operations, conversion and mapping functions,

--advanced schema transformation, e.g. transposition of relations, as well as resolving meta-level conflicts,

--Resolving data discrepancies, i.e. instance-level conflicts, using reconciliation functions and user-defined aggregates

Potter's Wheel

Potter's Wheel is an interactive data cleansing system that integrates data transformation and discrepancy detection in a single interactive framework using spreadsheet-like interface. The effects of the performed operations are shown immediately on tuples visible on screen. Error detection for the whole data collection is done automatically in the background.

Potter's Wheel allows users to define custom domains, and corresponding algorithms to enforce domain format constraints. The arbitrary domains are closely related to our domain formats. Potter's Wheel lets users specify the desired results on example values, and automatically infers regular expressions describing the domain format. Therefore, the user does not have to specify them in advance. The inferred domain format specification can afterwards be used to detect discrepancies. Specification of the data cleansing process is done interactively. The immediate feedback of performed transformations and error detection enables the users to gradually develop and refine the process as further discrepancies are found. This enables individual reaction on exceptions [RH, 01], [VJ, 00], [HJ, 03].

ETL

ETL stands for extract, transform, and load. That is, ETL programs periodically extract data from source systems, transform the data into a common format, and then load the data into the target data store, usually a data warehouse. ETL is the heart and soul of business intelligence (BI). ETL is also known as migration or mapping tools [WC, 03].

There are many commercial ETL tools that support transformations of different degrees, ranging from tools that do default conversions between a small set of common formats (e.g. WhiteCrane, Cognos) to fairly general tools that support a wide variety of transformations (e.g. Data Junction, Data Builder). Many of these tools provide a visual interface for specifying some common transformations, but typically require users to program other transformations using conversion libraries (e.g. Data Junction's CDI SDK and DJXL). Moreover, these tools typically perform transformations as long running batch processes, so users do not early feedback on the effectiveness of a transformation. Complementing the ETL tools are data quality analysis tools (a.k.a. "auditing" tools) that find discrepancies in data. This operation is again long-running and causes many delays in the data cleaning [VJ, 00].

ARKTOS

ARKTOS is a framework capable of modeling and executing the Extraction-Transformation-Load process (ETL process) for data warehouse creation. The authors consider data cleansing as an integral part of this ETL process which consists of single steps that extract relevant data from the sources, transform it to the target format and cleanse it, then loading it into the data warehouse [HJ, 03], [PZS, 00]. A meta-model is specified allowing the modeling of the complete ETL process. The single steps (cleansing operations) within the process are called activities. Each activity is linked to input and output relations. The logic performed by an activity is declaratively described by a SQL-statement. Each statement is associated with a particular error type and a policy which specifies the behaviour (the action to be performed) in case of error occurrence.

ARKTOS is also enriched with a set of "template" generic entities that correspond to the most popular data cleaning tasks (like primary or foreign key violations) and policies (like deleting offending rows, reporting offending rows to a file or table). Connectivity. ARKTOS uses JDBC to perform its connections to the underlying data stores. More specifically, the cleaning primitives include: (a) Primary key violation, (b) Reference violation, (c) NULL value existence, (d) Uniqueness violation and (e) Domain mismatch. Moreover, the took offers Propagation and Transformation primitive operations. ARKTOS policies are: (a) Ignore (i.e., do not react to the error and let the row pass), (b) Delete (from the source data store), (c) Report to a contingency file and (d) Report to a contingency

IntelliClean

IntelliClean is a rule based approach to data cleansing with the main focus on duplicate elimination. The Processing stage represents the evaluation of cleansing rules on the conditioned data items that specify actions to be taken under certain circumstances. There are four different classes of rules [HJ, 03]. Duplicate identification rules specify the conditions under which tuples are classified as duplicates. Merge/Purge rules specify how duplicate tuples are to be handled. It is not specified how the merging is to be performed or how its functionality can be declared. If no merge/purge rule has been specified, duplicate tuples can also manually be merged at the next stage. Update rules specify the way data is to be updated in a particular situation. This enables the specification of integrity constraint enforcing rules. For each integrity constraint an Update rule defines how to modify the tuple in order to satisfy the constraint. Update rules can also be used to specify how missing values ought to be filled-in. Alert rules specify conditions under which the user is notified allowing for certain actions. During the first two stages of the data cleansing process the actions taken are logged providing documentation of the performed operations. In the human verification and validation stage these logs are investigated to verify and possibly correct the performed actions.

Intelliclean applies three steps to clean data [ZE, 02], [PFP, 06]. First, it preprocesses data to remove abbreviations and standardize data formats. Second, it applies two synonymous record identification rules (Rule 1 and Rule 2) based on the certainty factor (CF) to detect and match synonymous records. The certainty factor CF, where 0 < CF [less than or equal to] 1, represents the confidence in the rule's effectiveness in identifying true duplicates. Rule 1 and Rule 2 represent CFs of 1 and 0.9, respectively. Intelliclean also uses one merge/purge rule to merge and purge the synonymous records. The merge/purge rule is applied only where the CF is greater than a user defined threshold (TH). For example, records A and B with CF = 0.8, TH = 0.7 are merged if CF > TH. Finally, Intelliclean interacts with the human user to confirm the sets of synonymous records. To allow the user some control over the matching process, the recall and precision are measured and presented to the user. The recall is defined as the percentage of synonymous records being selected as synonymous records. The precision (P) is percentage of records identified as synonymous records that are indeed synonymous. High recall is achieved by accepting records with low degree of similarity as synonymous, at the cost of lower precision. High precision is achieved analogously at the cost of lower recall.

WHIRL

WHIRL retains the original local names and reasons explicitly about the similarity of pairs of names, using statistical measures of document similarity that have been developed in the information retrieval(IR) community. WHIRL combines some properties of statistical IR systems and some properties of data base systems. WHIRL queries can involve many different relations, instead of a single document collection [Coh, 00], [Wil, 98].

A totally different approach is WHIRL which deals with instance heterogeneity during integration. Here, textual similarity is used for computing joins between relations from different sources. This permits integration without normalization of values but is restricted to textual data [SS, 01].

SchemaSQL

SchemaSQL is the one of a number of related higher-order languages. It is used to integrate views from heterogeneous sources [BM, 02].

SchemaSQL offers the capability of uniform manipulation of data and meta-data in relational multi-database systems. It extends the traditional SQL syntax and semantics. It also permits the creation of views whose schema is dynamically dependent on the contents of the input instance. It provides a great facility for interoperability and data management in relational multi-database systems [LSS, 96].

Febrl--Freely Extensible Biomedical Record Linkage

Febrl provides data standardization and probabilistic record linkage with choices of methods for blocking and comparison functions. Febrl's data standardisation "primarily employs a supervised machine learning approach implemented through a novel application of hidden Markov models (HMMs)" [CC, 03].

The main features of the Febrl are

--It follows Probabilistic and rules-based cleaning and standardization routines for names, addresses and dates.

--It supplies look-up and frequency tables for names and Addresses.

--It uses various comparison functions for names, addresses, dates and localities, including approximate string comparisons, phonetic encodings, geographical distance comparisons, and time and age comparisons.

--Several blocking (indexing) methods are used in much record linkage Programs.

--Fellegi and Sunter approach is used in Probabilistic record linkage

Comparison

In table 3a the described data cleansing approaches are compared according to the types of anomalies handled by them and a short indication about the methods and techniques used as defined by each of them [HJ, 03].

Similarity functions for Record Linkage

Record linkage algorithms fundamentally depend on string similarity functions for discriminating between equivalent and non-equivalent record fields, as well as on record similarity functions for combining similarity estimates from individual string fields [Mik, 03].

Techniques for calculating similarity between strings can be roughly separated into two broad groups: sequence-based functions and token-based functions. Sequence-based functions model string similarity by viewing strings as contiguous sequences that differ at the level of individual characters. Token-based functions, on other hand, do not view strings as contiguous sequences but as unordered sets of tokens.

Sequence-based String Similarity Functions

Hamming Distance

This is defined as the number of bits which differ between two binary strings i.e. the number of bits which need to be changed to turn one string into the other. The simple hamming distance function can be extended into a vector space approach where the terms within a string are compared, counting the number of terms in the same positions. The Hamming distance is used primarily for numerical fixed size fields like Zip Code or SSN. It counts the number of mismatches between two numbers [MVA, 02].

String Edit Distance

The Hamming distance cannot be used for variable length fields since it does not take into account the possibility of a missing letter, e.g.," John" and "Jon", or an extra letter, e.g., "John" and "Johhn". The most well-known sequence-based string similarity measure is string edit distance, also known in its simplest form as Levenshtein distance. The edit distance between two strings is the minimum cost to convert one of them to the other by a sequence of character insertions, deletions, and replacements. Each one of these modifications is assigned a cost value function [MVA, 02], [Mik, 03], [UMR, 02].

String Edit Distance with Affine Gaps

Needleman and Wunsch (Needleman & Wunsch, 1970) [UMR, 02] extended the distance model to allow contiguous sequences of mismatched characters, or gaps, in the alignment of two strings. Most commonly the gap penalty is calculated using the affine model, which assigns a high penalty for starting a gap, and a lower linearly increasing penalty for extending a gap. Different edit operations have varying significance in different domains. Therefore, adapting string edit distance to a particular domain requires assigning different weights to different edit operations. In prior work, Ristad and Yianilos have developed a generative model for Levenshtein distance along with an Expectation-Maximization algorithm that learns model parameters using a training set consisting of matched strings [BM, 03].

Soundex

The purpose of the Soundex code [is to cluster together names that have similar sounds MVA, 02]. For example, the Soundex code of "Hilbert" and "Heilbpr" is similar; as is the Soundex code of "John" and "Jon". The Soundex code of a name consists of one letter followed by three numbers.

Jaro

Jaro proposed a metric that is less sensitive to character transpositions. Jaro introduced a string comparison function that accounts for insertions, deletions, and transpositions. Jaro's algorithm finds the number of common characters and the number of transposed characters in the two strings. A common character is a character that appears in both strings within a distance of half the length of the shorter string. A transposed character is a common character that appears in different positions [MVA, 02].

Jaro-Winkler String Similarity

The Jaro-Winkler distance (Winkler, 1999) is a measure of similarity between two strings. It is designed and best suited for short strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995) and mainly used in the area of record linkage (duplicate detection) [Mik, 03].

The Jaro-Winkler string comparator [Winkler] is the comparator developed at the U.S. Census Bureau and used in the Census Bureau record linkage software. The basis of this comparator is the count of common characters between the strings, where a character is counted as common if it occurs in the other string within a position distance that depends on the string length.

The value of the string comparator is further determined by a count of transpositions. Transpositions are determined by pairs of common characters out of order. A further enhancement to the score computation due to Winkler is based on the observation that typographical errors occur more commonly toward the end of a string, so that strings that agree on prefixes (1 to 4 characters) are given a higher comparison score. Specifically, the algorithm considers the first four prefixes of each string and lets p be the length of the longest prefix pair that agree exactly (ai = bi).

Disadvantage of Sequence based similarity

While sequence-based string similarity functions work well for estimating distance between shorter strings that differ largely at character level, they become too computationally expensive and less accurate for larger strings.

Token-based String Similarity Functions

The string edit distance becomes computationally prohibitive for larger strings such as text documents on the Web since the computational complexity is quadratic in average string size. The vector-space model of text avoids these problems by viewing strings as "bags of tokens" and disregarding the order in which the tokens occur in the strings (Salton & McGill, 1983) [Mik, 03].

Jaccard Coefficient

Jaccard similarity can then be used as the simplest method for computing likeness as the proportion of tokens shared by both Strings. The primary problem with Jaccard similarity is that it does not take into account the relative importance of different tokens [Mik, 03].

TF-IDF cosine similarity

The Term Frequency-Inverse Document Frequency (TF-IDF) weighting scheme achieves this by associating a weight wvi,s = N(vi,s) maxvj2s N(vj,s) x log N N(vi) with every token vi from string s, where N(vi, s) is the number of times vi occurred in s (term frequency), N is the number of strings in the overall corpus under consideration, and N(vi) is the number of strings in the corpus that include vi (document frequency).

Given a corpus of strings that yields the set V of distinct tokens after tokenization, a string s can be represented as a |F|-dimensional vector of weights wvi,s, every nonzero component of which corresponds to a token present in s. TF-IDF cosine similarity between two strings is defined as the cosine of the angle between their vector representations:

SimTF-IDF (s, t) = Pvi2V wvi,swvi,t qPsi2S w2 si,s x qPti2Tw2 ti,t

With the help of appropriate hashing data structures, TF-IDF cosine similarity is computationally efficient due to high sparsity of most vectors, and provides a reasonable off-the-shelf metric for long strings and text documents [Mik, 03].

n-grams

Tokenization is typically performed by treating each individual word of certain minimum length as a separate token, usually excluding a fixed set of functional "stopwords" and optionally stemming tokens to their roots (Baeza-Yates & Ribeiro-Neto, 1999). An alternative tokenization scheme is known as n-grams: it relies on using all overlapping contiguous subsequences of length n as tokens.

N-grams is another approach for computing the distance between two strings. The N-grams comparison function forms the set of all the substrings of length n for each string. The distance between the two strings is defined as [summation] -x a b f (x) f (x) where f (x) a and f (x) b are the number of occurrences of the substring x in the two strings a and b, respectively. Bigrams comparison (n = 2) is known to be very effective with minor typographical errors. It is widely used in the field of information retrieval [Mik, 03], [MVA, 02]. Trigrams comparison (n = 3) is used by Hylton in record linkage of bibliographical data. Most recently, N- grams was extended to what is referred to as Q-grams for computing approximate string joins efficiently. N-grams is more efficient than edit distance or Jaro's algorithm in the case of strings that contain multiple words and are known to be commonly in error with respect to word order. For example, comparing "John Smith" with "Smith John" results in 0.342 using Jaro's algorithm, 0.5 using edit distance, 0.375 using trigrams, 0.222 using bigrams. Bigrams comparison gives the lowest value, which means that the two strings are much closer using bigrams than using other comparison functions.

Vector-space Similarity Functions

In Euclidean space the Minkowski family of metrics, also known as the Lk norms, includes most commonly used similarity measures for objects described by d-dimensional vectors [Mik, 03], [BM, 03]. The L2 norm, commonly known as Euclidean distance, is frequently used for low-dimensional vector data. Its popularity is due to a number of factors:

* Intuitive simplicity: the L2 norm corresponds to straight-line distance between points in Euclidean space;

* Invariance to rotation or translation in feature space;

* Mathematical metric properties: non-negativity (L2(x, y), 0)), reflexivity (L2(x, y) = 0iff x = y), symmetry (L2(x, y) = L2(y, x)), and triangle inequality (L2(x, y) + L2(y, z), L2(x, z)).

Despite these attractive characteristics, Euclidean distance as well as other Minkowski metrics suffers from the curse of dimensionality when they are applied to high-dimensional data (Friedman, 1997). As the dimensionality of the Euclidean space increases, sparsity of observations increases exponentially with the number of dimensions, which leads to observations becoming equidistant in terms of Euclidean distance.

The TF-IDF weighting scheme is useful for similarity computations because it attempts to give tokens weights that are proportional to the tokens' relative importance. However, the true contribution of each token to similarity is domain-specific.

Fellegi and Sunter

Fellegi and Sunter gave a formal model for record linkage that involves optimal decision rules that divide a product space AxB of pairs of records from two files A and B into matches and non-matches, denoted by M and U, respectively. The main issue is the accuracy of estimates of probability distributions used in a crucial likelihood ratio. When estimates are sufficiently accurate, decision rules are (nearly) optimal. The optimality is in the sense that, for fixed bounds of the proportions of false matches and false matches, the size of the set of pairs on which no decision is made is minimized [BM, 03], [FS, 69], [Wil, 93].

Blocking Methods

As potentially each record in one data set has to be compared to all records in a second data set, the number of record pair comparisons grows quadratically with the number of records to be matched. This approach is computationally infeasible for large data sets. To reduce the huge number of possible record pair comparisons, traditional record linkage techniques work in a blocking fashion, i.e. they use a record attribute (or sub-set of attributes) to split the data sets into blocks. Record pairs for detailed comparison are then generated from all the records in the same block. The choice of a good blocking method can greatly reduce the number of record pair evaluations to be performed and so achieve significant performance speed-ups.

Standard Blocking

The Standard Blocking (SB) method clusters records into blocks where they share the identical blocking key. A blocking key is defined to be composed from the record attributes in each data set. A blocking key can also be composed of more than one attribute [RPT, 03].

A further consideration is required in the choice of blocking keys. To achieve good accuracy in creating blocking key (Tokens), we need to proceed further steps. Recently, Christie I. Ezeife has developed a technique for token creation. First, decomposes a field values into different parts and removes the unimportant element and title tokens like "Mr", "Dr". Tokens are formed according to the numeric value, alphanumeric value and alphabetic value. Numeric token comprises only digits (0,1,2,3,4,5,6,7,8,9). Alphabetic token consist of alphabets (aA-zZ). It takes the first character of each word in the field and sorts the character in an order. Alphanumeric token comprises both numeric and alphabetic tokens. It decomposes a given alphanumeric element into numeric and alphabetic parts and then sorts the set of tokens in a certain order [ETO, 05].

Multiple blocking keys are also used as a way to mitigate the effects of errors in blocking keys. Several passes (iterations) with different blocking keys are performed, resulting in different blocks and different record pair comparisons.

Sorted Neighbourhood

The Sorted Neighbourhood (SN) method sorts the records based on a sorting key [SK] and then moves a window called Sliding window (SW) of fixed size w sequentially over the sorted records. Records within the window are then paired with each other and included in the candidate record pair list. The use of the window limits the number of possible record pair comparisons for each record to 2w-1 [MS, 98], [RPT, 03], [JV]. The resulting total number of record pair comparisons (assuming two data sets with n records each) of the sorted neighbourhood method is O(wn). One problem with the sorted neighbourhood method arises if a number of records larger than the window size have the same value in a sorting key. Similar to standard blocking, it is advantageous to do several passes (iterations) with different sorting keys and a smaller window size than one pass only with a large window size [MMT+, 03].

Multi-Pass Approach

In general, no single key will be sufficient to catch all matching records. The attributes or fields that appear first in the key have higher discriminating power than those appearing after them. Hence, if the error in a record occurs in the particular field or portion of the field that is the most important part of the key, there may be little chance a record will end up close to a matching record after sorting. The alternative strategy implemented is to execute several independent runs of the sorted-neighborhood method, each time using a different key and a relatively small window. They call this strategy the multi-pass approach [MMT+, 03]. Each independent run will produce a set of pairs of records which can be merged. They then apply the transitive closure to those pairs of records.

Transitive Closure

In general, no single pass will be sufficient to catch all matching records. An attribute that appears first in the key has higher discriminating power than those appearing after them. The results will be a union of all pairs discovered by all independent runs, with no duplicates, plus all those pairs that can be inferred by transitivity of equality. Notice that the use of a transitive closure step is not limited to the multi-pass approach [MMT+, 03]. The accuracy of a single pass by computing the transitive closure of the results can improve. If records a and b are found to be similar and, at the same time, records b and c are also found to be similar, the transitive closure step can mark a and c to be similar if this relation was not detected by the equational theory. Moreover, records a and b must be within w records to be marked as similar by the equational theory. The same is true for records b and c. But, if the transitive closure step is used, a and c need not be within w records to be detected as similar. The use of a transitive closure at the end of any single-pass run of the sorted--neighborhood method should allow reducing the size of the scanning window w and still detecting a comparable number of similar pairs without a final closure phase and a larger w.

Duplication Elimination-Sorted Neighborhood Method (DE-SNM).

Using Sorted Neighbourhood method the exact or very closely matching duplicate records with equivalent keys can be identified and it would likely be more efficient to find these records first during the sorting phase [MMT+, 03]. Removing them from the dataset, and retaining only one representative member of this set of duplicates for comparison by the rule set against other records with different keys may yield better results. Hernandez and Stolfo call the single retained member of the set of equivalent records with duplicate keys so found the "prime representative" of this set of records. Different keys may yield better results. Hernandez and Stolfo call the single retained member of the set of equivalent records with duplicate keys so found the "prime representative" of this set of records. The following algorithm captures this heuristic strategy. While sorting, it considers records with exactly equal keys, and then considers whether these records have exactly equal or matching records. All of those records found to be exact duplicates or matching records are replaced by one prime example, which will be processed later by the sorted-neighborhood method against records with different keys. Here, the strategy of removing all matching records except one would tend to have more records enter the same window, for comparison by the equational theory, than would otherwise be possible if we allowed all the records with duplicate keys to occupy all available slots in the window.

Bigram Indexing

The Bigram Indexing (BI) method as implemented in the Febrl record linkage system allows for fuzzy blocking. The basic idea is that the blocking key values are converted into a list of bigrams (sub-strings containing two characters) and sub-lists of all possible permutations will be built using a threshold (between 0.0 and 1.0). The resulting bigram lists are sorted and inserted into an inverted index, which will be used to retrieve the corresponding record numbers in a block [RPT, 03].

The number of sub-lists created for a blocking key value both depends on the length of the value and the threshold. The lower the threshold the shorter the sub-lists, but also the more sub-lists there will be per blocking key value, resulting in more (smaller blocks) in the inverted index. In the information retrieval field, bigram indexing has been found to be robust to small typographical errors in documents. Like standard blocking, the number of record pair comparisons with two data sets with n records each, b blocks all containing the same number of records is O(n2 b) [MA, 07]. However, as discussed above the number of blocks b will much larger in bigram indexing.

Canopy Clustering with TFIDF

Canopy Clustering with TFIDF Term Frequency/Inverse Document Frequency) forms blocks of records based on those records placed in the same canopy cluster. A canopy cluster is formed by choosing a record at random from a candidate set of records (initially, all records) and then putting in its cluster all the records within a certain loose threshold distance of it. The record chosen at random and any records within a certain tight threshold distance of it are then removed from the candidate set of records. We use the TFIDF distance metric, where bigrams are used as tokens. The algorithm and details are found in [AKL, 00]. The number of record pair comparisons resulting from canopy clustering is O(fn2 c) (where n is the number of records in each of the two data sets, c is the number of canopies and f is the average number of canopies a record belongs to. The threshold parameter should be set so that f is small and c is large, in order to reduce the amount of computation. However, if f is too small, then the method will not be able to detect typographical errors [RPT, 03].

Canopy Clustering with TFIDF (Term Frequency/Inverse Document Frequency) forms blocks of records based on those records placed in the same canopy cluster. A canopy cluster is formed by choosing a record at random from a candidate set of records (initially, all records) and then putting in its cluster all the records within a certain loose threshold distance of it. The record chosen at random and any records within a certain tight threshold distance of it are then removed from the candidate set of records. We use the TFIDF distance metric, where bigrams are used as tokens. The number of record pair comparisons resulting from canopy clustering is O([fn.sup.2]/c) where n is the number of records in each of the two data sets, c is the number of canopies and f is the average number of canopies a record belongs to. The threshold parameter should be set so that f is small and c is large, in order to reduce the amount of computation [MA, 07].

Performance Analysis

Performance measures are a valuable addition to quality control procedures and also used to manage their data cleaning processes. Performance measures may include statistical checks on the data (for example, 95% of all records have an accuracy of less than 5,000 meters from their reported position), on the level of quality control, completeness.

Performance Parameters Used

The performance of each data cleaning algorithm should be measured against four parameters namely: [ETO, 05] 1) Recall (RC), 2) False-positive error (FPE), 3) reverse false-positive error (RFP), 4) Precision and 5) threshold.

Recall:

Recall is the ratio indicating the number of duplicates correctly identified by a given algorithm. "X" is the numbers of duplicates were identified out of "Y" number of duplicates. Formally,

Recall = X/Y

False-Positive Error

It is possible that duplicate records are not detected and similar records which do not represent the same real world entity are treated as duplicates. These incorrectly paired records are known as false positives. False positive error is a ratio of wrongly identified duplicates. Formally, false positive errors:

FPE = number of wrongly identified duplicates * 100/ Total number of identified duplicates

Reverse False-Positive error (RFP)

Reverse False-Positive error is indicating the number of duplicates that a given data cleaning algorithm could not identify. Formally,

RFP = number of duplicates that escaped identification * 100/ Total number of duplicates

Every data cleaning algorithm should have 1) a high recall; 2) a very low (better if zero) FPE, hence high precision and 3) a very low (better if zero) RFP. To achieve high recall, i.e. to detect more duplicate records in the dataset, we can accept records with low degrees of similarity as duplicates. This can be done by using a smaller value for the threshold. However, this will lead to lower precision, i.e. higher false-positive error, as more "duplicates" are identified wrongly. Analogously, we can achieve higher precision at the cost of lower recall, by setting a higher threshold [LLL, 01].

Precision

Precision, also known as positive predictive value is the ratio of data points detected that indeed contain artifacts. Recall, also known as sensitivity is the ratio of data artifacts detected.

F-Score

Strategies for data cleaning may differ according to the types of data artifacts, but they generally faced the recall-precision dilemma. We first define recall and precision using true-positives (TP), false-positives (FP) and false-negatives (FN).

precision = TP / (TP FP)

recall = TP / (TP FN)

Precision, also known as positive predictive value is the ratio of data points detected that indeed contain artifacts. Recall, also known as sensitivity is the ratio of data artifacts detected. In this work, we use F-score which is a combined score of both recall and precision.

F score = (2 x precision recall) / precision + recall

The recall-precision dilemma indicates that the higher the recall, the lower is the precision, and vice versa. Data artifacts detection methods are commonly associated with criteria or thresholds that differentiate the artifacts from the non-artifacts. Higher recall can be achieved by relaxing some of the criteria or thresholds with an increase in the number of TP, but corresponding reduction in precision because FP also increases. Stringent criteria or high thresholds may reduce FP and thus increase precision, but at the same time, reduces the number of positive detected and thus the recall. Achieving both high recall and precision, and therefore a high F-score is a common objective for data cleaners.

Conclusion

This paper contains a comprehensive survey of the existing approaches that are used for data cleaning and the common errors which can be occurred in data warehouse records. This survey paper will be useful for substantial improvement in the current history of data.

Currently, there are many algorithms available for field similarity and blocking methods to clean the data in effectively and to reduce the time for the cleaning process. Research in data warehouse emphasizes relatively simple and fast comparison techniques that can be applied to data warehouse with millions of records to reduce the comparison. Such techniques typically do not rely on the existence of training data and emphasize efficiency over effectiveness.

An interesting direction for future research is to do survey on existing techniques and compare existing approaches to improve the quality of the data cleaning process. Most of the similarity functions and blocking methods available today offer various algorithmic approaches for speeding up the data cleaning process.

The effective data cleaning algorithms and techniques is needed for the incoming data, and any data sources. Another related aspect of data cleaning process is to develop methods that permit the user to correct or detect errors in data cleaning projects. The process of data cleaning is computationally expensive on very large data sets and it is very difficult to do with older methodologies. The new technology has to be developed for the data cleansing process in acceptable time on large amounts of data. The researcher should be tackled with many issues in the data cleansing area.

References

[1] [AIR+, 99] M. S. Almeida, M. Ishikawa, J. Reinschmidt, and T. Roeber, "Getting Started with Data Warehouse and Business Intelligence". IBM Redbooks, 1999.

[2] [AMA, 03] Andrew Borthwick, Martin Buechi, and Arthur Goldberg, "Automated Database Blocking and Record Matching", U.S. Patent # 7,152,060. Filed April 11, 2003. Awarded December 19, 2006.

[3] [Art, 05] Arthur D. Chapman, Principles and Methods of Data Cleaning--Primary Species and Species-Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen, 2005.

[4] [BL, 94] V. Barnett and T. Lewis, "Outliers in Statistical Data", John Wiley and Sons, New York, 1994.

[5] [BM, 02] Barbancon, F.; Miranker, D.P, "Implementing federated database systems by compiling SchemaSQL", Database Engineering and Applications Symposium, Proceedings. International Volume, Issue, Pages 192-201, 2002.

[6] [BM, 03] M. Bilenko and R. J. Mooney, "Adaptive duplicate detection using learnable string similarity measures", ACM SIGKDD, pages 39-48, 2003.

[7] [BZS+, 99] V. Brusic, J. Zeleznikow, T. Sturniolo, E. Bono, and J. Hammer. "Data cleansing for computer models: a case study from immunology", 6th International Conference on Neural Information Processing (ICONIP), pages 603-609, 1999.

[8] [CC, 03] Christen, P., Churches, "Febrl--Freely extensible biomedical record linkage'", T. Release 0.2 edition, Apr. 2003.

[9] [CDE, 03] Cochinwala M, Dalal S, Elmagarmid A, and Verykios, V, "Record Matching: Past, Present and Future", Submitted to ACM Computing Surveys, 2003.

[10] [Coh, 00] Cohen, W. "WHIRL: A Word-based Information Representation Language Artificial Intelligence", pages 118:163-196, 2000.

[11] [Coh, 00] W. W. Cohen, "Data Integration using similarity joins and a word-based information representation language". Information Systems, pages 607-633, 2001.

[12] [EKV, 07] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, "Duplicate Record Detection: A Survey", IEEE TKDE, pages 1-16, 2007.

[13] [ETO, 05] C.I. Ezeife and Timothy E. Ohanekwu, "Use of Smart Tokens in Cleaning Integrated Warehouse Data", the International Journal of Data Warehousing and Mining (IJDW), Vol. 1, No. 2, pages 1-22, Ideas Group Publishers, April-June 2005.

[14] [EW] Edward H. Porter and William E. Winkler, "Approximate String Comparison and its Effect on an Advanced Record Linkage System".

[15] [FS, 69] Felligi I and Sunter A, A Theory for Record Linkage, In JASA, 1969. #10.

[16] [FSW, 97] Famili, A., Shen, W.-M., Weber, R., Simoudis, E. "Data Pre-Processing and Intelligent Data Analysis", International Journal on Intelligent Data Analysis, pages 1-28. January 1997,

[17] [GFS+, 00] Galhardas, H., Florescu, D., Shasha, D. and Simon, E, "Declaratively cleaning your data using AJAX", In Journees Bases de Donnees, Oct. 2000.

[18] [GFS+, 01] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. A. Saita. Declarative Data Cleaning: Language, model, and algorithms, VLDB, pages 371-380, 2001.

[19] [HDD+, 00]Helena Galhardas, Daniela Florescu, and Dennis Shasha, "An Extensible Framework for Data Cleaning", Proc. of Int. Conf. on Data Engineering (ICDE), San Diego, February, 2000.

[20] [HDE+, 00] Helena Galhardas, Daniela Florescu, Eric Simon, Dennis Shasha, "AJAX: An Extensible Data Cleaning Tool", Proc. of ACM SIGMOD Conf. on Management of Data, pages 590-602, May, 2000.

[21] [HDE+, 01] Helena Galhardas, Daniela Florescu, Eric Simon, Cristian Augustin Saita, Dennis Shasha, "Declarative Data Cleaning: Language, Model, and Algorithms", Proc. of the Int. Conf. on Very Large Data Bases (VLDB), Rome, Italy, September, 2001

[22] [Her, 95] M. Hernandez, "A generation of band joins and the merge/purge problem", Techical report CUCS-005-1995, Department of Computer Science, Columbia University, 1995.

[23] [HJ, 03] Heiko Muller, Johann-Christoph Freytag, "Problems, Methods, and Challenges in Comprehensive Data Cleansing", 2003

[24] [HS, 95] A. M. Hernandez and J. S. Stolfo, "The Merge/Purge Problem for Large Databases", ACM SIGMOD, pages 127-138, 1995.

[25] [HS, 98] A. H. Mauricio and J. S. Stolfo, "Real-world Data is dirty: Data Cleansing and the Merge/Purge Problem", Data Mining and Knowledge Discovery, pages 9-27, 1998.

[26] [Inm, 93] W. H. Inmon. "Building the Data Warehouse", Wiley-QED, New York, 1993.

[27] [JA, 00] Jonathan I. Maletic and Andrian Marcus, "Data Cleansing: Beyond Integrity Analysis", In Proceedings of the Conference on Information Quality (IQ2000).

[28] [JA, 03] Jeremy Kubica, Andrew Moore, "Probabilistic Noise Identification and Data Cleaning", icdm, Third IEEE International Conference on Data Mining (ICDM'03), page 131, 2003.

[29] [JM, 01] Jin I Li C and Mehrotra, S. Efficient Record Linkage in Large Data Sets, Paper 113, UC Irvine Tech Reports, 2001.

[30] [JV] Jordi Nin and Vicen, c Torra, "Blocking anonymized data", IIIA, Artificial Intelligence Research Institute. CSIC, Spanish National Research Council. Campus UAB.

[31] [KB, 05] J. L. Y. Koh and V. Brusic, "Bioinformatics Database Warehousing. Bioinformatics Technologies", Y. P. P. Chen ed., Springer, pages 45-62, 2005.

[32] [KCH+, 03] W. Kim, B. J. Choi, E. K. Hong, S. K. Kim, and D. Lee, "A Taxonomy of Dirty Data, Data Mining and Knowledge Discovery", pages 8199, 2003.

[33] [Kim, 96] R. Kimball, "Dealing with Dirty Data. DBMS Online", www.dbmsmag.com/9609d14.htm, 1996.

[34] [KM, 03] J. Kubica and A. Moore, "Probabilistic Noise Identification and Data Cleaning", ICDE, pages 131-138, 2003.

[35] [KNT, 00] E. M. Knorr, R. T. Ng, V. Tucakov, "Distance-based Outliers: Algorithms and Applications", VLDB Journal, pages 237-253, 2000.

[36] [LHK, 04] M. L. Lee, W. Hsu, and V. Kothari, "Cleaning up the spurious links in data", IEEE Intelligent Systems: Special issue on Data and Information Cleaning and Preprocessing. 19(2):28-33, 2004.

[37] [LLL, 00] M. L. Lee, T. W. Ling, and W. L. Low, "IntelliClean: a knowledge-based intelligent data cleaner", ACM SIGKDD, pages 290-294, 2000.

[38] [LLL,01] Low, W.L., Lee, M.L., Ling, T.W., "A Knowledge- based Approach for Duplicate Elimination in Data Cleaning", Information Systems 26, 2001

[39] [LRD, 03] Lifang Gu, Rohan Baxter, Deanne Vickers, and Chris Rainsford, "Record Linkage: Current Practice and Future Directions", CMIS Technical Report No. 03/ 83, Apr. 2003.

[40] [LSS, 96] Lakshmanan, Sadri, and Subramanian, "SchemaSQL--A Language for Interoperability in Relational Multi-Database Systems", appeared in the proceedings of VLDB Conference in Bombay, 1996.

[41] [MA, 07] Markus Doring, Andreas Muller, "Analysis of existing software and methods to build a duplicate detection software for the GBIF cache database", Botanischer Garten und Botanisches Museum Berlin-Dahlem, Freie Universitat Berlin, 5th January 2007

[42] [ME, 96] A. E. Monge and C. P. Elkan, "The field matching problem: Algorithms and applications", SIGMOD workshop on research issues on knowledge discovery and data mining, pages 267-270, 1996.

[43] [Mik, 03] Mikhail Bilenko, "Learnable Similarity Functions and Their Applications to Record Linkage and Clustering", 2003

[44] [MMT+, 03] Monica Scannapieco, Massimo Mecella, Tiziana Catarci, Cinzia Cappiello, Barbara Pernici, Francesca Mazzoleni, "DL3: Comparative Analysis of the Proposed Methodologies for Measuring and Improving Data Quality and Description of an Integrated Proposal", 10 Gennaio, 2003.

[45] [MNF, 03] H. Muller, F. Naumann, and J. Freytag, "Data Quality in Genome Databases", International Conference on Information Quality, pages 269-284, 2003.

[46] [MS, 98] Mauricio Hernandez, Salvatore Stolfo, "Real World Data is Dirty: Data Cleansing and The Merge/Purge Problem", Journal of Data Mining and Knowledge Discovery, 1(2), 1998.

[47] [MTH+, 99] Mong-Li Lee, Tok Wang Ling, Hongjun Lu, and Yee Teng Ko, "Cleansing Data for Mining and Warehousing", In Proceedings of the International Conference on Database and Expert Systems Applications (DEXA), volume 1677 of LNCS, pages 751-760, Florence, Italy, 1999.

[48] [MVA, 02] Mohamed G. Elfeky, Vassilios S. Verykios, and Ahmed K. Elmagarmid, "TAILOR: A Record Linkage Toolbox", Proceedings of the ICDE'02, IEEE, 2002.

[49] [OJA, 05] Oded Maimon, Jonathan I. Maletic and Andrian Marcus, "Data Cleansing, Data Mining and Knowledge Discovery Handbook", Springer US, Pages 21-36, 2005.

[50] [Pet, 06] Peter Christen, "A Comparison of Personal Name Matching: Techniques and Practical Issues", icdmw, Sixth IEEE International Conference on Data Mining--Workshops (ICDMW'06), pages 290-294, 2006.

[51] [PFP, 06] Paulo Oliveira and Fatima Rodrigues and Pedro Rangel Henriques, "An Ontology-Based Approach For Data Cleaning", 11th International Conference on Information Quality, EUA, Nov 2006.

[52] [PMB, 99] Paul Jermyn, Maurice Dixon and Brian J Read, "Preparing Clean Views of Data for Data Mining", 12th ERCIM Workshop on Database Research, Amsterdam, 2-3 November 1999

[53] [Poe, 96] V. Poe, "Building a Data Warehouse for Decision Support", Prentice Hall PTR, 1996.

[54] [Pyl, 99] D Pyle, "Data Preparation for Data Mining", Morgan Kaufmann Inc., ISBN 1-55860-529-0, 1999.

[55] [PZS+, 00] Panos Vassiliadis, Zografoula Vagena, Spiros Skiadopoulos, Nikos Karayannidis, Timos Sellis, "ARKTOS: A Tool for Data Cleaning and Transformation in Data Warehouse Environments", Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 28, no. 4, pages 42-47, December 2000.

[56] [REH, 00] Rahm, Erhard; Do, Hong-Hai, "Data Cleaning: Problems and Current Approaches", IEEE Bulletin. of the Technical Committee on Data Engineering, Vol 23 No. 4, December 2000

[57] [RH, 01] V. Raman and J. M. Hellerstein, "Potter's wheel: an interactive data cleaning system", VLDB, pages 381-390, 2001.

[58] [Ric, 96] Richard J. Orli, "Data Quality Methods", Based on a Public Document prepared for the US Gov't., 1996.

[59] [RPT, 03] Rohan Baxter, Peter Christen and Tim Churches, "A Comparison of Fast Blocking Methods for Record Linkage, Workshop on Data Cleaning", Record Linkage and Object Consolidation, KDD, pages 24-27, August 2003.

[60] [SB, 02] S. Sarawagi and A. Bhamidipaty, "Interactive deduplication using active learning", ACM SIGKDD, pages 269-278, 2002.

[61] [SRC, 99] "Record Linkage Software by System Resources Cooperation" et al., pages 1-61, Nov. 18, 1999.

[62] [SS, 01] K. Sattler and E. Schallehn, "A Data Preparation Framework based on a Multidatabase Language", in Proc. of Int. Database Engineering and Applications Symposium (IDEAS 2001), M. Adiba, C. Collet, and B. P. Desai, eds., IEEE Computer Society, (Grenoble, France), pages 219-228, 2001.

[63] [SS, 06] S. Sumathi and S.N. Sivanandam, "Data Warehousing, Data Mining, and OLAP", Studies in Computational Intelligence, Volume 29/2006, pages 21-73, 2006.

[64] [Ten, 03] C. M. Teng. Applying noise handling techniques to genomic data: A case study, IEEE ICDM, pages 743- 746, 2003.

[65] [UMR, 02] Un Yong Nahm, Mikhail Bilenko and Raymond J. Mooney, "Two Approaches to Handling Noisy Variation in Text Mining", Proceedings of the ICML-2002 Workshop on Text Learning, Sydney, Australia, pages 18-27, July 2002.

[66] [VCE+, 05] J. Van den Broeck, S. A. Cunningham, R. Eeckels, and K. Herbst, "Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities", PLoS Med. pages 261-265, 2005.

[67] [VE, 99] Verykois, V.S. and Elmagarmid, A.K., "Automating the Approximate Record Matching Process", Computer Sciences Dept., Purdue University, West Lafayette, IN, Jun. 15, 1999.

[68] [VEE+, 00] Verykios, V., Elmagarmid, A., Elfeky, M., Cochinwala, M., Dalal, "On the Accuracy and Completeness of the Record Matching Process", Information Quality Conference, pages 54-69, 2000.

[69] [VJ, 00] Vijayshankar Raman and Joseph M. Hellerstein, "An Interactive Framework for Data Cleaning", Computer Science Division (EECS), University of California, Report No. UCB/CSD-0-1110, September 2000.

[70] [VVS+, 00] P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N. Karayannidis and T. Sellis, "ARKTOS: A tool for data cleaning and transformation in data warehouse Environments", IEEE Data Engineering Bulletin, pages 42-47, 2000.

[71] [WC, 03] Wayne Eckerson and Colin White, "Evaluating ETL and Data Integration Platforms", TDWI Report Series, 2003

[72] [Whe, 04] M. Wheatley, "Operation Clean Data", CIO Magazine, Jul. 1, 2004.

[73] [Wil, 93] William E. Winkler, "Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage", Proceedings of the Section on Survey Research Methods, American Statistical Association, pages 274-279, 1993.

[74] [Wil, 98] William W. Cohen, "Integration of heterogeneous databases without common domains using queries based on textual similarity", Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD-98), pages 201-212, 1998.

[75] [Win, 99] Winkler, W, "The State of Record Linkage and Current Research Problems", Proceedings of the Survey Methods Section, pages 73-79, 1999.

[76] [WKM, 93] Wang, R., Kon, H. & Madnick, S., "Data quality requirements analysis and modelling", IEEE ICDE, pages 670-677, 1993.

[77] [WSF, 95] R. Wang, H. Kon and S. Madnick, "A framework for analysis of data quality research", IEEE TKDE, pages 623-640, 1995.

[78] [XGS+, 06] Xiong, H.; Gaurav Pandey; Steinbach, M.; Vipin Kumar, "Enhancing data analysis with noise removal", Transactions on Knowledge and Data Engineering, Volume 18, Issue 3, Page 304-319, March 2006.

[79] [ZE, 02] Zoubida Kedad, Elisabeth Metais, Ontology-Based Data Cleaning", Natural Language Processing and Information Systems: 6th International Conference on Applications of Natural Language to Information Systems, NLDB 2002, Stockholm, Sweden, pages 137-149, June 27-28, 2002.

(1) J. Jebamalar Tamilselvi and (2) V. Saravanan

(1) Assistant Professor (SG), Department of Computer Application, Karunya University, Coimbatore--641 114, Tamilnadu, India

E-mail: jjebamalar@gmail.com

(2) Director, Department of Computer Application, Dr. N.G.P. Institute of Technology, Coimbatore--641 048, Tamilnadu, India

E-mail: tvsaran@hotmail.com
Table 1: Data Cleaning Approaches.

ETL--Developed by Rahm and Do, 2000 and Vassiliadis et al., 2001.
ETL's primary tools are data flow graphs, which tracks the
transformation of the dirty data into the cleaned data.

SchemaSQL--Proposed by V.S. Lakshmanan, Fereidoon Sadri, N.
Subramanian at 1996. SchemaSQL is used to integrate views from
heterogeneous sources. It extends the traditional SQL syntax and
semantics.

IntelliClean--Proposed by Mong Li Lee, Tok Wang Ling and Wai Lup
Low, 2000-2001. IntelliClean is a Knowledge-base technique.
IntelliClean is a rule based approach to data cleansing with the
main focus on duplicate elimination.

AJAX--Proposed by Galhardas, Florescu, Sasha & Simon, 2000-2001.
AJAX provides a declarative and extensible language for specifying
those transformations, which consist of SQL statements enriched
with a set of specific primitives.

Potter's wheel--Proposed by Raman & Hellerstein, 2001. Potter's
Wheel is an interactive data cleansing system that integrates data
transformation and discrepancy detection in a single interactive
framework.

ARKTOS--Vassiliadis, Simitsis, and Skiadopoulos 2002-2003
--Proposed a novel conceptual model. Simitsis, Vassiliadis,
Skiadopoulos and Sellis 20032004--Presented a Novel logical model.
ARKTOS is a framework capable of modelling and executing the
Extraction-Transformation-Load process (ETL process) for data
warehouse creation.

Febrl--Freely Extensible Biomedical Record Linkage--Developed by
ANU Data Mining Group. Febrl provides data standardization and
probabilistic record linkage with choices of methods for blocking
and comparison functions.

Table 2: Data Cleaning Approaches with Type of Anomalies.

 Constraint Dupli-cates
 Violation

AJAX Filter Match, Cluster
 violating and merge
 tuples

FraQL Union, Join,
 and
 Reconciliation

Potter's Domain format
Wheel constraints

ARKTOS Three types of
 Constraints

IntelliClean Alert and Merge/Purge
 update rule rule

ETL No standard
 format

SchemaSQL Semantic,
 Syntactic

WHIRL

Febrl

 Data base Invalid Type
 design

AJAX Mapping Formatting

FraQL Schema Statistical
 transformations method

Potter's Transforms
Wheel

ARKTOS PRIMARY KEY
 VIOLATION,
 UNIQUENESS
 VIOLATION

IntelliClean integrity
 constraint,
 Update rule

ETL Migrating

SchemaSQL Uniform
 manipulation of
 data and meta
 data

WHIRL Instance Statistical
 heterogeneity measures

Febrl Hidden Markov Data tagging
 Models for
 Data
 Standardisation

 Natural Error Domain Format Missing Values
 Error

AJAX Object Identity User defined User Defined
 Problem

FraQL Missing values, User defined Statistical
 invalid tuples Values

Potter's Syntax errors Pattern
Wheel and learning
 irregularities

ARKTOS FORMAT MISMATCH NULL EXISTENCE

IntelliClean User defined Update rule

ETL Transformation Load Transformation

SchemaSQL Semantic
 heterogeneity

WHIRL

Febrl Rule based Data Supplied Lookup
 Cleaning and Frequency
 Table

Table 3: Comparison of Similarity Functions.

Jaccard coefficient--1901--Jaccard
similarity can be used for computing
likeness as the proportion of tokens
shared by both strings.

Soundex encoding--1918--The
purpose of the Soundex code is to
cluster together names that have
similar sounds.

Levenshtein/edit distance--1965
It is defined as the minimum number
of insertions, deletions or
substitutions necessary to transform

Fellegi Sunter--1969--It provides a
formal model for record linkage that
involves optimal decision rules to find
matches and non-matches.

TF-IDF cosine similarity--1983--TF-IDF
cosine similarity is
computationally efficient due to high
sparsity of most vectors, and provides
a reasonable for long strings.

Jaro--1989--Jaro introduced a string
comparison function that accounts for
insertions, deletions, and
transpositions.

Winkler--1999--The Jaro-Winkler
distance is a measure of similarity
between two strings. It is designed and
best suited for short strings.
COPYRIGHT 2010 Research India Publications
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2010 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Tamilselvi, J. Jebamalar; Saravanan, V.
Publication:International Journal of Computational Intelligence Research
Date:Jul 1, 2010
Words:10064
Previous Article:Java remote method invocation with GSS in distributed systems.
Next Article:Data Mining fuzzy neuro in predicting diabetes.

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |