ETL Detail: Data Cleansing: data scrubbing, Dirty Data, Lexical Errors, Irregularities, Integrity Constraint Violation, Duplication

<< Issues of ETL: legacy data, Web scrapping, data quality, ETL vs ELT

Data Duplication Elimination and BSN Method: Record linkage, Merge, purge, Entity reconciliation, List washing and data cleansing >>

Lecture-19

ETL Detail: Data Cleansing

Background

Other names: Called as data scrubbing or cleaning.

More than data arranging.

Big problem, big effect: Enormous problem, as most data is dirty. GIGO

Dirty is relative.

Paradox: Must involve domain expert, as detailed domain knowledge is required, so it

becomes semi-automatic, but has to be automatic because of large data sets.

Data duplication.

It is crucial to the process of loading the data warehouse that you not just rearrange the grouping

of source data and deliver it to a destination database, but that you also ensure the quality of that

data. Data cleansing is vitally important to the overall health of your warehouse project and

ultimately affects the health of your company. Do not take this statement lightly.

The true scope of a data cleansing project is enormous. Much of production data is dirty, and you

don't even want to consider what work cleaning it up would take. By "dirty," I mean that it does

not conform to proper domain definitions or "make sense." The age -old adage "garbage in,

garbage out" still applies, and you can do nothing about it short of analyzing and correcting the

corporate data. What is noise in one domain may be information in another.

Usually the process of data cleansing can not be performed without the involvement of a domain

expert because the detection and correction of anomalies requires detailed domain knowledge.

Data cleansing is therefore described as semi-automatic but it should be as automatic as possible

because of the large amount of data that usually is be processed and the time required for an

expert to cleanse it manually.

The original aim of data cleansing was to eliminate duplicates in a data collection, a problem

occurring already in single database applications and gets worse when integrating data from

different sources. Will have a complete lecture on it.

Lighter Side of Dirty Data

Year of birth 1995 current year 2005

Born in 1986 hired in 1985

Who would take it seriously? Computers while summarizing, aggregating, populating etc.

Small discrepancies become irrelevant for large averages, but what about sums, medians,

maximum, minimum etc.?

146

Value validation is the process of ensuring that each value that is sent to the data warehouse is

accurate. You may had that experience in which you look at the contents of one of your major flat

files or database tables and intuitively pick that the data is incorrect. No way could that employee

be born in 2004! You know your company doesn't hire infants. You may also discover another

incorrect record. How could someone be born in 1978 but hired in 1977?

All too often, these types of data integrity problems are laughed at and then ignored. All too often

people say "No one would actually take that information seriously, would they?" Well, maybe

people won't, but MIS systems will. That information can be summarized, aggregated, and/or

manipulated in some way, and then populated into another data element. And when that data

element is moved into the DWH, analytical processing will be performed on it that can affect the

way your company does business. What if data from the DWH is being analyzed to revise hiring

practices? That data may make a wrong impact on the business decisions if enough of the hire

and birth dates are inaccurate.

Small data discrepanci es can become statistically irrelevant when large volumes of data are

averaged. But averaging is not the only analytical function that is used in analytical data

warehouse queries. What about sums, medians, max/min, and other aggregate and scalar

functions? Even further, can you actually prove that the scope of your data problems is as small

as you think it is? The answer is probably "no."

Serious Problems due to dirty data

Decisions taken at government level using wrong data resulting in undesirable results.

In direct mail marketing sending letters to wrong addresses loss of money and bad

reputation.

Administration: The government analyses data collected by population census to decide which

regions of the country require further investments in health, education, clean drinking water,

electricity etc. because of current and expected future trends. If the rate of birth in one region has

increased over the last couple of years, the existing health facilities and doctors employed might

not be sufficient to handle the number of current and expected patients. Thus, additional

dispensaries or employment of doctors will be needed. Inaccuracies in analyzed data can lead to

false conclusions and misdirected release of funds with catastrophic results for a poor country like

Pakistan.

Supporting business processes : Erroneous data leads to unnecessary costs and probably bad

reputation when used to support business processes. Consider a company using a list of consumer

addresses and buying habits and preferences to advertise a new product by direct mailing. Invalid

addresses cause the letters to be returned as undeliverable. People being duplicated in the mailing

list account for multiple letters sent to the same person, leading to unnecessary expenses and

frustration . Inaccurate information about consumer buying habits and preferences contaminate

and falsify the target group, resulting in advertisement of products that do not correspond to

consumer's needs. Companies trading such data face the possibility of an additional loss of

reputation in case of erroneous data.

147

3 Classes of Anomalies...

Syntactically Dirty Data

§ Lexical Errors

§ Irregularities

Semantically Dirty Data

§ Integrity Constraint Violation

§ Business rule contradiction

§ Duplication

Coverage Anomalies

§ Miss ing Attributes

§ Missing Records

Syntactically Dirty Data

Lexical errors: For example, assume the data to be stored in table form with each row

representing a tuple and each column an attribute. If we expect the table to have five columns

because each tuple has five attributes but some or all of the rows contain only four columns then

the actual structure of the data does not conform to the specified format.

Irregularities are concerned with the non-uniform use of values, units and abbreviations.

This can happen for example if we use different currencies to specify an employee's salary. This

is especially profound if the currency is not explicitly listed with each value, and is assumed to be

uniform. Annual salary or 20,000 means $20,000 or Rs. 20,000. This results in values being

correct representations of facts if we have the necessary knowledge about their representation

needed to interpret them.

Semantically dirty data

Integrity constraint violations describe tuples (or sets of tuples) that do not satisfy one or more

of the integrity constraints. Integrity constraints are used to describe our understanding of the

mini-world by restricting the set of valid instances. Each constraint is a rule representing

knowledge about the domain and the values allowed for representing certain facts.

Contradictions are values within one tuple or between tuples that violate some kind of

dependency between the values. An example for the first case could be a contradiction between

the attribute AGE and DATE_OF_BIRTH for a tuple representing persons. Contradictions are

either violations of functional dependencies that can be represented as integrity constraints or

duplicates with inexact values. They are therefore not regarded as separate data anomaly.

Duplicates are two or more tuples representing the same entity from the mini -world. The values

of these tuples do not need to be completely identical. Inexact duplicates are specific cases of

contradiction between two or more tuples. They represent the same entity but wit h different

values for all or some of its properties. This hardens the detection of duplicates and there

mergence.

148

3 Classes of Anomalies...

Coverage or lack of it

§ Missing Attribute

Result of omissions while collecting the data.

A constraint violatio n if we have null values for attributes where NOT NULL constraint

exists.

Case more complicated where no such constraint exists.

Have to decide whether the value exists in the real world and has to be deduced here or

not.

Coverage Problems

Missing values

Result of omissions while collecting the data. A constraint violation if we have null values for

attributes where NOT NULL constraint exists. Case more complicated where no such constraint

exists. Have to decide whether the value exists in the real world and has to be deduced here or

not.

Missing tuples

Result from omissions of complete entities existent in the mini -world that are not represented by

tuples in the data collection. Anomalies could further be classified according to the amount of

data accounting for the constraint violation. This can be single values, values within a tuple,

values within one or more columns of the data collection or tuples and sets of tuples from

different relations.

Why Missing Rows?

Equipment malfunction (bar code reader, keyboard etc.)

Inconsistent with other recorded data and thus deleted.

Data not entered due to misunderstanding/illegibility.

Data not considered important at the time of entry (e.g. Y2K).

There can be a number of reasons for missing rows or missing data. For example there may be

fault with the hardware for recording or inputting the data, this could be a bar code reader which

is faulty and missing part of the UPC (Universal Product Code) or on a low tech level, there could

be problem with a keyboard (numeric part) that results in entry of different keys as same keys and

only one is kept and all others are removed, which infact corresponded to unique keys. In many

cases I can not even read what I have written, I am sure I am not alone. If I can't read my own

handwriting sometime, than in many instances people will be unable to read my handwriting.

This means either the data will not be entered or it will be entered incorrectly, even using

automatic means using OCR (Optical Character Reader) soft wares. Another example is the

famous (or infamous) Y2K problem, when the first two most significant digits of the year where

not recorded, and in many cases it was almost impossible to differentiate between 1901 and 2001!

149

Handling missing data

Dropping records.

"Manually" filling missing values.

Using a global constant as filler.

Using the attribute mean (or median) as filler.

Using the most probable value as filler.

There can be a number of ways of handling missing data, the most convenient being to just drop

the records with missing values for certain attributes. The problem with this approach is what do

we gain by dropping records with missing values, and how to decide which records to drop i.e.

with one missing value or which one or how many missi g values? Obviously this is a complex

problem, and writing code would not be an easy task, so one easy way out could be to manually

fill missing values or use a global constant as a filler. For example if gender is not known for the

customers, then filling the gender by (say) NA. For numeric attributes, filling using a global value

may make some sense, as it could be the mean or median of the column value. Or this could also

be the most probable value to be used as a filler.

Key Based Classification of Problems

Primary key problems

Non -Primary key problems

The problems associated with the data can correspond to either the primary keys or non primary

keys, as the nature of the problem and the corresponding solution varies with the type of the key

invo lved. In the subsequent slides we will discuss each of them one by one.

Primary key problems

1. Same PK but different data.

2. Same entity with different keys.

3. PK in one system but not in other.

4. Same PK but in different formats.

Some of the problems associated with PK are;

1. Records may have the same primary key but might have different data. This can occur if

primary keys are reused or when two organizations or units merge.

2. The same entity might occur with different primary keys. This can easily arise when different

segments of an entity design databases independently of one another

3. Data may have a primary key in one system but not in another. The classic example of this

kind of error occurs when an entity is represented in more than one source database. It is quite

possible that the entity is central to one of the databases and therefore has a primary key field or

150

fields while the same entity is so peripheral to the purposes of the other database that it is not

dignified by being given a primary key.

4. Primary Keys may be intended to be the same but might occur in different formats. Probably,

the most widespread instance of this error occurs when NID are used as primary keys: are they

character data or numeric; if character data, do they contain dashes?

Non primary key problems...

1. Different encoding in different sources.

Multiple ways to represent the same information.

Sources might contain invalid data.

Two fields with different data but same name.

1. Data may be encoded differently in different sources. The domain of a "gender" field in some

database may be {`F', `M'} or as {"Female", "Male"} or even as {1, 0}.

2. There are often multiple ways to represent the same piece of information. "FAST", "National

University", "FAST NU" and "Nat. Un iv. of Computers " can all can be found in the literature as

representing the same institution.

3. Sources might contain invalid data. A point of sale terminal may require that the sales clerk

enters a customer's telephone number. If the customer does not wish to give it, clerks may enter

999-999-9999.

4. Two fields may contain different data but have the same name. There are a couple of ways in

which this can happen. "Total Sales" probably means fiscal year sales to one part of an enterprise

and calendar year sales to another. The second instance can be much more dangerous. If an

application is used by multiple divisions, it is likely that a field that is necessary for one business

unit is irrelevant to another and may be left blank by the second unit or, worse, used for otherwise

undocumented purposes.

Non primary key problems

Required fields left blank.

Data erroneous or incomplete.

Data contains null values.

5. Required fields may be left blank. Clerical errors account for most of these, but mobil e phones

did not come into use until early 90's, so contact numbers recorded before then will not have

them.

6. Data may be erroneous or inconsistent. The telephone number may be for Lahore but the city

listed as Karachi.

7. The data may contain null values. Null values can occur for a wide variety of reasons, the most

common of these are:

151

a. Data that is genuinely missing or unknown,

b. An attribute does not apply to an entity,

c. Data that is pending, or

d. Data that is only partially known.

Automatic Data Cleansing...

1) Statistical

2) Pattern Based

3) Clustering

4) Association Rules

Some of the data cleansing techniques are listed. Let's discuss each of them in detail.

Statistical : Identifying outlier fields and records using the values of mean, standard deviation,

range, etc., based on Chebyshev's theorem, considering the confidence intervals for each field.

Outlier values for particular fields are identified based on automatically computed statistics. For

each field the average and the standard deviation are utilized and based on Chebyshev's theorem

those records that have values in a given field outside a number of standard deviations from the

mean are identified. The number of standard deviations to be considered is customizable.

Confidence intervals are taken into consideration for each field.

Pattern-based: Identify outlier fields and records that do not conform to existing patterns in the

data. Combined techniques (partitioning, classification, and clustering) are used to identify

patterns that apply to most records. A pattern is defined by a group of records that have similar

characteristics ("behavior") for p% of the fields in the data set, where p is a user-defined value

(usually above 90).

Clustering : Identify outlier records using clustering based on Euclidian (or other) distance.

Existing clustering algorithms provide little support for identifying outliers. However, in some

cases clustering the entire record space can reveal outliers that are not identified at the field level

inspection. The main drawback of this method is computational time. The clustering algorithms

have high computational complexity. For large record spaces and large number of records, the run

time of the clustering algorithms is prohibitive.

Association rules: Association rul es with high confidence and support define a different kind of

pattern. As before, records that do not follow these rules are considered outliers. The power of

association rules is that they can deal with data of different types. However, Boolean association

rules do not provide enough quantitative and qualitative information.

152

Table of Contents: