ZeePedia

ETL Detail: Data Cleansing: data scrubbing, Dirty Data, Lexical Errors, Irregularities, Integrity Constraint Violation, Duplication

<< Issues of ETL: legacy data, Web scrapping, data quality, ETL vs ELT
Data Duplication Elimination and BSN Method: Record linkage, Merge, purge, Entity reconciliation, List washing and data cleansing >>
img
Lecture-19
ETL Detail: Data Cleansing
Background
§
Other names: Called as data scrubbing or cleaning.
§
More than data arranging.
§
Big problem, big effect: Enormous problem, as most data is dirty. GIGO
§
Dirty is relative.
§
Paradox: Must involve domain expert, as detailed domain knowledge is required, so it
becomes semi-automatic, but has to be automatic because of large data sets.
§
Data duplication.
It is crucial to the process of loading the data warehouse that you not just rearrange the grouping
of source data and deliver it to a destination database, but that you also ensure the quality of that
data. Data cleansing is vitally important to the overall health of your warehouse project and
ultimately affects the health of your company. Do not take this statement lightly.
The true scope of a data cleansing project is enormous. Much of production data is dirty, and you
don't even want to consider what work cleaning it up would take. By "dirty," I mean that it does
not conform to proper domain definitions or "make sense." The age -old adage "garbage in,
garbage out" still applies, and you can do nothing about it short of analyzing and correcting the
corporate data. What is noise in one domain may be information in another.
Usually the process of data cleansing can not be performed without the involvement of a domain
expert because the detection and correction of anomalies requires detailed domain knowledge.
Data cleansing is therefore described as semi-automatic but it should be as automatic as possible
because of the large amount of data that usually is be processed and the time required for an
expert to cleanse it manually.
The original aim of data cleansing was to eliminate duplicates in a data collection, a problem
occurring already in single database applications and gets worse when integrating data from
different sources. Will have a complete lecture on it.
Lighter Side of Dirty Data
§
Year of birth 1995 current year 2005
§
Born in 1986 hired in 1985
§
Who would take it seriously? Computers while summarizing, aggregating, populating etc.
§
Small discrepancies become irrelevant for large averages, but what about sums, medians,
maximum, minimum etc.?
146
img
Value validation is the process of ensuring that each value that is sent to the data warehouse is
accurate. You may had that experience in which you look at the contents of one of your major flat
files or database tables and intuitively pick that the data is incorrect. No way could that employee
be born in 2004! You know your company doesn't hire infants. You may also discover another
incorrect record. How could someone be born in 1978 but hired in 1977?
All too often, these types of data integrity problems are laughed at and then ignored. All too often
people say "No one would actually take that information seriously, would they?" Well, maybe
people won't, but MIS systems will. That information can be summarized, aggregated, and/or
manipulated in some way, and then populated into another data element. And when that data
element is moved into the DWH, analytical processing will be performed on it that can affect the
way your company does business. What if data from the DWH is being analyzed to revise hiring
practices? That data may make a wrong impact on the business decisions if enough of the hire
and birth dates are inaccurate.
Small data discrepanci es can become statistically irrelevant when large volumes of data are
averaged. But averaging is not the only analytical function that is used in analytical data
warehouse queries. What about sums, medians, max/min, and other aggregate and scalar
functions? Even further, can you actually prove that the scope of your data problems is as small
as you think it is? The answer is probably "no."
Serious Problems due to dirty data
§
Decisions taken at government level using wrong data resulting in undesirable results.
§
In direct mail marketing sending letters to wrong addresses loss of money and bad
reputation.
Administration: The government analyses data collected by population census to decide which
regions of the country require further investments in health, education, clean drinking water,
electricity etc. because of current and expected future trends. If the rate of birth in one region has
increased over the last couple of years, the existing health facilities and doctors employed might
not be sufficient to handle the number of current and expected patients. Thus, additional
dispensaries or employment of doctors will be needed. Inaccuracies in analyzed data can lead to
false conclusions and misdirected release of funds with catastrophic results for a poor country like
Pakistan.
Supporting business processes : Erroneous data leads to unnecessary costs and probably bad
reputation when used to support business processes. Consider a company using a list of consumer
addresses and buying habits and preferences to advertise a new product by direct mailing. Invalid
addresses cause the letters to be returned as undeliverable. People being duplicated in the mailing
list account for multiple letters sent to the same person, leading to unnecessary expenses and
frustration . Inaccurate information about consumer buying habits and preferences contaminate
and falsify the target group, resulting in advertisement of products that do not correspond to
consumer's needs. Companies trading such data face the possibility of an additional loss of
reputation in case of erroneous data.
147
img
3 Classes of Anomalies...
§
Syntactically Dirty Data
§  Lexical Errors
§  Irregularities
§
Semantically Dirty Data
§  Integrity Constraint Violation
§  Business rule contradiction
§  Duplication
§
Coverage Anomalies
§  Miss ing Attributes
§  Missing Records
Syntactically Dirty Data
Lexical errors: For example, assume the data to be stored in table form with each row
representing a tuple and each column an attribute. If we expect the table to have five columns
because each tuple has five attributes but some or all of the rows contain only four columns then
the actual structure of the data does not conform to the specified format.
Irregularities are concerned with the non-uniform use of values, units and abbreviations.
This can happen for example if we use different currencies to specify an employee's salary. This
is especially profound if the currency is not explicitly listed with each value, and is assumed to be
uniform. Annual salary or 20,000 means $20,000 or Rs. 20,000. This results in values being
correct representations of facts if we have the necessary knowledge about their representation
needed to interpret them.
Semantically dirty data
Integrity constraint violations describe tuples (or sets of tuples) that do not satisfy one or more
of the integrity constraints. Integrity constraints are used to describe our understanding of the
mini-world by restricting the set of valid instances. Each constraint is a rule representing
knowledge about the domain and the values allowed for representing certain facts.
Contradictions are values within one tuple or between tuples that violate some kind of
dependency between the values. An example for the first case could be a contradiction between
the attribute AGE and DATE_OF_BIRTH for a tuple representing persons. Contradictions are
either violations of functional dependencies that can be represented as integrity constraints or
duplicates with inexact values. They are therefore not regarded as separate data anomaly.
Duplicates are two or more tuples representing the same entity from the mini -world. The values
of these tuples do not need to be completely identical. Inexact duplicates are specific cases of
contradiction between two or more tuples. They represent the same entity but wit h different
values for all or some of its properties. This hardens the detection of duplicates and there
mergence.
148
img
3 Classes of Anomalies...
§
Coverage or lack of it
§  Missing Attribute
§
Result of omissions while collecting the data.
§
A constraint violatio n if we have null values for attributes where NOT NULL constraint
exists.
§
Case more complicated where no such constraint exists.
§
Have to decide whether the value exists in the real world and has to be deduced here or
not.
Coverage Problems
Missing values
Result of omissions while collecting the data. A constraint violation if we have null values for
attributes where NOT NULL constraint exists. Case more complicated where no such constraint
exists. Have to decide whether the value exists in the real world and has to be deduced here or
not.
Missing tuples
Result from omissions of complete entities existent in the mini -world that are not represented by
tuples in the data collection. Anomalies could further be classified according to the amount of
data accounting for the constraint violation. This can be single values, values within a tuple,
values within one or more columns of the data collection or tuples and sets of tuples from
different relations.
Why Missing Rows?
§
Equipment malfunction (bar code reader, keyboard etc.)
§
Inconsistent with other recorded data and thus deleted.
§
Data not entered due to misunderstanding/illegibility.
§
Data not considered important at the time of entry (e.g. Y2K).
There can be a number of reasons for missing rows or missing data. For example there may be
fault with the hardware for recording or inputting the data, this could be a bar code reader which
is faulty and missing part of the UPC (Universal Product Code) or on a low tech level, there could
be problem with a keyboard (numeric part) that results in entry of different keys as same keys and
only one is kept and all others are removed, which infact corresponded to unique keys. In many
cases I can not even read what I have written, I am sure I am not alone. If I can't read my own
handwriting sometime, than in many instances people will be unable to read my handwriting.
This means either the data will not be entered or it will be entered incorrectly, even using
automatic means using OCR (Optical Character Reader) soft wares. Another example is the
famous (or infamous) Y2K problem, when the first two most significant digits of the year where
not recorded, and in many cases it was almost impossible to differentiate between 1901 and 2001!
149
img
Handling missing data
§
Dropping records.
§
"Manually" filling missing values.
§
Using a global constant as filler.
§
Using the attribute mean (or median) as filler.
§
Using the most probable value as filler.
There can be a number of ways of handling missing data, the most convenient being to just drop
the records with missing values for certain attributes. The problem with this approach is what do
we gain by dropping records with missing values, and how to decide which records to drop i.e.
with one missing value or which one or how many missi g values? Obviously this is a complex
n
problem, and writing code would not be an easy task, so one easy way out could be to manually
fill missing values or use a global constant as a filler. For example if gender is not known for the
customers, then filling the gender by (say) NA. For numeric attributes, filling using a global value
may make some sense, as it could be the mean or median of the column value. Or this could also
be the most probable value to be used as a filler.
Key Based Classification of Problems
§
Primary key problems
§
Non -Primary key problems
The problems associated with the data can correspond to either the primary keys or non primary
keys, as the nature of the problem and the corresponding solution varies with the type of the key
invo lved. In the subsequent slides we will discuss each of them one by one.
Primary key problems
1. Same PK but different data.
2. Same entity with different keys.
3. PK in one system but not in other.
4. Same PK but in different formats.
Some of the problems associated with PK are;
1. Records may have the same primary key but might have different data. This can occur if
primary keys are reused or when two organizations or units merge.
2. The same entity might occur with different primary keys. This can easily arise when different
segments of an entity design databases independently of one another
3. Data may have a primary key in one system but not in another. The classic example of this
kind of error occurs when an entity is represented in more than one source database. It is quite
possible that the entity is central to one of the databases and therefore has a primary key field or
150
img
fields while the same entity is so peripheral to the purposes of the other database that it is not
dignified by being given a primary key.
4. Primary Keys may be intended to be the same but might occur in different formats. Probably,
the most widespread instance of this error occurs when NID are used as primary keys: are they
character data or numeric; if character data, do they contain dashes?
Non primary key problems...
1. Different encoding in different sources.
2.
Multiple ways to represent the same information.
3.
Sources might contain invalid data.
4.
Two fields with different data but same name.
1. Data may be encoded differently in different sources. The domain of a "gender" field in some
database may be {`F', `M'} or as {"Female", "Male"} or even as {1, 0}.
2. There are often multiple ways to represent the same piece of information. "FAST", "National
University", "FAST NU" and "Nat. Un iv. of Computers " can all can be found in the literature as
representing the same institution.
3. Sources might contain invalid data. A point of sale terminal may require that the sales clerk
enters a customer's telephone number. If the customer does not wish to give it, clerks may enter
999-999-9999.
4. Two fields may contain different data but have the same name. There are a couple of ways in
which this can happen. "Total Sales" probably means fiscal year sales to one part of an enterprise
and calendar year sales to another. The second instance can be much more dangerous. If an
application is used by multiple divisions, it is likely that a field that is necessary for one business
unit is irrelevant to another and may be left blank by the second unit or, worse, used for otherwise
undocumented purposes.
Non primary key problems
§
Required fields left blank.
§
Data erroneous or incomplete.
§
Data contains null values.
5. Required fields may be left blank. Clerical errors account for most of these, but mobil e phones
did not come into use until early 90's, so contact numbers recorded before then will not have
them.
6. Data may be erroneous or inconsistent. The telephone number may be for Lahore but the city
listed as Karachi.
7. The data may contain null values. Null values can occur for a wide variety of reasons, the most
common of these are:
151
img
a. Data that is genuinely missing or unknown,
b. An attribute does not apply to an entity,
c. Data that is pending, or
d. Data that is only partially known.
Automatic Data Cleansing...
1) Statistical
2) Pattern Based
3) Clustering
4) Association Rules
Some of the data cleansing techniques are listed. Let's discuss each of them in detail.
Statistical : Identifying outlier fields and records using the values of mean, standard deviation,
range, etc., based on Chebyshev's theorem, considering the confidence intervals for each field.
Outlier values for particular fields are identified based on automatically computed statistics. For
each field the average and the standard deviation are utilized and based on Chebyshev's theorem
those records that have values in a given field outside a number of standard deviations from the
mean are identified. The number of standard deviations to be considered is customizable.
Confidence intervals are taken into consideration for each field.
Pattern-based: Identify outlier fields and records that do not conform to existing patterns in the
data. Combined techniques (partitioning, classification, and clustering) are used to identify
patterns that apply to most records. A pattern is defined by a group of records that have similar
characteristics ("behavior") for p% of the fields in the data set, where p is a user-defined value
(usually above 90).
Clustering : Identify outlier records using clustering based on Euclidian (or other) distance.
Existing clustering algorithms provide little support for identifying outliers. However, in some
cases clustering the entire record space can reveal outliers that are not identified at the field level
inspection. The main drawback of this method is computational time. The clustering algorithms
have high computational complexity. For large record spaces and large number of records, the run
time of the clustering algorithms is prohibitive.
Association rules: Association rul es with high confidence and support define a different kind of
pattern. As before, records that do not follow these rules are considered outliers. The power of
association rules is that they can deal with data of different types. However, Boolean association
rules do not provide enough quantitative and qualitative information.
152
Table of Contents:
  1. Need of Data Warehousing
  2. Why a DWH, Warehousing
  3. The Basic Concept of Data Warehousing
  4. Classical SDLC and DWH SDLC, CLDS, Online Transaction Processing
  5. Types of Data Warehouses: Financial, Telecommunication, Insurance, Human Resource
  6. Normalization: Anomalies, 1NF, 2NF, INSERT, UPDATE, DELETE
  7. De-Normalization: Balance between Normalization and De-Normalization
  8. DeNormalization Techniques: Splitting Tables, Horizontal splitting, Vertical Splitting, Pre-Joining Tables, Adding Redundant Columns, Derived Attributes
  9. Issues of De-Normalization: Storage, Performance, Maintenance, Ease-of-use
  10. Online Analytical Processing OLAP: DWH and OLAP, OLTP
  11. OLAP Implementations: MOLAP, ROLAP, HOLAP, DOLAP
  12. ROLAP: Relational Database, ROLAP cube, Issues
  13. Dimensional Modeling DM: ER modeling, The Paradox, ER vs. DM,
  14. Process of Dimensional Modeling: Four Step: Choose Business Process, Grain, Facts, Dimensions
  15. Issues of Dimensional Modeling: Additive vs Non-Additive facts, Classification of Aggregation Functions
  16. Extract Transform Load ETL: ETL Cycle, Processing, Data Extraction, Data Transformation
  17. Issues of ETL: Diversity in source systems and platforms
  18. Issues of ETL: legacy data, Web scrapping, data quality, ETL vs ELT
  19. ETL Detail: Data Cleansing: data scrubbing, Dirty Data, Lexical Errors, Irregularities, Integrity Constraint Violation, Duplication
  20. Data Duplication Elimination and BSN Method: Record linkage, Merge, purge, Entity reconciliation, List washing and data cleansing
  21. Introduction to Data Quality Management: Intrinsic, Realistic, Orr’s Laws of Data Quality, TQM
  22. DQM: Quantifying Data Quality: Free-of-error, Completeness, Consistency, Ratios
  23. Total DQM: TDQM in a DWH, Data Quality Management Process
  24. Need for Speed: Parallelism: Scalability, Terminology, Parallelization OLTP Vs DSS
  25. Need for Speed: Hardware Techniques: Data Parallelism Concept
  26. Conventional Indexing Techniques: Concept, Goals, Dense Index, Sparse Index
  27. Special Indexing Techniques: Inverted, Bit map, Cluster, Join indexes
  28. Join Techniques: Nested loop, Sort Merge, Hash based join
  29. Data mining (DM): Knowledge Discovery in Databases KDD
  30. Data Mining: CLASSIFICATION, ESTIMATION, PREDICTION, CLUSTERING,
  31. Data Structures, types of Data Mining, Min-Max Distance, One-way, K-Means Clustering
  32. DWH Lifecycle: Data-Driven, Goal-Driven, User-Driven Methodologies
  33. DWH Implementation: Goal Driven Approach
  34. DWH Implementation: Goal Driven Approach
  35. DWH Life Cycle: Pitfalls, Mistakes, Tips
  36. Course Project
  37. Contents of Project Reports
  38. Case Study: Agri-Data Warehouse
  39. Web Warehousing: Drawbacks of traditional web sear ches, web search, Web traffic record: Log files
  40. Web Warehousing: Issues, Time-contiguous Log Entries, Transient Cookies, SSL, session ID Ping-pong, Persistent Cookies
  41. Data Transfer Service (DTS)
  42. Lab Data Set: Multi -Campus University
  43. Extracting Data Using Wizard
  44. Data Profiling