Data Duplication Elimination & BSN Method
The duplicate elimination task is typically performed after most other transformation and
cleaning steps have taken place, especially after having cleaned single -source errors and
conflicting representations. It is performed either on two cleaned sources at a time or on a single
already integrated data set. Duplicate elimination requires to first identify (i.e. match) simi lar
records concerning the same real world entity. In the second step, similar records are merged into
one record containing all relevant attributes without redundancy. Furthermore, redundant records
Why data duplicated?
A data warehouse is created from heterogeneous sources, with heterogeneous databases (different
schema/representation) of the same entity.
The data coming from outside the organization owning the DWH can have even lower quality
data i.e. different representation for same entity, transcription or typographical errors.
A data warehouse is created by merging large databases acquired from different sources with
heterogeneous representations of information. This raises the issue of data quality, the foremost
being identification and elimination of duplicates, crucial for accurate statistical analyses. Other
than using own historical/transactional data, it is not uncommon for large businesses to acquire
scores of databases each month, with a total size of hundreds of millions t o over a billion records
that need to be added to the warehouse. The fundamental problem is that the data supplied by
various sources typically include identifiers or string data that are either different among different
datasets or simply erroneous due to a variety of reasons including typographical or transcription
errors or purposeful fraudulent activity, such as aliases in the case of names.
Problems due to data duplication
Data duplication, can result in costly errors, such as:
§ False frequency distributions.
Incorrect aggregates due to double counting.
Difficulty with catching fabricated identities by credit card companies.
Without accurate identification of duplicated information frequency distributions and various
other aggregates will produce false or misleading statistics leading to perhaps untrustworthy new
knowledge and bad decisions. Thus this has become an increasingly important and complex
problem for many organizations that are in the process of establishing a data warehouse or
updating the one already in existence.
Credit card companies routinely assess the financial risk of potential new customers who may
purposely hide their true identities and thus their history or manufacture new ones.
The sources of corruption of data are many. To name a few, errors due to data entry mistakes,
faulty sensor readings or more malicious activities provide scores of erroneous datasets that
propagate errors in each successive generation of data.
Data Duplication: Non-Unique PK
Duplicate Identification Numbers
Multiple Customer Numbers
M. Ismail Siddiqi
M. Ismail Siddiqi
M. Ismail Siddiqi
Multiple Employee Numbers
Unable to determine customer relationships (CRM)
Unable to analyze employee benefits trends
What if the customers are divided amon g sales people to make telephone calls? Maybe the three
records of the same person go to three telesales people or maybe to the same one, but arranged as
per Cust.No. The point is that no telesales person would know that a single customer is going to
get multiple calls from the company, resulting in wastage of time and resources of the company,
and at the same time annoying the customer.
In the other case, the employee is same, who has been moving among different departments, and
possibly during the process got new employee numbers. He has also been getting bonuses
regularly, maybe because every time he appears to be an employee who has never got a bonus.
This resulting in loss to the company and an inability of the company to do any meaningful
employee benefit analysis.
Data Duplication: House Holding
Group together all records that belong to the same household.
440, Munir Road, Lahore
No. 440, Munir Rd, Lhr
House # 440, Munir Road, Lahore
In a joint family system, many working people live at the same house, but have different bank
accounts and ATM cards. If the bank would like to introduce a new service and wants to get in
touch with its customers, it would be much better if a single letter is sent to a single household,
instead of sending multiple letters. The problem is difficult because multiple persons living at the
same house may have written the same address differently even worse, one person with multiple
accounts at the same bank may have written his/her name and address differently.
One month is a typical business cycle in certain direct marketing operations. This means that
sources of data need to be identified, acquired, conditioned and then correlated or merged within
a small portion of a month in order to prepare mailings and response analyses. With tens of
thousands of mails to be sent, reducing the numbers of duplicate mails sent to the same household
can result in a significant savings.
Data Duplication: Individualization
Identify multiple records in each household which represent the same individual
440, Munir Road, Lahore
440, Munir Road, Lahore
Address field is standardized, is this by coincidence? This could be your luck data, but don't
count on it i.e. this is very unlikely to be the default case.
Formal definition & Nomenclature
§ "Given two databases, identify the potentially matched records Efficiently and
Many names, such as:
§ Record linkage
§ Entity reconciliation
§ List washing and data cleansing.
Current market and tools heavily centered towards customer lists.
Within the data warehousing field, data cleansing is applied especially when several databases are
merged. Records referring to the same entity are represented in different formats in the different
data sets or are represented erroneously. Thus, duplicate records will appear in the merged
database. The issue is to identify and eliminate these duplicates. The problem is known as the
merge/purge problem. Instances of this problem appearing in literature are called record linkage,
semantic integration, instance identification, or o bject identity problem.
Data cleansing is much more than simply updating a record with good data. Serious data
cleansing involves decomposing and reassembling the data. One can break down the cleansing
into six steps: elementizing, standardizing, verifyin g, matching, house holding, and documenting.
Although data cleansing can take many forms, the current marketplace and the current technology
for data cleansing are heavily focused on customer lists.
Need & Tool Support
Logical solution to dirty data is to clean it in some way.
Doing it manually is very slow and prone to errors.
Tools are required to do it "cost" effectively to achieve reasonable quality.
Tools are there, some for specific fields, others for specific cleaning phase.
Since application specific, so work very well, but need support from other tools for
broad spectrum of cleaning problems.
For existing data sets, the logical solution to the dirty data problem is to attempt to cleanse the
data in some way. That is, explore the data set for possible problems and endeavor to correct the
errors. Of course, for any real world data set, doing this task "by hand" is completely out of the
question given the amount of man hours involved. Some organizations spend millions of dollars
per year to detect data errors. A manual process of data cleansing is also laborious, time
consuming, and itself prone to errors. The need for useful and powerful tools that automate or
greatly assist in the data cleansing process are necessary and may be the only practical and cost
effective way to achieve a reasonable quality level in an existing data set.
A large variety of tools is available in the market to support data transformation and data
cleaning tasks, in particular for data warehousing. Some tools concentrate on a specific domain,
such as cleaning name and address data, or a specific cleaning phase, such as data analysis or
duplicate elimination. Due to their restricted domain, specialized tools typically perform very
well but must be complemented by other tools to address the broad spectrum of transformation
and cleaning problems.
Overview of the Basic Concept
In its simplest form, there is an identifying attribute (or combination) per record for
Records can be from single source or multiple sources sharing same PK or other common
Sorting performed on identifying attributes and neighboring records checked.
What if no common attributes or dirty data?
The degree of similarity measured numerically, different attributes may contribute
In the simplest case, there is an identifying attribute or attribute combination per record that can
be used for matching records, e.g., if different sources share the same primary key or if there are
other common unique attributes. For example people in income tax department get customer
details from the phone company and the Electricity Company and want to identify customers who
have high electricity and high telephone bill but do not pay tax accordingly. Would be the
common field, NID, they have changed over time, how about addresses, too many ways to write
addresses so have to figure out common attributes.
Instance matching between different sources is then achieved by a standard equi-join on the
identifying attribute(s), if you are very, very lucky. In the case of a single data set, matches can be
determined by sorting on the identifying attribute and checking if neighboring records match. In
both cases, efficient implementations can be achieved even for large data sets . Unfortunately,
without common key attributes or in the presence of dirty data such straightforward approaches
are often too restrictive. For example, consider a rule that states person records are likely to
correspond if name and portions of the address match.
The degree of similarity between two records, often measured by a numerical value between 0
and 1, usually depends on application characteristics. For instance, different attributes in a
matching rule may contribute different weight to the overall degree of similarity.
Basic Sorted Neighborhood (BSN) Method
Concatenate data into one sequential list of N records
Steps 1: Create Keys
§ Compute a key for each record in the list by extracting relevant fields or portions
Effectiveness of the this method highly depends on a properly chosen key
Step 2: Sort Data
§ Sort the records in the data list using the key of step 1
Step 3: Merge
§ Move a fixed size window through the sequential list of records limiting the
comparisons for matching records to those records in the window
If the size of the window is w records then every new record entering the window
is compared with the previous w-1 records.
1. Create Keys: Compute a key for each record in the list by extracting relevant fields or
portions of fields. The choice of the key depends upon an error model that may be viewed as
knowledge intensive and domain specific for the effectiveness of the sorted neighborhood
method. This highly depends on a properly chosen key with the assumption that common but
erroneous data will have closely matching keys .
2. Sort Data: Sort the records in the data list using the key of step -1.
3. Merge: Move a fixed size window through the sequential list of records limiting the
comparisons for matching records to those records in the window. If the size of the window is
w records as shown in Figure 20.1, then every new record entering the window is compared
with the previous w - 1 records to find "matching" records. The first record in the window
slides out of the window.
Note that in the absence of a window, the comparison would cover all the sorted records i.e. a
pair-wise comparison, resulting in O (n2) comparisons.
BSN Method : Sliding Window
Figure-20.1: BSN Method: Sliding Window
Figure -20.1 shows the outcome of the sliding window i.e. the IDs sorted after completion of the
run. The outcome will also depend on the window size, as it may so happen that when the
window size is relatively small and the keys are dispersed then some (or most o f the keys) may
not land in their corresponding neighborhood, thus defeating the purpose of the BSN method.
One way of overcoming this problem is working with different window sizes in a multipass
Complexity Analysis of BSN Method
Time Complexi ty: O(n log n)
§ O (n) for Key Creation
§ O (n log n) for Sorting
§ O (w n) for matching, where w ≤ 2 ≤ n
§ Constants vary a lot
At least three passes required on the dataset.
For large sets disk I/O is detrimental.
Complexity or rule and window size detrimental.
When this procedure is executed serially as a main memory based process the create keys phase is
an O(n) operation the sorting phase is O(n log n) and the merging phase is O( wn) where n is the
number of records in the database. Thus, the total time complexity of this method is O (n log n) if
w < log n , O(wn) otherwise. However the constants in the equations differ greatly. It could be
relatively expensive to extract relevant key values from a record during the create key phase.
Sorting requires a few machine instructions to compare the keys. The merge phase requires the
application of a potentially large number of rules to compare two records and thus has the
potential for the largest constant factor.
Notice that w is a parameter of the window scanning procedure. The legitimate values of w may
range from 2 (whereby only two consecutive elements are compared) to n (whereby each element
is compared to all others). The latter case pertains to the full quadratic O (n2) time process at the
maximal potential accuracy. The former case (where w may be viewed as a small constant
relative to n) pertains to optimal time performance (only O (n) time) but at minimal accuracy.
Note however that for very large databases the dominant cost is likely disk I/O and hence the
number of passes over the data set. In this case at least three passes would be needed one pass for
conditioning the data and preparing keys at least a second pass likely more for a high speed sort,
and a final pass for window processing and application of the rule program for each record
entering the sliding window. Depending upon the complexity of the rule program and window
size w the last pass may indeed be the dominant cost.
BSN Method: Selection of Keys
Selection of Keys
§ Effectiveness highly dependent on the key selected to sort the records middle
name vs. last name,
A key is a sequence of a subset of attributes or sub -strings within the attributes
chosen from the record.
The keys are used for sorting the entire dataset with the intention that matched
candidates will appear close to each other.
MuhammedAhmad 440 Munir Road
MuhammadAhmad 440 Munir Road
MuhammedAhmed 440 Munir Road
MuhammadAhmar 440 Munawar Road 34535334AHM440MUN345
Table-20.1: BSN Method: Selection of Keys
The key consists of the concatenation of several ordered fields or attributes in the data. The first
three letters of a middle name are concatenated with the first three letters of the first name field
followed by the address number field and all of the consonants of the road name. This is followed
by the first three digits of the National ID field as shown in Table 20.1.
These choices are made since the key designer determined that family names are not very
discriminating (Khan, Malik andCheema). The first names are also not very discriminating and
are typically misspelled (Muhammed, Muhammad, Mohamed) due to mistakes in vocalized
sounds vowels but middle names are typically more uncommon and less prone to being
misunderstood and hence less likely to be recorded incorrectly.
The keys are now used for sorting the entire dataset with the intention that all equivalent or
matching data will appear close to each other in the final sorted list. Notice in table 20.1, how the
first and second records are exact duplicates while the third is likely to be the same person but
with a misspelled middle name i.e. Ahmed instead of Ahmad. We would expect that this
phonetically bas ed mistake will be caught by a reasonable equational theory. However the fourth
record although having the exact same key as the prior three records appears unlikely to be the
BSN Method: Problem with keys
Since data is dirty, so keys WILL also be dirty, and matching records will not come
Data becomes dirty due to data entry errors or use of abbreviations. Some real examples
are as follows:
Solution is to use external standard source files to validate the data and resolve any data
Since the data is dirty and the keys are extracted directly from the data then the keys for sorting
will also be dirty. Therefore the process of sorting the records to bring matching records togethe r
will not be as effective. A substantial number of matching records may not be detected in the
subsequent window scan.
Data can become dirty due to data entry errors. To speed -up data entry abbreviations are also
used (all taken from the telephone directory.), such as:
Tehcnology, Tech. Tehcno. Tchnlgy
The above is a typical case of a non-PK error of same entity with multiple representations. To
ensure the correctness of data in the database, solution is using external source files to validate
the data and resolve any data conflicts.
BSN Method: Problem with keys (e.g.)
If contents of fields are not properly ordered, similar records will NOT fall in the same window.
Example: Records 1 and 2 are similar but will occur far apart.
1 N. Jaffri, SyedNo. 420, Street 15, Chaklala 4, Rawalpindi M
2 S. Noman
420, Scheme 4, Rwp
3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F
Solution is to TOKENize the fields i.e. break them further. Use the tokens in different fields for
sorting to fix the error.
Example: Either using the name or the address field records 1 and 2 will fall close.
Syed N Jaffri
420 15 4 Chaklala No Rawalpindi Street
420 4 Rwp Scheme
5 Afshan Colony Flat Lahore Road Saidpur
We observe that characters in a string can be grouped into meaningful pieces. We can often
identify important components or tokens within a Name or Address field by using a set of
delimiters such as space and punctua tions. Hence we can first tokenize these fields and then sort
the tokens within these fields. Records are then sorted on the select key fields; note that this
approach is an improvement over the standard BSN method.
BSN Method: Matching Candidates
Mergin g of records is a complex inferential process.
Example-1: Two persons with names spelled nearly but not identically, have the exact same
address. We infer they are same person i.e. Noma Abdullah and Noman Abdullah.
Example-2: Two persons have same National ID numbers but names and addresses are
completely different. We infer same person who changed his name and moved or the records
represent different persons and NID is incorrect for one of them.
Use of further information such as age, gender etc. can alter the decision.
Example-3: Noma-F and Noman-M we could perhaps infer that Noma and Noman are siblings
i.e. brothers and sisters. Noma-30 and Noman-5 i.e. mother and son.
The comparison of records during the merge phase to determine their equivalenc e is a complex
inferential process that considers and requires much more information in the compared records
than the keys used for sorting. For example suppose two person names are spelled nearly but not
identically, and have the exact same address. We might infer they are the same person. For
example Noma Abdullah and Noman Abdullah could be the same person if they have the same
On the other hand suppose two records have exactly the same National ID numbers but the names
and addresses are completely different. We could either assume the records represent the same
person who changed his name and moved or the records represent different persons and the NID
number field is incorrect for one of them. Without any further information we may perhaps
assume the latter. The more information there is in the records the better inferences can be made.
For example, if gender and age information is available in some field of the data (Noma-F,
Noman-M) we could perhaps infer that Noma and Noman are siblings.
BSN Method: Equational Theory
To specify the inferences we need equational Theory.
Logic is NOT based on string equivalence.
Logic based on domain equivalence.
Requires declarative rule language.
To specify the inferences as discussed above we need an equational theory in which the logic is
based not on the string equivalence but on the domain equivalence. A natural approach to
specifying an equational theory and making it practical too would be the use of a declarative rule
language. Rule languages have been effectively used in a wide range of applications requiring
inference over large data sets. Much research has been conducted to provide efficient means for
their compilation and evaluation and this technology can be exploited here for purposes of data
BSN Method: Rule Language
Rule Language Example
Given two records r1 and r2
IF the family_name of r1 equals the family_name of r2
AND the first names differ slightly
AND the address of r1 equals the address of r2
THEN r1 is equivalent to r2
This is rather self explanatory. The rule merely states that if for two records the last names and
the addresses are same, but the first names differ slightly, then both the records are same. This
makes sense, as using default values l st names can be easily fixed, and sorting the address and
then using default values can fix most part of the address too, as the addresses are repeating.
BSN Method: Mis -Matching Candidates
Transposition of names
How do you specify "Differ Slightly"?
§ Calculate on the basis of a distance function applied to the family_name fields of
Multiple options for distance functions
Notice that rules do not necessarily need to compare values from the same attribute or same
domain. For instance to detect a transposition in a persons name we could write a rule that
compares the first name of one record with the last name of the second record and the last name
of the first record with the first name of the second record.
BSN Method: Distance Metrics
§ Phonetic distance
§ Typewriter distance
§ Edit Distance
A widely used metric to define string similarity
§ Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change
s1 to s2
s1: Muhammad Ehasn
s2: Muha mmed Ahasn
ed(s1,s2) = 2
The implementation is based upon the computation of a distance function applied to the first
name fields of two records and the comparison of its results to a threshold to capture obvious
typographical errors that may occur in the data. The selection of a distance function and a proper
threshold is also a knowledge intensive activity that demands experimental evaluation. An
improperly chosen threshold will lead to either an increase in the number of falsely matched
records or to a decrease in the number of matching records that should be merged. A number of
alternative distance functions for typographical mistakes are possible, including distances based
upon (i) edit distance (ii) phonetic distance and (iii) typewriter distance.
Limitations of BSN Method
BSN Method Limitations
§ No single key is sufficient to catch all matching records.
Fields that appear first in the key have higher discriminating power than those
appearing after them.
If NID number is first attribute in the key 81684854432 and 18684854432 are
highly likely to fall in windows that are far apart.
Two Possible Modifications to BSN Method
§ Increase w, the size of window
§ Multiple passes over the data set
In general no single key will be sufficient to catch all matching records. The attributes or fields
that appear first in the key have higher discriminating power than those appearing after them.
Hence if the error in a record occurs in the particular field or portion of the field that is the most
important part of the key there may be little chance a record will end up close to a matching
record after sorting. For instance if an employee has two records in the database one with NID
number 81684854432 and another with NID number 18684854432 (the first two numbers were
transposed) and if the NID number is used as the principal field of the key then it is very unlikely
both records will fall under the same window i.e. the two records with transposed NID numbers
will be far apart in the sorted list and hence they may not be merged. As it was empirically
established that the number of matching records missed by one run of the sorted neighborhood
method can be large unless the neighborhood grows very large.
Increases the computational complexity.
Quality does not increase dramatically.
Not useful unless the window spans the entire database.
§ W spanning the entire set of rows is infeasible under strict time and cost
What would be the computational complexity?
Increasing w clearly this increases the computational complexity and as discussed in the next
section does not increase dramatically the number of similar records merged in the test cases we
ran unless of course the window spans the entire database which we have presumed is infeasible
under strict time and cost constraints.
Several independent runs of the BSN method each time with a different key and a
relatively small window.
Each independent run will produce a set of pairs of records which can be merged (takes
care of transposition errors)
Apply the transitive closure property to those pairs of records.
§ If records a and b are found to be similar and at the same time records b and c are
also found to be similar the transitive closure step ca n mark a and c to be similar.
The results will be a union of all pairs discovered by all independent runs with no
duplicates plus all those pairs that can be inferred by transitivity of equality.
The alternative strategy we implemented is to execute sev eral independent runs of the sorted
neighborhood method each time using a different key and a relatively small window. We call this
strategy the multipass approach. For instance in one run we use the address as the principal part
of the key while in another run we use the last name of the employee as the principal part of the
key. Each independent run will produce a set of pairs of records which can be merged. We then
apply the transitive closure to those pairs of records.
The results will be a union of a ll pairs discovered by all independent runs with no duplicates plus
all those pairs that can be inferred by transitivity of equality. The reason this approach works for
the test cases explored here has much to do with the nature of the errors in the data. Transposing
the first two digits of the NID number leads to nonmergeable records as we noted. However in
such records the variability or error appearing in another field of the records may indeed not be so
large. Therefore although the NID numbers in two records are grossly in error the name fields
may not be. Hence first sorting on the name fields as the primary key will bring these two records
closer together, lessening the negative effects of a gross error in the social security field.
Table of Contents: