Data Transfer Service (DTS)

<< Web Warehousing: Issues, Time-contiguous Log Entries, Transient Cookies, SSL, session ID Ping-pong, Persistent Cookies

Lab Data Set: Multi -Campus University >>

Lab Lect-2

Lab Data Set

In previous lecture I gave you an overview of the tool to be used for the lab i.e. Data

Transformation Services (DTS), MS SQL Server. Now keeping in view the real issue of data

acquisition, we will provide you with a simulated data set, so as to make you ready to start

exploring the tool. The data is for a multi-campus university having campuses in four major

cities. We discussed the details of such a university in Lect-6 of the course i.e. normalization.

Each of the campus has its own conventions and norms regarding storing Student information.

Multi -Campus University

University has four campuses situated at:

§ Lahore

§ Karachi

§ Islamabad

§ Peshawar

University Head Office in Islamabad

Data warehouse is a single source of truth. We have to put all data from different data sources

(campuses) at one place in some standard form. The task is not trivial. Different sources of data

have a lot of inherent issues of ETL. High level steps given in the slide give just an overview of

the task. First of all, we have to identify the source systems. It is quite possible that each campus

uses different database systems or same organization at different geographical locations uses

different database management systems. To put data into a single source, after extracting from

such diverse sources, requires powerful tools especially designed to fulfill the requirements of

ETL. We will use Microsoft SQL Server DTS which is a user friendly graphical tool and makes

such a complex task doable by some practice. After identification of source systems, it is

necessary to study the issues that must be considered before putting all the data together at a

single location. Microsoft SQL Server provides a powerful support to perform Extract, Transform

and Load (ETL) data from source systems to destination system. Finally certain steps are

performed to check and improve quality of data.

In this lab lecture we will look into the data for each of the campuses in detail. This would lead us

to identify the core issues that are needed to be taken care of before extracting data from these

diverse sources into a single destination.

Degree Programs

At each campus university has two degree programs:

§ BS

§ MS

University started its BS degree program in year 1994 and MS degree program in year

2001

Our Example University offers undergraduate and graduate degrees in all of its campuses. The

undergraduate degrees were started in year 1994 and graduate degrees were started in year 2001.

375

Disciplines for BS

Four disciplines at BS level

§ Computer Science (CS)

§ Computer Engineering (CE)

§ System Engineering (SE)

§ Telecommunication (TC)

All campuses offer these four disciplines

The slide is self explanatory.

Disciplines for MS

Four disciplines at MS level

§ Computer Science (MS-CS)

§ Software Project Mgmt. (MS-SPM)

§ Networking (MS-NW)

§ Telecommunication (MS-TC)

Lahore & Karachi campuses offer all the four disciplines

Islamabad offers MS-CS & MS-SPM

Peshawar offers MS-CS & MS-TC

The slide is self explanatory.

The need

Four campuses of the University maintain their students record locally

No standardized way of record management

Standardized reporting is difficult and time consuming.

No centralized repository of data

Head Office wants a central data repository for decision support i.e. a DWH

We will study the record management at each campus

In this lecture, we will collect data from each campus and figure out the is sues

As mentioned earlier, our example university has multiple campuses and each campus

independently maintains its student records without any meaningful level of coordination. There

is no any standardized record management system or agreement among t hese campuses. Each of

376

the campuses uses its own student record management practices independent of the other

campuses. The head office of the university now wants to consolidate the student records from all

of the four campuses into a central repository for decision support. Thus they are planning for a

DWH.

Students Record Keeping & Mgmt.

One by one we discuss the record management system specific to each campus of the

University

Lahore

Karachi

Islamabad

Peshawar

In real life when we need to work with heterogeneous systems from multiple sources then the

problems like poor design becomes prominent and significant. In this student record management

system none of the database is properly designed and in some cases, there is no database at all.

The databases are not normalized. Each of the campus maintains two "tables" to store student

information. I have used double quotes as the word table is not used in its literal meaning,

especially in the case of a single flat text file.

Student Table:

In each database Student table is used to maintain personal records of the students. This table has

only one entry for each student in each campus. A student may have entries in student tables of

two campuses in the issues like transfer cases.

Registration Table:

Second table is registration table that maintains the record for course registration. This table

contains as many records for each student as many times he/she registered any course.

Each campus keeps two tables does not mean that each campus has two files only (one for each

table). Each campus maintains its information independent of each other. Lahore campus

maintains two text files for each batch i.e. entry taken in a year. For each batch one file contains

student information and other file contains registration information. For eleven batches of BS

Lahore campus has 22 text files. For four batches of MS Lahore campus contains eight text files.

Same is the mechanism used in Peshawar campus to store the data in text files. Islamabad campus

has MS Access d atabase with three tables. Two of these three tables contain student information.

One table for MS and the other for BS students. The third table contains Registration data for

both degree programs i.e. MS and BS. Karachi campus manages to store all this information in

MS Excel sheets. Three Excel Books are maintained. Two out of three contains registration

records (one for BS and the other for MS) and the third one contains student records for both

degree programs.

Let us discuss "student record management systems" at each of the campuses.

377

Data from Lahore Campus

Data at Lahore campus is stored in Text files

To store data regarding one complete batch 2 text files are used:

Lhr_Student_batch (Student record)

Lhr_Detail_batch (Course Reg. record)

22 text files for 11 BS batches

8 text files for 4 MS batches

The slide is self explanatory. Here batch is the year the student entry was taken i.e. 94, 95,.... 104

i.e. year 2004.

Data from Lahore Campus: Sample

Flat file student data at Lahore campus

The slide shows the screenshot of a sample text file for student records at Lahore campus. We can

see that the first row contains the header and the columns are delimited by comma. Let's discuss

header of both student and registration tables in deta il.

Lahore: Header of Student Table

SID: Student ID

§ A numerical value, starting from 0

§ Starts from 0 individually for both degrees BS & MS

§ It is unique within a degree (BS/MS) but not unique across the degrees

§ Combination of SID and degree is always unique within a campus

St_Name: Student name

Father_Name: Father name

378

The slide is self explanatory.

Lahore: Header of Student Table

Gender:

§ 0 for Male

§ 1 for Female

Address: Permanent Address

[Date of Birth]:

§ 14-Apr-1980

[Reg Date]: Date on which student was enrolled

This is the convention used for storing some critical data at the Lahore campus. There is no

guarantee that the same convention will be used at other campuses too, actually in some cases the

converse may be true. We will identify and work on these apparent anomalies in the data

profiling phase before we do the actual transformation.

Lahore: Header of Student Table

[Reg Status]:

§ `A' if student was enrolled as new Admission

§ `T' if student was enrolled as Transfer case

§ [Degree Status]:

§ `C' (complete) if student has graduated

§ `I' for incomplete degree

§ [Last Degree]:

§ F.Sc. / A level for BS

§ M.Sc. / BS / BE for MS

The slide is self explanatory.

Lahore: Header of Course Reg. Table

SID:

Degree:

BS/MS

Semester:

e.g. Fall04

Course:

Course code

Marks:

Out of 100

Discipline:

CS/TC/SE/CE

The slide shows the header and sample values for Course registration table at Lahore Campus.

Lahore: Facts About Dat a

Total students = 5,200

Total male students= 3,466

Total BS students= 4,400

Number of graduated students= 3,200

Number of post graduated std.= 600

379

The slide shows some of the facts about Lahore campus. These facts can be used for data

validation in later steps. However, this has to be taken with a "pinch of salt" because the facts

before resolving the data quality issues will most likely be different as compared to the ones after

the data has been cleansed.

Data from Karachi Campus

Data at Karachi campus is stored in MS-Excel books

Three books are maintained

§ STUDENT_KHR (Student record)

§ Reg_BS_KHR (BS course Reg. record)

§ Reg_MS_KHR (MS course Reg. record)

STUDENT_KHR keeps two sheets

§ `BS' for BS students records

§ `MS' for MS students records

The slide is self explanatory.

Data from Karachi Campus: Sample

The slide shows MS Excel screenshot of the sample data for Karachi campus. Let's discuss its

header in detail for both student and registration tables.

Karachi: Header of Student Table

St_ID: Student identity

Name: Student name

Father: Father name

DoB: Date of Birth

M/F: Gender (M/F)

DoReg: Date of Registration/Enrollment

RStatus: Status of enrollment (A/T)

DStatus: Status of Degree (C/I)

Address: Permanent address

Qualification: Last degree achieved

380

The slide is self explanatory.

Karachi: Header of Course Reg. Table

SID:

Courses: Course code

Score: Out of 100

Sem: e.g. Fall04

Disp: CS/TC/SE/CE

The slide is self explanatory.

Karachi: Facts About Data

Total students = 6,000

Total male students= 4,500

Total BS students= 4,000

Number of graduated students= 3,500

Number of post graduated std.= 1,500

The slide shows some of the facts about Karachi campus. These facts can be used for data

validation in later steps. Again we have to look at the facts keeping in mind that the same may

change after data has been cleansed.

Data from Islamabad Campus

M S-Access is used at Islamabad campus

Database has three tables

§ Isb_BS_Student (MS Student record)

§ Isb_MS_Student (BS Student record)

§ Registration (All reg. record BS + MS)

Roll number is also used as primary key in student table

The slide is self explanatory.

Data from Islamabad Campus: Sample

381

The slide shows MS Access screenshot of the sample data for Islamabad campus. Let's discuss its

header in detail for both student and registration tables.

Islamabad: Header of Student Table

Roll Num: Student identity

Name: Student name

Father: Father name

Reg Date: Date of Enrollment

Reg Status: Status of Enrollment (A/T)

Degree Status: Status of Degree (C/I)

Date of Birth: Date of Birth

Education: Last degree achieved

Gender: Gender (Male=1, Female =0)

Address: Permanent address

The slide is self explanatory.

Islamaba d: Header of Course Reg. Table

Roll Num:

Course: Course code

Marks: Out of 100

Discipline: CS/TC/SE/CE

Session: e.g. Fall04

Here we can see that Degree (BS/MS) is missing, whereas same table contains records for both.

Only way to differentiate is through discipline attribute.

Islamabad: Facts About Data

Total students = 4,400

Total male students= 3,700

Total BS students= 3,200

Number of graduated students= 2,500

Number of post graduated std.= 900

The slide shows some of the facts about Islamabad campus. These facts can be used for data

validation in later steps.

382

Data from Peshawar Campus

Data at Peshawar campus is stored in Text files

To store data regarding one complete batch 2 text files are used

§ Lhr_Student_batch (Student record)

§ Lhr_Detail_batch (Course Reg. record)

22 text files for 11 BS batches

8 text files for 4 MS batches

The slide is self explanatory.

Data from Peshawar Campus: Sample

The slide shows the screenshot of a sample test file for student records at Peshawar campus. We

can see that the first row contains the header and the columns are delimited by comma. Let's

discuss header of both student and registration tables in detail.

Peshawar: Header of Student Table

Reg#: Student identity

Name: Student name

Father: Father name

Address: Permanent address

Date of Birth: Date of Birth

lastDeg: Last degree achieved

Reg Date: Date of Enrollment

Reg Status: Status of Enrollment (A/T)

Degree Status: Status of Degree (C/I)

The slide is self explanatory.

Peshawar: Header of Course Reg. Table

Reg#:

Courses: Course code

Score: Out of 100

Program: CS/TC/SE/CE

Sem: Fall/Spring

Year: YYYY e.g. 1999

383

Here we need to identify semester session (fall04) through combination of Sem and Year

Lab Exercise

Collect demographics for Peshawar campus

Figure out problems in data at Peshawar campus

Suggest suitable solutions to the problems identified above

Here is a small exercise. You are required to find the facts for the Peshawar campus. What

problems are there in the data? And what, in your opinion, could be possible solutions for those

problems.

Now by looking at each of the campus data individually, we found following problems that need

to be considered and solved properly before extracting the data and u ltimately loading it into the

central repository.

Problem-1: Non-Standard data sources

Each campus uses data sources independent of other campuses

The major problem is the inconsistent data sources at different campuses. The slide summarizes

the data sources at four campuses. We can see that Lahore and Peshawar campuses are using text

files while Islamabad and Karachi campuses are using MS Access and MS Excel respectively.

384

Problem-2: Non-standard attributes

The second problem is non standardized attributes across campuses. While looking at the header

of data from different campuses we came to know the following problems regarding attributes

and is summarized in the table in the slide.

Each of the campuses uses different attribute name for the identification or primary keys e.g.

Lahore uses SID while Peshawar usesReg# and so on.

Different conventions for representing Gender across the campuses e.g. Lahore campus uses 0/1

while Islamabad uses 1/0 for representing male and female respectively.

Similarly, there are different conventions for representing degree attribute across different

campuses.

Problem-3: No Normalized database

None of the campuses uses well designed normalized database

Each campus uses two "tables":

§ One table to store students' personal data

§ Second table to store course registration data of each student

Each campus uses multiple files to store these two tables

Actually Lahore, Karachi and Peshawar campus does not have databases at all, so there is no

concept of normalization. These campuses maintain the data in sample shown as follows:

Lhr_detail_94 : Is a text file that contains the following details:

SID,Degree,Semester,Course,Marks,Discipline

Lhr_student_94: Is a text file that contains the following details:

SID,St_Name,Father_Name,Gender,Address,Date of Birth,Last Degree _

385

Note pad: Issues (1)

Use of text files in record management systems is least suitable

We cannot run any query on text file

We cannot validate any input to text file

Comma is used as a field separator, any erroneous placement of comma can spoil the

whole record

There is no technical way of locating any particular record

Having discussed the three major problems, lets now look at what are the issues regarding the

record management tools at individual campuses. This slide and the following four list the issues

related to Notepad.

Note pad: Issues (2)

If I want to locate the record of `Mohammad Ali Nawaz' and I do not know his roll

number, what would I do?

At Lahore campus, academic officer used to do it by "Find" option of text file

Is it a proper way? Does it work always?

§ What about `Mohamed Ali Nawaz'?

People at different campuses, including the Lahore campus have developed ways and means to

answer some questions. But these so called "techniques" have their own inherent limitations. For

example, if I want to find the information about a student named `Mohammad Ali Nawaz' I can

use the find command from the notepad, but what if there is a slight change i the spelling? Of

course the technique is not going to work.

Note pad: Issues (3)

If I want to count total students who belong to Multan, can I do it in note pad?

§ No

To achieve this purpose, admin at Lahore used to open the file in Microsoft Word. Then

use "Replace with" functionality of Microsoft Word to count total occurrences of Multan.

In `Replace With' dialog box if I enter `Multan' is replaced with `Multan' & use `Replace All'

option. I can get the total occurrences of Multan. Interesting

Some simple questions that can be answered if there was a database can not be answered, such as

the number of students from any particular city. There can be number of short -term self-

developed ad-hoc mechanisms, but they are not guaranteed to succeed and have their own

inherent limitations.

386

Note pad: Issues (4)

Some improper ways can work for very limited cases

We can't collect demographics in note pad

§ Total number of male students

§ Students with a particular age

§ Students with a particular educational ba ckground

§ Students with a particular CGPA

§ Etc.

Some very simple statistics can not be collected in the absence of a database as we have a big text

file. Some of the examples are number of male students or students of particular age. We can get

answer to these problems by parsing the files, but text parsing is not only very slow, but is also

very complicated. All these complications and inefficiencies can be reduced, and even removed if

he had a database in place.

MS -Excel: Issues (1)

Karachi campus uses MS-Excel sheets to maintain students record

M S-Excel is again not basically developed for this purpose

However, it works somewhat better than note pad, as it can answer to more questions but

once again in an improper way

Both methods adopted for notepad are available here also but it can work more than that

MS Excel is better than having a big text file. For example Excel supports some simple tests and

other commands that can help more efficiently answer the questions that could not be answered

usin g a plain text file. But, still Excel is not the right way to store and keep the data for a host of

reasons, that we discussed in Lect-6 of the theory part.

MS -Excel : Issues (2)

Now, I can count total number of male or female students?

I can sort all columns on basis of gender and get all males and females clustered

I can get student-wise particular scores

I can get answers to many questions through conditional queries supported by MS-Excel

The slides gives a way of finding answer to some questio ns, but remember that we are dealing

with large data sets, and for such large sets comparison sort which at best is O(n log n) really

hurts.

387

MS -Excel : Issues (3)

Maintenance of records in MS-Excel is better with respect to the data quality concernin g

issues

M S-Excel recognizes the correct data type of columns

It somewhat validates the input, i.e. illegal input is filtered

Some more benefits of Excel. At least there is a column type i.e. not all values are textual in

nature, and this helps in the context of data validation.

MS -Access: Issues (1)

M S-Access is a proper RDBMS and can work well for small databases

At Islamabad campus, the problem is the poor design of database, not the tool

SQL of MS-Access is not very powerful, like that of SQL Server, but it works fine to

maintain records at campus level

Finally Islamabad campus at least is using the right tool i.e. Access databases, but it works for

small personal databases not years of data of a single campus and then pooling together the d ata

of multiple campuses. Thus the problem is not of the poor design (As there is no real design) but

of the wrong tool. The correct choice could have been to use MS SQL server which can handle

larger work loads more gracefully.

Problem Statement

We have disparate sources of data

We have to implement single source of truth i.e. DWH, so that decision makers can be

supported to get detailed or summarized university level view, irrespective of particular

campus

In the lab exercises and working we will experience interesting and complicated issues

need to be handled while moving towards single standardized source.

Thus, in view of the issues and challenges in our simulated scenario of a multi-campus university,

the problem ahead can be summarized as under.

There are disparate and diverse data sources and we have to implement a DWH i.e. single source

of truth that can support the decision making at the head office.

388

Table of Contents: