ZeePedia

Course Project

<< DWH Life Cycle: Pitfalls, Mistakes, Tips
Contents of Project Reports >>
img
§
DWH designers must be capable of working across the organization.
§
DWH team requires a combination of many experiences and expertise.
Conclusions
By now you have realized that a building a data warehouse is not an easy task. DWH are very
expensive to build, with the average cost of a system valued at around US$ 2 million. Hence, the
right people, methodology and experience is critical. The dependence on technology is only a
small part in realizing the true business value buried within the mountain of data collected and
stored within organizations business systems and operational databases. Data warehouses touch
the organization at all levels, and the people that design and build the data warehouse must be
capable of working across the organization at all leve ls as well, thus communication skills of the
people are of utmost importance. Thus the key requirements are industry and product experience
of a diverse team, coupled with a business focus and proven methodology. This will make the
difference between just a functional system and true success story.
Lecture-36
Course Project
This will be a group project, consisting of not more than three students per group.
308
img
Two Parts
·
Part -I outline given in 1st lecture. (about coding)
·
Part -II b riefly mentioned in 33rd lecture. (about planning and req definition)
Objective: Give a head-start by identifying the work in advance.
Approach: Combine both parts.
The course project consists of two parts, and both parts have already been briefly touched upon in
the earlier lectures. The first part of the project was discussed in the very first lecture and we call
it Part -I, while the second part of the project was discussed in lecture no. 33 and we will call it
Part -II. In this lecture, we will discuss both parts in detail. The purpose of an early discussion was
to give you a head start on the projects so that you start thinking/working about them, such as
polishing your programming skills for Part -I and identifying the organization for Part -II. Our
approach will be to combine both parts of the projects to come up with a single semester project.
Part -I
1. Code the BSN Method for finding siblings and duplicates (Lect-20).
2. Use 4GL or a high level programming language.
3. Must have GUI for I/P and O/P
4. For input, use data of Lahore campus provided as part of lab work.
5. Submit project report_1.
Part -I of the project basically deals with the implementation of the BSN (Basic Sorted
Neighborhood) Method that we discussed in lecture no. 20. You already have the paper that was
discussed in the said lecture. As you know that the BSN method can be used for dedupication i.e.
removal of duplicates, this is one aspect that you will have to code and test. The other application
of BSN will be to identify the siblings, people who have the same father i.e. they are brothers or
sisters of each other. To implement this part of the project, you can either use a 4GL such as
PL/SQL or any other high level programming language such as C++, or Java. Your application
must have a GUI (Graphical User Interface), console based application will not be acceptable.
The user must have the facility of reading from a data source, which could be a text file or a
database or a spreadsheet. In case of a database or spread sheet, OLE DB connectivity must be
supported by your application.
There can be basically two options for the data input. Either you can use data collected as part of
Project-II (after approval of instructor) or use the text files supplied for the Lahore campus as part
of the SQL DTS (Data Transformation Services) Lab Lectures. If you use data other than the lab
data, it must have a dozen relevant columns and several thousand rows too.
309
img
Part-II: Implementation Life-Cycle
The second part of the semester project i.e. Part -II deals with the DWH implementation life cycle
that we have already discussed in great detail in lectures 32-35. In lectures 33 and 34 we
discussed the Ralph Kimball's approach, you are not required to do any development and
deployment work in project Pa rt-II. During the lectures, the DWH life cycle road -map was
divided into three parts, you only have to cover these parts i.e. (i) project planning (ii) user
requirement definition and (iii) three parallel tracks. You are NOT required to discuss or do DWH
deployment or do any analytics development.
Part-II(a): Identify Organization
1. Do a complete data warehouse implementation life cycle study (and design) as discussed
in lectures 32-35.
2. Identify a large company/organization that is a prime candidate for a DWH.
3. What is a large company/organization? Lot of customers and large number of
transactions.
4. Submit report_2 giving and explaining 4-reasons for selecting a company.
5. Get the company/organization selected approved by the instructor before proceeding
ahead.
As stated earlier, the lifecycle study for project Part -II ends after the three parallel tracks of the
Kimball's approach; this was also stated in lecture 33. You are not required to do any deployment
or the study of deployment. Part -II of the project consist of two parts i.e. Part -II(a) which consists
of identification of a company for the lifecycle study, and Part -II(b) actually doing the lifecycle
study. To do Part -II(a), you have to identify a large company or an organization, this could be
private or government owned, but must be a prime candidate for such a study.
A shop at the corner of your street with ship-owner being the salesman too is not a prime target
for a DWH. He may actually do not even need a database system. So what is a large company or
an organization? Well one which has large number of customers and large number of
transactions. So what is large is this circular logic or a trick question? Neither. Large customers
mean tens of thousands of customers and similarly tens and thou sands of transactions per week.
Note that what we call large on our country, may be small for the developed world. Once you
have identified such a company, submit a report, we call this report_2. The report should have
four reasons why you have selected a particular company or organization. What could be those
four reasons?, this is a good question, the four reasons are (i) the number of customers (ii) the
number of transactions (iii) typical early adopter (from next slide) and (iv) any other reason. Once
you have submitted the report, you can not just by default move on to Part -II(b) of the project.
The company selected must be approved by the instructor before you can proceed ahead.
Large and Typical Early Adopters
310
img
1.
Financial service/insurance.
2.
Telecommunications.
3.
Transportation.
4.
Government.
5.
Educational.
Here you would be asking the question, what is meant by an early adopter? Will discuss this at
the end of the lecture, but if you can't wait, please check fig -36.1. For a developing country like
ours, we are in the phase of talking about typical early adopters of DWH. However, in the
developed world; this stage is no longer there for many large companies, as DWH are now in the
mainstream. The types of organizations and business listed are typically thos e that have a large
number of customers and large number of transactions.
Example DWH Target Organizations
·
Financial service/insurance.
­  Union Bank
­  State Bank of Pakistan
·
Telecommunications.
­  UFone
­  PTCL
­  PAKNET
·
Transportation.
­  PIA
·
Government.
­  NADRA
For example, as per www.paksearch.com Muslim Commercial Bank has 900+ branches all over
Pakistan. With an average of 500 customers per branch, the total number of customers is in the
order of half a million. It would not be surprising if the weekly ATM transactions all over
Pakistan run into millions. Such banks are potential candidates for a data warehouse. Same is true
for telecommunication companies. As per recent government figures, there are 10+ million
mobile phone users in Pakistan, and as per www.fdi.com the number of mobile phone users of
Mobilink is 3.7 million. Again, it would not be surprising to have literally millions of mobile
phone calls made/received per day. So these businesses fall under the category which you should
be looking at to select and study as part of your semester project.
NADRA (National Database and Registration Authority) probably has the largest data warehouse
in Pakistan and is the repository of the 1 98 census which covered the entire population of
9
Pakistan, which at time stood at 130 millions. As part of the census (source: www.nadra.gov.pk)
64 million NDFs or National Data Forms were collected and scanned which are presently stored
in a 4.2 Tera byte DWH in NADRA.
311
img
Part-II(b): Life-Cycle Study...
Project Planning
§  By now you already know the company.
§  Prepare a questionnaire (at least 15 non-trivial questions).
§  Identify and contact a key person who will help you.
Submit report_3
User requirement definition
Set an appointment to meet business users.
§  Collect answers to questions.
§  Get a copy of sample input.
§  Get a copy of sample output.
§  Compile report of interview.
§  Identify business processes.
§  Identify problems.
§  Identify measures of success.
Submit report_4
Once report_2 has been approved, you are all set to move on to Part -II(b) of the project i.e. the
actual lifecycle case study. The first part of this study is project planning, that we discussed in
great detail in lecture no. 33. Do the necessary preparations, which include development of a
questionnaire of atleast 15 questions before you meet the key person in the company. The
questions MUST be non -trivial i.e. must not be overly simplistic i.e. asking for information
already available in the company brochure or their website. Before preparing the questionnaire,
you must have done the necessary background study about the company, so that you make
relevant and probing questions. You must also have identified and set an appointment with the
key person who is willing to help you. Note that any person will not suffice, but a person who is
in a position of influence and wants to help you. IF you can't find such a person, don't worry,
keep on reading, I will provide you guidance on how to resolve this problem. After you have
successfully gone through the project planning phase, you should submit report_3.
Now the next part is user requirement definition, we have discussed this too in great detail in
lecture no. 33. I would suggest that before you meet the user for requirement collection, call and
set an appointment, reach there in time and in formal attire. Also take along a micro cassette
reorder or use your mobile phone for recording the session. Do the recording with permission
from the business user. Along with recording the answers of the user (don't forget to take notes)
312
img
get a copy of the sample input used in the organization, and a copy of the typical output, that
could be in the form of a report etc. Specifically ask questions and understand the business
process, the problems and the measures of success. After you are done, go through the entire
process of debriefing etc, and submit report_4.
Part-II(b): Life-Cycle Study...
Do a technical track study and submit report_5.
Do a data track design and submit report_6.
Do an analytic track study and submit report_7.
About reports:
§  Each report A4 page size and NOT more than 5 pages (excluding diagrams and tables).
§  12-pt Times New Roman font
§  Single space and 1" top, bottom, left and right spacing
What will be covered in report 5 through 7 was covered in detail in lecture no. 34, read the notes
and the reference book for these lectures i.e. "The Data Warehouse Toolkit by Ralph Kimball".
The only difference in the project work w.r.t to the lecture is that you don't have to do the
analytics development, and don't have to go beyond the three parallel tracks.
What if you are not entertained?
·
There are prospects, then there are customers.
·
Typically 1 out of 10 prospects is a buyer or customer i.e. 10% success.
·
The more prospects you meet and they are not interested, the more close you actually get
to your customer.
·
Quitters never win, and winners never quit.
·
Thomas Edison made 2,000 attempts to make the filament for the light bulb that actually
worked.
·
When asked how did he managed 2,000 failures? He said they were not failures; he just
identified WHICH 2,000 filaments don't work.
·
DON'T GIVE UP
313
img
In the context of sales and marketing, a prospect is someone who can become a customer. A
customer is someone who will buy a product or service form you. So every sales person is
looking for a customer or wants to convert a prospect into a customer. However, on an average
10% of the prospects become a customer, it means if you try 10 prospects, in the worst case the
last one will be the customer. Look at positively, meaning the more failures you have, the more
close you get to your customer. This is why the saying goes that "quitters never win, and winners
never quit". Similarly look at the work of Thomas Edison, he kept trying and untimely become
immortal i.e. his name still lives and his invention still used.
What if you are never entertained?
It may so happen that no one ente rtains/helps you, maybe you did not tried hard, maybe you had
the wrong attitude, maybe you did not meet the right people , maybe you were not very
convincing , maybe the end user was apprehensive etc.
Upto 10% less credit if this approach is adopted
In such a case, search the web, read books/magazines and pick any one of the 5 types of
organizations discussed and collect reference/related material (not beyond year 2000).
Use the material collected to write reports 2 to 6.
Now it may so happen, that you tried, and tried and tried very hard, but still unable to get a key
person interested and willing to help you so that you could write to write report_2. There could be
several reasons for this, maybe you don't tried hard, maybe it was because of the other person etc.
In such a case, you will do a lifecycle case study using the Internet. In report_2 you will list all
the reasons for failure and how you tried, and then we will decide about how much credit to
deduct from the semester project. The amount of deduction could be upto 10%. But you still have
to write all the reports and use the material from the case studies tailored to the requirements of
the reports. For this purpose you will have to download a number of focused case studies, better if
about the same organization, and compile the results in the form of the reports as per the Ralph
Kimball's road map.
Contents of Project Reports
The project reports to include, but is not limited to, the following:
§
Narrative summary of results produced (report_1).
§
Listings of computer models/programs coded and utilized (report_1).
§
Reports displaying results (report_1).
§
System usage instructions (report_1).
§
Narrative description of business and tables of appropriate data
(report_4).
§
Descriptions of decisions to be supported by information produced by
system (report_4).
314
img
§
Structure charts, dataflow diagrams and/or other diagrams to document
the structure of the system (report 4-7).
§
Recommendations (reports 5-7)
This is self explanatory, and I have also explained in the lecture. Follow the guidelines to the
word, as your work will be graded based on these guidelines. Note that system usage instructions
have to be specific with screen shots of the application developed, so that your applications can
be executed using the instructions and graded. Don't forget to submit the entire source code,
along with the compiled code with all necessary libraries DLLs attached.
Format of Project Reports: Main
·
Report No.
·
Title of course, semester & submission date
·
Names and roll no. of project members.
·
Campus and name of city.
·
Table of contents.
·
1-page executive summary for each report.
·
Attach (scanned) hard/soft copies of all related material collected and referenced with
each report.
·
At the end of semester, combine ALL reports and submit as a single report.
Again this is self explanatory, and I have also explained in the lecture. Follow the guidelines to
the word, as your work will be graded based on these guidelines.
Format of Project Reports: Other
·
No spelling or grammar mistakes.
·
Make sections and number them (as per contents of report discussed).
·
Pages should be numbered (bottom center).
·
Add an index at the end of report.
·
File name: RPT_no_semester_campus_rollno_CS614.doc
·
e.g. RPT_1_F05_BestComputersWah_234,235_CS614.doc
·
Email copy of report.
·
Warning:
­  Do not copy-paste, I can and I will catch you.
MS Word has a facility to create an index. You begin by creating a file with a list of keywords to
be listed in the index, and then go to Insert then Reference then In dex and Tables, press
AutoMark and select the file with keywords.
315
img
The naming convention of the reports is important, so that your reports can be easily identified as
there are about half a dozen reports for each group. The following convention to be used in report
naming:
RPT: This will be repeated for each report and will not change.
no: is the report number i.e. 1 through 7
semester: is the semester, some possible semesters are F05 i.e. Fall 2005 or SP06 i.e. Spring
2006 or SPL05 i.e. special semester 2005 or SM06 i.e. summer 2006.
campus: The name of the VU campus where you are registered or taking the course along with
the name of the city or town. Write full name of the campus, do not use underscore i.e. _ or
dashes or space in the campus name.
rollno: Since it is a group project, so roll numbers of students in the group separated by a comma.
CS614: This will be at the end of every file name i.e. the course code.
Don't try to copy -paste or use someone else's work with your name, there are smart tools to catch
this, once you are caught, this can result into zero credit.
Why would companies entertain you?
·
You are students, and whom you meet were also once students.
·
You can do an assessment of the company for DWH potential at no cost.
·
Since you are only interested in your project, so your analysis will be neutral.
·
Your report can form a basis for a professional detailed assessment at a later stage.
·
If a DWH already exists, you can do an independent audit.
If you present your case well you are likely to be entertained i.e. welcome by the organization.
The first and the foremost reason to get help is, whom you are talking to was once a student too
as you are currently, so there is a common bond. Since you have studied well DWH (hopefully),
therefore, you can do a requirement assessment (in the form of lifecycle study) of the company at
no cost; the company has nothing to lose. Your only interest is completion of your project and
grade, so you are going to be very objective and very neutra l; hence the company has still nothing
to lose. After you have done the lifecycle study, the same can be used as a seed or input by a
professional organization for an in depth study, thus saving in money to the company for which
you have done the work. Again the company has nothing to lose. It may so happen that the
company you contact already has a data warehouse in place, in such a case doing the lifecycle
development study can be used as a internal audit of the DWH implementation. Hence in short, if
the company allows and helps you with the study, it will be a win -win scenario for both the
parties.
Why you may be entertained?
316
img
Figure-36.1: Adoption/Innovation curve
Fig-36.1 shows a typical adoption curve when a new item, product or service is introduced and
the ratio of people responding to it. As you can see the people who are the first ones to adopt or
embrace it are the innovators, and they are very few. This is followed by early adopters, which is
a sizeable figure, and this is the category of companies you are supposed to target as part of your
project. Note that in our country, Data Warehousing is not yet in the category of early majority,
so there will be more companies who are prime candidates for lifecycle study. You may also
come across people who may be genuinely interested in a DWH, but don't know enough about it,
and want to know. In such a case, be prepared to educate them or enlighten them. It would be best
if you look at a number of case studies or reports about their line of business before meeting
them.
Interestingly, the bell shaped curve (which is not a perfect bell) divides the prospects into two
equal parts i.e. 50% each. You should be looking at that 50% which can and will help you,
instead of those who are likely to help you, but at a later stage.
Finally this project has enough challenges to become your final year BSc project. In such a case
be prepared to do coding leading to system deployment and completion of all the remaining steps.
Remember, many large organizations/businesses may not need a data warehouse today, but they
will need one surely tomorrow. And when that happens, they will be looking for you to help them
achieve their objectives.
The project has more than enough potential to become a final year project, which will cover an
implementation and deployment also.
317
Table of Contents:
  1. Need of Data Warehousing
  2. Why a DWH, Warehousing
  3. The Basic Concept of Data Warehousing
  4. Classical SDLC and DWH SDLC, CLDS, Online Transaction Processing
  5. Types of Data Warehouses: Financial, Telecommunication, Insurance, Human Resource
  6. Normalization: Anomalies, 1NF, 2NF, INSERT, UPDATE, DELETE
  7. De-Normalization: Balance between Normalization and De-Normalization
  8. DeNormalization Techniques: Splitting Tables, Horizontal splitting, Vertical Splitting, Pre-Joining Tables, Adding Redundant Columns, Derived Attributes
  9. Issues of De-Normalization: Storage, Performance, Maintenance, Ease-of-use
  10. Online Analytical Processing OLAP: DWH and OLAP, OLTP
  11. OLAP Implementations: MOLAP, ROLAP, HOLAP, DOLAP
  12. ROLAP: Relational Database, ROLAP cube, Issues
  13. Dimensional Modeling DM: ER modeling, The Paradox, ER vs. DM,
  14. Process of Dimensional Modeling: Four Step: Choose Business Process, Grain, Facts, Dimensions
  15. Issues of Dimensional Modeling: Additive vs Non-Additive facts, Classification of Aggregation Functions
  16. Extract Transform Load ETL: ETL Cycle, Processing, Data Extraction, Data Transformation
  17. Issues of ETL: Diversity in source systems and platforms
  18. Issues of ETL: legacy data, Web scrapping, data quality, ETL vs ELT
  19. ETL Detail: Data Cleansing: data scrubbing, Dirty Data, Lexical Errors, Irregularities, Integrity Constraint Violation, Duplication
  20. Data Duplication Elimination and BSN Method: Record linkage, Merge, purge, Entity reconciliation, List washing and data cleansing
  21. Introduction to Data Quality Management: Intrinsic, Realistic, Orr’s Laws of Data Quality, TQM
  22. DQM: Quantifying Data Quality: Free-of-error, Completeness, Consistency, Ratios
  23. Total DQM: TDQM in a DWH, Data Quality Management Process
  24. Need for Speed: Parallelism: Scalability, Terminology, Parallelization OLTP Vs DSS
  25. Need for Speed: Hardware Techniques: Data Parallelism Concept
  26. Conventional Indexing Techniques: Concept, Goals, Dense Index, Sparse Index
  27. Special Indexing Techniques: Inverted, Bit map, Cluster, Join indexes
  28. Join Techniques: Nested loop, Sort Merge, Hash based join
  29. Data mining (DM): Knowledge Discovery in Databases KDD
  30. Data Mining: CLASSIFICATION, ESTIMATION, PREDICTION, CLUSTERING,
  31. Data Structures, types of Data Mining, Min-Max Distance, One-way, K-Means Clustering
  32. DWH Lifecycle: Data-Driven, Goal-Driven, User-Driven Methodologies
  33. DWH Implementation: Goal Driven Approach
  34. DWH Implementation: Goal Driven Approach
  35. DWH Life Cycle: Pitfalls, Mistakes, Tips
  36. Course Project
  37. Contents of Project Reports
  38. Case Study: Agri-Data Warehouse
  39. Web Warehousing: Drawbacks of traditional web sear ches, web search, Web traffic record: Log files
  40. Web Warehousing: Issues, Time-contiguous Log Entries, Transient Cookies, SSL, session ID Ping-pong, Persistent Cookies
  41. Data Transfer Service (DTS)
  42. Lab Data Set: Multi -Campus University
  43. Extracting Data Using Wizard
  44. Data Profiling