ZeePedia

Why a DWH, Warehousing

<< Need of Data Warehousing
The Basic Concept of Data Warehousing >>
img
Lecture Handout
Data Warehousing
Lecture No. 02
Why a DWH?
·
Data recording and storage is growing.
·
History is excellent predictor of the future.
·
Gives total view of the organization.
·
Intelligent decision -support is required for decision -making.
Data recording and storage is growing.
Moore's law on increase in performance of CPUs and decrease in cost has been surpassed by the
increase in storage space and decrease in cost. Meaning, it is true that the cost of CPUs is going
down and the performance is going up, but this is applicable at a higher rate to storage space and
cost i.e. more and more cheap storage space is becoming available as compared to fast CPUs.
As you would have experienced, when you (or your father's) briefcase seems to be small as
compared to the contents carried in it, it seems a good idea to buy a new and larger briefcase.
However, after sometime the new briefcase too seems to be small for the contents carried. On the
practical side, it has been noted that the amount of data recorded in an organization doubles every
year and this is an exponential increase.
Reason-1: Data Sets are growing
How Much Data is that?
220 or 106 bytes
1 MB
Small novel ­ 31/2 Disk
230 or 109 bytes
1 GB
Paper rims that could fill the back of a pickup van
50,000 trees chopped and converted into paper
240 or 1012 bytes
1 TB
and printed
1 PB = 250 or 1015 bytes
2 PB
Academic research libraries across the U.S.
1 EB = 260 or 101 8 bytes
5 EB
All words ever spoken by human beings
Table-2.1: Quantifying size of data
Size of Data Sets are going up .
§
6
img
Cost of data storage is coming down .
§
§
Total hardware and software cost to store and manage 1 Mbyte of data
§  1990: ~ $15
§  2002: ~ ¢15 (Down 100 times)
§  By 2007: < ¢1 (Down 150 times)
§
A Few Examples
§  WalMart: 24 TB (Tera Byte)
§  France Telecom: ~ 100 TB
§  CERN: Up to 20 PB by 2006 (Peta Byte)
§  Stanford Linear Accelerator Center (SLAC): 500TB
A Ware House of Data
is NOT a
Data Warehouse
Someone says I have a data set of size 1 GB so I have a DWH can you beat this?
Someone else says, I have a data set of size 100 GB, can you beat this?
Someone else says, I have a 1 TB data set, who can beat this?
Who has a data warehouse? Not enough information, it is much more than just the size, it is a
whole concept, it is NOT a shrink wrapped solution, it evolves. A company may have a TB of
data and not have a data warehouse; while on the other hand, a company may have 500 GB of
data and have a fully functional data warehouse.
Size is NOT
Everything
History is excellent predictor of the future
Secondly as I mentioned earlier the data warehouse has the historical data. And one thing that we
have learned by using information is that, "past is the best predictor of the future". You use
historical data, because it gives you an insight into how the environment is changing. Also you
must have heard that "history repeats itself", however this repetition of history is not likely to be
constant for all businesses or all events. Note that you just can't use the historical data to predict
the future; you have to have to bring your own insight and experience to interpret how the
environment is changing in order to predictthe future accurately and meaningfully.
Gives total view of the organization
So why would you want data warehouse in your organization? First of all a data warehouse gives
a total view of an organization. If you look at the operational system i.e. the databases in most
environments, the databases are designed around different lines of business. Consider the case of
a Bank; a bank will typically have current accounts and savings accounts, foreign currency
account etc. The bank will have an MIS system for leasing, and another system for managing
credit cards and another system for every different kind of business they are in. However,
nowhere they have the total view of the environment from the customer's perspective. The reason
being, transaction processing systems are typically designed around functional areas, within a
business environment. For good decision making you should be able to integrate the data across
7
img
the organization so as to cross the LoB (Line of Business). So the idea here is to give the tot al
view of the organization especially from a customer's perspective within the data warehouse, as
shown in Figure -2.1
ATM
Leasing
Savings
Account
DATA WAREHOUSE
Checking Account
Credit Card
Figure-2.1: A Data Warehouse crosses the LoB
Intelligent decision -support is required for decision -making
Consider a bank which is losing customers, for reasons not known. However, one thing is for sure
that the bank is losing business because of lost customers. Therefore, it is important, actually
critical to understand which customers have left and why they have left. This will give you the
ability to predict going forward (in time), to identify which customers will leave you (i.e. the
bank). We are going to talk about this in the course using data mining algorithms, like clustering,
classification, regression analysis etc. However, this being another example of using historical
data to predict the future. So I can predict today, which customers will leave me in the next 3
months before they even leave. There can be, and there are whole courses on data mining, but we
will just have an applied overview of data mining in this course.
Reason-2: Businesses demand intelligence
§
Complex questions from integrated data.
§
"Intelligent Enterprise"
DBMS Approach
Intelligent Enterprise
List of all items that were sold last month?
Which items sell together? Which items to
stock?
List of all items purchased by Khizar?
Where and how to place the items? What
The total sales of the last month grouped by
discounts to offer?
branch?
How best to target customers to increase sales
How many sales transactions occurred during
at a branch?
the month of January?
Which customers are most likely to respond to
my next promotional campaign, and why?
8
img
Table-2.2: Comparison of queries
Let's take a close look at the typical queries for a DBMS. They are either about listing the
contents of tables or running aggregates of values i.e. rather simple and straightforward queries
and fairly easy to program. The queries follow rather pre -defined paths into the database and are
unlikely to come up with something new or abnormal.
Reason-3: Businesses want much more...
1.
What happened?
2.
Why it happened?
3.
What will happen?
4.
What is happening?
5.
What do you want to happen?
These questions primarily point to what is called as the different stages of a Data Warehouse i.e.
starting from the first stage, and going all the way to stage 5. The first stage is not actually a data
warehouse, but a pure batch processing system. Note that as the stages evolve the amount of
batching processing decreases, this be ing maximum in the first stage and minimum in the last or
5th stage. At the same time the amount of ad-hoc query processing increases. Finally in the most
developed stage there is a high level of event based triggering. As the system moves from stage -1
to stage-5 it becomes what is called as an active data warehouse.
What is a DWH?
A complete repository of historical corporate data extracted from transaction systems that is
available for ad-hoc access by knowledge workers
The other key points in this standard definition that I have also underlined and listed below are:
Complete repository
·  All the data is present from all the branches/outlets of the business.
·  Even the archived data may be brought online.
·  Data from arcane and old systems is also brought online.
Transaction System
·  Management Information System (MIS)
·  Could be typed sheets (NOT transaction system)
Ad-Hoc access
·  Does not have a certain predefined database access pattern.
·  Queries not known in advance.
·  Difficult to write SQL in advance.
Knowl edge workers
·  Typically NOT IT literate (Executives, Analysts, Managers).
·  NOT clerical workers.
·  Decision makers.
9
img
The users of data warehouse are knowledge workers in other words they are decision makers in
the organization. They are not the clerical people entering the data or overseeing the transactions
etc or doing programming or performing system design/analysis. These are really decision
makers in the organization like General Manager Marketing, or Executive Director or CEO
(Chief Operating Officer). Typically those decision makers are people in areas like marketing,
finance and strategic planning etc.
Completeness: There is a misnomer here, about completeness. As per the standard definition a
data warehouse is a complete repository of corporate data. The reality is that it can never be
complete. We will discuss this in detail very shortly.
Transaction System: Unlike databases where data is directly entered, the input to the data
warehouse can come from OLTP or transactional systems or other third party databases. This is
not a rule, the data could come from typed or even hand filled sheets, as was the case for the
census data warehouse.
Ad-Hoc access: It dose not have a certain repeatable pattern and it's not known in advance.
Consider financial transactions like a bank deposit, you know exactly what records will be
inserted deleted or updated. That's in OLTP system and in ERP system. But in a data warehouse
there are really no fixed patterns. Say the marketing person, just sits down and thinks abou t what
questions he/she has about customers and there behaviors and so on and they are typically using
some tool to generate SQL dynamically and then that SQL gets executed and that you don't know
in advance.
Although there may be some patterns of queries, but they are really not very predictable and the
query patterns may change over time. Hence there are no predefined access paths into the
database. That's why relational databases are so important for the data warehouse, because
relational databases allow you to navigate the data in any direction that is appropriate using the
primary, foreign key structure within the data model. Meaning, using a data warehouse, does not
implies that we just forget about databases.
Another view of a DWH
Subject
Oriented
Integrated
Time
Variant
Non
Volatile
10
Figure-2.2: Another view of a Data Warehouse
Subject oriented. The goal of data in the data warehouse is to improve decision making,
planning, and control of the major subjects of enterprises such as customer, products, regions, in
contrast to OLTP applications that are organized around the work-flows of the company.
Integrated. The data in the data warehouse is loaded from different sources that store the data in
different formats and focus on different aspects of the subject. The data has to be checked,
cleansed and transformed into a unified format to allow easy and fast access.
Time variant. Time variant records are records that are created as of some moment in time.
Every record in the data warehouse has some form of time variancy associated with it. In an
OLTP system, the contents change with time i.e. updated such as bank account balance or mobile
phone balance, but in a warehouse as the data is loaded; the moment usually becomes its time
stamp.
Non-volatile. Unlike OLTP systems, after inserting data in the data warehouse it is neither
changed nor removed. The only exceptions are when false or incorrect data gets inserted
erroneously or the capacity of the data warehouse exceeded and archiving becomes necessary.
11
Table of Contents:
  1. Need of Data Warehousing
  2. Why a DWH, Warehousing
  3. The Basic Concept of Data Warehousing
  4. Classical SDLC and DWH SDLC, CLDS, Online Transaction Processing
  5. Types of Data Warehouses: Financial, Telecommunication, Insurance, Human Resource
  6. Normalization: Anomalies, 1NF, 2NF, INSERT, UPDATE, DELETE
  7. De-Normalization: Balance between Normalization and De-Normalization
  8. DeNormalization Techniques: Splitting Tables, Horizontal splitting, Vertical Splitting, Pre-Joining Tables, Adding Redundant Columns, Derived Attributes
  9. Issues of De-Normalization: Storage, Performance, Maintenance, Ease-of-use
  10. Online Analytical Processing OLAP: DWH and OLAP, OLTP
  11. OLAP Implementations: MOLAP, ROLAP, HOLAP, DOLAP
  12. ROLAP: Relational Database, ROLAP cube, Issues
  13. Dimensional Modeling DM: ER modeling, The Paradox, ER vs. DM,
  14. Process of Dimensional Modeling: Four Step: Choose Business Process, Grain, Facts, Dimensions
  15. Issues of Dimensional Modeling: Additive vs Non-Additive facts, Classification of Aggregation Functions
  16. Extract Transform Load ETL: ETL Cycle, Processing, Data Extraction, Data Transformation
  17. Issues of ETL: Diversity in source systems and platforms
  18. Issues of ETL: legacy data, Web scrapping, data quality, ETL vs ELT
  19. ETL Detail: Data Cleansing: data scrubbing, Dirty Data, Lexical Errors, Irregularities, Integrity Constraint Violation, Duplication
  20. Data Duplication Elimination and BSN Method: Record linkage, Merge, purge, Entity reconciliation, List washing and data cleansing
  21. Introduction to Data Quality Management: Intrinsic, Realistic, Orr’s Laws of Data Quality, TQM
  22. DQM: Quantifying Data Quality: Free-of-error, Completeness, Consistency, Ratios
  23. Total DQM: TDQM in a DWH, Data Quality Management Process
  24. Need for Speed: Parallelism: Scalability, Terminology, Parallelization OLTP Vs DSS
  25. Need for Speed: Hardware Techniques: Data Parallelism Concept
  26. Conventional Indexing Techniques: Concept, Goals, Dense Index, Sparse Index
  27. Special Indexing Techniques: Inverted, Bit map, Cluster, Join indexes
  28. Join Techniques: Nested loop, Sort Merge, Hash based join
  29. Data mining (DM): Knowledge Discovery in Databases KDD
  30. Data Mining: CLASSIFICATION, ESTIMATION, PREDICTION, CLUSTERING,
  31. Data Structures, types of Data Mining, Min-Max Distance, One-way, K-Means Clustering
  32. DWH Lifecycle: Data-Driven, Goal-Driven, User-Driven Methodologies
  33. DWH Implementation: Goal Driven Approach
  34. DWH Implementation: Goal Driven Approach
  35. DWH Life Cycle: Pitfalls, Mistakes, Tips
  36. Course Project
  37. Contents of Project Reports
  38. Case Study: Agri-Data Warehouse
  39. Web Warehousing: Drawbacks of traditional web sear ches, web search, Web traffic record: Log files
  40. Web Warehousing: Issues, Time-contiguous Log Entries, Transient Cookies, SSL, session ID Ping-pong, Persistent Cookies
  41. Data Transfer Service (DTS)
  42. Lab Data Set: Multi -Campus University
  43. Extracting Data Using Wizard
  44. Data Profiling