Why a DWH, Warehousing

<< Need of Data Warehousing

The Basic Concept of Data Warehousing >>

Lecture Handout

Data Warehousing

Lecture No. 02

Why a DWH?

Data recording and storage is growing.

History is excellent predictor of the future.

Gives total view of the organization.

Intelligent decision -support is required for decision -making.

Data recording and storage is growing.

Moore's law on increase in performance of CPUs and decrease in cost has been surpassed by the

increase in storage space and decrease in cost. Meaning, it is true that the cost of CPUs is going

down and the performance is going up, but this is applicable at a higher rate to storage space and

cost i.e. more and more cheap storage space is becoming available as compared to fast CPUs.

As you would have experienced, when you (or your father's) briefcase seems to be small as

compared to the contents carried in it, it seems a good idea to buy a new and larger briefcase.

However, after sometime the new briefcase too seems to be small for the contents carried. On the

practical side, it has been noted that the amount of data recorded in an organization doubles every

year and this is an exponential increase.

Reason-1: Data Sets are growing

How Much Data is that?

220 or 106 bytes

1 MB

Small novel 31/2 Disk

230 or 109 bytes

1 GB

Paper rims that could fill the back of a pickup van

50,000 trees chopped and converted into paper

240 or 1012 bytes

1 TB

and printed

1 PB = 250 or 1015 bytes

2 PB

Academic research libraries across the U.S.

1 EB = 260 or 101 8 bytes

5 EB

All words ever spoken by human beings

Table-2.1: Quantifying size of data

Size of Data Sets are going up ↑.

Cost of data storage is coming down ↓.

Total hardware and software cost to store and manage 1 Mbyte of data

§ 1990: ~ $15

§ 2002: ~ ¢15 (Down 100 times)

§ By 2007: < ¢1 (Down 150 times)

A Few Examples

§ WalMart: 24 TB (Tera Byte)

§ France Telecom: ~ 100 TB

§ CERN: Up to 20 PB by 2006 (Peta Byte)

§ Stanford Linear Accelerator Center (SLAC): 500TB

A Ware House of Data

is NOT a

Data Warehouse

Someone says I have a data set of size 1 GB so I have a DWH can you beat this?

Someone else says, I have a data set of size 100 GB, can you beat this?

Someone else says, I have a 1 TB data set, who can beat this?

Who has a data warehouse? Not enough information, it is much more than just the size, it is a

whole concept, it is NOT a shrink wrapped solution, it evolves. A company may have a TB of

data and not have a data warehouse; while on the other hand, a company may have 500 GB of

data and have a fully functional data warehouse.

Size is NOT

Everything

History is excellent predictor of the future

Secondly as I mentioned earlier the data warehouse has the historical data. And one thing that we

have learned by using information is that, "past is the best predictor of the future". You use

historical data, because it gives you an insight into how the environment is changing. Also you

must have heard that "history repeats itself", however this repetition of history is not likely to be

constant for all businesses or all events. Note that you just can't use the historical data to predict

the future; you have to have to bring your own insight and experience to interpret how the

environment is changing in order to predictthe future accurately and meaningfully.

Gives total view of the organization

So why would you want data warehouse in your organization? First of all a data warehouse gives

a total view of an organization. If you look at the operational system i.e. the databases in most

environments, the databases are designed around different lines of business. Consider the case of

a Bank; a bank will typically have current accounts and savings accounts, foreign currency

account etc. The bank will have an MIS system for leasing, and another system for managing

credit cards and another system for every different kind of business they are in. However,

nowhere they have the total view of the environment from the customer's perspective. The reason

being, transaction processing systems are typically designed around functional areas, within a

business environment. For good decision making you should be able to integrate the data across

the organization so as to cross the LoB (Line of Business). So the idea here is to give the tot al

view of the organization especially from a customer's perspective within the data warehouse, as

shown in Figure -2.1

ATM

Leasing

Savings

Account

DATA WAREHOUSE

Checking Account

Credit Card

Figure-2.1: A Data Warehouse crosses the LoB

Intelligent decision -support is required for decision -making

Consider a bank which is losing customers, for reasons not known. However, one thing is for sure

that the bank is losing business because of lost customers. Therefore, it is important, actually

critical to understand which customers have left and why they have left. This will give you the

ability to predict going forward (in time), to identify which customers will leave you (i.e. the

bank). We are going to talk about this in the course using data mining algorithms, like clustering,

classification, regression analysis etc. However, this being another example of using historical

data to predict the future. So I can predict today, which customers will leave me in the next 3

months before they even leave. There can be, and there are whole courses on data mining, but we

will just have an applied overview of data mining in this course.

Reason-2: Businesses demand intelligence

Complex questions from integrated data.

"Intelligent Enterprise"

DBMS Approach

Intelligent Enterprise

List of all items that were sold last month?

Which items sell together? Which items to

stock?

List of all items purchased by Khizar?

Where and how to place the items? What

The total sales of the last month grouped by

discounts to offer?

branch?

How best to target customers to increase sales

How many sales transactions occurred during

at a branch?

the month of January?

Which customers are most likely to respond to

my next promotional campaign, and why?

Table-2.2: Comparison of queries

Let's take a close look at the typical queries for a DBMS. They are either about listing the

contents of tables or running aggregates of values i.e. rather simple and straightforward queries

and fairly easy to program. The queries follow rather pre -defined paths into the database and are

unlikely to come up with something new or abnormal.

Reason-3: Businesses want much more...

What happened?

Why it happened?

What will happen?

What is happening?

What do you want to happen?

These questions primarily point to what is called as the different stages of a Data Warehouse i.e.

starting from the first stage, and going all the way to stage 5. The first stage is not actually a data

warehouse, but a pure batch processing system. Note that as the stages evolve the amount of

batching processing decreases, this be ing maximum in the first stage and minimum in the last or

5th stage. At the same time the amount of ad-hoc query processing increases. Finally in the most

developed stage there is a high level of event based triggering. As the system moves from stage -1

to stage-5 it becomes what is called as an active data warehouse.

What is a DWH?

A complete repository of historical corporate data extracted from transaction systems that is

available for ad-hoc access by knowledge workers

The other key points in this standard definition that I have also underlined and listed below are:

Complete repository

· All the data is present from all the branches/outlets of the business.

· Even the archived data may be brought online.

· Data from arcane and old systems is also brought online.

Transaction System

· Management Information System (MIS)

· Could be typed sheets (NOT transaction system)

Ad-Hoc access

· Does not have a certain predefined database access pattern.

· Queries not known in advance.

· Difficult to write SQL in advance.

Knowl edge workers

· Typically NOT IT literate (Executives, Analysts, Managers).

· NOT clerical workers.

· Decision makers.

The users of data warehouse are knowledge workers in other words they are decision makers in

the organization. They are not the clerical people entering the data or overseeing the transactions

etc or doing programming or performing system design/analysis. These are really decision

makers in the organization like General Manager Marketing, or Executive Director or CEO

(Chief Operating Officer). Typically those decision makers are people in areas like marketing,

finance and strategic planning etc.

Completeness: There is a misnomer here, about completeness. As per the standard definition a

data warehouse is a complete repository of corporate data. The reality is that it can never be

complete. We will discuss this in detail very shortly.

Transaction System: Unlike databases where data is directly entered, the input to the data

warehouse can come from OLTP or transactional systems or other third party databases. This is

not a rule, the data could come from typed or even hand filled sheets, as was the case for the

census data warehouse.

Ad-Hoc access: It dose not have a certain repeatable pattern and it's not known in advance.

Consider financial transactions like a bank deposit, you know exactly what records will be

inserted deleted or updated. That's in OLTP system and in ERP system. But in a data warehouse

there are really no fixed patterns. Say the marketing person, just sits down and thinks abou t what

questions he/she has about customers and there behaviors and so on and they are typically using

some tool to generate SQL dynamically and then that SQL gets executed and that you don't know

in advance.

Although there may be some patterns of queries, but they are really not very predictable and the

query patterns may change over time. Hence there are no predefined access paths into the

database. That's why relational databases are so important for the data warehouse, because

relational databases allow you to navigate the data in any direction that is appropriate using the

primary, foreign key structure within the data model. Meaning, using a data warehouse, does not

implies that we just forget about databases.

Another view of a DWH

Subject

Oriented

Integrated

Time

Variant

Non

Volatile

Figure-2.2: Another view of a Data Warehouse

Subject oriented. The goal of data in the data warehouse is to improve decision making,

planning, and control of the major subjects of enterprises such as customer, products, regions, in

contrast to OLTP applications that are organized around the work-flows of the company.

Integrated. The data in the data warehouse is loaded from different sources that store the data in

different formats and focus on different aspects of the subject. The data has to be checked,

cleansed and transformed into a unified format to allow easy and fast access.

Time variant. Time variant records are records that are created as of some moment in time.

Every record in the data warehouse has some form of time variancy associated with it. In an

OLTP system, the contents change with time i.e. updated such as bank account balance or mobile

phone balance, but in a warehouse as the data is loaded; the moment usually becomes its time

stamp.

Non-volatile. Unlike OLTP systems, after inserting data in the data warehouse it is neither

changed nor removed. The only exceptions are when false or incorrect data gets inserted

erroneously or the capacity of the data warehouse exceeded and archiving becomes necessary.

Table of Contents: