ZeePedia buy college essays online

Data Warehousing

<<< Previous Types of Data Warehouses: Financial, Telecommunication, Insurance, Human Resource Next >>>
Lecture Handout
Data Warehousing
Lecture No. 05
Types of Data Warehouses
Human Resource
Financial DWH
First data warehouse that an organization builds. This is appealing because:
Nerve center, easy to get attention.
In most organizations, smallest data set.
Touches all aspects of an organization, with a common denomination i.e. money.
Inherent structure of data directly influenced by the day-to-day activities of
financial processing.
Financial data warehouses suffer from an anomaly of inability to match balanced
accounts, due to many legitimate reasons.
Financial data warehouses are often the first data warehouse that an organization builds. This is
appealing because:
Financial data is ALWAYS at the nerve center of the organization. Therefore, getting attention
drawn to a well-built financial data warehouse is a very easy thing to do. In most organizations
(but not all), financial data represents one of the smallest volumes of data that exist. Finance cuts
across all of the aspects of the corporate data and has a common denominator i.e. money.
Financial data inherently has a structure of data directly influenced by the day -to-day activities of
financial processing and the list goes on. For these reasons, finance becomes a very good target
for beginning to build the corporate data warehouse.
Financial data warehouses have some severe drawbacks that are not found elsewhere. When
purists use a financial data warehouse, they are overjoyed to find that it is almost impossible to
reconcile down to the rupee. Again and again they will report that they saw a report yesterday that
stated the balance to be Rs. 12,345,678 but when did the same report today the using our
accounting MIS system the balance was Rs. 13,245,678. Hence one can not trust the new
Note that it is unlikely that the financial data warehouse would balance to the last rupee, such a
notion is completely wrong for a number of reasons. The accounting periods may be different in
different operational systems, but in a data warehouse, it is the corporate calendar. The
classifications of regions may change. In the operational system Northern Pakis tan may consist of
Murree, Abbotabad etc. but in the national data warehouse, Northern Pakistan consists of every
major city north of Lahore. In the operational system (for example) sale or oranges is recorded in
dozen in one part of Pakistan, while in other they are sold by weight and in another part by crates.
However, in the national data warehouse, they are accounted for based on dozen and the list goes
Dominated by sheer volume of data.
Many ways to accommodate call level detail:
Only a few months of call level detail,
Storing lots of call level detail scattered over different storage media,
Storing only selective call level detail, etc.
Unfortunately, for many kinds of processing, working at an aggregate level is
s imply not possible.
Telecommunications Data Warehouse
The telecommunications data warehouse is dominated by the sheer volume of data generated at
the call level subject area. There are many other types of data in this environment as well. No
other data warehouse environment is dominated by the size of one of the subject areas as call
level detail dominates a telecommunications data warehouse.
There are many ways that call level detail can be accommodated:
Only a few months of call level detail,
Stori ng lots of call level detail scattered over different storage media,
Summarizing or aggregating call level detail,
Storing only selective call level detail, and so forth.
Unfortunately, for many kinds of processing, the only way it can be accommodated is by working
at the call level detail. Working at a summary level or an aggregate level simply is not a
Insurance data warehouses are similar to other data warehouses BUT with a few exceptions.
Stored data that is very, very old, used for actuarial processing.
Typical business may change dramatically over last 40-50 years, but not insurance.
Therefore, if someone says that the insurance business has not changed much over the
last 40-50 years would be telling the truth.
In retailing or telecomm there are a few important dates, but in the insurance environment
there are many dates of many kinds. In the retail sales environment there are a few dates
such as: sale date, stocking date, and perhaps manufacturing date. In the banking
environment there is the transaction date. However, in the insurance environment there
are dates everywhere of every kind.
Long operational business cycles, in years. Thus the operating speed is different.
Consider a bank, you insert the ATM card and draw the money instantly, or deposit a
check and it is cashed at most in a week. You go to store and purchase an item, and
instantly make a telephone call. However, in an insurance environment a claim is made
and it may take several years before it is settled.
Transactions are not gathered and processed, but are in kind of "frozen" state.
Thus a very unique approach of design & implementation.
Differences between different types of data warehouses
Financial data warehouses, often the first a corporation builds. However, it will not
reconcile down to the last rupee with the existing operational financial environment.
Insurance data warehouses are similar to other data warehouses with a few exceptions:
such as the length of time that insurance dat a warehouses exists, in terms of the dates
found in the business, and in terms of the operational business cycle.
The human resources data warehouse is different from others because this without doubt
has only one major subject area (yes you guessed it right!).
The telecommunications data warehouse is dominated by the sheer volume of data
generated at one subject area.
Indeed there are many types of data warehouses, each with their own peculiarities and all of them
can not be discussed here.
Typical Applications of DWH
There are, and there can be many many applications of a data warehouse. It is not possible to
discuss all of them. Some representative applications are listed to be discussed as follows:
Fraud detection.
Profitability analysis.
Direct mail/database marketing.
Credit risk prediction.
Customer retention modeling.
Yield management.
Inventory management.
ROI on any one of these applications can justify hardware/software and consultancy costs in most
Fraud detection
By observing data usage patterns.
People have typical purchase patterns.
Deviation from patterns.
Certain cities notorious for fraud.
Certain items bought by stolen cards.
Similar behavior for stolen phone cards.
You can look at patterns of data usage, and behavior of customer and detect fraud very easily.
Lets suppose you are a Credit Card Company. You know that Danyal as a customer has certain
behaviors. He tends to buy airplane tickets, make hotel reservations, rents cars using his credit
card. When he goes shopping, it is usually only on a weekend and only on certain stores that he
goes shopping and buys sports goods and action clothing. And then all of sudden you see very
different behaviors coming in, such as credit card transactions in Karachi. Karachi by the way has
very high stolen credit card rate. And the patterns of buying behavior are stereos, and CDs and
TVs i.e. electronic goods that are typically the things that people buy on stolen credit cards. You
can notice and say "look, this is not airplane or hotel reservations, this does not look like Danyal".
It is not in the city where Danyal lives or goes to very often. Although he was there a week ago,
so maybe this is how you can tell that when and where his credit card was stolen. And the
purchases were made after he left the city. Hence "high probability of fraud".
So I can predict based on the behavior of the individuals, when things out of the typical routine
happen. Then I can stop the fraud happening even before it has spread. That's the example in
credit card but you can do the same thing with telephone cards. I am sure you have used these
telephone cards. You punch in your number and you get access to telephone network. You find
people at airports or railway stations who make there living by stealing the telephone card codes.
They stay near the phones at the air port, they act like they are talking on the phone, but they are
really looking over your shoulder when you are punching in the code. And then they memorize
the code and then go sell it on the street to people who make long distance phone calls to all their
friends in Brazil and Australia all these expensive places. You can look at the calling patterns and
say look "Danyal makes regular calls to areas like say Rawalpindi and to friends in Lahore" and
all of sudden we see all these phone calls to Toba Taik Singh or Bhawalpur, meaning there is
some thing wrong here.
Profitability Analysis
Banks know if they are profitable or not.
Don't know which customers are profitable.
Typically more than 50% are NOT profitable.
Don't know which one?
Balance is not enough, transactional behavior is the key.
Restructure products and pricing strategies.
Life-time profitability models (next 3-5 years).
It is another example. If you go to an average bank, to Pakistan or any where in the world, the
bank will know if they are profitable or not. But they don't know which customers are profitable
or not, and at most banks this is true, especially true for retail banks. These banks do business
with the consumers, such that more then 50 % of the customers are unprofitable. In other word
the bank is loosing money on 50% of their customers. But they don't know which 50%? That's
the problem. So the idea behind profitability analysis is to do the analysis to figure out which
customers are profitable and which customers are not profitable. And then based on that they can
restructure their product offering and there pricing strategies to do more profitable business. I
used a banking example but same is true for Telecommunications Company or any other
consumer oriented business that requires access to detail data. It is not sufficient to just
the account balance, but also transactional behavior and so on to do profitability analysis. And
once I know profitability retrospectively, I can also build predictive models to understand what
it's going to be prospectively. So I can build what's called lifetime value models to using the
historical data to predict what the future profitability will be for the next 3 or 5 years.
Direct mail marketing
Targeted marketing.
Offering high bandwidth package NOT to all users.
Know from call detail records of web surfing.
Saves marketing expense, saving pennies.
Knowing your customers better.
Assume you are a telecommunications company, and you have an offering that you can make
such as to give value added Internet access through mobile phones. You can send a letter or an
SMS to every customer in the country and tell them about this value added Internet access
service. But 95% of that mail will be completely wasted. Because a small percentage of
customers will be interested in such as service. So the question here is how you your message to
the right people in a meaningful way. If you have the call detailed records, you can tell very very
easily which of your customer's access the Internet and which of them just make missed calls.
Because the patters of the call detail records of Internet access are completely different from
making voice calls and making missed calls. They have different durations; they have different
data bandwidth requirement and so on. So by having access to detail data, you can target those
people that are better to get a second account, or better to get a value added connection/service.
Credit risk prediction
W ho should get a loan?
Customer segregation i.e. stable vs. rolling.
Qualitative decision making NOT subjective.
Different interest rates for different customers.
Do not subsidize bad customer on the basis of good.
Credit risk prediction is another area where data warehouse is used quite frequently. Let's say I
and my friend both apply for home loan to construct a house. How would the bank determine
which one of those should get a loan? So typically they will say, ok I have been employed for 5
years, have good salary, married, have children etc etc and then based on the context of
application then they decide yes we think you are a low risk we will give you the loan. Or say no
you are a high risk because you had a job only for one year, your salary is not high, and you have
lots of change of jobs recently, so you know we don't feel so good about the risk we are taking,
so we won't give you the loan.
So typically today it is based on what is called qualitative decision -making. Basically in the
offices these are just references say OK Saeed is a nice person or is related to me or is my
neighbor so we give him the loan. And I don't really know Waleed over here so we don't give
him a loan. This is not the best way to give fair loan to people and is not the best way to manage
risk in the business. So with credit risk, the idea is to build quantitative models, which predict risk
based on historical data. And not only I use the historical data to predict the risk, if I get more
sophisticated, what I can do is called risk based pricing. So today lets say you offer me (Saeed) a
loan, and also to the other applicant i.e. Waleed also loan, and both are given the loan at the same
interest rates. However, if Waleed is a lower risk then Saeed then why should both pay the same
interest rates? If Waleed is a low risk, then he should get a lower interest rate other wise the bank
is subsidizing the higher risk loan. So what's happening is that you are using very precise credit
prediction models in order to adjust the interest to give lower interest to people who are at lower
risk of default. And again it should use some data mining algorithms to predict the risk, which is
called the risk index on every individual consumer. So that every consumer gets a loan at an
interest that is a ppropriate to their risk level.
Yield Management
Works for fixed inventory businesses.
The price of item suddenly goes to zero.
Item prices vary for varying customers.
Example: Air Lines, Hotels, Car rentals etc.
Price of (say) Air Ticket depends on:
§  How much in advance ticket was bought?
§  How many vacant seats were present?
§  How profitable is the customer?
§  Ticket is one-way or return?
Yield management works, for example for sophisticated airlines which use a variable pricing
mechanism for the seats. Say you are flying on a plane; 90% of the time the person sitting next to
you pays a different price for the ticket as compared to yours. The seats are basically same, yet
you pay a different price for the ticket, why is it that? Because the airline used a sophisticated
method to see how much in advance the person bought the ticket, and how many seats were
available on the plane at the time the person bought the ticket. Or did they buy a round ticket or
one-way ticket; there are all kinds of sophisticated ways to do the pricing. Yield management
helps decide what should be the price at any time for that airline seat.
Yield management is important for almost any type of fixed inventory business. For example,
once the plane leaves, the price of the seat if it is empty or full means nothing any more. So the
goal of the airline is to fill every seat with the maximum possible price for that seat. The same is
true for hotels. For example a three star hotel had an empty room last night. If they can get Rs.
1500 for that night it is better then having nothing at all. So someone who booked the room for
example as a group, would have paid lower price per room as compared to someone who is
booking only one room. The price is different, because the hotels use yield manageme nt to fill out
their hotel rooms as profitably as they can. This is applicable to car rentals also.
Agriculture Systems
Agri and related data collected for decades.
Metrological data consists of 50+ attributes.
Decision making based on expert judgment.
Lack of integration results in underutilization.
What is required, in which amount and when?
Each year different government departments and agencies in Pakistan create tens of thousands of
digital and non digital files from thousands of pest-scouting surveys, yield surveys, metrological
data collection, river flows etc. This data collection has been going on for decades. The data
collected has never been compiled, standardized and integrated to give a complete picture. Thus
the lack of data integration and standardization contributes to an under -utilization of historical
data, and inevitably results in an inability to perform any scientific predictive analysis for
effective decision support and policy making. An Agriculture data warehouse is the answer, as
processing 100+ variables by experts for large historical data is not possible. This has repeatedly
resulted in tragic outcomes.
Although the benefits are many, but are not without some major issues or problems, some of them
are as follows:
Major Issues
·  Unavailability of data and in an illegible format and form.
Most of the detail data is not digitized; therefore it can not be readily used by computers.
Dispersion of data across multiple and geographically displaced organizations.
Even though the said organizations are required to share the data with common citizens,
no official data sharing mechanism is currently in place.
Until the requisite training is provided, government employees and farmers are not equipped with
enough technical and literate skills to use latest information technology tools to use the data
Table of Contents:
  1. Need of Data Warehousing
  2. Why a DWH, Warehousing
  3. The Basic Concept of Data Warehousing
  4. Classical SDLC and DWH SDLC, CLDS, Online Transaction Processing
  5. Types of Data Warehouses: Financial, Telecommunication, Insurance, Human Resource
  6. Normalization: Anomalies, 1NF, 2NF, INSERT, UPDATE, DELETE
  7. De-Normalization: Balance between Normalization and De-Normalization
  8. DeNormalization Techniques: Splitting Tables, Horizontal splitting, Vertical Splitting, Pre-Joining Tables, Adding Redundant Columns, Derived Attributes
  9. Issues of De-Normalization: Storage, Performance, Maintenance, Ease-of-use
  10. Online Analytical Processing OLAP: DWH and OLAP, OLTP
  11. OLAP Implementations: MOLAP, ROLAP, HOLAP, DOLAP
  12. ROLAP: Relational Database, ROLAP cube, Issues
  13. Dimensional Modeling DM: ER modeling, The Paradox, ER vs. DM,
  14. Process of Dimensional Modeling: Four Step: Choose Business Process, Grain, Facts, Dimensions
  15. Issues of Dimensional Modeling: Additive vs Non-Additive facts, Classification of Aggregation Functions
  16. Extract Transform Load ETL: ETL Cycle, Processing, Data Extraction, Data Transformation
  17. Issues of ETL: Diversity in source systems and platforms
  18. Issues of ETL: legacy data, Web scrapping, data quality, ETL vs ELT
  19. ETL Detail: Data Cleansing: data scrubbing, Dirty Data, Lexical Errors, Irregularities, Integrity Constraint Violation, Duplication
  20. Data Duplication Elimination and BSN Method: Record linkage, Merge, purge, Entity reconciliation, List washing and data cleansing
  21. Introduction to Data Quality Management: Intrinsic, Realistic, Orr’s Laws of Data Quality, TQM
  22. DQM: Quantifying Data Quality: Free-of-error, Completeness, Consistency, Ratios
  23. Total DQM: TDQM in a DWH, Data Quality Management Process
  24. Need for Speed: Parallelism: Scalability, Terminology, Parallelization OLTP Vs DSS
  25. Need for Speed: Hardware Techniques: Data Parallelism Concept
  26. Conventional Indexing Techniques: Concept, Goals, Dense Index, Sparse Index
  27. Special Indexing Techniques: Inverted, Bit map, Cluster, Join indexes
  28. Join Techniques: Nested loop, Sort Merge, Hash based join
  29. Data mining (DM): Knowledge Discovery in Databases KDD
  31. Data Structures, types of Data Mining, Min-Max Distance, One-way, K-Means Clustering
  32. DWH Lifecycle: Data-Driven, Goal-Driven, User-Driven Methodologies
  33. DWH Implementation: Goal Driven Approach
  34. DWH Implementation: Goal Driven Approach
  35. DWH Life Cycle: Pitfalls, Mistakes, Tips
  36. Course Project
  37. Contents of Project Reports
  38. Case Study: Agri-Data Warehouse
  39. Web Warehousing: Drawbacks of traditional web sear ches, web search, Web traffic record: Log files
  40. Web Warehousing: Issues, Time-contiguous Log Entries, Transient Cookies, SSL, session ID Ping-pong, Persistent Cookies
  41. Data Transfer Service (DTS)
  42. Lab Data Set: Multi -Campus University
  43. Extracting Data Using Wizard
  44. Data Profiling