De-Normalization: Balance between Normalization and De-Normalization

<< Normalization: Anomalies, 1NF, 2NF, INSERT, UPDATE, DELETE

DeNormalization Techniques: Splitting Tables, Horizontal splitting, Vertical Splitting, Pre-Joining Tables, Adding Redundant Columns, Derived Attributes >>

Lecture Handout

Data Warehousing

Lecture No. 07

De-Normalization

Striking a balance between "good" & "evil"

Figure-7.1: Striking a balance between normalization and de-normalization

There should be a balance between normalized and de -normalized forms. In a fully normalized

form, too many joins are required and in a totally de -normalized form, we have a big, wide single

table. Database should be aligned someplace in between so as to strike a balance, especially in the

context of the queries and the application domain.

A "pure" normalized design is a good starting point for a data model and is a great thing to do and

achieve in academia. However, as briefly mentioned in the previous lecture, in the reality of the

"real world", the enhancement in performance delivered by some selective de -normalization

technique can be a very valuable tool. The key to success is to undertake de-normalization as a

design technique very cautiously and consciously. Do not let proliferation of the technique take

over your data warehouse or you will end up with a single big flat file!

What is De-Normalization?

�

It is not chaos, more like a "controlled crash" with the aim of performance enhancement

without loss of information.

�

Normalization is a rule of thumb in DBMS, but in DSS ease of use is achieved by way of

denormalization.

�

D e-normalization comes in many flavors, such as combining tables, splitting tables,

adding data etc., but all done very carefully.

`Denormalization' does not mean that anything and everything goes. Denormalization does not

mean chaos or disorder or indiscipline. The development of properly denormalized data structures

follows software engineering principles, which insure that information will not be lost. De -

normalization is the process of selectively transforming normalized relations into un -normalized

physical record specifications, with the aim of reducing query processing time. Another

fundamental purpose of denormalization is to reduce the number of physical tables, that must be

accessed to retrieve the desired data by reducing the number of joins required to answer a query.

Some people tend to confuse dimensional modeling with de-normalization. This will become

clear when we will cover dimensional modeling, where indeed tables are collapsed together.

Why De-Normalization in DSS?

�

Bringing "close" dispersed but related data items .

�

Query performance in DSS significantly dependent on physical data model.

�

Very early studies showed performance difference in orders of magnitude for different

number de-normalized tables and rows per table.

�

The level of de-normalization should be carefully considered.

The efficient processing of data depends how close together the related data items are. Often all

the attributes that appear within a relation are not used together, and data from different relations

is needed together to answer a query or produce a report. Although normalized relations solve

data maintenance anomalies (discussed in last lecture), however, normalized relations if

implemented one for one as physical records, may not yield efficient data processing times. DSS

query performance is a function of the performance of every component in the data delivery

architecture, but is strongly associated with the physical data model. Intelligent data modeling

through the use of techniques such as de-normalization, aggregation, and partitioning, can

provide orders of magnitude performance gains compared to the use of normalized data

structures.

The processing performance between totally normalized and partially normalized DSSs can be

dramatic. Inmon (grand father of data warehousing) reported way back in 1988 through a study,

by quantifying the performance of fully and partially normalized DSSs. In his study, a fully

normalized DSS contained eight tables with about 50,000 rows each, another partially normalized

DSS had four tables with roughly 25,000 rows each, and yet another partially normalized DSS

had two tables. The results of the study showed that the less than fully normalized DSSs could

muster a performance as much as an order of magnitude better than the fully normalized DSS.

Although such results depend greatly on the DSS and the type of processing, yet these results

suggest that one should carefully consider whether the physical records should exactly match the

normalized relations for a DSS or not?

How De-Normalization improves performance?

De-normalization specifically improves performance by either:

�

Reducing the number of tables and hence the reliance on joins, which consequently

speeds up performance.

�

Reducing the number of joins required during query execution, or

�

Reducing the number of rows to be retrieved from the Primary Data Table.

The higher the level of normalization, the greater will be the number of tables in the DSS as the

depth of snowflake schema would increase. The greater the number of tables in the DSS, the

more joins are necessary for data manipulation. Joins slow performance, especially for very large

tables for large data extractions, which is a norm in DSS not an exception. De-normalization

reduces the number of tables and hence the reliance on joins, which consequently speeds up

performance.

De-normalization can help minimize joins and foreign keys and help resolve aggregates. By

storing values that would otherwise need to be retrieved (repeatedly), one may be able to reduce

the number of indexes and even tables required to process queries.

4 Guidelines for De-normalization

1. Carefully do a cost-benefit analysis (frequency of use, additional storage, join time).

2. Do a data requirement and storage analysis.

3. Weigh against the maintenance issue of the redundant data (triggers used).

4. When in doubt, don't denormalize.

Guidelines for Denormalization:-

Following are some of the basic guidelines to help determine whether it's time to denormalize the

DSS design or not:

1. Balance the frequency of use of the data items in question, the cost of additional storage to

duplicate the data, and the acquisition time of the join.

2. Understand how much data is involved in the typical query; the amount of data affects the

amount of redundancy and additional storage requirements.

3. Remember that redundant data is a performance benefit at query time, but is a performance

liability at update time because each copy of the data needs to be kept up to date. Typically

triggers are written to maintain the integrity of the duplicated data.

4. De-normalization usually speeds up data retrieval, but it can slow the data modification

processes. It may be noted that both on-line and batch system performance is adversely affected

by a high degree of de-normalization. Hence the golden rule is: When in doubt, don't

denormalize.

Areas for Applying De-Normalization Techniques

�

Dealing with the abundance of star schemas.

�

Fast access of time series data for analysis.

�

Fast aggregate (sum, average etc.) results and complicated calculations.

�

Multidimensional analysis (e.g. geography) in a complex hierarchy.

�

Dealing with few updates but many join queries.

De-normalization will ultimately affect the database size and query performance.

De-normalization is especially useful while dealing with the abundance of star schemas that are

found in many data warehouse installations. For such cases, de-normalization provides better

performance and a more natural data structure for supporting decision making. The goal of most

analytical processes in a typical data warehouse environment is to access aggregates such as

averages, sums, complicated formula calculations, top 10 customers etc. Typical OLTP systems

contain only the raw transaction data, while decision makers expect to find aggregated and time -

series data in their data warehouse to get the big picture through immediate query and display.

Important and common parts of a data warehouse that are a good candidate for de -normalization

include (but are not limited to):

�

Aggregation and complicated calculations.

�

Multidimensional analysis in a complex hierarchy.

�

Few updates, but many join queries.

Geography is a good example of a multidimensional hierarchy (e.g. Province, Division, District,

city and zone). Basic design d cisions for example, the selection of dimensions, the number of

dimensions to be used and what facts to aggregate will ultimately affect the database size and

query performance.

Five principal De-normalization techniques

1. Collapsing Tables.

- Two entities with a One-to-One relationship.

- Two entities with a Many-to-Many relationship.

2. Pre-Joining.

3. Splitting Tables (Horizontal/Vertical Splitting).

4. Adding Redundant Columns (Reference Data).

5. Derived Attributes (Summary, Total, Balance etc).

Now we will discuss de -normalization techniques that have been commonly adopted by

experienced database designers. These techniques can be classified into four prevalent strategies

for denormalization which are:

1. Collapsing Tables.

- Two entities with a One-to-One relationship.

- Two entities with a Many-to-Many relationship.

2. Pre-joining.

3. Splitting Tables (Horizontal/Vertical Splitting).

4. Adding Redundant Columns (Reference Data).

5. Derived Attributes (Summary, Total, Balance etc) .

Collapsing Tables

Figure-7.1: Collapsing Tables

1. Collapsing Tables

One of the most common and safe denormalization techniques is combining of One-to-One

relationships. This situation occurs when for each row of entity A, there is only one related row in

entity B. While the key attributes for the entities may or may not be the same, their equal

participation in a relationship indicates that they can be treated as a single unit. For example, if

users frequently need to see COLA, CO LB, and COLC together and the data from the two tables

are in a One-to-One relationship, the solution is to collapse the two tables into one. For example,

SID and gender in one table, and SID and degree in the other table.

In general, collapsing tables in One-to-One relationship has fewer drawbacks than others. There

are several advantages of this technique, some of the obvious ones being reduced storage space,

reduced amount of time for data update, some of the other not so apparent advantages are reduced

number of foreign keys on tables, reduced number of indexes (since most indexes are created

based on primary/foreign keys). Furthermore, combining the columns does not change the

business view, but does decrease access time by having fewer physical objects and reducing

overhead.

Table of Contents: