ZeePedia buy college essays online

Data Warehousing

<<< Previous Conventional Indexing Techniques: Concept, Goals, Dense Index, Sparse Index Next >>>
Need for Speed: Hardware Techniques
Data Parallelism: Concept
Parallel execution of a single data manipulation task across multiple partitions of data.
Partitions static or dynamic
Tasks executed almost-independently across partitions.
"Query coordinator" must coordinate between the independently executing processes.
So data parallelism is I think the simplest form of parallelization. The idea is that we have parallel
execution of single data operation across multiple partitions of data. So the idea here is that these
partitions of data may be defined statically or dynamically fine, but we are requiring the same
operator across these multiple partitions concurrently. And this idea actually of data parallelism
has existed for a very long time. So the idea is that you are getting parallelization because we are
getting semi -independent execution, data manipulation across the partitions. And as long as we
keep the coordination required,  we can get very good speedups. Well again this que ry
coordinator, the thing that keeps the query distributed but still working and then collects its
results.. Now that query coordinator can potentially be a bottleneck, because if it does too much
work, that is serial execution. So the query coordination has to be very small amount of work.
Otherwise the overhead gets higher and the serialization of the workload gets higher.
Data Parallelism: Example
Figure-25.1: Example of Data Parallelism
Suppose that we have a question that we want to select the number of accounts where balance is
greater that 5,000$ and the open data is after first of June 2000. This account table and what we
end up doing is say ok send the query to each query server and each query server then runs the
query against a particular partition as shown in Figure 25.1. And then how many query servers we
need and each gets a partial result. So in the data blocks that are processed 1,235 that meet up the
criteria. And here we get 536 and here this one found 2,016. The query coordinator just takes
these numbers and lets just say there is a 100 -degree of parallelism, so that I take 100 numbers
and I add them and I get the result. Remember that a query coordinator is software.
Data Parallelism: Ensuring Speed-UP
To get a speed -up of N with N partitions, it must be ensured that:
There are enough computing resources.
Query -coordinator is very fast as compared to query servers.
Work done in each partition almost same to avoid performance bottlenecks.
Same number of records in each partition would not suffice.
Need to have uniform distribution of records w.r.t. filter criterion across partitions.
Linear speed up with data parallelism should be achieved as long as the coordinator is not
burdened by too much work and the workload is distributed relatively evenly to the participating
query servers. If the coordinator does too much work, then the serial processing involved in final
results construction will significantly impact overall performance (remember Amdahl's Law!).
Depending on CPU speed and RDBMS efficiency, the number of partitions may be either greater
or less than (or equal to) the number of data partitions in the parallel execution of a query. Faster
CPUs and more efficient (light weight) query server implementations will generally have more
partitions than CPUs (so that more overlap can take place with computational and I/O
processing). Less efficient (heavy weight) query server implementations or slow CPUs will tend
toward fewer data partitions than CPUs. There is usually a one -to-one correspondence of query
servers to data partitions.
Note: You should not confuse data partitions used for the purposes of data parallelism with range
partitions used to divide up large tables for ease of management - they are two different things!
The first thing is that the amount of work done by the query coordinator should be very small. It
should be small because that work is serial. And if it is serial it will kill the performance. So it
should be almost zero.
The second thing is that the workload performed in each partition should be relatively equal. If
one query server has much more work to do then any other query server then every body waits for
the smothered storm. So in parallel execution environment distribution of work is very important.
Every database that is serious about parallel query capability should have data parallelism in the
market place today. This is sort of cost of playing the game. You have to at least be able to do
this, to be considered as a serious competitor.
Spatial Parallelism (pipelining)
Figure-25.2: Spatial Parallelism (pipelining)
The key idea behind pipeline parallelism is to take a "big" task and break it into subtasks that can
be processed concurrently on a stream of data inputs in multiple, overlapping stages of execution.
Pipeline parallelism is the third form of parallel execution. This involves taking a complex task
and breaking it down in to sub -tasks. And I would like to use an analogy to describe this because
people some times get afraid about pipeline parallelism What I argue is that everybody inherently
known what pipeline parallelism is, they just don't know what by that mean.
Pipelining: Time Chart
Figure-25.3: Pipelining: Time Chart
Pipelining: Speed-Up Calculation
Time for sequential execution of 1 task  = T
Time for sequential execution of N tasks = N * T
(Ideal) time for pipelined execution of one task using an M stage pipeline = T
(Ideal) time for pipelined execution of N tasks using an M stage pipeline = T + ((N-1) × (T/M))
Speed -up (S) =
Pipeline parallelism focuses on increasing throughput of task execution, NOT on decreasing sub -
task execution time .
If we take a "big" task that takes T time units to execute on a single data item, then execution of
this task on N data items will take N*T time units.
If we break the big task into M equal sized subtasks then we can overlap execution using the
pipeline technique. It will take T time units for the first data item to be processed through the
pipeline, but after that we will get one data item every T/M time units (assuming no pipeline
Notice that the time for processing a single data item (latency) does not change with pipelining -
only the rate at which we process a large number of data items (throughput) is enhanced with
pipelined execution.
What happens is that as the number of laundry loads increases the speedup factor approaches the
number of stages in the pipeline but never quite gets there. Because the first load of laundry has
to get through costing me full T. and then every body after that is 3/T. So the more loads of
laundry the more amortize the cost of filling the pipeline. So this is my excuse for saving laundry
to the end of the month because it is much more efficient all right putting a 100 loads of laundry.
Pipeline has a reasonable good speedup factor if you have many stage pipeline and you have a
good distribution of work. Now, normally when we think of pipeline as hardware pipeline
Especially people who come from architecture background. But this can also be a software
pipeline. In databases it is really a software pipeline. So if we go back to the earlier slide and we
ask a question about how can I have pipeline in this example? Is that when I join these two
together in a merge join. I can take result of that and put in to a hash table at the same time as I
am joining rows here. So I am simultaneously doing merges and hashes and producing results
then. And it turns out that again all the pipelines have to be relatively balanced to get that work
well and it may or may not, I don't know about this particular example, I have not worked that.
But you can see theoretically how you can have software pipeline in addition to data and spatial
Pipelining: Speed-Up Example
Example: Bottling soft drinks in a factory
= 10 × T
Sequential execution
= T + T × (9-1)/3 = 4 × T
Fill bottle, Seal bottle, Label Bottle pipeline
Speed-up = 2.50
= 20 × T
Sequential execution
= T + T × (20-1)/3 = 7.3 × T
Fill bottle, Seal bottle, Label Bottle pipeline
Speed-up = 2.72
= 40 × T
Sequential execution
= T + T × (40-1)/3 = 14.0 × T
Fill bottle, Seal bottle, Label Bottle pipeline
Speed-up = 2.85
Since we have taken the bottling task and divided it into three subtasks (filling, sealing and
labeling), we have a three stage (M=3) pipeline.
The speed-up factor is the time it takes to execute in a sequential (non -pipelined) fashion divided
by the time it takes to execute with pipelining.
Notice that the speed-up factor gets larger as the number of crates of bottles increases. This is
because the start -up cost for the pipeline (getting the first load of bottles through) is amortized
over a larger number of laundry "executions" which thereafter are delivered every T/M time
units. The maximum speed-up factor achievable is equal to the number of pipeline stages (M=3
in our example).
Pipelining: Input vs. Speed-Up
10 11 12 13 14 15 16 17 18 19
Input (N)
Asymptotic limit on speed-up for M stage pipeline is M.
The speed-up will NEVER be M, as initially filling the pipeline took T time units.
Figure-25.4: Pipelining: Input vs. Speed-U p
Pipelining: Limitations
Pipeline parallelism is a good fit for data warehousing (where we are working with lots of data),
but it makes no sense for OLTP because OLTP tasks are not big enough to justify breaking them
down into subtasks.
In order to eliminate overhead in breaking a big task into an M -stage pipeline, it is important that
there is efficient synchronization between producer/consumer tasks in the pipeline execution.
Also, it is important to avoid any tasks for which all data must be processed before moving on to
the next step because this results in pipeline stalls (where we cannot feed the next execution step
until all data has been processed through the current execution step). Sort -merge joins "kill"
pipeline execution because the sort will stall the pipeline. Hash join s do very well because there
is no pipeline stall ­ provided that you have large amounts of memory available to accommodate
all the parallel (in -memory) hashing activity.
Performance is dictated by the slowest stage in the pipeline. Efficient RDBMS implementations
will break up the work into pipeline stages that are relatively balanced. Informix XPS is the only
RDBMS that has truly mastered pipelined execution in a data warehouse environment - albeit
with very high communication costs for synchronization between the producer/consumer tasks.
Partitioning & Queries
Full Table Scan
Point Queries:
Range Queries
Round Robin
Hash Partitioning
Range Partitioning
Parallel Sorting
Scan in parallel, and range partition on the go.
As partitioned data becomes available, perform "local" sorting.
Resulting data is sorted and again range partitioned.
Problem: skew or "hot spot".
Solution: Sample the data at start to determine partition points.
Figure-25.5: Parallel Sorting
Consider figure-25.5 that shows a non -uniform distribution of data that defeats the purpose of a
parallel sort as processor-2 develops a hot spot. The solution is to sample the data to determine
the data demographics and then partitioning the data so as to come up with near uniform
distribution to exploit parallelism.
Skew in Partitioning
Attribute-value skew.
Partition skew .
There can be two types of skews i.e. non uniform distribution when the data is distributed across
the processors. One type of skew is dependent in the properties of the data, consider the example
of data about cancellation of reservations. It is obvious that most cancellations in the history of
airline travel occurred during the last quarter of 2001. Therefore, whenever the data is dis tributed
based on date for year 2001 it will be always skewed. This can also be looked at from the
perspective of partition skew, as date is typically seen to result in non-uniform distribution of
Handling Skew in Range-Partitioning
Construct the partition vector
Duplicate entries or imbalances
There are number of ways to handle the skew in the data when it is partitioned based on the
range, here date is a good example with data distributed based in quarters across four processors.
One solution is to sort the data this would identify the "clusters" within the data, then bases on
them more or less equal partitions could be created resulted in elimination or reduction of sekw.
Barriers to Linear Speedup & Scale-up
Amdahl' Law
As I said in the first lecture on parallelism, intuitively we feel that as the number of processors
increase, the speedup should also increase. Thus theoretically there should be a linear speedup.
However, in reality this is not the case. The biggest hurled is the Amdahl's law which we have
discussed in detail. The other problem is the startup cost, it takes a while for the system to get
started and that time when amortized over the entire processing time results in a less than a linea r
speedup. Then is the interference among different processes or the dependencies among the
processes or some operations within the problem itself such as empting of the pipeline, this
results in degradation of performance.
Finally the skew in the data. Parallelization is based on the premise that there is a full utilization
of the processors and all of them are bust most or all of the time. However, if there is a skew in
the partitioning of data i.e. a non-uniform distribution, then some of the processors will be
working while other will be idle. And the processor that takes the mosst time (which has the most
data too) will become the bottleneck.
Table of Contents:
  1. Need of Data Warehousing
  2. Why a DWH, Warehousing
  3. The Basic Concept of Data Warehousing
  4. Classical SDLC and DWH SDLC, CLDS, Online Transaction Processing
  5. Types of Data Warehouses: Financial, Telecommunication, Insurance, Human Resource
  6. Normalization: Anomalies, 1NF, 2NF, INSERT, UPDATE, DELETE
  7. De-Normalization: Balance between Normalization and De-Normalization
  8. DeNormalization Techniques: Splitting Tables, Horizontal splitting, Vertical Splitting, Pre-Joining Tables, Adding Redundant Columns, Derived Attributes
  9. Issues of De-Normalization: Storage, Performance, Maintenance, Ease-of-use
  10. Online Analytical Processing OLAP: DWH and OLAP, OLTP
  11. OLAP Implementations: MOLAP, ROLAP, HOLAP, DOLAP
  12. ROLAP: Relational Database, ROLAP cube, Issues
  13. Dimensional Modeling DM: ER modeling, The Paradox, ER vs. DM,
  14. Process of Dimensional Modeling: Four Step: Choose Business Process, Grain, Facts, Dimensions
  15. Issues of Dimensional Modeling: Additive vs Non-Additive facts, Classification of Aggregation Functions
  16. Extract Transform Load ETL: ETL Cycle, Processing, Data Extraction, Data Transformation
  17. Issues of ETL: Diversity in source systems and platforms
  18. Issues of ETL: legacy data, Web scrapping, data quality, ETL vs ELT
  19. ETL Detail: Data Cleansing: data scrubbing, Dirty Data, Lexical Errors, Irregularities, Integrity Constraint Violation, Duplication
  20. Data Duplication Elimination and BSN Method: Record linkage, Merge, purge, Entity reconciliation, List washing and data cleansing
  21. Introduction to Data Quality Management: Intrinsic, Realistic, Orr’s Laws of Data Quality, TQM
  22. DQM: Quantifying Data Quality: Free-of-error, Completeness, Consistency, Ratios
  23. Total DQM: TDQM in a DWH, Data Quality Management Process
  24. Need for Speed: Parallelism: Scalability, Terminology, Parallelization OLTP Vs DSS
  25. Need for Speed: Hardware Techniques: Data Parallelism Concept
  26. Conventional Indexing Techniques: Concept, Goals, Dense Index, Sparse Index
  27. Special Indexing Techniques: Inverted, Bit map, Cluster, Join indexes
  28. Join Techniques: Nested loop, Sort Merge, Hash based join
  29. Data mining (DM): Knowledge Discovery in Databases KDD
  31. Data Structures, types of Data Mining, Min-Max Distance, One-way, K-Means Clustering
  32. DWH Lifecycle: Data-Driven, Goal-Driven, User-Driven Methodologies
  33. DWH Implementation: Goal Driven Approach
  34. DWH Implementation: Goal Driven Approach
  35. DWH Life Cycle: Pitfalls, Mistakes, Tips
  36. Course Project
  37. Contents of Project Reports
  38. Case Study: Agri-Data Warehouse
  39. Web Warehousing: Drawbacks of traditional web sear ches, web search, Web traffic record: Log files
  40. Web Warehousing: Issues, Time-contiguous Log Entries, Transient Cookies, SSL, session ID Ping-pong, Persistent Cookies
  41. Data Transfer Service (DTS)
  42. Lab Data Set: Multi -Campus University
  43. Extracting Data Using Wizard
  44. Data Profiling