|
|||||
Lecture-25
Need
for Speed: Hardware
Techniques
Data
Parallelism: Concept
§
Parallel
execution of a single data manipulation
task across multiple partitions of
data.
§
Partitions
static or dynamic
§
Tasks
executed almost-independently across
partitions.
§
"Query
coordinator" must coordinate
between the independently executing
processes.
So data
parallelism is I think the simplest
form of parallelization. The idea is
that we have parallel
execution of
single data operation across multiple
partitions of data. So the idea
here is that these
partitions of
data may be defined
statically or dynamically fine, but we
are requiring the
same
operator across
these multiple partitions concurrently.
And this idea actually of
data parallelism
has
existed for a very long
time. So the idea is that
you are getting parallelization because we
are
getting semi
-independent execution, data manipulation
across the partitions. And
as long as we
keep
the coordination required, we can
get very good speedups. Well
again this que ry
coordinator, the
thing that keeps the
query distributed but still
working and then collects
its
results..
Now that query coordinator
can potentially be a bottleneck, because
if it does too much
work,
that is serial execution. So
the query coordination has to be very
small amount of work.
Otherwise
the overhead gets higher and
the serialization of the workload
gets higher.
Data
Parallelism: Example
Figure-25.1:
Example of Data
Parallelism
Suppose
that we have a question that
we want to select the number
of accounts where balance is
greater
that 5,000$ and the
open data is after first of
June 2000. This account
table and what we
end up
doing is say ok send the
query to each query server and each query
server then runs the
query
against a particular partition as shown
in Figure 25.1. And then how
many query servers we
need
and each gets a partial
result. So in the data
blocks that are processed
1,235 that meet up
the
200
criteria.
And here we get 536
and here this one
found 2,016. The query
coordinator just
takes
these
numbers and lets just
say there is a 100 -degree of
parallelism, so that I take 100
numbers
and I
add them and I get
the result. Remember that a
query coordinator is software.
Data
Parallelism: Ensuring Speed-UP
To get a
speed -up of N with N
partitions, it must be ensured
that:
§
There
are enough computing
resources.
§
Query -coordinator is
very fast as compared to query
servers.
§
Work
done in each partition almost
same to avoid performance
bottlenecks.
§
Same number of
records in each partition
would not suffice.
§
Need to have
uniform distribution of records w.r.t.
filter criterion across
partitions.
Linear
speed up with data parallelism
should be achieved as long as the
coordinator is not
burdened by
too much work and the
workload is distributed relatively evenly
to the participating
query servers.
If the coordinator does too much
work, then the serial
processing involved in
final
results
construction will significantly
impact overall performance (remember
Amdahl's Law!).
Depending on CPU
speed and RDBMS efficiency,
the number of partitions may be
either greater
or less
than (or equal to)
the number of data partitions in the
parallel execution of a query.
Faster
CPUs
and more efficient (light
weight) query server implementations will
generally have more
partitions than
CPUs (so that more overlap
can take place with
computational and I/O
processing).
Less efficient (heavy
weight) query server implementations or
slow CPUs will
tend
toward
fewer data partitions than
CPUs. There is usually a one -to-one
correspondence of query
servers to
data partitions.
Note: You
should not confuse data partitions
used for the purposes of
data parallelism with
range
partitions used
to divide up large tables for
ease of management - they
are two different
things!
The
first thing is that the
amount of work done by the
query coordinator should be very small.
It
should be
small because that work is
serial. And if it is serial it
will kill the performance. So
it
should be
almost zero.
The
second thing is that the
workload performed in each
partition should be relatively
equal. If
one query server
has much more work to do
then any other query server then
every body waits for
the
smothered storm. So in parallel execution
environment distribution of work is very
important.
Every
database that is serious
about parallel query capability should have
data parallelism in the
market
place today. This is sort of
cost of playing the game.
You have to at least be able
to do
this, to be
considered as a serious
competitor.
201
Spatial
Parallelism (pipelining)
Figure-25.2:
Spatial Parallelism
(pipelining)
The
key idea behind pipeline parallelism is
to take a "big" task and
break it into subtasks that
can
be processed
concurrently on a stream of data
inputs in multiple, overlapping stages of
execution.
Pipeline parallelism is
the third form of parallel
execution. This involves taking a complex
task
and breaking it
down in to sub -tasks. And I
would like to use an analogy
to describe this
because
people
some times get afraid
about pipeline parallelism What I
argue is that everybody
inherently
known what
pipeline parallelism is, they just
don't know what by that
mean.
Pipelining:
Time Chart
Figure-25.3:
Pipelining: Time Chart
202
Pipelining:
Speed-Up Calculation
Time
for sequential execution of 1
task = T
Time
for sequential execution of N
tasks = N * T
(Ideal) time
for pipelined execution of one
task using an M stage
pipeline = T
(Ideal) time
for pipelined execution of N tasks
using an M stage pipeline = T +
((N-1) × (T/M))
Speed
-up (S) =
Pipeline parallelism
focuses on increasing throughput of task
execution, NOT on decreasing
sub -
task
execution time .
If we take a
"big" task that takes T time
units to execute on a single data item,
then execution of
this
task on N data items will
take N*T time units.
If we break
the big task into M
equal sized subtasks then we
can overlap execution using
the
pipeline
technique. It will take T time
units for the first
data item to be processed through
the
pipeline, but
after that we will get
one data item every T/M time
units (assuming no
pipeline
overhead).
Notice that
the time for processing a single
data item (latency) does not change
with pipelining -
only
the rate at which we process
a large number of data items
(throughput) is enhanced with
pipelined
execution.
What
happens is that as the number of
laundry loads increases the
speedup factor approaches
the
number of stages
in the pipeline but never
quite gets there. Because
the first load of laundry
has
to get through
costing me full T. and then
every body after that is
3/T. So the more loads
of
laundry
the more amortize the cost
of filling the pipeline. So this is my
excuse for saving
laundry
to the
end of the month because it
is much more efficient all
right putting a 100 loads of
laundry.
Pipeline has a
reasonable good speedup factor if you
have many stage pipeline
and you have a
good
distribution of work. Now,
normally when we think of
pipeline as hardware pipeline
Especially
people who come from
architecture background. But this
can also be a software
pipeline. In
databases it is really a software
pipeline. So if we go back to the
earlier slide and we
ask a
question about how can I
have pipeline in this
example? Is that when I join
these two
together in a
merge join. I can take
result of that and put in to
a hash table at the same
time as I
am joining rows
here. So I am simultaneously doing
merges and hashes and
producing results
then.
And it turns out that
again all the pipelines have
to be relatively balanced to get
that work
well
and it may or may not, I
don't know about this
particular example, I have not worked
that.
But you
can see theoretically how you
can have software pipeline in
addition to data and
spatial
parallelism.
Pipelining:
Speed-Up Example
Example:
Bottling soft drinks in a
factory
10 CRATES
LOADS OF BOTTLES
= 10 × T
Sequential
execution
= T + T × (9-1)/3 = 4
× T
Fill
bottle, Seal bottle, Label
Bottle pipeline
Speed-up
= 2.50
203
20 CRATES
LOADS OF BOTTLES
= 20 × T
Sequential
execution
= T + T × (20-1)/3 =
7.3 × T
Fill
bottle, Seal bottle, Label
Bottle pipeline
Speed-up
= 2.72
40 CRATES
LOADS OF BOTTLES
= 40 × T
Sequential
execution
= T + T × (40-1)/3 =
14.0 × T
Fill bottle,
Seal bottle, Label Bottle
pipeline
Speed-up
= 2.85
Since we
have taken the bottling
task and divided it into
three subtasks (filling,
sealing and
labeling), we have a
three stage (M=3)
pipeline.
The
speed-up factor is the time it
takes to execute in a sequential
(non -pipelined) fashion
divided
by the time it
takes to execute with
pipelining.
Notice
that the speed-up factor
gets larger as the number of crates of
bottles increases. This
is
because
the start -up cost
for the pipeline (getting
the first load of bottles
through) is amortized
over a
larger number of laundry "executions"
which thereafter are delivered
every T/M time
units.
The maximum speed-up factor achievable is
equal to the number of pipeline
stages (M=3
in our
example).
Pipelining:
Input vs.
Speed-Up
3
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
19
Input
(N)
Asymptotic limit
on speed-up for M stage
pipeline is M.
The
speed-up will NEVER be M, as
initially filling the pipeline
took T time units.
Figure-25.4:
Pipelining: Input vs.
Speed-U p
204
Pipelining:
Limitations
Pipeline parallelism is a
good fit for data warehousing
(where we are working with
lots of data),
but it makes no
sense for OLTP because
OLTP tasks are not big
enough to justify breaking them
down
into subtasks.
In order to
eliminate overhead in breaking a big task
into an M -stage pipeline, it is important
that
there is
efficient synchronization between
producer/consumer tasks in the pipeline
execution.
Also, it is important to
avoid any tasks for
which all data must be
processed before moving on to
the
next step because this
results in pipeline stalls
(where we cannot feed the
next execution step
until
all data has been
processed through the current execution
step). Sort -merge joins
"kill"
pipeline
execution because the sort
will stall the pipeline.
Hash join s do very well
because there
is no pipeline
stall provided that you have large
amounts of memory available to
accommodate
all
the parallel (in -memory)
hashing activity.
Performance is
dictated by the slowest
stage in the pipeline. Efficient
RDBMS implementations
will
break up the work into
pipeline stages that are
relatively balanced. Informix
XPS is the only
RDBMS
that has truly mastered
pipelined execution in a data warehouse
environment - albeit
with
very high communication costs
for synchronization between
the producer/consumer
tasks.
Partitioning
& Queries
§
Full
Table Scan
§
Point
Queries:
§
Range
Queries
§
Round
Robin
§
Hash
Partitioning
§
Range
Partitioning
Parallel
Sorting
§
Scan in
parallel, and range partition on the
go.
§
As partitioned
data becomes available,
perform "local" sorting.
§
Resulting
data is sorted and again
range partitioned.
§
Problem:
skew or "hot spot".
§
Solution: Sample
the data at start to
determine partition
points.
205
Figure-25.5:
Parallel Sorting
Consider
figure-25.5 that shows a non
-uniform distribution of data
that defeats the purpose of
a
parallel
sort as processor-2 develops a hot spot.
The solution is to sample the
data to determine
the
data demographics and then
partitioning the data so as to
come up with near
uniform
distribution to
exploit parallelism.
Skew in
Partitioning
§
Attribute-value
skew.
§
Partition skew
.
There
can be two types of skews
i.e. non uniform distribution
when the data is distributed
across
the
processors. One type of skew
is dependent in the properties of the
data, consider the
example
of data
about cancellation of reservations. It is
obvious that most
cancellations in the history of
airline
travel occurred during the
last quarter of 2001. Therefore,
whenever the data is dis
tributed
based on
date for year 2001 it
will be always skewed. This
can also be looked at from
the
perspective of
partition skew, as date is
typically seen to result in
non-uniform distribution of
data.
Handling
Skew in Range-Partitioning
§
Sort
§
Construct
the partition vector
§
Duplicate
entries or imbalances
There
are number of ways to handle
the skew in the data
when it is partitioned based on
the
range,
here date is a good example
with data distributed based in
quarters across four
processors.
One solution is
to sort the data this
would identify the "clusters"
within the data, then
bases on
them
more or less equal partitions could be
created resulted in elimination or
reduction of sekw.
Barriers to
Linear Speedup &
Scale-up
§
Amdahl'
Law
§
Startup
§
Interference
§
Skew
206
As I said in
the first lecture on parallelism,
intuitively we feel that as
the number of processors
increase,
the speedup should also
increase. Thus theoretically there should
be a linear speedup.
However, in
reality this is not the
case. The biggest hurled is
the Amdahl's law which we
have
discussed in
detail. The other problem is the
startup cost, it takes a
while for the system to
get
started
and that time when amortized
over the entire processing time
results in a less than a linea
r
speedup.
Then is the interference among
different processes or the
dependencies among
the
processes or
some operations within the
problem itself such as empting of the
pipeline, this
results in
degradation of performance.
Finally
the skew in the data.
Parallelization is based on the premise
that there is a full
utilization
of the
processors and all of them
are bust most or all of
the time. However, if there
is a skew in
the
partitioning of data i.e. a
non-uniform distribution, then some of
the processors will
be
working
while other will be idle.
And the processor that
takes the mosst time (which
has the most
data
too) will become the
bottleneck.
207
Table of Contents:
|
|||||