Need for Speed: Hardware Techniques: Data Parallelism Concept

<< Need for Speed: Parallelism: Scalability, Terminology, Parallelization OLTP Vs DSS

Conventional Indexing Techniques: Concept, Goals, Dense Index, Sparse Index >>

Lecture-25

Need for Speed: Hardware Techniques

Data Parallelism: Concept

�

Parallel execution of a single data manipulation task across multiple partitions of data.

�

Partitions static or dynamic

�

Tasks executed almost-independently across partitions.

�

"Query coordinator" must coordinate between the independently executing processes.

So data parallelism is I think the simplest form of parallelization. The idea is that we have parallel

execution of single data operation across multiple partitions of data. So the idea here is that these

partitions of data may be defined statically or dynamically fine, but we are requiring the same

operator across these multiple partitions concurrently. And this idea actually of data parallelism

has existed for a very long time. So the idea is that you are getting parallelization because we are

getting semi -independent execution, data manipulation across the partitions. And as long as we

keep the coordination required, we can get very good speedups. Well again this que ry

coordinator, the thing that keeps the query distributed but still working and then collects its

results.. Now that query coordinator can potentially be a bottleneck, because if it does too much

work, that is serial execution. So the query coordination has to be very small amount of work.

Otherwise the overhead gets higher and the serialization of the workload gets higher.

Data Parallelism: Example

Figure-25.1: Example of Data Parallelism

Suppose that we have a question that we want to select the number of accounts where balance is

greater that 5,000$ and the open data is after first of June 2000. This account table and what we

end up doing is say ok send the query to each query server and each query server then runs the

query against a particular partition as shown in Figure 25.1. And then how many query servers we

need and each gets a partial result. So in the data blocks that are processed 1,235 that meet up the

200

criteria. And here we get 536 and here this one found 2,016. The query coordinator just takes

these numbers and lets just say there is a 100 -degree of parallelism, so that I take 100 numbers

and I add them and I get the result. Remember that a query coordinator is software.

Data Parallelism: Ensuring Speed-UP

To get a speed -up of N with N partitions, it must be ensured that:

�

There are enough computing resources.

�

Query -coordinator is very fast as compared to query servers.

�

Work done in each partition almost same to avoid performance bottlenecks.

�

Same number of records in each partition would not suffice.

�

Need to have uniform distribution of records w.r.t. filter criterion across partitions.

Linear speed up with data parallelism should be achieved as long as the coordinator is not

burdened by too much work and the workload is distributed relatively evenly to the participating

query servers. If the coordinator does too much work, then the serial processing involved in final

results construction will significantly impact overall performance (remember Amdahl's Law!).

Depending on CPU speed and RDBMS efficiency, the number of partitions may be either greater

or less than (or equal to) the number of data partitions in the parallel execution of a query. Faster

CPUs and more efficient (light weight) query server implementations will generally have more

partitions than CPUs (so that more overlap can take place with computational and I/O

processing). Less efficient (heavy weight) query server implementations or slow CPUs will tend

toward fewer data partitions than CPUs. There is usually a one -to-one correspondence of query

servers to data partitions.

Note: You should not confuse data partitions used for the purposes of data parallelism with range

partitions used to divide up large tables for ease of management - they are two different things!

The first thing is that the amount of work done by the query coordinator should be very small. It

should be small because that work is serial. And if it is serial it will kill the performance. So it

should be almost zero.

The second thing is that the workload performed in each partition should be relatively equal. If

one query server has much more work to do then any other query server then every body waits for

the smothered storm. So in parallel execution environment distribution of work is very important.

Every database that is serious about parallel query capability should have data parallelism in the

market place today. This is sort of cost of playing the game. You have to at least be able to do

this, to be considered as a serious competitor.

201

Spatial Parallelism (pipelining)

Figure-25.2: Spatial Parallelism (pipelining)

The key idea behind pipeline parallelism is to take a "big" task and break it into subtasks that can

be processed concurrently on a stream of data inputs in multiple, overlapping stages of execution.

Pipeline parallelism is the third form of parallel execution. This involves taking a complex task

and breaking it down in to sub -tasks. And I would like to use an analogy to describe this because

people some times get afraid about pipeline parallelism What I argue is that everybody inherently

known what pipeline parallelism is, they just don't know what by that mean.

Pipelining: Time Chart

Figure-25.3: Pipelining: Time Chart

202

Pipelining: Speed-Up Calculation

Time for sequential execution of 1 task = T

Time for sequential execution of N tasks = N * T

(Ideal) time for pipelined execution of one task using an M stage pipeline = T

(Ideal) time for pipelined execution of N tasks using an M stage pipeline = T + ((N-1) � (T/M))

Speed -up (S) =

Pipeline parallelism focuses on increasing throughput of task execution, NOT on decreasing sub -

task execution time .

If we take a "big" task that takes T time units to execute on a single data item, then execution of

this task on N data items will take N*T time units.

If we break the big task into M equal sized subtasks then we can overlap execution using the

pipeline technique. It will take T time units for the first data item to be processed through the

pipeline, but after that we will get one data item every T/M time units (assuming no pipeline

overhead).

Notice that the time for processing a single data item (latency) does not change with pipelining -

only the rate at which we process a large number of data items (throughput) is enhanced with

pipelined execution.

What happens is that as the number of laundry loads increases the speedup factor approaches the

number of stages in the pipeline but never quite gets there. Because the first load of laundry has

to get through costing me full T. and then every body after that is 3/T. So the more loads of

laundry the more amortize the cost of filling the pipeline. So this is my excuse for saving laundry

to the end of the month because it is much more efficient all right putting a 100 loads of laundry.

Pipeline has a reasonable good speedup factor if you have many stage pipeline and you have a

good distribution of work. Now, normally when we think of pipeline as hardware pipeline

Especially people who come from architecture background. But this can also be a software

pipeline. In databases it is really a software pipeline. So if we go back to the earlier slide and we

ask a question about how can I have pipeline in this example? Is that when I join these two

together in a merge join. I can take result of that and put in to a hash table at the same time as I

am joining rows here. So I am simultaneously doing merges and hashes and producing results

then. And it turns out that again all the pipelines have to be relatively balanced to get that work

well and it may or may not, I don't know about this particular example, I have not worked that.

But you can see theoretically how you can have software pipeline in addition to data and spatial

parallelism.

Pipelining: Speed-Up Example

Example: Bottling soft drinks in a factory

10 CRATES LOADS OF BOTTLES

= 10 � T

Sequential execution

= T + T � (9-1)/3 = 4 � T

Fill bottle, Seal bottle, Label Bottle pipeline

Speed-up = 2.50

203

20 CRATES LOADS OF BOTTLES

= 20 � T

Sequential execution

= T + T � (20-1)/3 = 7.3 � T

Fill bottle, Seal bottle, Label Bottle pipeline

Speed-up = 2.72

40 CRATES LOADS OF BOTTLES

= 40 � T

Sequential execution

= T + T � (40-1)/3 = 14.0 � T

Fill bottle, Seal bottle, Label Bottle pipeline

Speed-up = 2.85

Since we have taken the bottling task and divided it into three subtasks (filling, sealing and

labeling), we have a three stage (M=3) pipeline.

The speed-up factor is the time it takes to execute in a sequential (non -pipelined) fashion divided

by the time it takes to execute with pipelining.

Notice that the speed-up factor gets larger as the number of crates of bottles increases. This is

because the start -up cost for the pipeline (getting the first load of bottles through) is amortized

over a larger number of laundry "executions" which thereafter are delivered every T/M time

units. The maximum speed-up factor achievable is equal to the number of pipeline stages (M=3

in our example).

Pipelining: Input vs. Speed-Up

2.8

2.6

2.4

2.2

1.8

1.6

1.4

1.2

10 11 12 13 14 15 16 17 18 19

Input (N)

Asymptotic limit on speed-up for M stage pipeline is M.

The speed-up will NEVER be M, as initially filling the pipeline took T time units.

Figure-25.4: Pipelining: Input vs. Speed-U p

204

Pipelining: Limitations

Pipeline parallelism is a good fit for data warehousing (where we are working with lots of data),

but it makes no sense for OLTP because OLTP tasks are not big enough to justify breaking them

down into subtasks.

In order to eliminate overhead in breaking a big task into an M -stage pipeline, it is important that

there is efficient synchronization between producer/consumer tasks in the pipeline execution.

Also, it is important to avoid any tasks for which all data must be processed before moving on to

the next step because this results in pipeline stalls (where we cannot feed the next execution step

until all data has been processed through the current execution step). Sort -merge joins "kill"

pipeline execution because the sort will stall the pipeline. Hash join s do very well because there

is no pipeline stall provided that you have large amounts of memory available to accommodate

all the parallel (in -memory) hashing activity.

Performance is dictated by the slowest stage in the pipeline. Efficient RDBMS implementations

will break up the work into pipeline stages that are relatively balanced. Informix XPS is the only

RDBMS that has truly mastered pipelined execution in a data warehouse environment - albeit

with very high communication costs for synchronization between the producer/consumer tasks.

Partitioning & Queries

�

Full Table Scan

�

Point Queries:

�

Range Queries

�

Round Robin

�

Hash Partitioning

�

Range Partitioning

Parallel Sorting

�

Scan in parallel, and range partition on the go.

�

As partitioned data becomes available, perform "local" sorting.

�

Resulting data is sorted and again range partitioned.

�

Problem: skew or "hot spot".

�

Solution: Sample the data at start to determine partition points.

205

Figure-25.5: Parallel Sorting

Consider figure-25.5 that shows a non -uniform distribution of data that defeats the purpose of a

parallel sort as processor-2 develops a hot spot. The solution is to sample the data to determine

the data demographics and then partitioning the data so as to come up with near uniform

distribution to exploit parallelism.

Skew in Partitioning

�

Attribute-value skew.

�

Partition skew .

There can be two types of skews i.e. non uniform distribution when the data is distributed across

the processors. One type of skew is dependent in the properties of the data, consider the example

of data about cancellation of reservations. It is obvious that most cancellations in the history of

airline travel occurred during the last quarter of 2001. Therefore, whenever the data is dis tributed

based on date for year 2001 it will be always skewed. This can also be looked at from the

perspective of partition skew, as date is typically seen to result in non-uniform distribution of

data.

Handling Skew in Range-Partitioning

�

Sort

�

Construct the partition vector

�

Duplicate entries or imbalances

There are number of ways to handle the skew in the data when it is partitioned based on the

range, here date is a good example with data distributed based in quarters across four processors.

One solution is to sort the data this would identify the "clusters" within the data, then bases on

them more or less equal partitions could be created resulted in elimination or reduction of sekw.

Barriers to Linear Speedup & Scale-up

�

Amdahl' Law

�

Startup

�

Interference

�

Skew

206

As I said in the first lecture on parallelism, intuitively we feel that as the number of processors

increase, the speedup should also increase. Thus theoretically there should be a linear speedup.

However, in reality this is not the case. The biggest hurled is the Amdahl's law which we have

discussed in detail. The other problem is the startup cost, it takes a while for the system to get

started and that time when amortized over the entire processing time results in a less than a linea r

speedup. Then is the interference among different processes or the dependencies among the

processes or some operations within the problem itself such as empting of the pipeline, this

results in degradation of performance.

Finally the skew in the data. Parallelization is based on the premise that there is a full utilization

of the processors and all of them are bust most or all of the time. However, if there is a skew in

the partitioning of data i.e. a non-uniform distribution, then some of the processors will be

working while other will be idle. And the processor that takes the mosst time (which has the most

data too) will become the bottleneck.

207

Table of Contents: