|
|||||
Lecture-
28
Join
Techniques
Background
Used to
retrieve data from multiple
tables.
Joins
used frequently, hence lot of
work on improving or optimizing
them.
Simplest
join that works in most
cases is nested-loop join
but results in quadratic
time
complexity.
Tables
identified by FROM clause
and condition by WHERE clause.
Will
cover different types of
joins.
Join
commands are statements that
retrieve data from multiple
tables. A join is identified
by
multiple
tables in the FROM clause,
and the relationship between
the tables to be joined
is
established
through the existence of a
join condition in the WHERE clause.
Because joins are so
frequently
used in relational queries and
because joins are so expensive,
lot of effort has gone
into
developing
efficient join algorithms and
techniques. The simplest
i.e. nested-loop join
is
applicable in
all cases, but results in
quadratic performance. Several fast
join algorithms have
been developed
and extensively used; these
can be categorized as sort
-merge, hash-based,
and
index-based
algorithms. In this lecture we will be
covering the following
join
algorithms/techniques:
·
Nested loop
join
·
Sort
Merge Join
·
Hash
based join
·
Etc.
About
Nested-Loop Join
Typically
used in OLTP
environment.
Limited
application for DSS and
VLDB
In DSS
environment we deal with
VLDB and large sets of
data.
Traditionally
Nested -Loop join has
been and is used in OLTP
environments, but for many
reasons,
such a join mechanism is not
suitable for VLDB and
DSS environments. Nested
loop
joins
are useful when small
subsets of data are joined
and if the join condition is
an efficient way
of accessing
the inner table. Despite
these restrictions/limitations, we will begin
our discussion
with
the traditional join
technique i.e. nested loop
join, so that you can
appreciate the benefits of
the
join techniques typically used in a
VLDB and DSS
environment.
226
Nested-Loop
Join: Code
FOR i = 1 to n DO
BEGIN
/*
N rows in
T1
*/
IF ith
row of T1 qualifies THEN
BEGIN
For j = 1 to m
DO BEGIN
/*
M rows in
T2
*/
IF the
ith row of T1 matches to jth
row of T2 on join key THEN
BEGIN
IF the
jth row of T2 qualifies THEN
BEGIN
produce output
row
END
END
END
END
END
Nested loop
join wo rks like a nested
loop used in a high level
programming language, where each
value of
the index of the outer loop is
taken as a limit (or
starting point or whatever
applicable)
for
the index of the inner loop,
and corresponding actions are
performed on the
statement(s)
following
the inner loop. So basically, if
the outer loop executes R
times and for each
such
execution
the inner loop executes S
times, then the total cost
or time complexity of the nested
loop is
O(RS).
Nested-loop joins
provide efficient access when
tables are indexed on join
columns. Furthermore,
in many
small transactions, such as
those affecting only a small
set of rows, index nested
loops
joins are
far superior to both sort -merge joins
and hash joins. A nested
loop join involves
the
following
steps:
1. The optimizer
determines the major table
(i.e. Table_A) and
designates it as the
outer
table.
Table_A
is
accessed once. If the outer
table has no useful indexes, a
full table scan
is performed. If an index
can reduce I/O costs,
the index is used to locate
the rows.
2. The other
table is designated as the
inner table or Table_B. Table_B
is
accessed once for
each
qualifying row (or touple) in
Table_A.
3. For
every row in the outer
table, DBMS accesses all
the rows in the inner
table. The outer
loop is
for every row in outer
table and the inner
loop is for every row in
the inner table.
If 10 rows from
Table_A
match
the conditions in the query,
Table_B
is
accessed 10 times. If
Table_B
has a
useful index on the join
column, it might require 5 I/Os to
read the data block
for
each
scan, plus one I/O
for each data block. Hence
the total cost of accessing
Table_B
would
be
60 logical
I/Os.
227
Nested-Loop
Join: Working Example
For each
qualifying row of Personal
table,
Academic
table is examined for
matching rows.
Figure-28.1:
The process of Nested-Loop
Join without
indexing
The
process of creating the result
set for a nested-loop join
is achieved by nesting the tables,
and
scanning
the inner table repeatedly
for each qualifying row of
the outer table. This
process is
shown in Figure-28.1:
The question being asked is
"What is
the average GPS of
undergraduate
male
students?".
Nested-Loop
Join: Order of
Tables
If the outer
loop executes R times and
the inner loop executes S
times, then the time
complexity
will be
O(RS).
The
time complexity should be independent of
the order of tables i.e. O(RS) is
same as O(SR).
However, in
the context of I/Os the
order of tables does
matter.
Along
with this the relationship
between the number of
qualifyin g rows/blocks between
the two
tables
matters.
It is true that
the actual number of matching rows
returned as a result of the join
would be
independent of
the order of tables i.e.
inner or outer; actually it is
going to be even independent of
the
type of join used. Now
assume that depending on the
filter criterion, different number of
rows
is returned from
the two tables to be joined.
Let the rows returned from
Table_A be RA
and
rows
returned
from Table_B be RB, and
let R A < RB. If
Table_B would have been
the outer table, then
for
each row returned, the
inner table i.e. Table_A
would have been scanned
(assuming no
indexing)
RB times. However, if
Table_A would have been the
outer table, then for
each row
returned,
the inner table i.e.
Table_B would have been
scanned (assuming no indexing)
RA times
228
i.e.
less number of times scanned. Meaning,
even if the inner table
have fewer number of
qualifying rows,
it is going to be scanned the number of
times the qualifying rows of
the outer
table.
Hence the choice of tables
does matter.
Nested-Loop
Join: Cost Formula
Join cost =
Cost of accessing Table_A
+
# of qualifying
rows in Table_A
×
Blocks of
Table_B
to be
scanned for each qualifying
row
OR
Join cost =
Blocks accessed for Table_A
+
Blocks
accessed for Table_A
×
Blocks
accessed for Table_B
For a
high level programming
language, the time complexity of a nested
-loop join remains
unchanged if
the order of the loops
are changed i.e. the
inner and outer loops are
interchanged.
However,
this is NOT true for
Nested-Loop-Joins in the context of
DSS when we look at the
I/O.
For a
nested-loop join with two
tables, the formula for
estimating the join cost
is:
Join cost =
Cost of accessing Table_A
+
# of qualifying
rows in Table_A
×
Blocks of
Table_B
to be
scanned for each qualifying
row.
OR
Join cost =
Blocks accessed for Table_A
+
Blocks
accessed for Table_A
×
Blocks
accessed for Table_B
For a
Nested-Loop join inner and outer
tables are determined as
follows:
The outer
table is usually the one
that has:
The
smallest number of qualifying rows,
and/or
·
The
largest numbers of I/Os required to
locate the rows.
·
The inner
table usually has:
The
largest number of qualifying rows,
and/or
·
The
smallest number of reads required to
locate rows.
229
Nested-Loop
Join: Cost of reorder
Table_A = 500
blocks and
Table_B = 700
blocks.
Qualifying
blocks for Table_A
QB(A) = 50
Qualifying
blocks for Table_B
QB(B) =
100
Join
cost A&B = 500 + 50× 700 =
35,500 I/Os
Join
cost B&A = 700 + 100× 500 = 50,700
I/Os
i.e. an increase
in I/O of about 43%.
For
example, if qualifying blocks
for Table_A
QB(A) = 50
and qualifying blocks for
Table_B
QB(B) = 100
and size of Table_A is 500
blocks and size of Table_B
is 700 blocks then Join
cost
A&B = 500 +
50×
700 =
35,500 I/Os and using the
other order i.e. Table_B
outer
table and
Table_A
as
inner table, the join
cost B&A = 700 + 100× 500 =
50,700 I/Os i.e. an increase
in I/O
of about
43%.
Nested-Loop
Join: Variants
1. Naive
nested-loop join
2. Index
nested-loop join
3. Temporary index
nested-loop join
Working of Query
optimizer
There
are many variants of the
traditional nested-loop join.
The simplest case is when an
entire
table is
scanned; this is called a
naive nested -loop join. If
there is an index, and that
index is
exploited, then
it is called an index nested-loop join.
If the index is built as part of
the query plan
and
subsequently dropped, it is called as a temporary index
nested -loop join. All
these variants
are
considered by the query optimizer before
selecting the most
appropriate join
algorithm/technique.
Sort-Merge
Join
Joined tables to
be sorted as per WHERE clause of
the join predicate.
Query optimizer
scans for (cluster) index,
if exists performs join.
In the
absence of index, tables are
sorted on the columns as per
WHERE clause.
If multiple
equalities in WHERE clause,
some merge columns
used.
230
The Sort
-Merge join requires that
both tables to be joined are
sorted on those columns that
are
identified by
the equality in the WHERE clause of
the join predicate.
Subsequently the tables
are
merged
based on the join columns.
The query optimizer typically scans an index on
the columns
which
are part of the join, if
one exists on the proper set
of columns, fine, else the
tables are
sorted on
the columns to be joined ,
resulting in what is called a cluster
index. However, in
rare
cases,
there may be multiple
equalities in the WHERE clause, in
such a case, the merge
columns
are
taken from only some of
the given equality
clauses.
Because
each table is sorted, the
Sort -Merge Join operator
gets a row from each
table and
compares it
one at a time with the rows
of the other table. For
example, for equi-join
operations,
the rows
are returned if they
match/equal on the join
predicate. If they are not
equal or don't
match,
whic hever row has
the lower value is
discarded, and next row is
obtained from that
table.
This
process is repeated until
all the rows have been
exhausted.
The Sort
-Merge join process just
described works as follows:
· Sort
Table_A
and
Table_B
on
the join column in ascending
order, then scan them to do
a
``merge''
(on join column), and output
result tuples/rows.
Proceed
with scanning of Table_A
until current
A_tuple ≤ current B_tuple,
then
§
proceed
scanning of Table_B
until
current B_tuple ≤ current A_tuple;
do this
until current
A_tuple = current B_tuple.
§
At this point,
all A_tuples with same value
in Ai (current A_group) and
all
B_tuples
with same value in Bj (current
B_group) match; output
<a, b> for all
pairs of
such tuples/records.
§
Update pointers,
resume scanning Table_A
and
Table_B
.
· Table_A
is
scanned once; each B group
is scanned once per matching
Table_A
tuple.
(Multiple
scans of a B group are
likely to find needed pages
in buffer.)
· Cost: M
log M + N log N +
(M+N)
§ The
cost of scanning is M+N,
could be M*N (very
unlikely!)
231
Figure-28.2:
The process of
merging
Fig-28.2
shows the process of merging
two sorted tables with
IDs shown. Conceptually
the
merging is
similar to the merging you must
have studies in Merge_Sort in your
Algorithm course.
Sort-Merge
Join: Note
Very
fast.
Sorting can be
expensive.
Presorted
data can be obtained from
existing B-tree.
Sort-Merge join
itself is very fast, but it
can be an expensive choice if
sort operations are
required
frequently
i.e. the contents of the t
able's change often resulting in deterioration of
the sort order.
However, it
may so happen that even if
the data volume is large the
desired data can be
obtained
presorted
from existing B-tree. For
such a case sort -merge
join is often the fastest
availabl e join
algorithm.
Hash-Based
Join: Working
Suitable
for the VLDB
environment.
The
choice which table first
gets hashed plays a pivotal
role in the overall performance of
the join
operation, this
decided by the optimizer.
The joined
rows are identified by collisions i.e.
collisions are "good" in case of
hash join.
232
Hash joins
are suitable for the
VLDB environment as they are
useful for joining large data
sets or
tables.
The choice about which
table first gets hashed
plays a pivotal role in the
overall
performance of
the join operation, and
left to the optimizer. The optimizer
decides by using the
smaller of the
two tables (say) Table_A
or
data sources to build a hash
table in the main memory
on the
join key used in the
WHERE
clause. It
then scans the larger table
(say) Table_B
and
probes
the
hashed table to find the
joined rows. The joined rows
are identified by collisions i.e.
collisions
are
"good" in case of hash
join.
The optimizer
uses a hash join to join
two tables if they are
joined using an equij oin
and if either
of the
following conditions are
true:
A large amount
of data needs to be joined.
·
A large portion
of the table needs to be
joined.
·
This
method is best used when
the smaller table fits in
the available main memory.
The cost is
then
limited to a single read pass
over the data for
the two tables. Else
the "smaller" table has
to
be partitioned
which results in unnecessary
delays and degradation of performance due
to
undesirable
I/Os.
Figure-28.3:
Working of Hash-based join
Cost of
Hash -Join
· In partitioning
phase, read + write both
operations requires 2(M+N)
I/Os.
· In matching
phase, read both requires
M+N I/Os.
233
Hash-Based
Join: Large "small"
Table
Smaller of
the two tables may
grow too big to fit
into the main memory.
Optimizer performs
partitioning, but is not simple.
Multi-step
approach followed, each step
has a build phase and probe
phase.
Both tables
entirely consumed and partitioned via
hashing.
Hashing
guarantees any two joining
records will fall in same
pair of partitions.
Task
reduced to multiple, but smaller,
instances of the same
tasks.
It may so
happen that the smaller of
the two tables grows too big
to fit into the main
memory,
then
the optimizer breaks it up by
partitioning, such that a partition
can fit in the main
memory.
However, it is
not that simple because the
qualifying rows of both the tables
have to fall in the
corresponding
partition pairs that are
hashed (build) and probed.
Thus in such a case the hash
join
proceeds in
several steps. Each step has
a build phase and probe
phase. Initially, the two
tables
are entirely
consumed and partitioned (using a hash
function on the hash keys)
into multiple
partitions.
The number of such partitions is
sometimes called the
partitioning fan -out. Using
the
hash
function on the hash keys
(based on the predicates in
the WHERE clause)
guarantees that any
two
joining records must be in
the same pair of partitions.
Therefore, the task of
joining two large
tables
gets reduced to multiple, but
smaller, instances of the
same tasks. The hash
join is then
applied to each
pair of partitions.
Hash-Based
Join: Partition Skew
Partition
skew can
become a problem.
Hashing works under
the assumption of uniformity of
data distribution, may not be always
true.
Consequently
hash-based join degrades
into nested-loop
join.
Solution: Make
available other hash functions to be
chosen by the optimizer; that
better distribute
the
input.
Partition
skew can
become a problem in hash-join. In
the first step of hash
join, records are
hashed
into the main memory into
their corresponding bucket.
This being done based on the
hash
function
used. However, an attribute being hashed
may not be uniformly distributed within
the
relation,
and some buckets may
then contain more records than
other buckets. When
this
difference
becomes large, the corresponding bucket
may no longer fit in the
main memory. As a
consequence,
hash -based join degrades
into performance of a nested-loop join.
The only possible
solution is to
make available other hash functions to be
chosen by the optimizer; that
better
distribute the
input.
234
Hash-Based
Join: Intrinsic
Skew
Intrinsic
skew can
become a problem for hash, as
well as sort -merge
join.
The
skew is in data, NOT due to
hash function.
Example:
Many non-CS majors
registering for CS-101
instead of CS students in
summer.
Intrinsic
skew occurs when
attributes are not distributed uniformly;
it is also called attribute
value
skew.
For example a basic Computer
Science (CS) course being
offered in summer, and taken
by
many non
-CS majors who want to know
about computers. The course
taken by few
CS-majors
who
missed it or got an incomplete (i.e. I)
grade during the regular
semester due to one reason
or
another.
Ironically, intrinsic skew effects
the performance of both hash
and sort -merge
joins.
Sort-merge join
works best when the join
attributes are the primary
key of both tables.
This
property
guarantees absence of duplicates, so
that a record in the left
-hand-side of the relation
will
join with at most one record
in the right -hand-side of
the relation, thus avoiding
the intrinsic
skew.
235
Table of Contents:
|
|||||