|
|||||
Lecture-12:
Relational OLAP
(ROLAP)
Why
ROLAP?
Issue of
scalability i.e. curse of dimensionality
for MOLAP
§
Deployment of
significantly large dimension tables as
compared to MOLAP
using
secondary storage.
§
Aggregate
awareness allows using pre
-built summary tables by
some front -end
tools.
§
Star
schema designs usually used
to facilitate ROLAP querying (in next
lecture).
The
biggest problem with MOLAP is
the requirement of large main memory as
the cube size
increases.
There may be many reasons
for the increase in cube
size, such as in crease in
the
number of
dimensions, or increase in the
cardinality of the dimensions, or
increase in the
amount
of detail data
or a combination of some or all these
aspects. Thus there is an issue of
scalability
which
limits its applications to large
data sets.
Despite
advances in MOLAP technologies,
high -end OLAP
implementations will
normally
require
assistance from a relational database.
Hence a ROLAP or Relational OLAP.
ROLAP
tools
will query the relational database
using SQL generated to
conform to a framework using
the
facts
and dimensions paradigm using
the star schema.
The
other approach is "aggregate
awareness" i.e. the environment is
smart enough to develop
or
compute higher
level aggregates using lower
level or more detailed
aggregates.
ROLAP as a
"Cube"
§
OLAP
data is stored in a relational database
(e.g. a star schema)
§
The fact
table is a way of visualizing as
an
"un-rolled" cube.
§
So where is
the cube?
§ It's a
matter of perception
§ Visualize
the fact table as an elementary
cube.
78
Figure-12.1:
ROLAP as a cube
Recall
from the last lecture
that an implementation of the
OLAP framework will be
accepted as
OLAP
after it passes the FASMI test
i.e. it has to be Multi-dimensional.
The question is "how
to
convert
something into a cube" which
is intrinsically not a cube. And
more importantly, if
MOLAP
has the space requirement limitations,
and to overcome those limitations we are
using a
different
implementation, then wouldn't the
conversion back to a "cube" defeat
the purpose? This
is actually a
very good question, and
needs some detailed
explanation. Fig -12.1 shows
two rows
of a fact table,
and the aggregates corresponding to
the cells of the fact table
are shown
correspondingly in a
three dimensional cube. Extending this
argument, each and every
cell of the
fact table
can have a corresponding mapping. To
clarify this further,
consider the analogy of
the
map of
the world. Our planet
earth is NOT flat, it is
spherical in nature i.e.
three dimensional,
yet
the
map of the earth is fl at
i.e. two dimensional in nature
i.e. there is a mapping from 3D
space to
2D space. If you
look closely at the map, you
will see that the
map is divided into a grid
based on
longitude and
latitude, and the
corresponding cells or rectangles
are not of the same
size.
Similarly a 2D
table can be considered to be a mapping
or representation of a multi-dimensional
cube or
vice-a-versa.
How to
create a "Cube" in ROLAP
§
Cube is a
logical entity containing values of a certain fact at
a certain aggregation level
at
an intersection
of a combination of dimensions.
§
The
following table can be
created using 3 queries
Figure-12.2:
Creating tables from
queries
When we
talked of a cube for a
MOLAP, it was not actually a physical
cube, but was a logical
entity. We
continue with that concept,
and assume that what was
stored in a cube at a
certain
combination of
indexes, corresponding to such a group of
indices, we store the
aggregates in a
two
dimensional table, and we use
such groups of tables to store
the same data that
was stored in
a MOLAP.
The table shown in fig -12.2 is
divided into three parts
shown shaded and also
shown
by dotted
lines. Corresponding to each part of the
table, there is a query and
consequently the
table
can actually be filled using
three SQL queries as
follows:
§
For
the table entries, without
the totals
SELECT
S.Month_Id,
S.Product_Id,
SUM(S.Sales_Amt)
79
FROM
Sales
GROUP BY
S.Month_Id,
S.Product_Id;
§ For
the row totals
SELECT
S.Product_Id,
SUM (Sales_Amt)
FROM
Sales
GROUP BY
S.Product_Id;
§ For
the column totals
SELECT
S.Month_Id,
SUM (Sales)
FROM
Sales
GROUP BY
S.Month_Id;
The
first query can be used to
fill Part -II of the table,
the second query used to
fill Part -I of the
table,
and the third query
used to fill Part -III of
the table, thus using
these three queries, we
create
a ROALP
structure.
Problem
with simple
approach
§
Number of required
queries increases exponentially with
the increase in number
of
dimensions.
§
It's wasteful to
compute all queries.
§
In the
example, the first query
can do most of the work of
the other two
queries
§
If we could save
that result and aggregate
over Month_Id and
Product_Id, we
could compute
the other queries more
efficiently
Using typical
SQL t o fill-up the tables
quickly runs into a problem, as
the number of dimensions
increases,
the number of aggregates also
increases, and the number of
queries required to
calculate
those aggregates also
increases. Actually it becomes extremely
wasteful to compute all
queries,
wasteful, because if we are smart, we
can use the results of
the queries already
computed
to get
the answers to new queries.
How to do this? it is not very
difficult. For example for
the
column total
queries, we could just add
the aggregates over the
results of the months. So
the
moral of
the story is "Work smart not
hard".
Cube
clause
§
The
CUBE clause is part of
SQL:1999
§
GROUP BY CUBE
(v1, v2, ..., vn)
§
Equivalent to a
collection of GROUP BYs, one
for each of the subsets of
v1, v2,
..., vn
80
The other
problem with using standard
SQL is that one has to
write too many statements
and that
could lead to
mistakes. Therefore, back in 1999 a
CUBE clause was made part of
SQL, and that
clause is
equivalent to a collection of GROUP BY
clauses.
Some
students who did a BS final
year project with me of an HOLAP
implementation, used
dynamic
web page generation to dynamically
generate SQL instead of
hard-coding the queries
to
generate
the aggregates. Meaning, they
used SQL to generate
aggregates to fill a MOLAP
cube.
The project
was a success, all got jobs
based on this work; the
first prize in the 13th Annual
National Software Competition
along with a cash prize of
Rs. 30,000 was a
bonus.
ROLAP
and Space
Requirement
If one is
not careful, with the
incre ase in number of dimensions,
the number of summary
tables
gets
very large
Consider
the example discussed earlier
with the following two
dimensions on the fact
table...
Time:
Day, Week, Month, Quarter,
Year, All Days
Product:
Item, Sub-Category, Category,
All Products
OK so we worked
smart and got around the
problem of aggregate generation. But
the aggregates
once
generated have to be stored somewhere
too i.e. in tables as this
is a ROLAP environment.
Be warned:
Pre -aggregates can very
quickly get out of control in a ROLAP
environment. Do not
try to
pre -aggregate all
combinations of all dimensions at
all levels of the
hierarchies. If you do,
the
storage and maintenance
costs will quickly overwhelm
your implementation.
For
example, consider the combinatorial
explosion with just two
dimensions as shown
above...
EXAMPLE:
ROLAP & Space
Requirement
A naïve
implementation will require all
combinations of summary tables at
each and every
aggregation
level.
81
·
Z
·
·
24 summary
tables, add in
geography,
results in
120 tables
Figure-12.3:
Sample ROALP
tables
There
are 24 possible pre -aggregates
just with the product and
Time dimensions! Add in
the
geography
dimension and we are quickly
up to 120 pre-aggregates (and the
number of levels in
the
hierarchies that we are
assuming is very conservative). The
largest aggregate will be
the
summarization by
day and product because this
is the most detailed. Clearly, we do
not want to
pre -aggregate
all of these
combinations.
Smart
tools will allow less
detailed aggregates to be constructed from more
detailed aggregates
(full
aggregate awareness) at run-time so that
we do not go all the way
down to the detail
for
every
aggregation. However, for
this to work, the metrics
must be additive (e.g., no
ratios,
averages,
etc.). More detailed pre
-aggregates are larger, but can
also be used to build
less
detailed
aggregates on -the
-go.
ROLAP
Issues
§
Maintenance.
§
Non
standard hierarchy of dimensions.
§
Non
standard conventions.
§
Explosion of
storage space requirement.
§
Aggregation pit
-falls.
Creati g an
aggregate is relatively easy as
compared to keeping it updated,
the maintenance issue
n
will
eat you alive, if you (say) you
get data late, and it is
also to be reflected in the
aggregate
archives.
Dimensions
are not just simply atomic
item, then SKU, then product
category. They can
cross
boundaries
too and can become
expensive to create.
82
The
conventions for (say) week
may be absolute within an
organization, but differ
across
organizations,
worst they could differ across
the organization too e.g. marke
ting people looking
differently at
the week as compared to the
accounts people.
ROLAP
Issue: Maintenance
Summary
tables are mostly a
maintenance issue (similar to
MOLAP) than a storage
issue.
§
Notice that
summary tables get much
smaller as dimensions get
less detailed (e.g.,
year
vs.
day).
§
Should
plan for twice the
size of the un -summarized
data for ROLAP summaries in
most
environments.
§
Assuming
"to date" summaries, every
detail record that is received
into warehouse must
-
aggregate
into EVERY summary
table.
It is not
unusual for the collection of pre
-aggregate summary tables to
take up significantly more
space
(e.g., double or higher) than
the raw data tables in a
deployment that is focused heavily
on
query
performance delivery. However,
the maintenance cost for
keeping all of the pre
-aggregates
up-to-date with
the detailed data (very important
for consistent query results) is
very high. Pre -
aggregate
summary tables in the relational
environment have the same
"update problem" as
cubes
in a MOLAP
env ironment.
ROLAP
Issue: Hierarchies
Dimensions
are NOT always simple
hierarchies
Dimensions
can be more than simple
hierarchies i.e. item,
subcategory, category,
etc.
The product
dimension might also branch
off by trade style that
cross simple hierarchy
boundaries
such as:
Looking at
sales of air conditioners that
cross manufacturer boundaries, such as
COY1, COY2,
COY3
etc.
Looking at
sales of all "green colored"
items that even cross
product categories (washing
machine,
refrigerator, split -AC,
etc.).
Looking at a
combination of both.
It is quite
common that dimensions are
more than simple
hierarchies. A simple product
hierarchy
might include
item, subcategory, category,
etc. However, the product
dimension might also
branch off by
trade style in ways that
cross boundaries of the
simple hierarchy. We may want to
look at
sales of air conditioners
that cross manufacturer boundaries,
such as COY1, COY2,
COY3
etc. Looking at all "green
colored" items will cross
product categories (washing
machine,
refrigerator,
split -AC, etc.).
Looking at a combination of both, or other
will result in a
combinatorial
explosion, as these combinations
get very large - such that
brute-force pre -
aggregation is
not practical.
83
Providing
analytic capability along multip le
calendar hierarchies is even more
unmanageable in
terms of pre
-aggregate construction.
ROLAP
Issue: Semantics
Conventions are
NOT absolute
Example:
What is
calendar year? What is a
week?
§ Calendar:
01 Jan. to 31
Dec or
01 Jul. to 30
Jun. or
01 Sep to 30
Aug.
§ Week:
Mon. to
Sat. or Thu. to Wed.
Conventions may
vary across organizations,
and even within an organization.
Consider the
apparently simple
case of calendar year. There
is no absolute standard definition of
calendar year
i.e. is it
Jan to Dec. or Jun to Jul or
any other time period covering 12
months? Similarly what
is
meant by week
i.e. it starts from which
day and ends at which
day? Even within an
organization
the
nomenclature may vary, for
example finance people may
consider a Mon. to Sun.
week, while
the marketing
people may consider a Wed.
to Tue. Week. Consequently the
results of aggregation
will
vary across the organization
for the same data,
thus creating lot of confusion. To
bring all the
departments on
the same grid may require
lot of arm twisting from the
highest level and
will
involve
office politics too.
ROLAP
Issue: Storage Space
Explosion
Summary
tables required for non -standard
grouping
Summary
tables required along different definitions of
year, week etc.
Brute
force approach would quickly
overwhelm the system storage
capacity due to a
combinatorial
explosion.
We just
saw the problem of semantics
and thought it was complex, we were
wrong! There are
many more
and other non-standard grouping
too. For example during
the month of August
to
capitalize on
the patriotism because of the
Independence Day, different manufacturers
print the
national flag on
the packages. This adds a
new grouping, as the
decision maker would
be
interested in
knowing the sales of item
with flag as compared to
items without a flag printed
on
them.
In some
cases bringing all the
departments on the same grid
for agreeing on the same
definition
of year or week
may not be advisable, in such a case
the number of summary tables
increases, as
tables
are required for each
definition of the week and
the year. One may
naively follow the
brute
force
path i.e. creating all
possible summary tables with
all possible definitions, groupings
and
nomenclatures.
This may work for
small databases, but for
VLDB (Very La rge Data
Bases) very
soon
the memory requirements of the
system will be chocked. Thus
this is not a viable option
for
large data
sets.
84
ROLAP
Issue: Aggregation
Pitfalls
§
Coarser
granularity correspondingly decreases
potential cardinality.
§
Aggregating
whatever that can be
aggregated.
§
Throwing
away the detail data after
aggregation.
As data is
aggregated to higher levels in the
dimensional hierarchy the number of rows
retained
will
obviously reduce. However,
this reduction is not usually directly
proportional to the ratio
of
entries at
each level in the hierarchy.
For example, one might
(incorrectly) think that the
weekly
summary
table will be about 14% or
1/7th of the daily summary
table as there are seven
days in a
week.
Therefore, keeping the data
at a daily level of granularity should be
seven times the size
of
the weekly
level, right? Wrong. The
reason being all products
that sell in a week
definitely do not
sell every
day of the week. As a result, the
factor is 2× rather than
7×. For example
rent al of
movies usually
goes up during the weekends.
Most heart patients suffer a
stroke on Monday, so
are
most of stock sales.
A general rule
of thumb is that aggregation from one
level of detail to another should only
be
undertaken if
there is a 10x (or mo re)
reduction in table size from
the more detailed level to
the
coarser
level of aggregation. Of course,
this is only a guideline and
will be influenced by
the
frequency of query
activity at each level of potential
aggregation.
Keeping the
detail is impo rtant because it is
inevitable that advanced
analysis will require
drill
down to
the most granular level of
detail available. Remember Murphy's
Law, the day you
throw
away
the detail, is the day it
would be required.
How to
reduce summary
tables?
Many
ROLAP products have developed
means to reduce the number of
summary tables by:
§
Building
summaries on -the -fly as required by
end-user applications.
§
Enhancing
performance on common queries at coarser
granularities.
§
Providing
smart tools to assist DBAs
in selecting the "best" aggregations to
build
i.e. trade-off
between speed and
space.
Tools
from vendors such as Microsoft
and Micro Strategy are
"fully aggregate aware" with
the
ability to
take summaries from one
(more detailed) level of aggregation and
roll them up into a
less
detailed summary at run
-time. In this way,
aggregate metrics can be
delivered without
forcing run-time
aggregation from the most
detailed data in the
warehouse or mart.
Wizards
have come into the
marketplace with the ability
to make suggestions as to optimal
aggregates
that should be built for
maximizing performance for a defined
workload.
Performance
vs. Space Trade-off
85
§
Maximum
performance boost implies using lots of
disk space for storing every
pre -
calculation.
§
Minimum
performance boost implies no disk
space with zero
pre-calculation.
§
Using Meta
data to determine best level
of pre-aggregation from which all
other
aggregates
can be computed.
Theoretically
there can be two extremes
i.e. free space and
free performance . If storage is not
an
issue,
then just pre -compute
every cube at every unique combination of
dimensions at every
level
as it does
not cost anything. This
will result in maximum query
performance. But in reality, this
implies huge
cost in disk space and th e
time for constructing the
pre -aggregates. In the
other case
where performance is
free i.e. infinitely fast
machines and infinite number of
them, then there is
not need to
build any summaries. Meaning
zero cube space and
zero pre-calculations, and
in
reality
this would result in minimum
performance boost, in the
presence of infinite
performance.
What is
meant by Meta data? Meta data is
data about data. So what is
the data about data
one
would be
interested in? For example,
how many entries are
there at each level of the
hierarchy?
How
many stores are there?
How many stores per
zone? How many zones
per district? Etc.
When
the
density of each dimension is known, it
gives a fair idea where the
aggregation is going to
have
the
biggest bang for the
buck. Because if I have a dimension
that has for example on
average two
UPCs
(Universal Product Code) per SKU
(Stock Keeping Unit), its
really not very interesting
to
build a
summary of UPCs in SKUs
because all it saves is adding
two records together.
Performance
vs. Space Trade-off using
Wizard
Figure-12.4:
Aggregation vs.
Performance
Aggregation
design wizards allow a cube
or pre -aggregate designer to
specify the tradeoff
between
disk storage and performance to
determine the maximum volume of
pre-calculated
aggregates, as
shown in Figure-12.4. Suggestions about
amount of aggregation are based on
the
amount of
data reduction along different dimensions
and the size of each
aggregate.
86
Optimizing
the amount of aggregation is a
very hard problem, thus
heuristics are used.
Heuristics
in the
aggregate design wizard can
be improved by Usage -Based Optimization
Wizards via
examination of
query logs to determine
which pre-aggregate will
deliver best bang for
your
storage
buck.
Microsoft's
Usage-Based Optimization Wizard (shown here) also
allows the DBA to tell
Analytic
(OLAP)
Services to create a new set
of aggregations for all
queries exceeding a defined
response
time
threshold.
HOLAP
§
Target is to get
the best of both worlds.
§
HOLAP
(Hybrid OLAP) allow
co-existence of pre -built MOLAP
cubes alongside
relational OLAP
or ROLAP structures.
§
How
much to pre -build?
The
hybrid OLAP (HOLAP) solution is a
mix of MOLAP and relational
ROLAP architectures
that
supports queries against
summary and transaction data
in an integrated fashion. HOLAP
environments use
MOLAP cubes to support
common access paths with
reasonably small
dimensional cardinality
and number of dimensions and relational
structures when greater
scalability
for OLAP queries is
required. This coexistence
strategy allows exploiting the best
of
both worlds.
Microsoft OLAP Services
supports a HOLAP environment, as do tools
such as
HOLOS.
The HOLAP approach enables a
user to perform multidimensional analysis
on data in
the
MDDB along with query bas ed
probing. However, if the
user reaches the bottom of
the
multidimensional hierarchy
and requires further detail
data, the smart HOLAP
engine
automatically
generates SQL to retrieve
the detail data from the
source RDBMS and returns it
to
the
end user. This is done
transparently to the user. Several
MOLAP vendors, such as Arbor
and
Oracle, have
transitioned to HOLAP architectures
that include a ROLAP component.
However,
these
HOLAP architectures are typically
more complex to implement
and administer than
ROLAP or
MOLAP architectures
individually.
DOLAP
87
Figure-12.5:
Concept of Desktop OLAP or
DOLAP
DOLAP typically
is the simplified version of MOLAP or
ROLAP. DOLAP is inexpensive, it
is
fast
and easy to setup on small
dat a sets comprising of thousands of
rows instead of millions of
rows. It
provides specific cube for
the analysis. The DOLAP
systems developed are extensions
of
production
system report writers, while the
systems developed in the early days of
client /server
computing aimed
to take advantage of the
power of the emerging PC desktop
machine. DOLAP
also
provides the mobile
operations of OLAP for the
people who travel and move
extensively,
such as
sales people. The one
obvious disadvantage of DOLAP is
that it lacks the ability
to
manage large
data sets. But this is just
another technique to suit
the business
requirement.
88
Table of Contents:
|
|||||