|
|||||
Lecture
Handout
Data
Warehousing
Lecture
No. 10
Online
Analytical Processing
(OLAP)
OLAP is
Analytical Processing instead of
Transaction Processing. It is also
NOT a physical
database
design or implementation technique, but a
framework. Furthermore OLAP
implementations
are either highly de-normalized or
completely de-normalized, while
OLTP
implementations
are fully normalized. Therefore, a
clear understanding of how
OLAP (On-Line
Analytical
Processing) fits into the
overall architecture of your
data warehouse i essential
for
s
appropriate
deployment. There are different
methods of deployment of OLAP
within a decision
support
environment - often obscured by
vendor hype and product
salesmanship. It is important
that you as
data warehouse architect
drive your choice of OLAP
tools - not driven by
it!
DWH &
OLAP
§
Relationship
between DWH &
OLAP
§
Data
Warehouse & OLAP go
together.
§
Analysis
supported by OLAP
A data
warehouse is a "subject-oriented,
integrated, time varying, non
-volatile collection of data
that is
used primarily in organizational decision
making. Typically, the data
warehouse is
maintained
separately from the organization's
operational databases and on different
(and more
powerful)
servers. There are many
reasons for doing this.
The data warehouse supports
on-line
analytical
processing (OLAP), the
functional and performance
requirements of which are
quite
different
from those of the on -line
transaction processing (OLTP)
applications, traditionally
supported by
the operational data
bases.
A data
warehouse implementation without an
OLAP tool is nearly
unthinkable. Data
warehousing
and on-line
analytical processing (OLAP)
are essential elements of
decision support, and
go
hand-in-hand.
Many commercial OLAP
products and services are
available, and all of
the
principal
database management system vendors have
products in these areas.
Decision support
systems
are very different from
MIS systems, consequently
this places some rather
different
requirements on
the database technology compared to
traditional o n-line transaction
processing
applications.
Data
Warehouse provides the best
support for analysis while
OLAP carries out the
analysis task.
Although
there is more to OLAP technology than
data warehousing, the classic
statement of
OLAP is
"decision making is an iterative
process; which must involve
the users". This
statement
implicitly
points to the not so obvious fact
that while OLTP systems
keep the users' records
and
60
assist
the users in performing the
routine operations tasks of their
daily work. OLAP systems
on
the other
hand generate information
users require to do the
non-routine parts of their jobs.
OLAP
systems
deal with unpredicted
circumstances (for which if not
impossible, it is very difficult
to
plan ahead). If
the data warehouse doesn't
provide users with helpful
information quickly in an
intuitive
format, the users would be
turned down by it, and will
revert back to their old
and
faithful,
yet inefficient but tried out
procedures. Data warehouse developers
sometimes refer to
the
importance of intuitiveness to the users
as the "field of dreams
scenario" i.e. naively
assuming
that
just because you build it doesn't
mean they will come
and use it.
Supporting
the human thought
process
We will
first learn the nature of
analysis before going on to
the actual structure and
functionality
behind OLAP
systems; this is discussed in
Figure -10.1. Consider a
hypothetical (but a realistic
scenario) when a
decision maker (analyst,
CEO) of a large enterprise
i.e. a multinational company
is confronted
with the yearly sales
figure that there is an
enterprise wide fall in
profit. Note that
we are talking
about a large multinational company,
which is not only in sales,
but also in
production. To
get a hold of the problem,
decision maker has to start
from somewhere, but
that
somewhere
has to be meaningful too.
This "somewhere" could be starting the
exploration across
the
enterprise geographically or exploring
across the enterprise
functionally or doing it product -
wise or
doing it with reference to
time.
One of
the valid options of starting is with
reference to time, and the query corresponding to
the
thought process
of the decision maker turns
out to be "What was the quarterly
sales during last
year?" The
result of this query would be sales
figures for the four
quarters of last year. There
can
be a number of
possible outcomes, for
example, the sale could be
down in all the quarters, or
it
could be down in
three out of four quarters
with four possible options.
Another possibility could
be sales
down in any two quarters
with six possible scenarios.
For the sake of simplicity,
over
here we
assume that the sales were
down in only one quarter. If
the results were displayed
as
histograms,
the decision maker would
immediately pick the quarter in which
sales were down,
which in
this case comes out to be
the last quarter.
Supporting
the human thought
process
THOUGHT
PROCESS
QUERY
SEQUENCE
Figure-10.1:
Supporting the human thought
process
61
Now
that the "lossy" quarter has
been identified, the
decision maker would naturally
like to probe
further, to go
to the root of the problem,
and would like to know what
happened during the
last
quarter that
resulted into a fall in
profit? The objective here
is to identify the root
cause, once
identified,
the problem can be fixed. So
the obvious question that
comes into the mind of
the
decision
maker is what is
special about the last
quarter? Before
jumping directly into the
last
quarter to come
up with hasty outcomes, the
careful decision maker wants
to look at the sales
data
of the
last year, but with a
different perspective, therefore, his/her
question is translated into
two
queries
i.e. (i) What was
the quarterly sales at regional level
during last year? (ii)
What was the
quarterly sal es
at product level during last
year? There could be another
sequence of questions
i.e.
probing
deeper into the "lossy" quarter directly,
which is followed in the
next sequence of
queries.
It may
seem that the decision
maker was overly careful,
but he was not. Recognize
that the
benefit of
looking at a whole year of
data enables the decision
maker to identify if there
was a
trend of sales
going down, that resulted in
least sales during the
last quarter or the sales
only went
down
during the last quarter? The
finding was that the
sales went down only
during the last
quarter, so
the decision maker probed
further by looking at the
sales at monthly basis during
the
last
quarter with reference to
product, as well as with
reference to region. Here is a finding,
the
sales were
doing all right with
reference to the product, but
they were found to be down
with
reference to
the region, specifically the Northern
region.
Now
the decision maker would
like to know what happened in
the Northern region in the
last
quarter
with reference to the
products, as just looking at
the Northern regional alone
would not
give
the complete picture. Although not
explicitly shown in the figure
-?, but the current
exploratory query of
the decision maker is
further processed by taking the
sales figures of last
quarter on
monthly basis and coupled
together with the product
purchased at store level.
Now the
hard work
pays off, as the problem has
finally been identified i.e.
high cost of products
purchased
during
the last quarter in the
Northe rn region. Well actually it
was not hard work in
its true literal
meaning, as
using an OLAP tool, this
would have taken not more
than a couple of
minutes,
maybe even
less.
Analysis of
last example
§
Analysis is
Ad-hoc
§
Analysis is
interactive (user driven)
§
Analysis is
iterative
§ Answer to
one question leads to a
dozen more
§
Analysis is
directional
§ Drill
Down
§
Roll
Up
§
Drill
Across
Now if we
carefully analyze the last
example, some points are
very clear, the first
one being
analysis is
ad-hoc i.e. there is no
known order to the sequence
of questions. The order of
the
questions
actually depends on the
outcome of the questions
asked, the domain knowledge of
the
decision
maker, and the inference
made by the decision maker
about the results
retrieved.
Furthermo re, it
is not possible to perform such an
analysis with a fixed set of
queries, as it is not
62
possible to
know in advance what questions
will be asked, and more
importantly what answers or
results
will be presented corresponding to those
questions. In a nut -shell, not
only the querying is
ad-hoc, but
the universal set of pre
-programmed queries is also
very large, hence it is a
very
complex
problem.
As you would
have experienced while using an
MIS system or done while
developing and MIS
system,
such systems are programmer
driven. One would disagree
with that, arguing
that
development
comes after system analysis,
when user has given his/her
requirements. The problem
with
decision support systems is
that, the user is unaware of
their requirements, until
they ha ve
used
the system. Furthermore, the number
and type of queries are
fixed as set by the
programmer,
thus if
the decision maker has to
use the OLTP system
for decision making, he is forced to
use
only
those queries that have
been pre-written thus the
process of decision making is driven not
by
the
decision maker. As we have seen,
the decision making is interactive i.e.
there is noting in
between
the decision support system
and the decision
maker.
In a traditional
MIS system, there is an
almost linear sequence of
queries i.e. one query
followed
by another,
with a limited number of queries to go
back to with almost fixed
and rigid results. As
a consequence,
running the same set of
queries in the same order
will not result in
something
new. As we
discussed in the example,
the decision support
processes is exploratory, and
depending on the
results obtained, the
queries are further fine
-tuned or the data is probed
deeply,
thus it is an
iterative process. Going
through the loop of queries
as per the requirement of the
decision
maker fine tunes the
results, as a result the
process actually results in coming up
with the
answers to
the questions.
Challenges
§
Not
feasible to write predefined
queries.
§ Fails to
remain user_driven (becomes programmer
driven).
§
Fails to remain
ad_hoc and hence is not
interactive.
§
Enable
ad-hoc query support
§ Business
user can not build his/her
own queries (does not know
SQL, should not
know
it).
§
On_the_go SQL
generation and execution too
slow.
Now
that we have identified how
decision making proceeds, this could
lure someone with an
OLTP or
MIS mindset to approach
decision support using the
traditional tools and
techniques.
Let's
look at this (wrong)
approach one point at a
time. We identified one very
important point
that
decision support requires
much more and many more
queries as compared to an MIS
system.
But even if more
queries are written, the
queries still remain predefined,
and the process of
decision making
still continues to be driven by
the programmer (who is not a decision
maker)
instead of
the decision maker. When
something is programmed, it kind of
gets hard-wired i.e.
it
can not be
changed significantly, although minor
changes are possible by changing
the parameter
values.
That's where the fault lies
i.e. hard -wiring. Decision
making has no particular order
associated
with it.
One could
naively say OK if the
decision making process has to be
user driven, let the end
user or
business
user do the query development himself/herself.
This is a very unrealistic assumption,
as
63
the
business user is not a programmer, does
not know SQL, and does not
need to know SQL. An
apparently
realistic solution would be to let the
SQL generation be on-the-go.
This approach does
make
the system user driven
without having the user to
write SQL, but even if on-the-go
SQL
generation can
be achieved, the data sets
used in a data warehouse are
so large, that it is no way
possible
for the system to remain
interactive!
Challenges
§
Contradiction
§ Want to
compute answers in advance, but don't
know the questions
§
Solution
§ Compute
answers to "all" possible "queries".
But how?
§
NOTE:
Queries are multidimensional aggregates
at some level
So what are
the challenges? We have a paradox or a
contradiction here. OK we manage
to
assemble a
very large team of dedicated
programmers, with an excellent project
manager and all
sit
and plan to write all
queries in advance. Can we do
that? The answer is a resounding
NO. The
reason being,
since even the user
dose not knows the queries
in advance, and consequently
the
programmers also
do not know the queries in
advance, so how can the
queries be written in
advance? It is
just no possible.
So what is the
solution then? We need a paradigm shift
from the way we have
been thinking. In a
decision
support environment, we are looking at
the big picture; we are
interested in all
sales
figures
across a time interval or at a particular time
interval, but not for a particular
person. So the
solution lies in
computing all possible aggregates,
which would then correspond
to all possible
queries.
Over here a query is a multidimensional
aggregate at a certain level of
the hierarchy of
the
business.
To understand
this concept, let's look at
Figure -10.2 that shows multiple levels
or hierarchies of
geographies.
Pakistan is logically divided
into provinces, and each
province is administratively
divided
into divisions, each division is
divided into districts and
within districts are
cities.
Consider a
chain store, with outlets in
major cities of Pakistan, with large
cities such as
Karachi
and Lahore
having multiple outlets,
which you may have observed
also. Now the sales
actually
take
place at the business outlet
which could be in a zone, such as
Defense and Gulberg in
our
example, of
course there may be many
more stores in a city, as
already discussed.
64
"All"
possible queries (level
aggregates)
Figure-10.2:
Hierarchies in data
OLAP:
Facts & Dimensions
§
FACTS:
Quantitative values (numbers) or
"measures."
§ e.g.,
units sold, sales $, Co, Kg
etc.
§
DIMENSIONS:
Descriptive categories.
§ e.g.,
time, geography, product
etc.
§
DIM
often organized in hierarchies
representing levels of detail in the data
(e.g.,
week,
month, quarter, year, decade
etc.).
The foundation
for design in this environment is through
use of dimensional modeling
techniques
which
focus on the concepts of
"facts" and "dimensions" for organizing
data. Facts are
the
quantities,
numbers that can be
aggregated (e.g., sales $,
units sold etc.) that we
measure and
dimensions
are how we filter/report on
the quantities (e.g., by
geography, product, date, etc.).
We
will
discuss in detail, actually
have number of lectures on dimensional modeling
techniques.
Where
Does OLAP Fit
In?
§
It is a
classification of applications, NOT a
database design
technique.
§
Analytical
processing uses multi-level
aggregates, instead of record level
access.
§
Objective is to
support very
§
(i)
fast
§
(ii) iterative
and
§
(iii)
ad-hoc decision-making.
65
OLAP is a
characterization of applications. It is
NOT a database design
technique. People
often
confuse
OLAP with specific physical
design techniques or data
structure. This is a
mistake.
OLAP is a
characterization of the application domain
centered around slice-and-dice and
drill
down
analysis. As we will see,
there are many possible
implementations capable of
delivering
OLAP
characteristics. Depending on data size,
performance requirements, cost
constraints, etc.
the specific
implementation technique will
vary.
Where
does OLAP fit
in?
Figure-10.3:
How does OLAP pi eces
fir together
Figure -10.3
shows that using the
transactional databases (after
Extract Transform Load)
OLAP
data
cubes are generated. Over
here multidimensional cube is shown
i.e. a MOLAP cube.
Data
retrieved as a
consequence of exploring the
MOLAP cub e is then used as
a source to generate
reports or
charts etc. Actually in typical
MOLAP environments, the results
are displayed as
tables, along
with different types of
charts, such as histogram, pie
etc.
66
OLTP
vs. OLAP
Feature
OLTP
OLAP
Level of
data
Detailed
Aggregated
Amount of
data retrieved
per
Small
Large
transaction
Views
Pre-defined
User-defined
Typical
write operation
Update,
insert, delete
Bulk
insert, almost no
deletion
"age" of
data
Current (60-90
days)
Historical
5-10 years and
also
current
Tables
Flat
tables
Multi-Dimensional
tables
Number of
users
Large
Low-Med
Data
availability
High
(24 hrs, 7 days)
Low-Med
Database
size
Med
(GB- TB)
High
(TB PB)
Query
Optimizing
Requires
experience
Already
"optimized"
Table-10.1:
Difference between OLTP and
OLAP
The
table summarizes the fundamental
differences between traditional OLTP
systems and typical
OLAP
applications. In OLTP operations
the user changes the
database via transactions
on
detailed
data. A typical transaction in a banking environment
may transfer money from
one
account to
another account. In OLAP
applications the typical user is an
analyst who is
interested
in selecting
data needed for decision
support. He/She is primarily
not interested in detailed
dat a,
but usually in
aggregated data over large
sets of data as it gives the
big picture. A typical OLAP
query is to find
the average amount of money
drawn from ATM by those
customers who are
male,
and of age between 15 and 25
years from (say) Jinnah
Super Market Islamabad after 8
pm.
For
this kind of query there are
no DML operations and the
DBMS contents do not
change.
OLAP
FASMI Test
Fast:
Delivers
information to the user at a
fairly constant rate i.e.
O(1) time. Most
queries
answered in
under 5 seconds.
Analysis:
Performs
basic numerical and statistical
analysis of the data, pre -defined by
an
application developer or
defined ad-hocly by the
user.
Shared:
Implements
the security requirements
necessary for sharing potentially
confidential data
across a large
user population.
Multi-dimensional:
The
essential characteristic of
OLAP.
Information:
Accesses
all the data and
information necessary and
relevant for the
application,
wherever it
may reside and not
limited by volume.
67
FAST
means
that the system is targeted
to deliver response to moat of
the end user queries
under
5 seconds,
with the simplest analyses
taking no more than a second and
very few queries
taking
more than 20
seconds (for various reasons
to be discussed). Because if the
queries take
significantly
longer time, the thought
process is broken, as users are
likely to get
distracted,
consequently
the quality of analysis suffers.
This speed is not easy to
achieve with large amounts
of data,
particularly if on -the -go
and ad-hoc
complex
calculations are required.
ANALYSIS
means
that the system is capable
of doing any business logic
and statistical
analysis,
which is
applicable for the application and to
the user, and also
keeps it easy enough for
the user
i.e. at
the point -and-click level. It is
absolutely necessary to allow
the target user to define
and
execute
new ad hoc
calculations/queries
as part of the analysis and to
view the results of the
data
in any
desired way/format, without having to do
any programming.
SHARED
means
that the system is not
supposed to be a stand-alone system,
and should
implement all
the security requirements
for confidentiality (possibly
down to cell level) for
a
multi-user
environment. The other point is
kind of contradiction of the point
discussed earlier
i.e.
writing
back. If multiple DML
operations are needed to be performed,
concurrent update
locking
should be
executed at an appropriate level. It is
true that not all
applications require users to
write
data
back, but it is true for a
considerable majority.
MULTIDIMENSIONAL
is
the key requirement to the letter
and at the heart of the
cube concept
of OLAP.
The system must provide a
multidimensional logical view of the
aggregated data,
including
full support for hierarchies
and multiple hierarchies, as
this is certainly the most
logical
way to
analyze organizations and
businesses. There is no "magical" minimum
number of
dimensions
that must be handled as it is
too application specific.
INFORMATION
is
all of the data and
derived information required, wherever it
is and however
much is
relevant for the application.
The objective over here is
to ascertain the capacity of
various
products in
terms of how much input
data they can handle
and process, instead of how
many
Gigabytes of
raw data they can
store. There are many
considerations here, including
data
duplication,
Main Memory required, performance, disk
space utilization, integration with
data
warehouses
etc.
68
Table of Contents:
|
|||||