|
|||||
Lecture
Handout
Data
Warehousing
Lecture
No. 02
Why a
DWH?
·
Data recording
and storage is
growing.
·
History is
excellent predictor of the
future.
·
Gives total
view of the organization.
·
Intelligent
decision -support is required for
decision -making.
Data recording
and storage is
growing.
Moore's
law on increase in performance of CPUs
and decrease in cost has
been surpassed by the
increase in
storage space and decrease
in cost. Meaning, it is true that
the cost of CPUs is
going
down
and the performance is going
up, but this is applicable at a higher
rate to storage space
and
cost
i.e. more and more
cheap storage space is
becoming available as compared to
fast CPUs.
As you would
have experienced, when you
(or your father's) briefcase
seems to be small as
compared to
the contents carried in it, it
seems a good idea to buy a new
and larger briefcase.
However,
after sometime the new
briefcase too seems to be small
for the contents carried. On
the
practical
side, it has been noted
that the amount of data
recorded in an organization doubles
every
year
and this is an exponential
increase.
Reason-1:
Data Sets are
growing
How Much
Data is that?
220 or 106 bytes
1 MB
Small
novel 31/2 Disk
230 or 109 bytes
1 GB
Paper rims
that could fill the back of
a pickup van
50,000
trees chopped and converted
into paper
240 or 1012 bytes
1 TB
and
printed
1 PB = 250 or 1015 bytes
2 PB
Academic
research libraries across
the U.S.
1 EB = 260 or 101 8 bytes
5 EB
All words
ever spoken by human
beings
Table-2.1:
Quantifying size of
data
Size of
Data Sets are going up
↑.
§
6
Cost of
data storage is coming down
↓.
§
§
Total hardware
and software cost to store
and manage 1 Mbyte of
data
§ 1990: ~
$15
§ 2002: ~
¢15 (Down 100
times)
§ By 2007:
< ¢1 (Down 150 times)
§
A Few
Examples
§ WalMart: 24 TB
(Tera Byte)
§ France
Telecom: ~ 100 TB
§ CERN: Up
to 20 PB by 2006 (Peta Byte)
§ Stanford
Linear Accelerator Center
(SLAC): 500TB
A Ware
House of Data
is NOT
a
Data
Warehouse
Someone
says I have a data set of
size 1 GB so I have a DWH
can you beat this?
Someone
else says, I have a data
set of size 100 GB, can you
beat this?
Someone
else says, I have a 1 TB
data set, who can
beat this?
Who
has a data warehouse? Not
enough information, it is much more than
just the size, it is
a
whole
concept, it is NOT a shrink wrapped solution, it
evolves. A company may have
a TB of
data
and not have a data
warehouse; while on the other
hand, a company may have 500
GB of
data
and have a fully functional
data warehouse.
Size is
NOT
Everything
History is
excellent predictor of the
future
Secondly as I
mentioned earlier the data
warehouse has the historical
data. And one thing
that we
have
learned by using information is
that, "past is the best
predictor of the future". You
use
historical data,
because it gives you an insight into
how the environment is
changing. Also you
must
have heard that "history
repeats itself", however
this repetition of history is not likely to
be
constant
for all businesses or all
events. Note that you just
can't use the historical
data to predict
the
future; you have to have to bring your
own insight and experience to interpret
how the
environment is
changing in order to predictthe future
accurately and
meaningfully.
Gives total
view of the organization
So why
would you want data warehouse in
your organization? First of all a data
warehouse gives
a total view of
an organization. If you look at the operational
system i.e. the databases in
most
environments,
the databases are designed
around different lines of business.
Consider the case of
a Bank; a
bank will typically have current
accounts and savings
accounts, foreign
currency
account
etc. The bank will
have an MIS system for
leasing, and another system
for managing
credit cards
and another system for
every different kind of
business they are in.
However,
nowhere they
have the total view of the
environment from the
customer's perspective. The
reason
being,
transaction processing systems
are typically designed around functional
areas, within a
business
environment. For good decision making you
should be able to integrate
the data across
7
the organization
so as to cross the LoB (Line
of Business). So the idea
here is to give the tot
al
view of
the organization especially from a
customer's perspective within
the data warehouse,
as
shown in
Figure -2.1
ATM
Leasing
Savings
Account
DATA
WAREHOUSE
Checking
Account
Credit
Card
Figure-2.1:
A Data Warehouse crosses the
LoB
Intelligent
decision -support is required for
decision -making
Consider a
bank which is losing customers,
for reasons not known.
However, one thing is for
sure
that
the bank is losing business
because of lost customers.
Therefore, it is important,
actually
critical to
understand which customers have
left and why they
have left. This will
give you the
ability to
predict going forward (in time), to
identify which customers
will leave you (i.e.
the
bank). We
are going to talk about this
in the course using data
mining algorithms, like
clustering,
classification,
regression analysis etc.
However, this being another
example of using historical
data to predict
the future. So I can predict
today, which customers will
leave me in the next
3
months
before they even leave.
There can be, and
there are whole courses on
data mining, but we
will
just have an applied overview of
data mining in this
course.
Reason-2:
Businesses demand
intelligence
§
Complex
questions from integrated
data.
§
"Intelligent
Enterprise"
DBMS
Approach
Intelligent
Enterprise
List of
all items that were sold
last month?
Which items
sell together? Which items
to
stock?
List of
all items purchased by
Khizar?
Where
and how to place the
items? What
The
total sales of the last
month grouped by
discounts to
offer?
branch?
How
best to target customers to
increase sales
How
many sales transactions
occurred during
at a
branch?
the
month of January?
Which customers
are most likely to respond
to
my next
promotional campaign, and why?
8
Table-2.2:
Comparison of queries
Let's
take a close look at the
typical queries for a DBMS.
They are either about
listing the
contents of
tables or running aggregates of values
i.e. rather simple and
straightforward queries
and
fairly easy to program. The
queries follow rather pre -defined
paths into the database
and are
unlikely to
come up with something new or
abnormal.
Reason-3:
Businesses want much
more...
1.
What
happened?
2.
Why it
happened?
3.
What
will happen?
4.
What is
happening?
5.
What do you
want to happen?
These
questions primarily point to
what is called as the different
stages of a Data Warehouse
i.e.
starting from
the first stage, and
going all the way to
stage 5. The first stage is
not actually a data
warehouse, but a
pure batch processing system. Note
that as the stages evolve
the amount of
batching
processing decreases, this be
ing maximum in the first
stage and minimum in the
last or
5th stage. At the same time
the amount of ad-hoc query processing
increases. Finally in the
most
developed stage
there is a high level of event
based triggering. As the
system moves from stage
-1
to stage-5 it
becomes what is called as an active data
warehouse.
What is a
DWH?
A
complete repository of historical corporate
data extracted from transaction systems
that is
available
for ad-hoc access by knowledge
workers
The
other key points in this
standard definition that I have
also underlined and listed
below are:
Complete
repository
· All
the data is present from
all the branches/outlets of
the business.
· Even
the archived data may be
brought online.
· Data
from arcane and old
systems is also brought
online.
Transaction
System
· Management
Information System
(MIS)
· Could be
typed sheets (NOT
transaction system)
Ad-Hoc
access
· Does not
have a certain predefined
database access
pattern.
· Queries
not known in advance.
· Difficult
to write SQL in
advance.
Knowl
edge workers
· Typically
NOT IT literate (Executives, Analysts,
Managers).
· NOT
clerical workers.
· Decision
makers.
9
The
users of data warehouse are
knowledge workers in other words they are
decision makers in
the
organization. They are not the clerical
people entering the data or overseeing
the transactions
etc or
doing programming or performing system
design/analysis. These are
really decision
makers in
the organization like General Manager
Marketing, or Executive Director or CEO
(Chief
Operating Officer). Typically
those decision makers are
people in areas like
marketing,
finance and
strategic planning
etc.
Completeness:
There is a
misnomer here, about
completeness. As per the standard
definition a
data
warehouse is a complete repository of
corporate data. The reality
is that it can never
be
complete. We
will discuss this in detail
very shortly.
Transaction
System: Unlike
databases where data is directly
entered, the input to the
data
warehouse
can come from OLTP or
transactional systems or other third
party databases. This is
not a rule,
the data could come
from typed or even hand
filled sheets, as was the
case for the
census
data warehouse.
Ad-Hoc
access: It dose
not have a certain repeatable
pattern and it's not
known in advance.
Consider
financial transactions like a
bank deposit, you know
exactly what records will
be
inserted
deleted or updated. That's in
OLTP system and in ERP
system. But in a data
warehouse
there
are really no fixed
patterns. Say the marketing
person, just sits down
and thinks abou t
what
questions
he/she has about customers
and there behaviors and so on
and they are typically
using
some
tool to generate SQL dynamically
and then that SQL
gets executed and that you
don't know
in
advance.
Although
there may be some patterns
of queries, but they are
really not very predictable and
the
query patterns
may change over time.
Hence there are no
predefined access paths into
the
database.
That's why relational
databases are so important for
the data warehouse,
because
relational
databases allow you to navigate
the data in any direction
that is appropriate using the
primary,
foreign key structure within
the data model. Meaning,
using a data warehouse, does
not
implies that we
just forget about
databases.
Another
view of a DWH
Subject
Oriented
Integrated
Time
Variant
Non
Volatile
10
Figure-2.2:
Another view of a Data
Warehouse
Subject
oriented. The
goal of data in the data
warehouse is to improve decision
making,
planning, and
control of the major
subjects of enterprises such as
customer, products, regions,
in
contrast to
OLTP applications that are
organized around the work-flows of
the company.
Integrated.
The
data in the data warehouse
is loaded from different sources
that store the data
in
different
formats and focus on different
aspects of the subject. The
data has to be
checked,
cleansed
and transformed into a
unified format to allow easy
and fast access.
Time
variant. Time
variant records are records
that are created as of some
moment in time.
Every record in
the data warehouse has
some form of time variancy
associated with it. In an
OLTP
system, the contents change
with time i.e. updated such
as bank account balance or
mobile
phone
balance, but in a warehouse as the
data is loaded; the moment usually
becomes its time
stamp.
Non-volatile.
Unlike
OLTP systems, after inserting data in
the data warehouse it is
neither
changed
nor removed. The only
exceptions are when false
or incorrect
data gets inserted
erroneously or
the capacity of the data
warehouse exceeded and
archiving becomes
necessary.
11
Table of Contents:
|
|||||