|
|||||
Lecture-19
ETL
Detail: Data
Cleansing
Background
§
Other names:
Called as data scrubbing or
cleaning.
§
More
than data arranging.
§
Big problem,
big effect: Enormous problem, as most
data is dirty. GIGO
§
Dirty is
relative.
§
Paradox:
Must involve domain expert,
as detailed domain knowledge is required,
so it
becomes
semi-automatic, but has to be automatic
because of large data
sets.
§
Data
duplication.
It is crucial to
the process of loading the
data warehouse that you not
just rearrange the
grouping
of source
data and deliver it to a destination
database, but that you also
ensure the quality of
that
data.
Data cleansing is vitally important to
the overall health of your
warehouse project and
ultimately
affects the health of your
company. Do not take this
statement lightly.
The true
scope of a data cleansing project is
enormous. Much of production data is
dirty, and you
don't
even want to consider what work cleaning
it up would take. By "dirty," I
mean that it does
not conform to
proper domain definitions or "make
sense." The age -old
adage "garbage in,
garbage
out" still applies, and you
can do nothing about it short of
analyzing and correcting
the
corporate
data. What is noise in one
domain may be information in
another.
Usually the
process of data cleansing
can not be performed without
the involvement of a domain
expert
because the detection and
correction of anomalies requires
detailed domain knowledge.
Data
cleansing is therefore described as
semi-automatic but it should be as
automatic as possible
because of
the large amount of data
that usually is be processed
and the time required for
an
expert to
cleanse it manually.
The
original aim of data cleansing
was to eliminate duplicates in a data
collection, a problem
occurring
already in single database
applications and gets worse
when integrating data from
different
sources. Will have a
complete lecture on
it.
Lighter
Side of Dirty Data
§
Year of birth
1995 current year
2005
§
Born in
1986 hired in 1985
§
Who
would take it seriously?
Computers while summarizing,
aggregating, populating etc.
§
Small
discrepancies become irrelevant
for large averages, but what
about sums, medians,
maximum, minimum
etc.?
146
Value
validation is the process of ensuring
that each value that is
sent to the data warehouse
is
accurate.
You may had that
experience in which you look at
the contents of one of your
major flat
files or
database tables and
intuitively pick that the
data is incorrect. No way could
that employee
be born in
2004! You know your
company doesn't hire infants.
You may also discover
another
incorrect
record. How could someone be
born in 1978 but hired in
1977?
All too often,
these types of data
integrity problems are
laughed at and then ignored.
All too often
people
say "No one would actually
take that information
seriously, would they?"
Well, maybe
people
won't, but MIS systems will.
That information can be
summarized, aggregated, and/or
manipulated in
some way, and then populated
into another data element.
And when that
data
element is
moved into the DWH,
analytical processing will be performed
on it that can affect
the
way
your company does business.
What if data from the
DWH is being analyzed to revise
hiring
practices?
That data may make a
wrong impact on the business
decisions if enough of the
hire
and
birth dates are
inaccurate.
Small
data discrepanci es can
become statistically irrelevant
when large volumes of data
are
averaged.
But averaging is not the
only analytical function that is
used in analytical data
warehouse
queries. What about sums,
medians, max/min, and other
aggregate and scalar
functions?
Even further, can you
actually prove that the
scope of your data problems
is as small
as you think it
is? The answer is probably
"no."
Serious
Problems due to dirty
data
§
Decisions
taken at government level using
wrong data resulting in
undesirable results.
§
In direct mail
marketing sending letters to wrong
addresses loss of money and
bad
reputation.
Administration: The government
analyses data collected by population
census to decide
which
regions of
the country require further
investments in health, education,
clean drinking water,
electricity etc.
because of current and expected
future trends. If the rate
of birth in one region
has
increased
over the last couple of
years, the existing health
facilities and doctors employed
might
not be
sufficient to handle the number of
current and expected patients.
Thus, additional
dispensaries or
employment of doctors will be
needed. Inaccuracies in analyzed
data can lead to
false
conclusions and misdirected
release of funds with catastrophic
results for a poor country
like
Pakistan.
Supporting
business processes : Erroneous
data leads to unnecessary
costs and probably
bad
reputation when
used to support business
processes. Consider a company
using a list of consumer
addresses
and buying habits and
preferences to advertise a new product by
direct mailing. Invalid
addresses
cause the letters to be returned as
undeliverable. People being duplicated in
the mailing
list account
for multiple letters sent to
the same person, leading to
unnecessary expenses
and
frustration .
Inaccurate information about
consumer buying habits and
preferences contaminate
and
falsify the target group, resulting in
advertisement of products that do not
correspond to
consumer's
needs. Companies trading
such data face the
possibility of an additional loss of
reputation in
case of erroneous
data.
147
3 Classes
of Anomalies...
§
Syntactically
Dirty Data
§ Lexical
Errors
§ Irregularities
§
Semantically
Dirty Data
§ Integrity
Constraint Violation
§ Business
rule contradiction
§ Duplication
§
Coverage
Anomalies
§ Miss
ing Attributes
§ Missing
Records
Syntactically
Dirty Data
Lexical
errors: For
example, assume the data to
be stored in table form with
each row
representing a
tuple and each column an attribute. If we
expect the table to have
five columns
because
each tuple has five
attributes but some or all of
the rows contain only four
columns then
the
actual structure of the data
does not conform to the specified
format.
Irregularities
are
concerned with the
non-uniform use of values,
units and
abbreviations.
This
can happen for example if we
use different currencies to
specify an employee's salary.
This
is especially
profound if the currency is
not explicitly listed with
each value, and is assumed to
be
uniform.
Annual salary or 20,000 means $20,000 or
Rs. 20,000. This results in
values being
correct
representations of facts if we have
the necessary knowledge about
their representation
needed to
interpret them.
Semantically
dirty data
Integrity
constraint violations describe
tuples (or sets of tuples)
that do not satisfy one or
more
of the
integrity constraints. Integrity
constraints are used to
describe our understanding of
the
mini-world by
restricting the set of valid
instances. Each constraint is a
rule representing
knowledge about
the domain and the values
allowed for representing
certain facts.
Contradictions
are values
within one tuple or between
tuples that violate some
kind of
dependency
between the values. An
example for the first
case could be a contradiction
between
the attribute
AGE and DATE_OF_BIRTH for a tuple
representing persons. Contradictions
are
either
violations of functional dependencies
that can be represented as
integrity constraints or
duplicates
with inexact values. They
are therefore not regarded as
separate data
anomaly.
Duplicates
are
two or more tuples representing
the same entity from the
mini -world. The
values
of these
tuples do not need to be
completely identical. Inexact duplicates
are specific cases of
contradiction
between two or more tuples.
They represent the same
entity but wit h
different
values for
all or some of its properties.
This hardens the detection
of duplicates and
there
mergence.
148
3 Classes
of Anomalies...
§
Coverage or
lack of it
§ Missing
Attribute
§
Result of
omissions while collecting the
data.
§
A constraint
violatio n if we have null values
for attributes where NOT
NULL constraint
exists.
§
Case
more complicated where no such
constraint exists.
§
Have to
decide whether the value
exists in the real world and
has to be deduced here
or
not.
Coverage
Problems
Missing
values
Result of
omissions while collecting the
data. A constraint violation if we
have null values for
attributes where
NOT NULL constraint exists.
Case more complicated where no
such constraint
exists.
Have to decide whether the
value exists in the real
world and has to be deduced
here or
not.
Missing
tuples
Result
from omissions of complete
entities existent in the
mini -world that are not
represented by
tuples in
the data collection.
Anomalies could further be classified
according to the amount
of
data
accounting for the
constraint violation. This
can be single values, values within a
tuple,
values within
one or more columns of the
data collection or tuples and
sets of tuples from
different
relations.
Why
Missing Rows?
§
Equipment
malfunction (bar code
reader, keyboard
etc.)
§
Inconsistent
with other recorded data
and thus deleted.
§
Data not
entered due to
misunderstanding/illegibility.
§
Data
not considered important at the
time of entry (e.g.
Y2K).
There
can be a number of reasons for
missing rows or missing
data. For example there
may be
fault
with the hardware for recording or
inputting the data, this
could be a bar code reader
which
is faulty
and missing part of the UPC
(Universal Product Code) or on a low
tech level, there
could
be problem
with a keyboard (numeric part) that
results in entry of different
keys as same keys
and
only
one is kept and all
others are removed, which
infact corresponded to unique keys. In
many
cases I
can not even read what I
have written, I am sure I am not
alone. If I can't read my
own
handwriting
sometime, than in many
instances people will be
unable to read my
handwriting.
This
means either the data will
not be entered or it will be entered
incorrectly, even using
automatic
means using OCR (Optical
Character Reader) soft wares. Another
example is the
famous
(or infamous) Y2K problem,
when the first two
most significant digits of the
year where
not
recorded, and in many cases
it was almost impossible to differentiate
between 1901 and
2001!
149
Handling
missing data
§
Dropping
records.
§
"Manually"
filling missing
values.
§
Using a global
constant as filler.
§
Using the
attribute mean (or median) as
filler.
§
Using the
most probable value as
filler.
There
can be a number of ways of handling
missing data, the most
convenient being to just drop
the
records with missing values
for certain attributes. The
problem with this approach
is what do
we gain by
dropping records with
missing values, and how to
decide which records to drop
i.e.
with
one missing value or which
one or how many missi g
values? Obviously this is a
complex
n
problem, and
writing code would not be an
easy task, so one easy
way out could be to manually
fill
missing values or use a global
constant as a filler. For
example if gender is not known
for the
customers,
then filling the gender by
(say) NA. For numeric
attributes, filling using a
global value
may
make some sense, as it could be
the mean or median of the
column value. Or this could
also
be the
most probable value to be used as a
filler.
Key Based
Classification of Problems
§
Primary
key problems
§
Non -Primary
key problems
The
problems associated with the
data can correspond to
either the primary keys or
non primary
keys, as
the nature of the problem
and the corresponding solution varies
with the type of the
key
invo
lved. In the subsequent
slides we will discuss each
of them one by one.
Primary key
problems
1. Same PK but
different data.
2. Same
entity with different
keys.
3. PK in one
system but not in other.
4. Same PK but
in different formats.
Some of
the problems associated with PK
are;
1. Records
may have the same
primary key but might have
different data. This can
occur if
primary
keys are reused or when two
organizations or units
merge.
2. The
same entity might occur with
different primary keys. This
can easily arise when
different
segments of an
entity design databases
independently of one
another
3. Data
may have a primary key in
one system but not in another.
The classic example of
this
kind of
error occurs when an entity is
represented in more than one
source database. It is quite
possible
that the entity is central to
one of the databases and
therefore has a primary key
field or
150
fields
while the same entity is so peripheral to
the purposes of the other
database that it is not
dignified by
being given a primary
key.
4. Primary Keys
may be intended to be the
same but might occur in
different formats.
Probably,
the
most widespread instance of
this error occurs when NID
are used as primary keys:
are they
character
data or numeric; if character
data, do they contain
dashes?
Non primary
key problems...
1. Different
encoding in different sources.
2.
Multiple
ways to represent the same
information.
3.
Sources might
contain invalid data.
4.
Two
fields with different data
but same name.
1. Data
may be encoded differently in
different sources. The
domain of a "gender" field in
some
database
may be {`F', `M'} or as {"Female",
"Male"} or even as {1,
0}.
2. There are
often multiple ways to represent
the same piece of
information. "FAST",
"National
University",
"FAST NU" and "Nat. Un
iv. of Computers " can all
can be found in the literature
as
representing
the same institution.
3. Sources
might contain invalid data.
A point of sale terminal may
require that the sales
clerk
enters a
customer's telephone number. If the
customer does not wish to
give it, clerks may
enter
999-999-9999.
4. Two
fields may contain different
data but have the same
name. There are a couple of
ways in
which
this can happen. "Total
Sales" probably means fiscal year
sales to one part of an
enterprise
and
calendar year sales to
another. The second instance
can be much more dangerous. If
an
application is
used by multiple divisions, it is
likely that a field that is
necessary for one
business
unit is
irrelevant to another and
may be left blank by the
second unit or, worse, used
for otherwise
undocumented
purposes.
Non primary
key problems
§
Required fields
left blank.
§
Data
erroneous or incomplete.
§
Data
contains null values.
5. Required
fields may be left blank.
Clerical errors account for
most of these, but mobil e
phones
did
not come into use
until early 90's, so contact numbers
recorded before then will
not have
them.
6. Data
may be erroneous or inconsistent.
The telephone number may be
for Lahore but the
city
listed as
Karachi.
7. The
data may contain null
values. Null values can
occur for a wide variety of
reasons, the most
common of
these are:
151
a. Data
that is genuinely missing or
unknown,
b. An attribute
does not apply to an entity,
c. Data
that is pending, or
d. Data
that is only partially
known.
Automatic
Data Cleansing...
1)
Statistical
2) Pattern
Based
3)
Clustering
4) Association
Rules
Some of
the data cleansing
techniques are listed. Let's
discuss each of them in
detail.
Statistical
:
Identifying outlier fields
and records using the
values of mean, standard
deviation,
range,
etc., based on Chebyshev's theorem,
considering the confidence intervals for
each field.
Outlier
values for particular fields
are identified based on
automatically computed statistics.
For
each
field the average and the
standard deviation are
utilized and based on Chebyshev's
theorem
those
records that have values in a
given field outside a number of
standard deviations from
the
mean
are identified. The number of
standard deviations to be considered is
customizable.
Confidence
intervals are taken into
consideration for each
field.
Pattern-based: Identify
outlier fields and records
that do not conform to
existing patterns in
the
data. Combined
techniques (partitioning, classification,
and clustering) are used to
identify
patterns
that apply to most records. A
pattern is defined by a group of records
that have similar
characteristics
("behavior") for p% of the
fields in the data set,
where p is a user-defined value
(usually above
90).
Clustering
:
Identify outlier records
using clustering based on Euclidian
(or other) distance.
Existing
clustering algorithms provide
little support for
identifying outliers. However, in
some
cases
clustering the entire record space
can reveal outliers that are
not identified at the field
level
inspection.
The main drawback of this
method is computational time. The
clustering algorithms
have high
computational complexity. For
large record spaces and
large number of records, the
run
time of the
clustering algorithms is
prohibitive.
Association
rules: Association
rul es with high confidence
and support define a
different kind of
pattern. As
before, records that do not follow
these rules are considered
outliers. The power of
association
rules is that they can
deal with data of different
types. However, Boolean
association
rules do
not provide enough quantitative
and qualitative information.
152
Table of Contents:
|
|||||