|
|||||
Virtual
University of Pakistan
Data
Profiling
Lab
lec: 4
Ahsan
Abdullah
Assoc. Prof.
& Head
Center for
Agro -Informatics
Research
www.nu.edu.pk/cairindex.asp
National
University of Computers & Emerging
Sciences, Islamabad
Email:
ahsan101@yahoo.com
Slide 2
Single
View
· In SQL
Server we have four
databases
corresponding to
each campus
· Each
campus has a Student &
Registration
table
· These
databases can give us campus
wide
view of
student information and
their
enrollment
· What
about if we want to have a
university
wide
view of student information
and their
enrollment?
In SQL
Server we have four
databases corresponding to each
campus. Each campus has a
Student
& Registration
table. Now both tables for
each campus have been
loaded to MS SQL sever.
It
means by
this stage we have four
databases (one for each
campus) and corresponding to
each
database we
have two tables.
These
databases can give us campus
wide view of student
information and their enrollment.
In
order to
get university wide view of
information we have to have
consolidated information of
all
campuses. To
consolidate we need to standardize
information across all
campuses. In this
lecture
we will
step towards consolidation.
Slide 3
417
Single
Source & Single
View
·
To get
in-depth university wide
view we
need to
put all data into a
single source
·
Can we
just combine all student
tables to
have a
single university student
table?
·
Definitely
NO, as
1.
Order of
columns are
different
2.
Data
types of columns are
different
3.
Number of
columns is different in each
table
4.
Date
formats are different
5.
Gender
convention is different in each
table
At this time we
have four student tables in
different SQL databases. To
consolidate all tables
to
get
singl e source of truth we can not
just glue all tables because
of following facts:
Order of
columns are
different
Data
types of columns are
different
Number of
columns is different in each
table
Date
formats are
different
Gender
convention is different in each
table
Slide 4
Need: Data
Standardization
· Before
combining all tables we need
to
standardize
them
· Number
and types of columns, date
formats
and
storing conventions all of
them should
be consistent in
each table
· The
process of standardization
requires
transformation of
data elements
· To identify
the degree of
transformation
required we
will perform data
profiling
To remove
the factors that are not
letting us putting all data
together we need to standardize
all
tables.
Standardization process involves
the consistency of number and
types of columns,
date
formats, and
storing conventions across all
campuses and more. To
standardize we need to
transform data
elements. To identify the
degree of transformation required we need to
perform
data
profiling.
Slide 5
418
New Columns:
Distinct Row ID
· Add
another column to each
student table
· This
new column is named as
RowID
· It is used to
identify each record
separately
· It can be
simple auto increment
column
By this time,
due to lack of primary key,
records can not be identified uniquely.
At this stage we
need an
attribute that can identify
each record uniquely. We need such an
identifying attribute at
this
stage because a lot of
records in this stage will
be put to error or exception table
due to error
in any of
the columns. In the error
table they will be corrected
and later on after correction
the
changes
will be updated in original
student table. For this
update we need a column like
row id to
join
error table with student
table.
Slide 6
New Columns:
Dirty bit
· Add a
new column to each student
table
· This
new column is named as
"Dirty bit"
· It can be
boolean type
column
· This
column will help us in
keeping record of
rows
with errors, during data
profiling
To keep record
of the rows that have been
inserted into error tables
due to certain errors we
need
an additional column in
student table that will
serve as dirty bit. Dirty
bit of those records is set
to
true
that are inserted in the
error table.
Slide 7
419
Exception
table
· Create
error tables exactly same in
structure
as student
table except
It does
not contain any dirty
bit column
It contains an
additional column
`Comment'
· `Comment'
contains the name of column
having error
· After
correction in error table
only those rows
will be
updated in original student
tables that
are
changed
Error or
exception table contains the
copy of records that have
corrupted values for any
column in
original
student table. Error table
is the copy of original
student table except instead
of dirty bit
we will
have comments column in the
error table. In comments column we
store the name of
columns in
which we encountered errors
for this particular row. For
example consider a
record
`R', it
has missing gender and incorrect
date of birth. For the
record R we will have
comment
[Gender],
[Date of Birth]. These
comments will be used while
correction of error tables.
After
correction of
error table we will update
corresponding rows in the original
student table.
Slide 8
Towards
Standardization: Profiling,
Exception
&transformation
· Data
profiling
Identify
erroneous records
Copy
erroneous records to Exception
table and
set
dirty bit of erroneous
records in student
table of a
campus
· Correct
exception table
Reflect
corrections in exception table in
original
student
table
· Transform
student table
Corrected
records in student table are
then
transformed
and copied to the table
Student_Info
Data
profiling is a process which
involves gathering of information about
column through
execution of
certain queries with intention to
identify erroneous records. In
this process we
identify
the following:
Total
number of values in a column
Number of
distinct values in a column
Domain of a
column
Values out of
domain of a column
Validation of
business rules
420
We run
different SQL queries to get
the answers of above
questions. During this
process we can
identify
the erroneous records. Whenever we
will come across an
erroneous record we will
just
copy it in
error or exception table and
set the dirty bit of
record in the actual student
table.
Then we
will correct the exception
table. After this profiling
process we will transform
the
records
and load them into a new
table Student_Info.
Slide 9
Towards
standardization: Repeat
profiling,
Combine Campus Wise
Standardized
Tables
· Data
profiling
Prepare
profile again and note
the reduction in
errors
· Student_Info
must be
clean and
standardized
campus table
· Same
process will be repeated for
all
campuses
· Finally
Student_Info tables of all
campuses
will be combined to
get a standardized
single
table of data from all
campuses
After
transforming all records, we
will perform data profiling
again and identify the
reduction in
the number of
errors.
Student_Info
must be clean and
standardized c ampus table.
After performing same steps
for all
campuses we
will just put Student_Info tables
from four campuses together
to get consolidated
source.
Slide 10
Process of Data
Profiling
· Data
profiling, gathering information
about
columns,
fulfils the following two
purposes
Identify
the type and extent to
which
transformation is
required
Gives us a
detailed view of data
quality
· Data
profiling is required to be
performed
twice
Before
applying transformations
After
applying transformations
Data
profiling is a process of gathering
information about col umns, It
must fulfil the
following
purposes
· Identify
the type and extent to
which the transformation is
required
· The number
of columns which are required to be
transformed and which
transformation
is required,
meaning date format or
gender convention.
· It should
provide us a detailed view
about the quality of data.
The number of
erroneous
values and
the number of values out of domain.
421
To judge effectiveness of
transformation we perform data profiling
twice. One before
transformation
and the other after
transformation.
Slide 11
Data
Profiling: Lahore
`SID'
· Column
Name
SID
· Data
Type
Varchar
[255]
· Nullable
Yes
· Nature
Numeric values
only
· Missing
values
Zero
· Total
values
5201 ... as
many records
· Unique
values
4401 ....
Same IDs are repeating
for MS Students
also
The
slide shows data profiling
for Student ID. The
results can help us in identifying
that what
transformations
should be applied. Like we can
see from here that
data type is highly
incompatible. To
store SID we should not have
varchar[255], we can use
varchar[10] at
maximum.
To know whether
column is unique or not, we can compare
total values with distinct / unique
values. If both
values are same then its
mean that our column is unique.
Similarly we can also
identify
and count the total
number of null
values.
Slide 12
Data
Profiling: Lahore
`SID'
Problems
· Maximum
value
4400
· Negative
values
No
· When
wizard is used to load data
from text file,
it
creates
table through automatic
query with all
columns
having same data type
`varchar [255]'
· As in this we
have merged the records
for BS and
MS therefore SID is
no more unique whereas
it
was
used as primary key for BS
and MS separately
Two major
problems that are identified
are:
· Varchar[255]
is very large to store IDs,
names and even addresses, we
must need to
change
it.
· There is no unique
column that can be used as
primary key
422
Slide 13
Transformations:
Lahore `SID':
· Column
Name
SID
· Column
Type
Changed to
`Numeric'
· Nullable
No
· The
above two transformations
are required
for
SID
The
slide shows the
transformations that are
suggested for the first
column that is SID.
First of all
its type should be changed
to Numeric, as varchar[255] does not make
any sense.
Secondly we
need to change nullability
option to no, as ID of each
student must be present
there.
NULLs in ID
can cause great trouble
while joining tables.
Slide 14
Data
Profiling: Lahore
`St_Name'
· Column
Name
St_Name
· Data
Type
Varchar
[255]
· Nullable
Yes
· Nature
Text or
char arrays
· Missing
values
Zero
· Total
values
5201 ... as
many records
· Unique
values
4793 .... Some names
are repeating
The
above slide shows the
profiling outpu t of St_Name column. The
slide shows that no
names
are
missing and some of the
names are repeating.
423
Slide 15
Transformation:
Lahore `St_Name'
· `One to
Many' transformation will be
applied
here
· Column
names
First_Name
char[10]
Last_Name
char[10]
Student_Name
char[10]
· Nullable
No
To store
information we need one to
many transformations of names. We need to
transform name
of each
stu ent into 3
columns
d
· First
Name
· Last
Name
· Student
Name (middle part of
name)
This
type of transformation requires scripts.
We will write VB Scripts for
such transformations.
Slide 16
Data
Profiling: Lahore
`Father_Name'
· Column
Name
Father_Name
· Data
Type
Varchar
[255]
· Nullable
Yes
· Nature
Text or
char arrays
· Missing
values
Zero
· Total
values
5201 ... as
many records
· Unique
values
4574.... Students may be
siblings
The
slide shows the profiling
for Fathe r's
name.
424
Slide 17
Data
Profiling: Lahore
`Gender'
· Column
Name
Gender
· Data
Type
Varchar
[255]
· Nullable
Yes
· Nature
Binary
values only (0/1)
· Missing
values
187
· Total
values
5014
The
slide shows the profiling
details for column
Gender.
Slide 18
Data
Profiling: Lahore `Gender
'
Convention &
Unique Values
· Convention
0 for
Male
1 for
Female
· Unique
values
8 including
(0,1,00,11,01,10,001,M)
There
should be only two unique values in
this column (0 and 1). But
while profiling we
have
found
that there are 8 unique
values. Its mean there are a
lot of rows corrupted in original
data.
Some
may be typos or missing
values etc. We need to clean
data to make it usable for
analysis
purposes.
425
Slide 19
Data
Quality: Erroneous
Records
Exception
Table
· Any
record having value other
than 0 or 1
are
erroneous
· Move
all such records to
exception table
INSERT
INTO Exception
(SID,
St_Name, Father_Name , Gender,
Address, [Date of
Birth], [
Reg Date], [Reg Status],
[Degree Status],
[Last
Degree],
RowID)
Select
SID, St_Name, Father_Name,
Gender, Address,
[Date
of Birth], [
Reg Date], [Reg Status],
[Degree Status],
[Last
Degree],
RowID
From
Student
Where
(Gender <>'0') AND (
Gender <>'1')
Legal
values are just 0 and 1.
All other values are
noise and required to be cleaned. To
clean
values put all
corrupted rows to exception table.
Following query can be used to
move all
corrupted rows to
exception table.
INSERT
INTO Exception
(SID,
St_Name, Father_Name, Gender,
Address, [Date of Birth], [Reg Date],
[Reg Status],
[Degree
Status], [Last Degree],
RowID)
Select
SID,
St_Name, Father_Name, Gender,
Address, [Date of
Birth],
[Reg
Date], [Reg Status], [Degree
Status], [Last Degree],
RowID
From
Student
Where
(Gender <>'0') AND (
Gender <>'1')
Slide 20
Data
Quality: Erroneous
Records
ous
cords
Commentts
Collumn
Commen s Co
umn
Above
query woulld copy
allll records
wiitth
ry wou d
copy a records w h
errors in
Gender fiielld to
excepttiion ttablle
der f e d to
excep on ab e
· Insertt the
name of erroneous ffiielld ii..e..
ser the
ame f roneous e d e
Gender iin the
Commentts collumnoff
der n
the Commen s co umn o
Exceptiion ttablle
ept on
abe
UPDATE
Exception
DATE
SET
Commentts =
''Genderr'
Commen s =
Gende '
T
WHERE
(Comments IIS
NULL))
ERE
ments S
NULL
As the
Comment column was added
just to store the name of
column having error, therefore,
above query is
required to be executed to update Comme
nt column to enter the name of
corrupted
column against
each row.
This query is
required to be run right after
each Insert query to exception table,
otherwise we will
lose
the information `What was
wrong in a particular row?' in exception
table.
426
Slide 21
Data
Quality: Erroneous
Records
Set Dirty
Bit
· Furthermore,
set dirty bit of all
records
copied to
exception table
UPDATE
Student
SET
Dirty =
1
WHERE
(Gender
<> '0') AND (Gender <>
'1')
At the
same time we need to update
each corrupted record in original student
table for Lahore
campus by
setting its dirty bit. So
that we can maintain the record
which rows are copied in
error
table.
Dirty bit says that
this record has at least
one field corrupted. It does
not show how
many
fields
are corrupted.
Slide 22
Data
Profiling consists of...
427
Slide 23
Data
Profiling: `Date of
Birth'
· Column
Name
Date of
Birth
· Data
Type
Varchar
[255]
· Nullable
Yes
· Format
d-MMM-yy
e.g. 8-Jul-94
The
slide shows the output of
profiling of column `Date of Birth'.
Again column type is
incorrect,
column can
contain null values, and
format of date is
d-MMM-yy.
Profiling
Date needs to change our
flow of work. For columns
before this we just did
profiling
and no
transformation has been done
yet. We can not profile
date without transforming it . So
we
need to
transform date first and
then profile. The problem
is, when we loaded all
records to SQL
server from
text files they are
loaded as strings (character
arrays). While profiling we
may want
to get
the range of dates (minimum
and maximum dates). We can
not identify date ranges
until or
unless we
transform it as date data
type.
Slide 24
Handling
Dates: Poblem
· While
profiling we need to run
queries
to
identify
Inconsistencies in
date formats
Invalidities
like 29th
Feb
1975
Missing
values of dates
Violations in
business rules like 10
years
student in
graduating class
By this time you
must have got the idea
that data profiling is a
powerful method to have an
idea
about
the quality of data. While
profiling data we need to
run queries to
identify:
· Inconsistencies
in date formats
· Invalidities
like 29th
Feb
1975
· Missing values of
dates
· Violations in
business rules like 10 years
student in graduating class
428
Not
only the values out of domain of
columns hit the quality of
data but certain values in
the
domain of column
may cause the quality to
suffer. Like the values
cause the violation of
business
rules.
While profiling data we must
have a set of rules
regarding to our busines s
like we can not
allow
any student less than 16
years to be admitted for BS.
While profiling we need to
identify
the
records violating such rules
means if we find a 10 years
old student in graduated
students then
its
the violation of business
rule and there must be
some error either in date of
birth of student or
date of
graduation. Similarly any student
has date of graduation less
than date of birth then
again
it is considered
as noise while both dates
are in valid domain. Such
dates are required to be
identified at
the time of profiling and
needs to be dealt separately in
error table.
Slide 25
Data
Quality: Format
Inconsistencies
Date Error:
Identification Query
· Dates
with errors in formats can
be
identified
as
SELECT
*
FROM
Student
WHERE
([Date of
Birth] NOT LIKE '%_-Jan-__')
AND
([Date of
Birth] NOT LIKE '%_-Feb-__')
AND
([Date of
Birth] NOT LIKE '%_-Mar-__')
AND
([Date of
Birth] NOT LIKE '%_-Apr-__')
AND
([Date of
Birth] NOT LIKE '%_-May-__')
AND
([Date of
Birth] NOT LIKE '%_-Jun-__')
AND
([Date of
Birth] NOT LIKE '%_-Jul-__')
AND
Errors in the
format of dates can be
identified through following
query.
SELECT
*
FROM
Student
WHERE ([Date of
Birt h] NOT LIKE
'%_-Jan-__')
AND
([Date of Birth]
NOT LIKE '%_-Feb-__')
AND
([Date of Birth]
NOT LIKE '%_-Mar-__')
AND
([Date of Birth]
NOT LIKE '%_-Apr-__')
AND
([Date of Birth]
NOT LIKE '%_-May-__')
AND
([Date of Birth]
NOT LIKE '%_-Jun
-__')
AND
([Date of Birth]
NOT LIKE '%_-Jul-__')
AND
([Date of Birth]
NOT LIKE '%_-Aug-__')
AND
([Date of Birth]
NOT LIKE '%_-Sep-__')
AND
([Date of Birth]
NOT LIKE '%_-Oct-__')
AND
([Date of Birth]
NOT LIKE '%_-Nov-__')
AND
([Date of Birth]
NOT LIKE '%_-Dec-__')
AND
([Date of Birth]
<> '') AND
([Date of Birth]
IS NOT NULL)
The above query
can not identify all
invalid dates like 30-Feb-2005.
But any date with
invalid
format, o r
having illegal month can be
identified through the above
query.
429
Slide 26
Data
Quality: Format
Inconsistencies
Date Error:
Query output-1
([Date of
Birth] NOT LIKE '%_-Aug-__')
AND
([Date of
Birth] NOT LIKE '%_
-Sep-__') AND
([Date of
Birth] NOT LIKE '%_
-Oct-__') AND
([Date of
Birth] NOT LIKE '%_
-Nov -__') AND
([Date of
Birth] NOT LIKE '%_
-Dec-__') AND
([Date of
Birth] <> '') AND
(Invaed f Btries Iare OT
NULL)
[Datli oeni th] S
N
The output of
the above query shows 3
invalid dates. 22-Jal-75,1/27/75,27-Apl -77.
These are
invalid
dates as their format are
not correct.
Whi le
transforming dates from string to
date format, all dates
must be legal. If any illegal
date
comes
the package will be
terminated and all
transactions will be rolled
back. So through
sequence of
different query we try to
identify all invalid dates
so tha t our package can
complete
its
execution successfully.
Slide 27
Data
Quality: Format
Inconsistencies
Date Error:
Query output-2
· Invalid
dates can be 29 th Feb 1975 or 31st
June
etc
SELECT
*
FROM
Student
WHERE
([Date of
Birth] LIKE '29 -Feb-%')
OR
([Date of
Birth] LIKE
'3_-%')
To identify
invalid dates like 29th Feb 1975 or 31st June, we can run
the following query
SELECT
* FROM
Student
WHERE
([Date of Birth] LIKE '2
-Feb-%') OR
9
([Date
of Birth] LIKE '3_ -% '
)
There is no
finalized methodology to identify errors in
dates. It is up to the designer
what
sequence of
queries he/she generates to
identify the errors in dates
because the purpose is
to
identif y
all erroneous dates before
running the package.
Otherwise, package's execution
will be
terminated
and all transactions caused
due to that package will be
rolled back.
430
Slide 28
Data
Quality: Multiple
Inconsistencies
· If any
record is selected whose
dirty bit is
already
set we will not copy it
again to
exception
table rather we will modify
the
comment of
Exception table
· But in
this query all selected
records have
their
dirty bits off
· So copy
all record to Exception
table and
set
their dirty bits on in
Student table
It has
been discussed earlier, how we
will work with error
table. Now if we meet a record
whose
dirty
bit is already set then we
will not copy the record
again but we will modify the
comment in
exception
table. We will append the
name of second erroneous column
with the name of
column
already present
there.
Slide 29
Data
Quality: Multiple
Inconsistencies
Example
SELECT
[Date of Birth], RowID,
Dirty
FROM
Student
WHERE
([Date of
Birth] LIKE '31-Sep-%')
OR
([Date of
Birth] LIKE '31-Feb-%')
OR
([Date of
Birth] LIKE '31-Apr-%')
OR
([Date of
Birth] LIKE
'31-Nov-%')
We can
use following query as well.
The above query and the query in
the previous slide both
are
aimed at
identify the invalidities in
dates. There may be some
other set of queries that
can be
used
wit h the same intention to
identify invalid dates. Good
queries are those that
identify and
output
only erroneous records and
all erroneous
records.
431
Slide 30
Data
Quality: Multiple
Inconsistencies
Handling
· RowID
3695 has already been
copied due to
error in
Gender, therefore it is not
required
to be copied
again just modify its
comments
as
UPDATE
Exception
SET
Comments =
Comments + '+ Date of
Birth'
WHERE
([Date of
Birth] LIKE '31-Sep-%')
OR
([Date of
Birth] LIKE '31 -Feb-%')
OR
([Date of
Birth] LIKE '31 -Apr-%')
OR
([Date of
Birth] LIKE '31
-Nov-%')
For
the records whose dirty
bit is already on, we do not
ne ed to enter the record again but
to
modify
the comment only through
following query
UPDATE
Exception
SET
Comments =
Comments + '+ Date of
Birth'
WHERE
([Date of
Birth] LIKE '31 -Sep-%')
OR
([Date of
Birth] LIKE '31 -Feb-%')
OR
([Date of
Birth] LIKE '31-Apr-%')
OR
([Date of
Birth] LIKE
'31-Nov-%')
Slide 31
Summary:
`Date of Birth '
· Total dates
found with inconsistent
formats
are 3
· Total
invalid dates are 4
· Total
Missing dates are 9
· Business rules
are yet to be validated
Slide shows
the summary of data
profiling of `Date of Birth'. Business
rule can be as
follows:
At the
time of joining for BS,
student must be at least 16
years old
At the
time of joining for MS,
student must be at least 20
years old
Business
rules are yet to be
validated because to validate
business rule we have to
have date of
registration as
well.
432
Slide 32
Transformation:
Lahore `Date of
Birth'
· Column
Name
DoB
· Column
Type
DateTime
· Nullable
Yes
Following
transformations are suggested for
Date of Birth:
· Change
the name of column as
DoB
· Change
column type as DateTime
format
· Allow
null values as there are
some nulls and empty strings
in the orig inal
data
Slide 33
Data
Profiling: Lahore `Reg
Date'
· Column
Name
Reg
Date
· Data
Type
Varchar
[255]
· Nullable
Yes
· Format
d-MMM-yy
e.g. 8-Jul-94
Now we
need to profile date of registration
column. The same procedure
that we have done
for
Date of
Birth is required to be repeated.
433
Slide 34
Transformation:
Lahore `Reg Date'
· Column
Name
Reg
Date
· Column
Type
DateTime
· Nullable
Yes
The
slide shows t he transformations
that are required for Registration
date.
Slide 35
Data
Profiling: `Business rule
validation'
Two
business rules are required
to be
validated
here
· All new
registrations are done in
month of
August
before 28
· Transfer
cases can also be dealt
in
January
· At the
time of registration for BS
age must
be greater
than 16 years and for MS
age
must be
greater than 20 years
Now as we
have profiled date of registration as
well therefore we can validate
the following
business
rules
· All new
registrations are done in
month of August before 28th
· Transfer
cases can also be dealt in
January
· At the
time of registration for BS age must be
greater than 16 years and
for MS age
must be
greater than 20
years
There
can be a lot of business
rules that are supposed to
be validated at this stage. It
totally
depends upon
the business.
Other than
business rules we also
validate some logical rules
at this stage, like date of
registration
can
not be less than date of
birth. No one could register
for a degree before birth.
Similarly date
of birth
can not be 100 years
back and so on. Such
rules can be devised keeping
in mind the
nature of
business.
434
Slide 36
Data
Profiling: `Business rule
validation'
Correct
Records
· Total
correct records in Student
table
is
4958
SELECT
COUNT(*) AS
Expr1
FROM
Student
WHERE
(Dirty =
0)
· We will
validate business rules only
for
correct
records
At this
stage we have only correct
records in Student table.
All records with errors in
dates have
been
moved to exception table.
Now we will validate the
correct records for given
set of business
rules
and general logical rules.
Records violating any of the
rules are required to copy
in
exception
table.
Slide 37
Data
Profiling: `Business rule
validation'
Registration
Date
· Validate
registrations date
SELECT
COUNT([Reg Date]) AS
Date
FROM
Student
WHERE
(Dirty = 0)
AND ([Reg Date] NOT
LIKE '% -Aug-%')
AND
([Reg
Status] = 'A') OR
(Dirty = 0)
AND ([Reg Date] NOT
LIKE '% -Jan-%')
AND
([Reg
Status] = 'T')
(Dirty = 0)
AND ([Reg Date] NOT
LIKE '% -Aug-%')
AND
([Reg
Status] = `T')
· 0
· All
dates are valid
The s
lide shows the query that
can violate the following
two rules:
· All
new registrations are done
in month of August before 28th
· Transfer
cases can also be dealt in
January
We can
see that the first
check ,`dirty bit = 0', is
giving us the records having
no errors. The
following query
can identify all records
that have registration date in
any month other than
August
and January. Those records
will be erroneous records.
Records having registration date
in
January
and Registration status is
transfer case is legal. Similarly in
August both Transfer
cases
and
new admissions are
legal.
SELECT
COUNT([Reg Date]) AS
Date
FROM
Student
WHERE
(Dirty
= 0) AND ([Reg Date] NOT
LIKE '% -Aug-%') AND ([Reg
Status] = 'A') OR
435
(Dirty
= 0) AND ([Reg Date] NOT
LIKE '% -Jan-%') AND ([Reg
Status] = 'T')
(Dirty
= 0) AND ([Reg Date] NOT
LIKE '% -Aug-%') AND ([Reg
Status] = `T')
Slide38
Data
Profiling: `Business rule
validation'
Age
· Validate
age while
registration
SELECT
*
FROM
(SELECT
*
FROM
Student
WHERE
dirty = 0)
DERIVEDTBL
WHERE
(CAST([Reg Date] AS DateTime)
-
CAST([Date of
Birth] AS DateTime ) <
16)
· 3 violating
records can be seen on
next
slide
The
slide shows the query that
can be used to validate the
following business
rule:
At the
time of registration for BS
age must be greater than 16
years and for MS age
must be
greater
than 20 years
The query again
first identifies the correct
records through dirty bit.
Then checks which of
the
students
are less than 16 years at
the time of registration. To apply `+' or ` -'
operator on dates we
need to
cast date strings into
date data type by using
CAST function. If there comes
any invalid
date
the query terminates with
error.
Slide 39
Data
Profiling: `Business
rule
validation'
Age:
Violating Records
· Copy
these records to Exception
table
and set
dirty bit
· Update
comments in Exception
table
with
value = `BR: Age'
On the
slide we can see three
records that are violating
business rules. In the first
record date of
birth is
20-Nov-75 and date of Registration is
4-Aug-74, which is obviously
impossible. Here,
one
thing is obvious that date
of registration is incorrect as we have data
starting from 1994 only.
Either date of
birth is correct or not,
nothing can be said without
consulting golden copy (Original
copy).
We will
copy these records to
exception table and set
the dirty bit for
these records. For
such
cases we
will update comment by
appending `BR:Age' (Business
rule: Age).
436
Table of Contents:
|
|||||