ZeePedia

Extracting Data Using Wizard

<< Lab Data Set: Multi -Campus University
Data Profiling >>
img
Virtual University of Pakistan
Data Profiling
Lab lec: 4
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro -Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan101@yahoo.com
Slide 2
Single View
· In SQL Server we have four databases
corresponding to each campus
· Each campus has a Student & Registration
table
· These databases can give us campus wide
view of student information and their
enrollment
· What about if we want to have a university
wide view of student information and their
enrollment?
In SQL Server we have four databases corresponding to each campus. Each campus has a Student
& Registration table. Now both tables for each campus have been loaded to MS SQL sever. It
means by this stage we have four databases (one for each campus) and corresponding to each
database we have two tables.
These databases can give us campus wide view of student information and their enrollment. In
order to get university wide view of information we have to have consolidated information of all
campuses. To consolidate we need to standardize information across all campuses. In this lecture
we will step towards consolidation.
Slide 3
417
img
Single Source & Single View
·
To get in-depth university wide view we
need to put all data into a single source
·
Can we just combine all student tables to
have a single university student table?
·
Definitely NO, as
1.
Order of columns are different
2.
Data types of columns are different
3.
Number of columns is different in each table
4.
Date formats are different
5.
Gender convention is different in each table
At this time we have four student tables in different SQL databases. To consolidate all tables to
get singl e source of truth we can not just glue all tables because of following facts:
Order of columns are different
Data types of columns are different
Number of columns is different in each table
Date formats are different
Gender convention is different in each table
Slide 4
Need: Data Standardization
· Before combining all tables we need to
standardize them
· Number and types of columns, date formats
and storing conventions all of them should
be consistent in each table
· The process of standardization requires
transformation of data elements
· To identify the degree of transformation
required we will perform data profiling
To remove the factors that are not letting us putting all data together we need to standardize all
tables. Standardization process involves the consistency of number and types of columns, date
formats, and storing conventions across all campuses and more. To standardize we need to
transform data elements. To identify the degree of transformation required we need to perform
data profiling.
Slide 5
418
img
New Columns: Distinct Row ID
· Add another column to each student table
· This new column is named as RowID
· It is used to identify each record separately
· It can be simple auto increment column
By this time, due to lack of primary key, records can not be identified uniquely. At this stage we
need an attribute that can identify each record uniquely. We need such an identifying attribute at
this stage because a lot of records in this stage will be put to error or exception table due to error
in any of the columns. In the error table they will be corrected and later on after correction the
changes will be updated in original student table. For this update we need a column like row id to
join error table with student table.
Slide 6
New Columns: Dirty bit
· Add a new column to each student table
· This new column is named as "Dirty bit"
· It can be boolean type column
· This column will help us in keeping record of
rows with errors, during data profiling
To keep record of the rows that have been inserted into error tables due to certain errors we need
an additional column in student table that will serve as dirty bit. Dirty bit of those records is set to
true that are inserted in the error table.
Slide 7
419
img
Exception table
· Create error tables exactly same in structure
as student table except
­ It does not contain any dirty bit column
­ It contains an additional column `Comment'
· `Comment' contains the name of column having error
· After correction in error table only those rows
will be updated in original student tables that
are changed
Error or exception table contains the copy of records that have corrupted values for any column in
original student table. Error table is the copy of original student table except instead of dirty bit
we will have comments column in the error table. In comments column we store the name of
columns in which we encountered errors for this particular row. For example consider a record
`R', it has missing gender and incorrect date of birth. For the record R we will have comment
[Gender], [Date of Birth]. These comments will be used while correction of error tables. After
correction of error table we will update corresponding rows in the original student table.
Slide 8
Towards Standardization: Profiling,
Exception &transformation
· Data profiling
­ Identify erroneous records
­ Copy erroneous records to Exception table and
set dirty bit of erroneous records in student
table of a campus
· Correct exception table
­ Reflect corrections in exception table in original
student table
· Transform student table
­ Corrected records in student table are then
transformed and copied to the table Student_Info
Data profiling is a process which involves gathering of information about column through
execution of certain queries with intention to identify erroneous records. In this process we
identify the following:
Total number of values in a column
Number of distinct values in a column
Domain of a column
Values out of domain of a column
Validation of business rules
420
img
We run different SQL queries to get the answers of above questions. During this process we can
identify the erroneous records. Whenever we will come across an erroneous record we will just
copy it in error or exception table and set the dirty bit of record in the actual student table.
Then we will correct the exception table. After this profiling process we will transform the
records and load them into a new table Student_Info.
Slide 9
Towards standardization: Repeat
profiling, Combine Campus Wise
Standardized Tables
· Data profiling
­ Prepare profile again and note the reduction in
errors
· Student_Info must be clean and
standardized campus table
· Same process will be repeated for all
campuses
· Finally Student_Info tables of all campuses
will be combined to get a standardized
single table of data from all campuses
After transforming all records, we will perform data profiling again and identify the reduction in
the number of errors.
Student_Info must be clean and standardized c ampus table. After performing same steps for all
campuses we will just put Student_Info tables from four campuses together to get consolidated
source.
Slide 10
Process of Data Profiling
· Data profiling, gathering information about
columns, fulfils the following two purposes
­ Identify the type and extent to which
transformation is required
­ Gives us a detailed view of data quality
· Data profiling is required to be performed
twice
­ Before applying transformations
­ After applying transformations
Data profiling is a process of gathering information about col umns, It must fulfil the following
purposes
·  Identify the type and extent to which the transformation is required
·  The number of columns which are required to be transformed and which transformation
is required, meaning date format or gender convention.
·  It should provide us a detailed view about the quality of data. The number of erroneous
values and the number of values out of domain.
421
img
To judge effectiveness of transformation we perform data profiling twice. One before
transformation and the other after transformation.
Slide 11
Data Profiling: Lahore `SID'
· Column Name
­ SID
· Data Type
­ Varchar [255]
· Nullable
­ Yes
· Nature
­ Numeric values only
· Missing values
­ Zero
· Total values
­ 5201 ... as many records
· Unique values
­ 4401 .... Same IDs are repeating for MS Students also
The slide shows data profiling for Student ID. The results can help us in identifying that what
transformations should be applied. Like we can see from here that data type is highly
incompatible. To store SID we should not have varchar[255], we can use varchar[10] at
maximum.
To know whether column is unique or not, we can compare total values with distinct / unique
values. If both values are same then its mean that our column is unique. Similarly we can also
identify and count the total number of null values.
Slide 12
Data Profiling: Lahore `SID'
Problems
· Maximum value
­ 4400
· Negative values
­ No
· When wizard is used to load data from text file, it
creates table through automatic query with all
columns having same data type `varchar [255]'
· As in this we have merged the records for BS and
MS therefore SID is no more unique whereas it
was used as primary key for BS and MS separately
Two major problems that are identified are:
·  Varchar[255] is very large to store IDs, names and even addresses, we must need to
change it.
·  There is no unique column that can be used as primary key
422
img
Slide 13
Transformations: Lahore `SID':
· Column Name
­ SID
· Column Type
­ Changed to `Numeric'
· Nullable
­ No
· The above two transformations are required
for SID
The slide shows the transformations that are suggested for the first column that is SID.
First of all its type should be changed to Numeric, as varchar[255] does not make any sense.
Secondly we need to change nullability option to no, as ID of each student must be present there.
NULLs in ID can cause great trouble while joining tables.
Slide 14
Data Profiling: Lahore `St_Name'
· Column Name
­ St_Name
· Data Type
­ Varchar [255]
· Nullable
­ Yes
· Nature
­ Text or char arrays
· Missing values
­ Zero
· Total values
­ 5201 ... as many records
· Unique values
­ 4793 .... Some names are repeating
The above slide shows the profiling outpu t of St_Name column. The slide shows that no names
are missing and some of the names are repeating.
423
img
Slide 15
Transformation: Lahore `St_Name'
· `One to Many' transformation will be applied
here
· Column names
­ First_Name
char[10]
­ Last_Name
char[10]
­ Student_Name
char[10]
· Nullable
­ No
To store information we need one to many transformations of names. We need to transform name
of each stu ent into 3 columns
d
·  First Name
·  Last Name
·  Student Name (middle part of name)
This type of transformation requires scripts. We will write VB Scripts for such transformations.
Slide 16
Data Profiling: Lahore `Father_Name'
· Column Name
­ Father_Name
· Data Type
­ Varchar [255]
· Nullable
­ Yes
· Nature
­ Text or char arrays
· Missing values
­ Zero
· Total values
­ 5201 ... as many records
· Unique values
­ 4574.... Students may be siblings
The slide shows the profiling for Fathe r's name.
424
img
Slide 17
Data Profiling: Lahore `Gender'
· Column Name
­ Gender
· Data Type
­ Varchar [255]
· Nullable
­ Yes
· Nature
­ Binary values only (0/1)
· Missing values
­ 187
· Total values
­ 5014
The slide shows the profiling details for column Gender.
Slide 18
Data Profiling: Lahore `Gender '
Convention & Unique Values
· Convention
­ 0 for Male
­ 1 for Female
· Unique values
­ 8 including (0,1,00,11,01,10,001,M)
There should be only two unique values in this column (0 and 1). But while profiling we have
found that there are 8 unique values. Its mean there are a lot of rows corrupted in original data.
Some may be typos or missing values etc. We need to clean data to make it usable for analysis
purposes.
425
img
Slide 19
Data Quality: Erroneous Records
Exception Table
· Any record having value other than 0 or 1
are erroneous
· Move all such records to exception table
INSERT INTO Exception
(SID, St_Name, Father_Name , Gender, Address, [Date of
Birth], [ Reg Date], [Reg Status], [Degree Status], [Last
Degree], RowID)
Select SID, St_Name, Father_Name, Gender, Address, [Date
of Birth], [ Reg Date], [Reg Status], [Degree Status], [Last
Degree], RowID
From Student
Where (Gender <>'0') AND ( Gender <>'1')
Legal values are just 0 and 1. All other values are noise and required to be cleaned. To clean
values put all corrupted rows to exception table. Following query can be used to move all
corrupted rows to exception table.
INSERT INTO Exception
(SID, St_Name, Father_Name, Gender, Address, [Date of Birth], [Reg Date], [Reg Status],
[Degree Status], [Last Degree], RowID)
Select
SID,  St_Name,  Father_Name,  Gender,  Address,  [Date  of  Birth],
[Reg Date], [Reg Status], [Degree Status], [Last Degree], RowID
From Student
Where (Gender <>'0') AND ( Gender <>'1')
Slide 20
Data Quality: Erroneous Records
ous
cords
Commentts Collumn
Commen s Co umn
Above query woulld copy allll records wiitth
ry wou d copy a records w h
errors in Gender fiielld to excepttiion ttablle
der f e d to excep on ab e
· Insertt the name of erroneous ffiielld ii..e..
ser the ame f  roneous e d e
Gender iin the Commentts collumnoff
der n the Commen s co umn o
Exceptiion ttablle
ept on abe
UPDATE  Exception
DATE
SET
Commentts = ''Genderr'
Commen s = Gende '
T
WHERE  (Comments IIS NULL))
ERE
ments S NULL
As the Comment column was added just to store the name of column having error, therefore,
above query is required to be executed to update Comme nt column to enter the name of corrupted
column against each row.
This query is required to be run right after each Insert query to exception table, otherwise we will
lose the information `What was wrong in a particular row?' in exception table.
426
img
Slide 21
Data Quality: Erroneous Records
Set Dirty Bit
· Furthermore, set dirty bit of all records
copied to exception table
UPDATE
Student
SET
Dirty = 1
WHERE
(Gender <> '0') AND (Gender <> '1')
At the same time we need to update each corrupted record in original student table for Lahore
campus by setting its dirty bit. So that we can maintain the record which rows are copied in error
table. Dirty bit says that this record has at least one field corrupted. It does not show how many
fields are corrupted.
Slide 22
Data Profiling consists of...
427
img
Slide 23
Data Profiling: `Date of Birth'
· Column Name
­ Date of Birth
· Data Type
­ Varchar [255]
· Nullable
­ Yes
· Format
­ d-MMM-yy e.g. 8-Jul-94
The slide shows the output of profiling of column `Date of Birth'. Again column type is incorrect,
column can contain null values, and format of date is
d-MMM-yy.
Profiling Date needs to change our flow of work. For columns before this we just did profiling
and no transformation has been done yet. We can not profile date without transforming it . So we
need to transform date first and then profile. The problem is, when we loaded all records to SQL
server from text files they are loaded as strings (character arrays). While profiling we may want
to get the range of dates (minimum and maximum dates). We can not identify date ranges until or
unless we transform it as date data type.
Slide 24
Handling Dates: Poblem
· While profiling we need to run queries
to identify
­ Inconsistencies in date formats
­ Invalidities like 29th Feb 1975
­ Missing values of dates
­ Violations in business rules like 10 years
student in graduating class
By this time you must have got the idea that data profiling is a powerful method to have an idea
about the quality of data. While profiling data we need to run queries to identify:
·  Inconsistencies in date formats
·  Invalidities like 29th Feb 1975
·  Missing values of dates
·  Violations in business rules like 10 years student in graduating class
428
img
Not only the values out of domain of columns hit the quality of data but certain values in the
domain of column may cause the quality to suffer. Like the values cause the violation of business
rules. While profiling data we must have a set of rules regarding to our busines s like we can not
allow any student less than 16 years to be admitted for BS. While profiling we need to identify
the records violating such rules means if we find a 10 years old student in graduated students then
its the violation of business rule and there must be some error either in date of birth of student or
date of graduation. Similarly any student has date of graduation less than date of birth then again
it is considered as noise while both dates are in valid domain. Such dates are required to be
identified at the time of profiling and needs to be dealt separately in error table.
Slide 25
Data Quality: Format Inconsistencies
Date Error: Identification Query
· Dates with errors in formats can be
identified as
SELECT  *
FROM
Student WHERE
([Date of Birth] NOT LIKE '%_-Jan-__') AND
([Date of Birth] NOT LIKE '%_-Feb-__') AND
([Date of Birth] NOT LIKE '%_-Mar-__') AND
([Date of Birth] NOT LIKE '%_-Apr-__') AND
([Date of Birth] NOT LIKE '%_-May-__') AND
([Date of Birth] NOT LIKE '%_-Jun-__') AND
([Date of Birth] NOT LIKE '%_-Jul-__') AND
Errors in the format of dates can be identified through following query.
SELECT  *
FROM
Student
WHERE ([Date of Birt h] NOT LIKE '%_-Jan-__')
AND
([Date of Birth] NOT LIKE '%_-Feb-__')
AND
([Date of Birth] NOT LIKE '%_-Mar-__')
AND
([Date of Birth] NOT LIKE '%_-Apr-__')
AND
([Date of Birth] NOT LIKE '%_-May-__')
AND
([Date of Birth] NOT LIKE '%_-Jun -__')
AND
([Date of Birth] NOT LIKE '%_-Jul-__')
AND
([Date of Birth] NOT LIKE '%_-Aug-__')
AND
([Date of Birth] NOT LIKE '%_-Sep-__')
AND
([Date of Birth] NOT LIKE '%_-Oct-__')
AND
([Date of Birth] NOT LIKE '%_-Nov-__')
AND
([Date of Birth] NOT LIKE '%_-Dec-__')
AND
([Date of Birth] <> '') AND
([Date of Birth] IS NOT NULL)
The above query can not identify all invalid dates like 30-Feb-2005. But any date with invalid
format, o r having illegal month can be identified through the above query.
429
img
Slide 26
Data Quality: Format Inconsistencies
Date Error: Query output-1
([Date of Birth] NOT LIKE '%_-Aug-__') AND
([Date of Birth] NOT LIKE '%_ -Sep-__') AND
([Date of Birth] NOT LIKE '%_ -Oct-__') AND
([Date of Birth] NOT LIKE '%_ -Nov -__') AND
([Date of Birth] NOT LIKE '%_ -Dec-__') AND
([Date of Birth] <> '') AND
(Invaed f Btries Iare OT NULL)
[Datli oeni th] S N
The output of the above query shows 3 invalid dates. 22-Jal-75,1/27/75,27-Apl -77. These are
invalid dates as their format are not correct.
Whi le transforming dates from string to date format, all dates must be legal. If any illegal date
comes the package will be terminated and all transactions will be rolled back. So through
sequence of different query we try to identify all invalid dates so tha t our package can complete
its execution successfully.
Slide 27
Data Quality: Format Inconsistencies
Date Error: Query output-2
· Invalid dates can be 29  th Feb 1975 or 31st
June etc
SELECT
* FROM
Student
WHERE
([Date of Birth] LIKE '29 -Feb-%') OR
([Date of Birth] LIKE '3_-%')
To identify invalid dates like 29th Feb 1975 or 31st June, we can run the following query
SELECT  * FROM
Student
WHERE  ([Date of Birth] LIKE '2 -Feb-%') OR
9
([Date of Birth] LIKE '3_ -% ' )
There is no finalized methodology to identify errors in dates. It is up to the designer what
sequence of queries he/she generates to identify the errors in dates because the purpose is to
identif y all erroneous dates before running the package. Otherwise, package's execution will be
terminated and all transactions caused due to that package will be rolled back.
430
img
Slide 28
Data Quality: Multiple Inconsistencies
· If any record is selected whose dirty bit is
already set we will not copy it again to
exception table rather we will modify the
comment of Exception table
· But in this query all selected records have
their dirty bits off
· So copy all record to Exception table and
set their dirty bits on in Student table
It has been discussed earlier, how we will work with error table. Now if we meet a record whose
dirty bit is already set then we will not copy the record again but we will modify the comment in
exception table. We will append the name of second erroneous column with the name of column
already present there.
Slide 29
Data Quality: Multiple Inconsistencies
Example
SELECT [Date of Birth], RowID, Dirty
FROM
Student
WHERE
([Date of Birth] LIKE '31-Sep-%') OR
([Date of Birth] LIKE '31-Feb-%') OR
([Date of Birth] LIKE '31-Apr-%') OR
([Date of Birth] LIKE '31-Nov-%')
We can use following query as well. The above query and the query in the previous slide both are
aimed at identify the invalidities in dates. There may be some other set of queries that can be
used wit h the same intention to identify invalid dates. Good queries are those that identify and
output only erroneous records and all erroneous records.
431
img
Slide 30
Data Quality: Multiple Inconsistencies
Handling
· RowID 3695 has already been copied due to
error in Gender, therefore it is not required
to be copied again just modify its comments
as
UPDATE
Exception
SET
Comments = Comments + '+ Date of Birth'
WHERE
([Date of Birth] LIKE '31-Sep-%') OR
([Date of Birth] LIKE '31 -Feb-%') OR
([Date of Birth] LIKE '31 -Apr-%') OR
([Date of Birth] LIKE '31 -Nov-%')
For the records whose dirty bit is already on, we do not ne ed to enter the record again but to
modify the comment only through following query
UPDATE
Exception
SET
Comments = Comments + '+ Date of Birth'
WHERE
([Date of Birth] LIKE '31 -Sep-%') OR
([Date of Birth] LIKE '31 -Feb-%') OR
([Date of Birth] LIKE '31-Apr-%') OR
([Date of Birth] LIKE '31-Nov-%')
Slide 31
Summary: `Date of Birth '
· Total dates found with inconsistent formats
are 3
· Total invalid dates are 4
· Total Missing dates are 9
· Business rules are yet to be validated
Slide shows the summary of data profiling of `Date of Birth'. Business rule can be as follows:
At the time of joining for BS, student must be at least 16 years old
At the time of joining for MS, student must be at least 20 years old
Business rules are yet to be validated because to validate business rule we have to have date of
registration as well.
432
img
Slide 32
Transformation: Lahore `Date of Birth'
· Column Name
­ DoB
· Column Type
­ DateTime
· Nullable
­ Yes
Following transformations are suggested for Date of Birth:
·  Change the name of column as DoB
·  Change column type as DateTime format
·  Allow null values as there are some nulls and empty strings in the orig inal data
Slide 33
Data Profiling: Lahore `Reg Date'
· Column Name
­ Reg Date
· Data Type
­ Varchar [255]
· Nullable
­ Yes
· Format
­ d-MMM-yy e.g. 8-Jul-94
Now we need to profile date of registration column. The same procedure that we have done for
Date of Birth is required to be repeated.
433
img
Slide 34
Transformation: Lahore `Reg Date'
· Column Name
­ Reg Date
· Column Type
­ DateTime
· Nullable
­ Yes
The slide shows t he transformations that are required for Registration date.
Slide 35
Data Profiling: `Business rule validation'
Two business rules are required to be
validated here
· All new registrations are done in month of
August before 28
· Transfer cases can also be dealt in
January
· At the time of registration for BS age must
be greater than 16 years and for MS age
must be greater than 20 years
Now as we have profiled date of registration as well therefore we can validate the following
business rules
·  All new registrations are done in month of August before 28th
·  Transfer cases can also be dealt in January
·  At the time of registration for BS age must be greater than 16 years and for MS age
must be  greater than 20 years
There can be a lot of business rules that are supposed to be validated at this stage. It totally
depends upon the business.
Other than business rules we also validate some logical rules at this stage, like date of registration
can not be less than date of birth. No one could register for a degree before birth. Similarly date
of birth can not be 100 years back and so on. Such rules can be devised keeping in mind the
nature of business.
434
img
Slide 36
Data Profiling: `Business rule validation'
Correct Records
· Total correct records in Student table
is 4958
SELECT
COUNT(*) AS Expr1
FROM
Student
WHERE
(Dirty = 0)
· We will validate business rules only for
correct records
At this stage we have only correct records in Student table. All records with errors in dates have
been moved to exception table. Now we will validate the correct records for given set of business
rules and general logical rules. Records violating any of the rules are required to copy in
exception table.
Slide 37
Data Profiling: `Business rule validation'
Registration Date
· Validate registrations date
SELECT  COUNT([Reg Date]) AS Date
FROM
Student
WHERE
(Dirty = 0) AND ([Reg Date] NOT LIKE '% -Aug-%') AND
([Reg Status] = 'A') OR
(Dirty = 0) AND ([Reg Date] NOT LIKE '% -Jan-%') AND
([Reg Status] = 'T')
(Dirty = 0) AND ([Reg Date] NOT LIKE '% -Aug-%') AND
([Reg Status] = `T')
· 0
· All dates are valid
The s lide shows the query that can violate the following two rules:
·  All new registrations are done in month of August before 28th
·  Transfer cases can also be dealt in January
We can see that the first check ,`dirty bit = 0', is giving us the records having no errors. The
following query can identify all records that have registration date in any month other than
August and January. Those records will be erroneous records. Records having registration date in
January and Registration status is transfer case is legal. Similarly in August both Transfer cases
and new admissions are legal.
SELECT  COUNT([Reg Date]) AS Date
FROM
Student
WHERE
(Dirty = 0) AND ([Reg Date] NOT LIKE '% -Aug-%') AND ([Reg Status] = 'A') OR
435
img
(Dirty = 0) AND ([Reg Date] NOT LIKE '% -Jan-%') AND ([Reg Status] = 'T')
(Dirty = 0) AND ([Reg Date] NOT LIKE '% -Aug-%') AND ([Reg Status] = `T')
Slide38
Data Profiling: `Business rule validation'
Age
· Validate age while registration
SELECT
*
FROM
(SELECT  *
FROM
Student
WHERE
dirty = 0) DERIVEDTBL
WHERE  (CAST([Reg Date] AS DateTime) -
CAST([Date of Birth] AS DateTime ) < 16)
· 3 violating records can be seen on next
slide
The slide shows the query that can be used to validate the following business rule:
At the time of registration for BS age must be greater than 16 years and for MS age must be
greater than 20 years
The query again first identifies the correct records through dirty bit. Then checks which of the
students are less than 16 years at the time of registration. To apply `+' or ` -' operator on dates we
need to cast date strings into date data type by using CAST function. If there comes any invalid
date the query terminates with error.
Slide 39
Data Profiling: `Business rule
validation'
Age: Violating Records
· Copy these records to Exception table
and set dirty bit
· Update comments in Exception table
with value = `BR: Age'
On the slide we can see three records that are violating business rules. In the first record date of
birth is 20-Nov-75 and date of Registration is 4-Aug-74, which is obviously impossible. Here,
one thing is obvious that date of registration is incorrect as we have data starting from 1994 only.
Either date of birth is correct or not, nothing can be said without consulting golden copy (Original
copy).
We will copy these records to exception table and set the dirty bit for these records. For such
cases we will update comment by appending `BR:Age' (Business rule: Age).
436
Table of Contents:
  1. Need of Data Warehousing
  2. Why a DWH, Warehousing
  3. The Basic Concept of Data Warehousing
  4. Classical SDLC and DWH SDLC, CLDS, Online Transaction Processing
  5. Types of Data Warehouses: Financial, Telecommunication, Insurance, Human Resource
  6. Normalization: Anomalies, 1NF, 2NF, INSERT, UPDATE, DELETE
  7. De-Normalization: Balance between Normalization and De-Normalization
  8. DeNormalization Techniques: Splitting Tables, Horizontal splitting, Vertical Splitting, Pre-Joining Tables, Adding Redundant Columns, Derived Attributes
  9. Issues of De-Normalization: Storage, Performance, Maintenance, Ease-of-use
  10. Online Analytical Processing OLAP: DWH and OLAP, OLTP
  11. OLAP Implementations: MOLAP, ROLAP, HOLAP, DOLAP
  12. ROLAP: Relational Database, ROLAP cube, Issues
  13. Dimensional Modeling DM: ER modeling, The Paradox, ER vs. DM,
  14. Process of Dimensional Modeling: Four Step: Choose Business Process, Grain, Facts, Dimensions
  15. Issues of Dimensional Modeling: Additive vs Non-Additive facts, Classification of Aggregation Functions
  16. Extract Transform Load ETL: ETL Cycle, Processing, Data Extraction, Data Transformation
  17. Issues of ETL: Diversity in source systems and platforms
  18. Issues of ETL: legacy data, Web scrapping, data quality, ETL vs ELT
  19. ETL Detail: Data Cleansing: data scrubbing, Dirty Data, Lexical Errors, Irregularities, Integrity Constraint Violation, Duplication
  20. Data Duplication Elimination and BSN Method: Record linkage, Merge, purge, Entity reconciliation, List washing and data cleansing
  21. Introduction to Data Quality Management: Intrinsic, Realistic, Orr’s Laws of Data Quality, TQM
  22. DQM: Quantifying Data Quality: Free-of-error, Completeness, Consistency, Ratios
  23. Total DQM: TDQM in a DWH, Data Quality Management Process
  24. Need for Speed: Parallelism: Scalability, Terminology, Parallelization OLTP Vs DSS
  25. Need for Speed: Hardware Techniques: Data Parallelism Concept
  26. Conventional Indexing Techniques: Concept, Goals, Dense Index, Sparse Index
  27. Special Indexing Techniques: Inverted, Bit map, Cluster, Join indexes
  28. Join Techniques: Nested loop, Sort Merge, Hash based join
  29. Data mining (DM): Knowledge Discovery in Databases KDD
  30. Data Mining: CLASSIFICATION, ESTIMATION, PREDICTION, CLUSTERING,
  31. Data Structures, types of Data Mining, Min-Max Distance, One-way, K-Means Clustering
  32. DWH Lifecycle: Data-Driven, Goal-Driven, User-Driven Methodologies
  33. DWH Implementation: Goal Driven Approach
  34. DWH Implementation: Goal Driven Approach
  35. DWH Life Cycle: Pitfalls, Mistakes, Tips
  36. Course Project
  37. Contents of Project Reports
  38. Case Study: Agri-Data Warehouse
  39. Web Warehousing: Drawbacks of traditional web sear ches, web search, Web traffic record: Log files
  40. Web Warehousing: Issues, Time-contiguous Log Entries, Transient Cookies, SSL, session ID Ping-pong, Persistent Cookies
  41. Data Transfer Service (DTS)
  42. Lab Data Set: Multi -Campus University
  43. Extracting Data Using Wizard
  44. Data Profiling