Lab Data Set: Multi -Campus University

<< Data Transfer Service (DTS)

Extracting Data Using Wizard >>

Lab Lec: 3

Extracting Data Using Wizard

Steps towards single source of truth

�

Identify source systems

�

Figure out the issues associated with each source system

�

Extract data

�

Transform data

�

Load data

�

Quality checks

Data warehouse is a single source of truth. We have to put all departmental data from desperate

sources at one place in some standard form. The task is not trivial. Desperate sources of data have

a lot of inherent issues. High level steps given in the slide gives just an overview of the task. First

of all we have to identify the source systems. It is quite possible that each department uses

different database systems or same organization at different geographical locations uses different

database management systems. To put data into a single source after extracting from such diverse

sources requires powerful tools especially designed to fulfill the purpose. We will use Microsoft

SQL Server which is a user friendly graphical tool and makes such a complex task doable by

some practice.

After identification of source systems, it is necessary to study the issues that must be cons idered

before putting all the data together. Microsoft SQL Server provides a powerful support to perform

Extract, Transform and Load (ETL) data from source systems to Destination system. Finally

certain steps are performed to check and improve quality of d ata.

In this lab exercise we will perform all the above steps in a detailed manner through powerful

support of Microsoft SQL Server.

Example: Student Record System: Diversity

�

Identify source systems at different campuses

�

Source systems are as follows

� Lahore campus uses simple Text Files

� Karachi campus uses MS-Excel workbooks

� Islamabad campus uses MS-Access DB

� Peshawar campus uses simple Text Files

The task starts from analysis of source systems at different campuses. Data for Lahore and

Peshawar campus is kept in simple text files, for Karachi Ms-Excel books are used and for

Islamabad MS-Access is used to keep data. Further more table structures, date formats,

conventions for gender M/F or 1/0, etc. are different from campus to campus.

First we will load the data for each campus in MS-SQL Server as it is, in different databases then

before putting all the pieces together all these issues will be addressed.

389

Example: Student Record System: Issues

�

Figure out the issues related with each source syste m

�

Issues include

� Standards and formats of stored data

�

Number of tables in different source systems

�

Type of columns, their number and ordering in different tables to be combined

Here we need to figure out the issues in source systems. As source data is distributed over

different campuses therefore the issues like difference in date formats, conventions of storing

gender (M/F,0/1,1/0), etc are obvious. Microsoft SQL Server has a good support to resolve these

issues.

Extracting University Data

1. Lets start our practical by loading data for the university

2. We have data from four different campuses

3. Initially we will develop four different databases, one for each campus, and load

corresponding data

4. Then we will transform and standardized each database

5. Finally we will combine all the four databases to single source of truth (DWH)

6. At each step we will run queries to collect demographics

For loading data for the university, it is required to load the data for four campuses, separately

and as it is, into the MS-SQL Server. Once all data is loaded to SQL Server then the tasks of

transformation and standardization would be started. First we will transform the database of each

of the campuses individually. Then we will standardize the databases of four campuses

sepa rately. Finally, the data from four different campuses will be put together.

Extracting Data Using Wizards

�

Import and Export Data Wizard provides the easiest method of loading data.

�

The wizard creates package which is a collection of tasks

�

Tasks can be as follows:

� Establish connection through source / destination systems

� Creates similar table in SQL Server

� Extracts data from text files

� Apply very limited basic transformations if required

� Loads data into SQL Server table

After addressing the issues w decide to select a suitable tool in SQL Server to resolve these

issues. At this stage we are not performing transformations rather we are just copying data from

source to destination. For this purpose the easiest method is the use of wizard. Wizard would

create package for us including all required tasks as:

�

Establishes connection through source / destination systems

390

�

Creates similar table in SQL Server

�

Extracts data from text files

�

Applies very limited basic transformations, if required

�

Loads data into SQL Server table

Extracting Data for Lahore Campus

�

First of all load data for the Lahore campus

Connect to source Text files

Connect to Destination SQL Server

Create new database `Lahore_Campus'

Create two tables Student & Registration

Load data from th text files containing student information into Student table

Load data from the text files containing registration records into Registration table

�

Import/Export Wizard is sufficient to perform all above mentioned tasks easily

Loading data for Lahore campus includes following tasks:

1. Connect to source Text files

Since there are many text files for Lahore campus, we need to load those text files separately.

First of all, select the file that is to be loaded first.

2. Connect to Destination SQL Server

In this case our source system is a text file. For transformation and standardization we will load

all data as it is from source file to the SQL server and then through powerful tools of SQL Server,

we will perform these intended task.

3. Create new database `Lahore_Campus'

To load data for four campuses we will develop four separate databases. So, to load data for

Lahore campus we will create a new data base named `Lahore_Campus'.

4. Create two tables Student & Registration

All files containing student information will be loaded in one table Student and all other files

containing registration information will be loaded in other table Registration. After this step we

will have two populated tables only.

5. Load data from the text files containing student informati on into Student table.

6. Load data from the text files containing registration records into Registration table

Import/Export Wizard is sufficient to perform all above mentioned tasks easily. So we will use

the wizard as it can provide us good functionality in this scenario.

391

Seven Steps to Extract Data Using Wizard

1. Launch the Wizard

2. Choose a Data Source

3. Choose a Database

� Specification of file format incase of Text files

4. Specify the Destination

5. Choose Destination Database

� Selection of existing database or creation of a new database

6. Select a table

� Selection of existing table or creation of a new table

7. Finalizing and Scheduling the package

The slide states seven simple steps to create a package for data loading through wizard. Lets

discuss each o f the steps in detail.

Step1: Launch the wizard(1)

�

Two methods to launch the wizard

� Start > Programs > Microsoft SQL Server > Import & Export Data

� Start > Programs > Microsoft SQL Server > Enterprise Manager

1) On console root drop Data Transformation Service node

2) Tools > Data Transformation Service > Import/Export data

These are two different methods to launch the wizard. We can use either.

Step1: Launch the wizard(2)

The slide shows the main screen of SQL Server enterprise ma nager. In the left pane we have

Console root. We can see Data Transformation services highlighted. Expand the node mentioning

Data Transformation Services and then press Tools in the menu bar. This will lead you to launch

the wizard to load data.

392

Step2: Choose a Data Source(1)

After launching the wizard, first of all we will be welcomed by the wizard through a welcome

screen having a button `NEXT'. On pressing `NEXT' we will see the above window. This

window is basically allows to perform step 2 of creating a package through wizard. In this

window we can see a drop down menu `Data Source'. This menu shows all possible connections

that are available through SQL Server. Connection will be selected according to sou rce system.

In our case, source system is a simple text file. Therefore select the last option text file from this

drop down menu.

Step2: Choose a Data Source(2)

�

Data source for Lahore is a Text file

�

After specifying data source, Browse first file to be loaded from Lahore data

�

"Lhr_Student_94.txt", is a text file that contains students data of Lahore campus

�

First of all browse this file to load

After selecting text file as a source system, we will see a new option on the lower part of same

window. The option is to browse the text file considered as a source system. To load data for

Lahore campus first of all select the file `Lhr_Student_94.txt'. Browse option allows us to have

an Open File dialog box, which will let us to locate file through navigatio n. We will locate the

directory in which our source data is placed, select the file `Lhr_Student -94' from Lahore folder.

Then press next to move to step 3 of wizard.

Step3: Specify File Format(1)

�

After specifying file, Wizard asks for file format like

� Columns are fixed length fields or delimited by some character

�

First row contains header information (column/field) names or directly data

�

What are the text qualifiers

�

What is the file storage format

�

How many rows, if required, should be skipped from start

393

Third step in creation of package through wizard is to specify the format of the file to be loaded.

In this new screen we are supposed to provide following information to the SQL Server:

�

Whether the source file has fixed length columns or delimited length columns.

�

Does the first row contain column name?

�

Do we have text qualifiers in our source text file? If yes then select the specified from drop

down list menu.

�

What is the file storage format?

�

Do we want to skip rows from start of file? If yes, then how many rows are supposed to be

skipped?

�

The answers to all above stated questions are provided through graphical user interface

objects like check box, radio buttons and drop down list menus.

Step3: Specify File Format(2)

Formatted: Font: 7 pt

Here we can see the screen to input answers of the questions asked on previous slide. Each and

every object is self explanatory. First of all you will see the above screen and then on pressing

next we need to specify the column delimiter that is used in our text file.

Step4: Specify the Destination(1)

�

As many options for destinations are available as were for the source (in step 2)

�

By selecting Text files as destination, data extracted from text file would be stored in

another text file

�

Incase of given scenario we want to load data in SQL Server, which is a default option

for destination

Step 4 in creation of a package through wizard is to specify the destination database. This needs

to establish a connection with destination system first. For this purpose select the correct

connection from the drop down menu `Destination' which contains all possible connection

options as were available for source in step 2.

394

In case we select Text file as a destination then the data extracted from one text file will be loaded

in another text file. As we have planned to load data in SQL Server, therefore select SQL server

connection, which is by default selected.

Step4: Specify the Destination

Choose Database

Formatted: Font: 7 pt

On selecting SQL Server as destination we can see options appea red on the same window. From

here we are just required to select the destination database in destination system. This can be done

through a drop down list menu titled as Database. This drop down list menu is showing all

databases available to this user in SQL server. <new> option allows us to create a new database.

We will select new as we want to create separate database initially for each campus.

Step5a: Creating a New Database

On selecting <new> option we can see this Create database pop up dialog box. This dialog box is

asking for the name of database and the space needed for that database. Write name as

`Lahore_Campus' and then press OK. Then press NEXT

395

Step6: Creating a New Table

�

Source contains the name of input file

�

Destination is a new table with same structure as that of source text file

�

By default, name of destination table is same as that of text file i.e. "Lhr_Student_94",

Rename it as "Students"

After select ing the destination database from existing databases or new database, we press next.

This time we will see the above window on the screen. This window contains 3 columns source,

destination and transform. Source contains the name of input file. Destination shows the name of

the destination table in destination database which was selected in previous step. The wizard

checks the names of all available tables in destination database. Select the table for loading whose

name is exactly the same as of source table/ view/ text file. If none of the table has matching

name then wizard generates code to create new table with exactly the same name and same

number of columns as that of source table/view/text file. Wizard generates table with all columns

having same data type i.e. VARCHAR(255)

In this case our destination database is empty, as it has not yet physically created. Therefore,

wizard selects the name for new table `Lhr_Student_94' as it is the name of source text file. To

rename this text file we can double click the destination highlighted row. As shown in the figure.

Step6: Creating a New Table: Transform

�

Name and order of columns in new table can be seen through "Transform"

�

Names, data types and order of columns in new table can b changed through

"Transform"

Transform column is used to apply transformations that are available through wizard like

changing the data types of the columns, order of the columns and so on. If we click the transform

button, we can see the mapping between source and destination columns.

396

Step6: Creating a New Table

Column Mappings

On clicking transform we can view the dialog box showing mapping between source and

destination columns. The third column is showing data type of destination column. By default all

types are selected as varchar of size 255. By default Nullable option is checked that means the

corresponding columns can contain null values as well. A pointer variable varchar to hold

character array of length 255 requires a lot of memory to be consumed especially when we are

dealing with a huge input data. So we should change the type according to required lengths, for

example, for Gender one character is enough as gender may be M/F or 0/1.

Step7: Scheduling the Pac kage(1)

�

In the preceding six steps, a package has developed including tasks of establishing

connections, extraction and loading of data

�

Whenever the package would run all the tasks would be executed

After finalizing the mapping we can press next for f inalizing the package. On saving the package

all steps would be written in a script file. Whenever the script would be run all steps would be

executed and tasks would be performed.

Step7: Scheduling the Package(2)

397

The above dialog box provides following options:

1. Run the package immediately

2. Schedule DTS package to run periodically

3. Use of replication to publish destination data ( we are not concerned with this option currently)

4. Save package

For saving package we need to select the destination location that can be one of following:

a. SQL Server

b. SQL Server Meta Data Services

c. Structured storage file

d. Visual Basic file

Now, we have successfully completed the package and it is ready to run. It can be executed

manually or automatically by scheduled option.

Execution of a package

Connection with source (Text file) is established

Connection with destination (MS-SQL Server) is established

New Database at destination is created

New table is created

Data is extracted from source

Data is loaded to destination

When this package will be executed either automatically due to scheduled option enabled or

manually following tasks will be performed:

� Connection with source (Text file) is established

� Connection with destination (Ms-SQL Server) is established

� New Database at destination is created

� New table is created

� Data is extracted from source

� Data is loaded to destination

Execution can be completed successfully or it may be stopped and roll backed due to some error.

In case of successful completion of execution all the transactions will be committed to the

398

database otherwise, if some error occurs, execution will be terminated abnormally and all

transactions will be rolled back. In second case when we will access the database we will find it

in the state that was before the execution of package.

Verification of Results(1)

�

Results can be verified by view resultant table and its rows

�

New Database "Lahore_Campus" can be accessed through SQL Server Enterprise

Manager

�

Expand the tree on the left pane

�

Local > Databases > Lahore_Campus > Student

After successful completion of the package we can verify the results by viewing the destination

table and its rows. To access new database `Lahore_Campus", we will use SQL Server Enterprise

Manager. In SQL Server Enterprise Manager we will expand the console tree in the left pane and

drop the databases, there we can find `Lahore_Campus'. Double click the node tables and locate

Student table in right pane window.

Verification of Results(2)

Right click "Student"

Open Table > Return all rows

Right click student table and select option `Return all rows' in sub menu open Table. This will

show you all the rows loaded in the destination table.

Adding More Records to Table

�

The Loaded file contains the data for batch 1994 only

�

Data for remaining ten batches is stored in ten separate text files

�

In SQL Server we will be having only one student table for all batches in a campus, so

data for remaining table is required to be added in same table

By this step destination table contains the data for one batch only, as the source file was for the

1994 only. Data for remaining ten batches is stored in separate text files. We are required to add

the data for remaining batches in the same table. Following slides guide you for loading data in

the same table.

399

Adding More Records to Table

�

Repeat first five steps as it was done while loading previous table

�

In step six, Drop down destination menu and select the table in which you want to append

the records

To load data in the same table, we are required to repeat first five steps as it was done before.

1. Launch the Wizard

2. Choose a Data Source

3. Choose a Database

Specification of file format incase of Text files

1. Specify the Destination

2. Choose Destination Database

Selection of existing database "Lahore_Campus"

The difference comes in 6th step when we select destination table. To choose destination table

drop the menu in Destination column and locate the name of required table that is "Student". That

is all we are required to add records in the same table.

Load All BS &MS Student Records For Lahore

�

Repeat the same procedure for remaining nine files for the batches 1996 to 2004

�

Then load "Course Registration Details" data in a new table "Registration"

�

We have eleven files for Registration data, one for each batch

�

Load all files in table, "Registration"

400

We are required to repeat the same procedure for remaining nine files for the batches 1996 to

2004. After loading all text files containing student data of Lahore campus, load "Course

Registration Details" data in a new table "Registration". We can find eleven files for registration

data, one for each batch. Load all course registration details files in table, "Registration"

Loading "Course Registration" Records

�

Load "Course Registration" records, in the same way

�

We have 11 BS + 4 MS text files for Registration data, one for each batch

�

Load all files in table, "Registr tion"

�

Registration records already contain a column indicating degree (BS/MS)

Similarly load all text files that contains course registration details records. We have 11 text files

for BS registration detail records and 4 text files for MS registration detail records. Create new

table registration and load all files in new table. After loading all course registration records for

BS and MS we can see that each record would still be identified uniquely as registration table

contains a column indicating t he degree of a student either BS or MS.

Verification: "Course Registration" Records

�

Right click Registration table through enterprise manager, to get the above menu

In the same way as we did for Student table we may verify the results of package targeting

Registration records load.

Verification: "Course Registration" Records

401

The above slide shows the output of query " Return all rows" from Registration table.

Demographics

�

Finally all the data for Lahore Campus has been loaded

�

Now we can collect demographics on this data through queries

By this time we have two tables in SQL Server, Students & Registration that contain all students

and registration records for Lahore Campus respectively. Before starting transformation and

quality check of data, it is required to collect demographics so that we can chalk out the way of

transformation.

Total students & BS Students

�

Total Number of Students

Select COUNT(*) AS Expr1

From Student

�

5,200

402

�

Total BS students

SELECT COUNT(*) AS Expr1

FROM

Student

WHERE ([Last Degree] IN ('F.Sc.', 'FSc', 'A level', 'A -Level', 'HSSC'))

� 4,400

Total number of students

This is the simplest query that counts the total number of students registered in Islamabad

campus. For each campus individually, this query provides correct results but for consolidated

data for all campuses this type of query will count those students twice who admitted in one

campus and then transferred to another campuses as their names would be present in the

databases of both campuses. While running queries for consolidated data we must take care of

such issues.

Total BS students

Repeating IDs for BS and MS students make such queries complicated which involves the

separation of BS and MS students. To count total number of BS students we must separate them

from MS students. In Student table we do not have any such direct information that can filter BS

students from total students. To meet such requirements we devised a methodology of

considering degree information. No doubt this is not a perfect way of filtering as the quality of

result can suffer a lot due to the presence of outliers. However, in real life when we have to deal

with legacy systems we need to face hundreds of such complicated issues due to bad designs and

limitations of legacy systems. In such scenarios we need to devise solutions intelligently through

indirect ways.

SIDs vs. Students

�

Total unique SIDs

SELECT COUNT(DISTINCT SID) AS Expr1

FROM

Student

�

4,400

�

SIDs are exactly equal to the BS students

Total unique SIDs

In this query we counted the distinct student IDs from student table. The answer is exactly equal

to the total number of BS students because the student IDs are unique among BS students only

and same IDs are repeating for BS students.

Unique identification of each student

Unique identification of each student is possible through combination of degree only. In one

campus there can be only one such student who has SID=1 and he is enrolled in BS degree

program. Similarly there can be only one student is possible who has again SID=1 and is enrolled

for MS degree program. Each SID can be repeated at most twice, one time for BS student and

other time for MS student.

Repeating Student IDs for BS and MS

403

�

Students IDs are repeating for MS students

�

After loading records of MS students in the same table with BS students, student ID is no

more useful to identify each record uniquely

�

Now we need some more information to be used with IDs to identify each record

uniquely

After loading all files for BS and MS, you will find an interesting problem. As the university was

managing the records of BS and MS separately, Student ID was used as Primary key to identify

students uniquely. Student ID is just an auto-increment sort of number which starts from zero for

both BS and MS students. In warehouse environment when we have combined both BS and MS

students, Student ID no more uniquely identifies each student in warehouse.

Solution of Repeating IDs

�

Problem of repeating IDs can be resolved through the use of `Last Degree' column with

�

SID + [Last Degree] -> unique record

�

1) SID = `100', [Last Degree] = `F.Sc.' 2) SID = `100', [Last Degree] = `M.Sc.'

�

1) is a BS student & 2) is an MS student

� What can be outliers here?

As ID is no more useful we need to add some additional information with ID to identify each

student uniquely. If we look at table structure of student table, we can find a column Last degree

that can be used to distinguish BS students from that of MS students. But there is also a little

chance of outliers like the students who are doing MS in computer science after MS in physics

etc. These outliers, if exists, will be handled separately.

Solution of Repeating IDs (Cont.)

�

Outlier can be a BS student who has already sixteen years education background (say) in

Mathematics and he is again registered for BS computer science

�

Such a student can have [last degree] = `M.Sc.'

At this time we will use information like if student ID is 1 and last degree is `F.Sc.' he is an

undergraduate student, and if student ID is 1 and last degree is BS then he is a graduate student

who is enrolled in MS.

Male Students

�

Total Number of Male Students

SELECT COUNT(*) AS Expr1

FROM

Student

WHERE (Gender = '0')

�

3,466

404

To find out the total number of male students we counted all records where gender = `0'. This

gives 3,466. We can not say any thing about the quality of this result. The quality can be

discussed after data profiling but not at this stage. There may be a lot of errors in data, we may

have some other male students for whom gender is missing here. We may have some records with

noise like '01' or '10' which out of domain of the column. But all such issues can be identified

first in data profiling only not before that.

Female BS students in Telecom

�

Total Number of Female students in BS Telecom

SELECT COUNT(DISTINCT r.SID) AS Expr1

FROM Registration r INNER JOIN

Student s ON r.SID = s.SID AND

s.[Last Degree] IN ('F.Sc.', 'FSc',

'HSSC', 'A -Level', 'A level') AND

r.Discipline = 'TC' AND s.Gender = '1`

�

365

This query requires access of both student and registration table. Gender of student can be found

from student table where as Discipline can be found from registration table only. To find the

answer of such a query we need to have inner join of both tables.

Extracting Data for Karachi Campus

�

Now load data for Karachi Campus

Connect to source MS-Excel

Connect to Destination SQL Server

Create new database `Karachi_Campus'

Create two tables Student & Registration

Load data from the Excel worksheet containing student information into Student table

Load data from the Excel worksheets containing reg istration records into Registration

table

�

Import/Export Wizard is sufficient to perform all above mentioned tasks easily

By this time we have loaded data and collected demographics for Lahore campus only. Now we

are required to load data fro Karachi ca mpus. For Karachi campus we need to load data from

Excel files. Main steps of loading are as follows:

1. Connect to source MS-Excel

2. Connect to Destination SQL Server

3. Create new database `Karachi_Campus'

4. Create two tables Student & Registration

5. Load data from the Excel worksheet containing student information into Student table

6. Load data from the Excel worksheets containing registration records into Registration

table

405

Again Import/Export Wizard is sufficient to perform all above mentioned tasks easily

Student data for Karachi

�

Step1: Launch the wizard

�

Step2a : Choose a data source

Step 1 is same as it was for the text files. In step two we need to select Microsoft Excel 97-2000

as data source.

Student Data for Karachi (Cont.)

�

Step2b: Browse Student Excel-Worksheet

�

STUDENT.xls for all BS/MS student records

�

Reg_BS_KHR.xls for BS Registration records

�

Reg_MS_KHR.xls for MS Registration records

After selection of MS-Excel data source, we are required to locate the data file that contains

extension ".xls". To load student data we need to locate STUDENT.xls data file as it contains

data for both BS and MS students.

When we will load registration data we will require to load Reg_BS_KHR.xls file for BS students

registration details and Reg_MS_KHR.xls for MS students registration records.

Student Data for Karachi (Cont.)

406

�

Step 3: Specify Destination

�

Step 4: Choose/Create Database

� Karachi_Campus

�

Step 5 a: Table/View or query

In step3 we select Microsoft SQL Server as destination and in step4 we create new database

Karachi_Campus in the same way that we adopted for Lahore_Campus.

As some particular sort of queries can be run in MS Excel therefore wizard asks us whether we

want to extract data from source through query or we want to c opy complete table/view. As we

do not want to filter data through query therefore we will select the other option that is copy

complete table/view.

Student Data for Karachi (Cont.)

�

Step 5b:

�

Choose BS to copy complete worksheet of BS students

In this screen we can see a lot of worksheets, where as in our Student_KHR.xls file there are only

two worksheets, one for BS and one for MS. You can see in the dialog box that the name of

worksheet BS is followed by `$'. To load complete worksheet BS select the first option `BS$'.

Following options show logical divisions with in a worksheet. Like in this case we can see an

option `BS$St_Kch_94', it means that by selecting this option only those students will be copied

who belong to the batch 1994. Similarly, by selecting the option `BS$St_Kch_100' only students

of batch 2000 will be copied but if we select the option BS$ then all students in worksheet BS

irrespective of their batch will be copied.

Student Data for Karachi (Cont.)

407

�

Step 5b (Cont.) :

�

In step 5b you can see two columns Source and Destination

�

If we want to copy all records from BS worksheet we need to check `BS$'

�

If we want to copy only those records from BS worksheet that belongs to year

2000 we need to choose BS$St_Kch_100

�

Similarly BS$St_Kch_101 belongs to records of year 2001 and so on

In step 5b we can see two columns Source and Destination. If we want to copy all records from

BS worksheet we need to check `BS$'. If we want to copy only those records from BS worksheet

that belongs to year 2000 we need to choose BS$St_Kch_100. Similarly BS$St_Kch_101

belongs to records of year 2001 and so on

Student Data for Karachi (Cont.)

�

Step 5b (Cont.) :

�

For Ms-Excel, it is a convention to use `$' after worksheet name like `BS$' /

`MS$'

�

If there are any logical divisions of records within an Excel worksheet, it is

written after `$' sign like `BS$St_Kch_94'

�

In the given dataset of Karachi BS Excel worksheet contains data records for

eleven years (9404 )and MS Excel worksheet contains data records for 4 years

(01-04)

�

Records with in an Excel worksheet are logically divided on annual basis hence

we can see options like `BS$ST_Kch_94'

For MS-Excel, it is a convention to use `$' after worksheet name like `BS$' / `MS$'. If there are

any logical divisions of records within an Excel worksheet, it is written after `$' sign like

`BS$St_Kch_94'. In the given dataset of Karachi BS Excel worksheet contains data records for

eleven years (9404 )and MS Excel worksheet contains data records for 4 years (01-04). Records

with in an Excel worksheet are logically divided on annual basis hence we can see options like

`BS$ST_Kch_94'.

Student Data for Karachi (Cont.)

�

Step 5b (Cont.) :

�

As we want to copy all records from BS worksheet and all records from MS

408

worksheet as well, therefore, we will choose two options from sources BS$ and

MS$

�

As Records from both worksheets are required to be copied into a single table

Student, therefore, rename destinations against both sources as "Students"

Loading data from MS-Excel workbook provides us facility of loading data in all worksheet

through single package. On the same dialog box, scroll down and find option MS$ under the

column Source. By this way you can load other worksheet MS in the same table. Locate t he row

having check box against MS$ and rename the destination as `Student'. So that all data from both

worksheets load in the same table Karachi.

Student Data for Karachi (Cont.)

So, finally two options should be checked under the column Source, one is BS$ and the other one

is MS$. Against both options the table name should be the same i.e. Student, because our

destination is same. We can use third column transform, by pressing button against the selected

row. This option provides us the same mapping view that we have seen earlier, while loading text

file.

Student Data for Karachi (Cont.)

�

Step 6:

�

Run the package immediately

�

Status after package completion is as follows:

After this, run the package. After successful execution we can see the above window. We can see

that there were four steps in this case:

1) Create table student

2) Load data of Karachi BS students

409

3) Create table Student

4) Load data of Karachi MS students

As we checked two options therefore package will try to create the two destination tables. But

both tables have same names therefore SQL Server database will not allow to create two tables

with the same name `Student'. Hence out of four, Task three will be terminated with an error

`Student table already exists'. As no new destination table is created for MS student therefore

finally all records of MS worksheet will be loaded in the same table `Student', due to same

destination name. Same information can be seen in above dialog, but these errors are not reported

in the order in which they occur. Therefore we can see first task failed where as it is third in

reality.

Student Data for Karachi (Cont.)

�

Step 6 (Cont.) :

�

It can be seen that a task listed first in previous figure is failed

�

Choosing Source `BS$' and selecting destination table `Student' means create

table student and copy all records from destination worksheet to student table

�

Similarly Choosing source `MS$' and selecting destination table `Student' means

exactly the same

�

Hence package tries to create table student twice, once for BS and then for MS

It can be seen that a task listed first in previous run failed. Choosing Source `BS$' and selecting

destination table `Student' means create table student and copy all records from destination

worksheet to student table. Similarly Choosing source `MS$' and selecting destination table

`Student' means exactly the same. Hence package tries to create table student twice, once for BS

and then for MS

Student Data for Karachi (Cont.)

�

Step 6 (Cont.) :

�

But second time SQL does not allow to create the table with the same name

�

Therefore a task to create table again fails

�

As there exists only one table student therefore all data for BS and MS is copied

to the same table

�

Rest of the three tasks are successful

� Create table student

� Copy BS records worksheet

� Copy MS records worksheet

But second time SQL does not allow to create the table with the same name. Therefore a task to

create table again fails. As there exists only one table student therefore all data for BS and MS is

copied to the same table. Rest of the three tasks are successful:

1. Create table student,

2. Copy BS records worksheet &

3. Copy MS records worksheet

However, their order of reporting is not the same in which they actually occurred.

410

Registration Data for Karachi

�

Similarly load Registration data for Karachi Campus

�

There are two Excel workbooks for Registration data of Karachi

� Reg_BS_KHR

� Reg_MS_KHR

�

Reg_BS_KHR contains six worksheets each containing registration records of two

batches

�

Reg_MS_KHR contains only single worksheet contianing all records of 4 batches

Now we are required to load registration detailed records for Karachi campus. There are two MS-

Excel workbooks, one for BS & the other one for MS. Workbook for BS contains 6 worksheets,

each sheet containing records for 2 batches whereas the worksheet for MS contains all records in

one worksheet. We need to load all worksheets complete.

Registration Data for Karachi (Cont.)

�

First of all load all BS records into a new table `Registration' in database

`Karachi_Campus'

�

Then append all Ms records into the same table `Registration'

�

All these tasks can be performed through wizard in the same way as previous data was

loaded

For loading all BS records in a new table `Registration' we can write only one package. Then we

will develop another package that will append all MS records in Registration table. This all can

be done in the similar way as we have done before.

To select complete worksheet to be copied we need to check the row indicating the name of

worksheet followed by immediate `$' without any logical division. Then set destination table as

registration. All registration records either for BS or for MS should be loaded in the same

registration table.

Registration Data for Karachi (Cont.)

The above slide shows that we are selecting only those options that are showing name of

worksheet followed by `$; sign. Against both options we are setting destination as `Registration'.

411

Registration Data for Karachi (Cont.)

�

As the source is again an Excel book t herefore we have sources with same convention i.e.

Worksheet name followed by `$' and logical division of work sheet

�

Check only those sources that corresponds to full worksheet like '94 -95$', such sources

always miss the name of logical division after `$' sign

As the source is again an Excel book therefore we have sources with same convention i.e.

Worksheet name followed by `$' and logical division of work sheet. Check only those sources

that correspond to full worksheet like '94-95$', such sources always miss the name of logical

division after `$' sign.

Checking each of the check boxes, results in two tasks in the final package. One is to create a new

table, always, with the name specified in destination with the same structure as specified by the

source table. As we have set same name for destination table that is Registration, therefore only

one task of new table creation will succeed rest of all Create Table task will fail and data will be

loaded in the same table Registration.

Registration Data for Karachi (Cont.)

�

Status of completion is as follows

This time it can be easily identified that only those tasks are failed that attempted to create a new

table Registration. Rest of all tasks of copy data copied data into same destination table.

Registration Data for Karachi (Cont.)

�

All tasks intended to create a new table with the same name `Registration' are failed

�

All the records are appended to already created single table `Registration'

�

Similarly append the Records for MS students in the same Registration table

412

All tasks intended to create a new table with the same name `Registration' are failed. All the

records are appended to the already created single table `Registration'. Similarly append the

Records for MSstudents in the same Registration table

Registration Data for Karachi (Cont.)

Similarly load MS data from the workbook Reg_MS_KHR. Again set Registration table as

destination table and perform remaining steps as it is. As Registration table already e xists, again

create table task would fail and data will be loaded in the same Registration table, appending new

records.

Demographics (1)

�

Finally all the data for Karachi Campus has been loaded

�

Now we can collect demographics on this data through queries

�

First we try the same queries that we used for Lahore Campus

Finally all the data for Karachi campus has been loaded. For collection of demographics, first of

all we try the same queries that we used for Lahore campus. As the table names are same i.e.

Student and Registration, therefore, those queries should work here also.

Demographics (2)

�

Total Number of Students

Select COUNT(*) AS Expr1

From Student

�

8,200

�

Total BS students

SELECT

COUNT(*) AS Expr1

FROM

Student

WHERE

([Last Degree] IN ('F.Sc.', 'FSc', 'A level', 'A -Level', 'HSSC'))

413

�

Error, [Last Degree] invalid column

Total Number of Students

Select COUNT(*) AS Expr1 From Student;

This query returns 8,200, which is correct. This is exactly the same query that we have already

run for Lahore campus. Due to the standardization of table names we get this benefit.

Total BS students

SELECT COUNT(*) AS Expr1

FROM

Student

WHERE ([Last Degree] IN ('F.Sc.', 'FSc', 'A level', 'A -Level', 'HSSC'))

This query returns Error, [Last Degree] invalid column. Because in Karachi campus student

table, there is no column name as [Last Degree], but same information is stored in qualification

column.

Demographics (3)

�

At Karachi campus, correct column is `Qualification' instead of [Last Degree], similarly

names of others columns are also different from those of Lahore, therefore, same queries

are not applicable here

�

Correct Query to count BS students is

SELECT COUNT(*) AS Expr1

FROM

Student

WHERE (Qualification IN ('F.Sc.', 'FSc', 'A level', 'A -Level', 'HSSC'))

�

6,600

This is one of the characteristic of non standardized and heterogeneous data that same

information is stored under different column name. Therefore, we need to standardize all column

names and data-types so that data from different databases can be put together. Now, to extract

same information for Karachi campus we need to correct the above query as

SELECT COUNT(*) AS Expr1

FROM

Student

WHERE (Qualification IN ('F.Sc.', 'FSc', 'A level', 'A-Level', 'HSSC'))

Now there is no problem and query gives correct results (6,600) for Karachi campus. There may

be some outliers but that can not be identified at this stage.

414

Demographics (4)

�

To find total number of male students we use same query that we used for Lahore except

Column name `Gender' which is `M/F' here, so query is

SELECT COUNT(*) AS Expr1

FROM

Student

WHERE ([M/F] = '0')

�

0, Incorrect answer

�

Answer to the query is zero i.e. no male student in Karachi campus which is definitely

incorrect

Now we try to find the number of male students at Karachi campus. We submit the same SQL

query that we run for Lahore campus. The answer turns out to be zero. There must be some thing

incorrect in this query otherwise it can not be zero. To identify the error we look at the data and

can easily point out that the error is due to the inconsistent gender storage conventions. At Lahore

campus, 1 is a convention to store female and 0 is a convention to store male whereas at Karachi

campus M for male and F for female is used. To put all data in single source we need to

standardize the data storage conventions.

Demographics (5)

�

If we look in the data for Karachi campus we can find out the reason for the error in

above query

�

For Karachi campus convention to store gender information is M (male) and F (female)

instead of 0 (male) and 1 (female)

�

Correct query is

SELECT COUNT(*) AS Expr1

FROM

Student

WHERE ([M/F] = `M')

�

5,463

Following is the correct query that can be used for Karachi campus to count the total number of

male students at Karachi campus.

SELECT

COUNT(*) AS Expr1

FROM

Student

WHERE

([M/F] = `M')

The answers to be 5,463.

Demographics (6)

�

Total Number of Female students in BS Telecom

415

�

We again need to correct the query which was written for Lahore Campus, and correct

query is

SELECT COUNT(DISTINCT r.St_ID) AS Expr1

FROM Registration r INNER JOIN

Student s ON r.St_ID = s.St_ID

AND s.Qualification IN ('F.Sc.', 'FSc',

`HSSC', 'A-Level', 'A level') AND r.Disp = 'TC'

AND s.[M/F] = 'F'

�

551

Similarly to get the count of total number of female students in BS telecom, we need to correct

the column names and data storage conventions. The correct query is as follows:

SELECT COUNT(DISTINCT r.St_ID) AS Expr1

FROM Registration r INNER JOIN

Student s ON r.St_ID = s.St_ID

AND s.Qualification IN ('F.Sc.', 'FSc',`HSSC', 'A-Level', 'A level') AND

r.Disp = 'TC'

AND s.[M/F] = 'F'

Answer is 551

Slide 1

416

Table of Contents: