ZeePedia

DWH Life Cycle: Pitfalls, Mistakes, Tips

<< DWH Implementation: Goal Driven Approach
Course Project >>
img
Lecture No. 35
DWH Life Cycle: Pitfalls, Mistakes, Tips
In this lecture we will discusses the problems, troubles and mistakes that commonly occur while
building a data warehouse. We will discuss things to do, and also things not to do. We will
discuss ways to avoid common mistakes that may halt data ware housing process.
5 Signs of trouble
1.
Project proceeded for two months and nobody has touched the data.
2.
End users are not involved hands-on from day one throughout the program.
3.
IT team members doing data design (modelers and DBAs) have never used the acc ess
tools.
4.
Summary tables defined before raw atomic data is acquired and base tables have been
built.
5.
Data design finished before participants have experimented with tools and live data.
Signs of trouble
First of all we will discuss "5 signs of trouble". Any of these signs if present, will serve as a key
indicator that the data warehousing project is under threat. The following situations indicate a
project in trouble:
·
The project has proceeded for two months and nobody has even touched the data. Befo re
even embarking on the project, the team should have had a thorough understanding and
look and feel of the data. As I always tell my students, "know your data intimately".
·
The future consumers are not involved hands-on from day one throughout the program.
Working in isolation will result in systems that no one will accept, and consequently none
is going to use, resulting in negative marketing for you and your company. Thus avoid
this at all costs.
·
The team members doing data design (modelers and DBAs) have never used the access
tools. You need experienced campaigners not "green apples" or raw hands. You need
people who know the job, not those who will learn on the job, and in the process sink the
ship.
·
The summary tables are defined before the raw atomic data is acquired and base tables
have been built. A converse process has been followed, which again is a recipe for
disaster.
·
The data design is finished before participants have experimented with the tools and live
data. As we have discussed at length in lecture no. 33, involve the business users from the
very beginning, get user requirement definition, record it and follow it.
11-Possible Pitfalls
§
1. Weak business sponsor: Getting stuck by office politics, need CXO on your side.
§
2. Not having multi ple servers : Penny wise pound Foolish. (i) Both going down (ii)
Performance degradation
299
img
§
3. Modeling without domain expert:
§
4. Not enough time for ETL: Giving too little time or over complicating by including
everything conceivable. Users will forgive:
§  Less Formatting, slow system, few features, few reports BUT NOT incorrect
results
11-Possible pitfalls in DWH Life Cycle & Development
Many early data warehousing projects failed, having fallen into one or more of the traps we will
discuss. These pitfalls are still difficult to avoid, unless those steering the project are able to
understand and anticipate the associated risks.
1. Weak business sponsor
This phase often turns out to be the trickiest phase of the data warehousing implementation and is
also the Part-II(a) of your semester project. Because data warehousing by definition includes data
from multiple sources spanning many different departments within the enterprise. Therefore,
there are often political battles that center on the willingness of information sharing. Even though
a successful data warehouse benefits the enterprise, there are occasions where departments may
not feel the same way. As a result of unwillingness of certain groups to release data or to
participate in the data warehousing requirements definition, the data warehouse effort either never
gets off the ground, or could not get started in the right direction defined originally. When this
happens, it would be ideal to have a strong business sponsor. If the sponsor is at the CXO level
(X: Information, Knowledge, Financial etc), he/she can often exert enough influence to make sure
everyone cooperates.
2. Not having multiple servers
This is a classical example of penny wise and pound foolish. To save capital, often data
warehousing teams will decide to use only a single database and a single server for the different
environments i.e. development and production. Environment separation is achieved by either a
directory structure or setting up distinct instances of the database. This is awkward for the
following reasons:
·
Sometimes it is possible that the server needs to be rebooted for the development
environment. Having a separate development environment will prevent the production
environment from being effected by this.
·
There may be interference while having different database environments on a single
server. For example, having multiple long queries running on the development server
could affect the performance on the production server, as both are same.
3. Modeling without domain expert
It is essential to have a subject -matter expert as part of the data modeling team. This person can
be an outside consultant or can be someone in -house with extensive industry experience. Without
this person, it becomes difficult to get a definitive answer on many of the questions, and the entire
project gets dragged out, as the end users may not always be available.
4. Not enough time for ETL
This is common everywhere, ETL getting the least time, remember data is always dirtier than you
think. There is a tendency to give this particular phase of DWH too little time and other
300
img
resources. This can prove suicidal to the project, as the end users will usually tolerate less
formatting, longer time to run reports, less functionality (slicing and dicing), or fewer delivered
reports; one thing that they will never ever tolerate is wrong information.
A second common problem is that some people unnecessarily make the ETL process
complicated. In ETL design, the primary goal should be to optimize load speed without
sacrificing on quality. This is, however, sometimes not followed. There are cases when the design
goal is to cover all possible future uses and possible scenarios, some of which may be practical,
while others just plain impractical. When this happens, ETL perfo rmance suffers, and often so
does the performance of the entire data warehousing system.
11-Possible Pitfalls (continued)
§
5. Low priority for OLAP Cube Construction: Giving it the lowest priority.
§
6. Fixation with technology: End users impressed by timely information NOT advanced
infra-structure.
§
7. Wrong test bench: Required performance NOT on fast production level machines.
§
8. QA people NOT DWH literate: Ensure QA people are educated about DWH.
5. Low priority for OLAP Cube Construction
Make sure your OLAP cube-building or pre -calculation process is optimized and given the right
priority. It is common for the data warehouse to be on the bottom of the nightly batch loads, and
after the loading the DWH, usually there isn't much time left for the OLAP cube to be refreshed.
As a result, it is worthwhile to experiment with the OLAP cube generation paths to ensure
optimal performance.
6. Fixation with technology
Just remember that the end users do not care how complex or how technologically advanced you r
front end (or for that matter back-end) infrastructure is. All they care is that they should receive
their information in a timely manner and in the way they specified.
7. Wrong test bench
Make sure the development environment is very similar to the production environment as much
as possible - Performance enhancements seen on less powerful machines sometimes do not
happen on the larger, production -level machines.
8. QA people NOT DWH literate
As mentioned above, usually the QA team members know little abo ut data warehousing, and
some of them may even resent the need to have to learn another tool or tools. Make sure the QA
team members get enough education about data warehousing so that they can complete the testing
themselves.
11-Possible Pitfalls (continued)
§
9. Uneducated user: Take care and address end -user education needs. Intuition does not
work.
301
img
§
10. Improper documentation: Complete documentation before developers leave.
§
11. Doing incremental enhancements: Definite no-no. Dev => QA =>Production
9. Uneducated user
Take care and address the user education needs. There is nothing more frustrating to spend
several months to develop and QA the data warehousing system, only to have little usage because
the users are not properly trained and educated. Regardless of how intuitive or easy the interface
may be, it is always a good idea to send the users to at least a one -day training course to let them
understand what they can achieve by properly using the data warehouse.
10. Improper documentation
Usually by this time most, if not all, of the developers will have left the project, so it is essential
that proper documentation is left for those who are handling production maintenance. There is
nothing more frustrating than staring at something another person did, yet unable to figure it out
due to the lack of proper documentation.
Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of the
data warehouse planned, start on that as soon as possible.
11. Doing incremental enha ncements
Because a lot of times the changes are simple to make, it is very tempting to just go ahead and
make the change in production. This is a definite no -no. Many unexpected problems will pop up
if this is done. It is very strongly recommend that the typical cycle of Development QA
Production be followed, regardless of how simple the change may seem.
Pitfall: Searching for the "Silver Bullet"
Abandon completely even the desire to find a silver bullet.
§
Wasting time on that one access tool that will handle all needs of all users.
§
Many tools in the market, instead do a match-making.
§
Beware: Vendors use ambiguous, confusing, non-standard nomenclature, which
sometimes serves their own purpose.
§
For any meaningful comparison, evaluate tools by classifying on the basis of
functionality.
Pitfall: Searching for the "Silver Bullet": Pitfalls of Selecting the Tools
Abandon completely even the desire to find a silver bullet.
Many data warehouse project teams waste enormous amounts of time searching in vain for a
silver bullet i.e. a panacea or Amratdhara . They believe their mission is to find the one access
tool that will handle all the needs of all their users. Don't even try. It can't be done. One size does
not fit all.
302
img
There is a wide and eve r-growing diversity of tools for data access, exploration, analysis, and
presentation. The appropriate mission of a data warehouse or decision support team is to
understand these diverse alternatives and properly match the tool to the intended usage.
The names typically applied to tools are ambiguous and confusing. Generally every new name
only obscures the issue more. What is the definitive definition of ad hoc query tool, decision
support system, executive information system or online analytic processing? Some terms, like
EIS, carry the burden of years of overzealous selling and underwhelming results.
To evaluate tools, you need to slot the alternative into categories that allow for meaningful
comparison. Since the traditional terms add little discrimi natory power, where do you turn? The
first part of the answer is to create purely functional categories.
Pitfall: Extremes of Tech. Arch. Design
Common mistake: Attacking the problem from two extremes, neither is correct.
§
Focusing on data warehouse delivery, architecture feels like a distraction and
impediment to progress and often end up rebuilding.
§
Investing years in architecture, forgetting primary purpose is to solve business
problems, not to address any plausible (and not so plausible) technical c hallenge.
Pitfall: Extremes of Tech. Arch. Design Data warehouse teams approach the technical
architecture design process from opposite ends of the spectrum. Some teams are so focused on
data warehouse delivery that the architectures feels like a distraction and impediment to progress
and eventually, these teams often end up rebuilding. At the other extreme, some teams want to
invest two years designing the architecture while forgetting that the primary purpose of a data
warehouse is to solve business p oblems, not address any plausible (and not so plausible)
r
technical challenge. Neither end of the architecture spectrum is healthy; the most appropriate
response lies somewhere in the middle.
Top 10-Common Mistakes to Avoid...
§
Mistake 1 : Not interacting directly with the end users.
§
Mistake 2 : Promising an ambitious data mart as the first deliverable.
§
Mistake 3 : Never freezing the requirements i.e. being an accommodating person.
§
Mistake 4: Working without senior executives in loop, waiting to include them after a
significant success.
§
Mistake 5: Doing a very comprehensive and detailed first analysis to do the DWH right
the very first time.
10-Common Data warehouse mistakes to avoid
303
img
So far you have been told what to do, however now we'll balance t hose recommendations with a
list of what not to do. When building and managing a data warehouse, the common mistakes to
avoid are listed. These mistakes are described as a series of negative caricatures. The goal is for
you to learn from these as George S ntayana said, "Those who cannot remember the past are
a
condemned to repeat it." Let's all agree not to repeat any of these mistakes. Each of the mistakes
will be discussed one by one.
Mistake 1: Not interacting directly with the end users; your job is to be the publisher of the right
data. To achieve your job objectives, you must listen to the business users, who are always right.
Nothing substitutes for direct interaction with the users. Develop the ability to listen.
Mistake 2: Promising an ambitious data mart as the first deliverable; these kinds of data marts
are 'consolidated, second-level marts with serious dependencies on multiple sources of data.
Customer profitability requires all the sources of revenue and all the sources of cost, as well as an
allocation scheme to map costs onto the revenue! For the first deliverable, focus instead on a
single source of data, and do the more ambitious data marts later.
Mistake 3: Never freezing the requirements i.e. being an accommodating person ; You need to
think like a software developer and manage three very visible stages of developing each data
mart: (1) the business requirements gathering stage, where every suggestion is considered
seriously, (2) the implementation stage, where changes can be accommodated~ but must be
negotiated and generally will cause the schedule to slip, and (3) the rollout stage, where project
features are frozen. In the second and third stages, you must avoid insidious scope creep (and stop
being such an accommodating person).
Mistake 4: Working without senior executives in loop, waiting to include them after a significant
success; the senior executives must support the data warehouse effort from the very beginning. If
they don't, your organization likely will not be able to use the data warehouse effectively. Get
their support prior to launching the project.
Mistake 5: Doing a very comprehensive and detailed first analysis to do the DWH right the very
first time; Very few organizations and human beings can develop the perfect comprehensive plan
for a data warehouse upfront. Not only are the data assets of an organization too vast and complex
to describe completely, but also the urgent business drivers will change significantly over the life
of the data warehouse. Start with lightweight data warehouse bus architecture of conformed
dimensions and conformed facts, and then build your data warehouse iteratively. You will keep
altering and building it forever.
Top 10-Common Mistakes to Avoid
§
Mistake 6: Assuming the business users will develop their own "killer application" on
their own.
§
Mistake 7: Training users on the detailed features of the tool using dummy data and
consider it a success.
§
Mistake 8: Isolating the IT support people from the end or business users.
§
Mistake 9: After DWH i s finished, holding a planning and communications meeting with
end users.
§
Mistake 10: Shying away from operational source systems people, assuming they are too
304
img
busy.
Mistake 6: Assuming the business users will develop their own "killer application" o n their own;
Business users are not application developers. They will embrace the data warehouse only if a set
of prebuilt analytic applications is beckoning them.
Mistake 7: Training users on the detailed features of the tool using dummy data and consider it a
success; Delay training until your first data mart is ready to go live on real data. Keep the first
training session short, and focus only on the simple uses of the access tool. Allocate more time to
the data content and analytic applications rather than to the tool. Plan on a permanent series of
beginning training classes and follow -up training classes as well. Take credit for the user
acceptance milestone when your users are still using the data warehouse six months after they
have been trained.
Mistake 8: Isolating the IT support people from the end or business users; Data warehouse
support people should be physically located in the business departments, and while on
assignment, they should spend all their waking hours devoted to the business content of the
departments they serve. Such a relationship engenders trust and credibility with the business
users.
Mistake 9: After DWH is finished, holding a planning and communications meeting with end
users; Newsletters, training sessions, and ongoing pe rsonal support of the business community
should be to gather items for the first rollout of the data warehouse.
Mistake 10: Shying away from operational source systems people, assuming they are too busy;
Certainly, they cannot alter their operational proc edures significantly for passing data to or from
the warehouse. If your organization really understands and values the data warehouse, then the
operational source systems should be effective partners with you in downloading the data needed
and in uploading cleaned data as appropriate.
Top 7-Key Steps for a smooth DWH implementation ...
§
Step-1: Assigning a full-time project manager, or doing it yourself full-time.
§
Step-2: Consider handing-off project management.
§
Step-3: During user interview don't go after answers, let the answers come to you.
7-Tips to a smooth data warehouse implementation
There's a long list of things not to do in managing a data warehouse project, but there are also a
number of positive, proactive steps that can increase your chances of a smooth implementation.
Resolve to be open to new ideas, and seek creative inspiration in radically modifying your tried -
and-true practices to fit this new way of thinking.
1. Assigning a full-time project manager, or doing it yourself full-time
It's common, and often unavoidable, that project managers ride herd on several projects at once.
The economics of IT resources make this a fact of life. When it comes to building a data
warehouse, however, don't even think about it. You are entering a domain unlike anything else
you and your crew have worked on. Everything about it--analysis, design, programming, testing,
305
img
modifications, maintenance--will be new. You, or whoever you assign as project manager, will
have a much better shot at success if allowed t o get into that "new" mode and stay there.
2. Consider handing -off project management
Because the phases of a data -warehouse build are so very different, you do yourself no disservice
by handing off to another project manager when a phase is complete, provided you adhere to Step
One above. Why is it reasonable to do this? First, any phase of a data warehouse implementation
can be exhausting, from a project management standpoint. From the deployment of physical
storage to implementing the Extract -Transform-Load, from designing and developing schemas to
OLAP, the phases of a warehouse build are also markedly different from one another. Each not
only could use a fresh hand, management -wise, but a fresh creative perspective. Handing off
management not only do esn't necessarily hurt, it may even help.
3. During user interviews don't go after answers let the answers come to you.
This is important enough to be an article in itself. You must understand, going into the design
process that your potential warehouse users aren't going to be able to clearly articulate what it is
they want the warehouse to do for them. They're going to have to explore and discover it as they
go--and so will your development team, in conducting interviews. Make your interviews open -
ended, with lots of note -taking, and have your development -team interviewer's focus more on the
consequences of processes than the processes themselves.
Since you're conducting these interviews in order to get some idea of what data to store and how
to efficiently store it, you need to (in partnership with your users) come up with new ways to look
at data, not process it. You're trying to find potential information that can be gleaned, not from
transactional data itself, but from the information behind it: the rise and fall of numbers over time,
etc. Don't chase answers in these interviews. Let answers come to you.
Top 7-Key Steps for a smooth DWH implementation
§
Step-4: Assigning responsibilities to oversee and ensure continuity.
§
Step-5: Accept the "fact" t hat DWH will require many iterations before it is ready.
§
Step-6: Assign significant resources for ETL.
§
Step-7: Be a diplomat NOT a technologist.
4. Assigning responsibilities to oversee and ensure continuity.
These don't need to be full -time assignments, but because the phases of a data warehouse
implementation differ so greatly, you're going to need people out there assuring continuity. There
are three important areas: (i) architecture, (ii) technology, and (iii) business. Assign an
architecture lead to ensure that the generally agreed-upon architecture of the data warehouse,
from the physical level on up, is maintained throughout the project. A technology lead should be
appointed, because your developers and key users will all be using tools they 've never used
before--someone needs to oversee the deployment and consistent use of these tools.
Finally, the business needs that will be met through use of the warehouse must be carefully
observed and documented, to spur continued development. Since the analytics and metrics to be
306
img
derived from the process are developed over time, by users who will not necessarily communicate
well with one another, someone must watch this development, encourage its continuation, and
nurture it into progressing to higher levels.
5. Accept the "fact" that DWH will require many iterations before it is ready.
A data warehouse will never, ever be right the first time. Why? You don't know what you're
really looking for until you see it. Or, to say it more precisely, the ultimate users of the system
won't know what they're really going to use it for until they've used it for awhile. As contrary as
that may seem to all that you've sworn by throughout your career, it really is the way to go:
business intelligence is an infant scie nce, and different for every company.
You'll have to fish around for the right data in the right format, and things will change often. BI
is very "personal," unique to your environment, your market, and your partnerships. What does
this mean? First of all, it means you need to lock your database administrator in a room
somewhere and break the news that the data warehouse data structures are going to change and
change and change, as will the ETL procedures. There is no way around this. Make your peace
with it now, and save both yourself and the DBA a lot of stress.
6. Assign significant resources for ETL
You're going to be stepping in it again and again as you wade through oceans of old data, in old
databases, on old magnetic tape, from remote sources. Much of it will be dirty. Much of it will be
hard to get to. You're going to be doing a lot of this, and you're going to be devising ETL
procedures to seek out and retrieve information like this forevermore. You do yourself and the
project a great service by establishing a method of doing this right the first time. Have your
development people put in the extra time to explore old data thoroughly, characterize "dirty" data
issues realistically, and to design and implement robust extraction and transformation procedures
exhaustively. The ETL portion of a data warehouse can consume as much as 80 percent of your
total project resources! Make sure you spend wisely.
7. Be a diplomat NOT a technologist
The biggest problem you will face during a warehouse implementation will be people, not the
technology or the development. You're going to have senior management complaining about
completion dates and unclear objectives. You're going to have development people protesting
that everything takes too long and why can't they do it the old way? You're going to have users
with outrageously unrealistic expectations, who are used to systems that require mouse-clicking
but not much intellectual investment on their part. And you're going to grow exhausted,
separating out Needs from Wants at all levels. Commit from the outset to work very hard at
communicating the realities, encouraging investment, and cultivating the development of new
skills in your team and your users (and even your bosses).
Most of all, keep smiling. When all is said and done, you'll have a resource in place that will do
magic, and your grief will be long past. Eventually, your smile will be effortless and real.
Conclusions
§
DWH is not simple.
§
DWH is very expensive.
§
DWH is not ONLY about technology.
307
Table of Contents:
  1. Need of Data Warehousing
  2. Why a DWH, Warehousing
  3. The Basic Concept of Data Warehousing
  4. Classical SDLC and DWH SDLC, CLDS, Online Transaction Processing
  5. Types of Data Warehouses: Financial, Telecommunication, Insurance, Human Resource
  6. Normalization: Anomalies, 1NF, 2NF, INSERT, UPDATE, DELETE
  7. De-Normalization: Balance between Normalization and De-Normalization
  8. DeNormalization Techniques: Splitting Tables, Horizontal splitting, Vertical Splitting, Pre-Joining Tables, Adding Redundant Columns, Derived Attributes
  9. Issues of De-Normalization: Storage, Performance, Maintenance, Ease-of-use
  10. Online Analytical Processing OLAP: DWH and OLAP, OLTP
  11. OLAP Implementations: MOLAP, ROLAP, HOLAP, DOLAP
  12. ROLAP: Relational Database, ROLAP cube, Issues
  13. Dimensional Modeling DM: ER modeling, The Paradox, ER vs. DM,
  14. Process of Dimensional Modeling: Four Step: Choose Business Process, Grain, Facts, Dimensions
  15. Issues of Dimensional Modeling: Additive vs Non-Additive facts, Classification of Aggregation Functions
  16. Extract Transform Load ETL: ETL Cycle, Processing, Data Extraction, Data Transformation
  17. Issues of ETL: Diversity in source systems and platforms
  18. Issues of ETL: legacy data, Web scrapping, data quality, ETL vs ELT
  19. ETL Detail: Data Cleansing: data scrubbing, Dirty Data, Lexical Errors, Irregularities, Integrity Constraint Violation, Duplication
  20. Data Duplication Elimination and BSN Method: Record linkage, Merge, purge, Entity reconciliation, List washing and data cleansing
  21. Introduction to Data Quality Management: Intrinsic, Realistic, Orr’s Laws of Data Quality, TQM
  22. DQM: Quantifying Data Quality: Free-of-error, Completeness, Consistency, Ratios
  23. Total DQM: TDQM in a DWH, Data Quality Management Process
  24. Need for Speed: Parallelism: Scalability, Terminology, Parallelization OLTP Vs DSS
  25. Need for Speed: Hardware Techniques: Data Parallelism Concept
  26. Conventional Indexing Techniques: Concept, Goals, Dense Index, Sparse Index
  27. Special Indexing Techniques: Inverted, Bit map, Cluster, Join indexes
  28. Join Techniques: Nested loop, Sort Merge, Hash based join
  29. Data mining (DM): Knowledge Discovery in Databases KDD
  30. Data Mining: CLASSIFICATION, ESTIMATION, PREDICTION, CLUSTERING,
  31. Data Structures, types of Data Mining, Min-Max Distance, One-way, K-Means Clustering
  32. DWH Lifecycle: Data-Driven, Goal-Driven, User-Driven Methodologies
  33. DWH Implementation: Goal Driven Approach
  34. DWH Implementation: Goal Driven Approach
  35. DWH Life Cycle: Pitfalls, Mistakes, Tips
  36. Course Project
  37. Contents of Project Reports
  38. Case Study: Agri-Data Warehouse
  39. Web Warehousing: Drawbacks of traditional web sear ches, web search, Web traffic record: Log files
  40. Web Warehousing: Issues, Time-contiguous Log Entries, Transient Cookies, SSL, session ID Ping-pong, Persistent Cookies
  41. Data Transfer Service (DTS)
  42. Lab Data Set: Multi -Campus University
  43. Extracting Data Using Wizard
  44. Data Profiling