Data Warehousing Keys To Success WP Us
Data Warehousing Keys To Success WP Us
Data Warehousing Keys To Success WP Us
Data Warehousing
The Keys for a Successful Implementation
W H I T E PA P E R :
Data Warehousing
The Keys for a Successful Implementation
2
OVERVIEW
Many organizations have successfully implemented data warehouses to analyze the data contained in their multiple
operational systems to compare current and historical values. By doing so, they can better, and more profitably,
manage their business, analyze past efforts, and plan for the future. When properly deployed, data warehouses
benefit the organization by significantly enhancing its decision-making capabilities, thus improving both its
However, the quality of the decisions that are facilitated by a data warehouse is only as good as the quality of the
data contained in the data warehouse – this data must be accurate, consistent, and complete. For example, in order to
determine its top ten customers, an organization must be able to aggregate sales across all of its sales channels
and business units and recognize when the same customer is identified by multiple names, addresses, or customer
numbers. In other words, the data used to determine the top ten customers must be integrated and of high quality.
After all, if the data is incomplete or incorrect then so will be the results of any analysis performed upon it.
www.pbinsight.com
the quality of decisions that are facilitated by a data
warehouse is only as good as the quality of the data
However, this is much more complicated than it might first Benefit: The acquired functionality included the ability to:
appear, especially since each production system was developed
• Execute order placements
to satisfy a particular operational need. Consequently, each
application system was designed with its own data standards • Monitor the books
and thus was poorly integrated with other systems. This integration • Provide decision support
is particularly challenging when dealing with legacy systems
that were implemented before any real effort was made to establish • Analyze transactional data, product performance,
enterprise data standards or even common data definitions. sales and earnings statistics, and information on
customer experiences.
Data Warehousing
The Keys for a Successful Implementation
4
Variations on a Theme: Data Warehouse, Data Data Warehouses
Mart, Operational Data Store, EII:
From a conceptual perspective, data warehouses store
The need to bring consistent data from disparate sources snapshots and aggregations of data collected from a variety
together for analysis purposes is the basic premise behind of source systems. Data warehouses encompass a variety of
any data warehouse implementation. Based on this need, subject areas. Each of these source systems could store the
various data warehouse architectures and implementation same data in different formats, with different editing rules,
approaches have evolved from the basic concept as and different value lists. For example, gender code could
originally formulated by Bill Inmon1. In his book he be represented in three separate systems as male/female,
stated, “a data warehouse is a subject oriented, integrated, 0/1, and M/F respectively; dates might be stored in a year/
nonvolatile, time variant, and nonvolatile collection of data month/day, month/day/year, or day/month/year format.
in support of management’s decisions.” In the United States “03062010” could represent March 6,
2010 while in the United Kingdom it might represent
There are now a variety of approaches to data warehousing June 3, 2010.
including enterprise data warehouses, data marts,
operational data stores, and enterprise information Data warehouses involve a long-term effort and are usually
integration. However, most organizations deploy a built in an incremental fashion. In addition to adding new
hybrid combination with each approach complementing subject areas, at each iteration, the breadth of data content
the others. Although they may differ in content, of existing subject areas is usually increased as users expand
scope, permanency or update cycle they all have two their analysis and their underlying data requirements.
characteristics in common: the need to integrate data and the
need for this data to be of high quality. Users and applications can directly use the data warehouse
to perform their analysis. Alternately, a subset of the
Components Manufacturer
data warehouse data, often relating to a specific line-of-
business and/or a specific functional area, can be exported
Challenge: Provide access to information on production
to another, smaller data warehouse, commonly referred
efficiency, sales activities and logistics and transform
to as a data mart. Besides integrating and cleansing an
it into useful intelligence. Enable users to query data
organization’s data for better analysis, one of the benefits of
sources without having to rely on IT assistance.
building a data warehouse is that the effort initially spent
to populate it with complete and accurate data content
Solution: Implemented a Web-based data warehousing
further benefits any data marts that are sourced from the
solution, providing the ability to:
data warehouse.
• Access and analyze data from anywhere via the Web.
www.pbinsight.com
a data warehouse needs to integrate the data of multiple
operational systems and disparate sources and establish
a common format
5
ACCESS
ENHANCE DELIVER
B
A
INTERPRET
C
ABC
B CONSOLIDATE
B
B A A
A
A B C
C
STANDARDIZE MATCH
VALIDATE
Users and applications can directly use the data warehouse to perform their analysis.
upon the contents of the data warehouse. Independent Building a corporate data warehouse on a “subject by
data marts are those that are developed without regard subject” approach is certainly a reasonable and proven
to an overall data warehouse architecture, perhaps at the strategy. Many organizations that have tried to populate
departmental or line-of-business level, typically for use as a their enterprise data warehouses with data for all requested
temporary solution. subject areas prior to initial rollout have found that this was
akin to attempting to trying to “boil the ocean,” the task was
As the independent data mart cannot rely on an existing simply too overwhelming to be realistically accomplished in
data warehouse for its content, implementation will take anything other than a phased approach.
longer than a dependent data mart. Regardless, for a data
mart operating independently of any other data mart or It is reasonable to assume that an organization’s
data warehouse, it is still important that the data within it independent data marts will ultimately be combined.
be complete and accurate. If not, erroneous analysis is likely Eventually they will lose their independence as individual
to occur and invalid conclusions drawn. data needs are ultimately satisfied through an enterprise
data warehouse.
Pragmatically, an independent data mart may be the only
viable approach when the existing enterprise warehouse Combining the content requirements of these independent
is being built incrementally and the data needed by the data marts to determine the contents of the enterprise
data mart users is not yet available from the warehouse. data warehouse will be significantly easier if each
WHITE PAPER: DATA QUALITY & DATA INTEGRATION
Data Warehousing
The Keys for a Successful Implementation
6 data mart contains high quality, complete data. This Enterprise Information Integration (EII):
“bottoms up” approach of using the requirements of
While not necessarily a new concept, the idea of enterprise
existing independent data marts to then determine the
information integration, or EII, has received much publicity in
requirements of a data warehouse from which they will be
the past few years. Simply stated, it involves quickly bringing
populated has been effective in organizations where several
together data from multiple sources for analysis purposes
departments first needed to quickly implement their own
without necessarily storing it in a separate database. Some
solutions. These organizations could simply not wait for
vendors even have gone so far as to claim that an EII approach
their “top down” data warehouse to first be built.
can replace a traditional data warehouse or data mart with a
“virtual data warehouse” by eliminating the need to extract and
A common problem that exists in many organizations is
store the data into another database. However, the ramifications
the inability to quickly combine operational data about
associated with this approach (e.g., such as the underlying data
the same entity such as a customer or vendor that exists in
changing, or even being purged, between analysis) must be not
multiple systems. A classic example occurred when banking
be overlooked.
institutions first started adding new service offerings such
as investment accounts to their more traditional savings
However, an EII solution complements the other data
and checking account offerings. Many of these new services
warehousing variants and can be a valuable resource for
were supported by systems that existed independently.
those wishing to perform quick, perhaps ad hoc, analysis on
When the bank needed to see all of the current financial
current data values residing in operational systems. It can
information it had about a customer, it needed to combine
help alleviate the backlog of requests which are a constant
and consolidate data from all of these systems, assuming
struggle for any IT staff.
of course it could identify that a customer whose account
information resided in several systems, was the same
However, organizations must recognize that the data in
customer. As this need became increasingly more important,
these operational systems may not be consistent with each
the operational data store (ODS) came into vogue.
other, that the data quality of each source may vary widely,
and that historical values may not be available. This is a risk
A primary difference between data warehouses and
many users are willing to take for “quick and dirty” analysis
operational data stores is that while a data warehouse
when the needed data is not contained in a formal data
frequently contains multiple time-stamped historical data
warehouse or data mart. In fact, many organizations use an
snapshots, with new snapshots being added on a well-defined
EII approach to establish processes and programming logic
periodic schedule, an operational data store contains current
that enable their users to transform and pull together data
values that are continually in flux. A data warehouse adds
from multiple sources for purposes that include desktop
new time-stamped data values and retains the old ones; an
analysis. EII is at its best when the quality of the data in the
operational data store updates existing data values. While
underlying operational systems is high.
the initial load and continual updating of the operational
data store are classic examples of data integration, the ability
EII solutions can also be successfully used to prototype or
to identify and link different accounts each captured from
evaluate additional subject areas for possible inclusion in a
a different system, as belonging to the same customer is
data warehouse. Some organizations have initially deployed
also a classic example of data quality. This underscores the
EII solutions when the data warehouses or data marts did
importance of, and interdependence between, data quality and
not contain the needed data and later added this data to
data integration, when solving real-world business problems.
www.pbinsight.com
data integration and data quality are the two key
components of a successful data warehouse
7
DATA WAREHOUSE VARIANTS
PRIMARY USE Operational & Tactical Tactical and Strategic Tactical and Strategic Operational & Tactical
LEVEL OF DETAIL Detailed Detailed & Summary Detailed & Summary Depends on Sources
DATA VOLATILITY High-values added Low-values added Low-values addes Depends on Sources
An organization’s data warehousing architecture can consist of a variety of components that co-exist and compliment each other.
their data warehouse content. In order to combine current providing appropriate aggregations or summary tables,
and historical values, organizations can include an existing and loading it into the data warehouse environment. This
data warehouse or data mart as one of their sources and sounds simple enough but there are many complicating
thus combine historical and current data values. factors that must be considered.
Data Warehousing
The Keys for a Successful Implementation
EII
Partners
Reports
Distributed Enterprise
Spread Sheets
OLTP Data Warehouses
Enterprise OLAP Servers Graphics
BUSINESS PROCESSES OPERATIONAL DATA DATA INTEGRATION/DATA QUALITY DATA DELIVERY DATA PRESENTATION
Data Inspection
Data Profiling
and Monitoring
Routing Parsing
Identity Resolution
CORE DATA SERVICES
Geocoding
www.pbinsight.com
it is desirable to enhance an organization’s operational
data with information from outside sources
Data Transformation 9
Energy Provider
The detailed data residing in the operational systems must
Challenge: Collect and analyze large volumes of data in
frequently be consolidated in order to generate and store,
different formats from both internal and external sources
for example, daily sales by product by retail store, rather
to optimize business processes and make predictive
than storing the individual line-item detail for each cash
calculations for planning purposes.
register transaction.
Data Warehousing
The Keys for a Successful Implementation
10
Data Volumes
Trends in Data Warehousing
It is frequently necessary to load very large data volumes Several trends are developing in the data warehouse
into the warehouse in a short amount of time, thereby market, many of which are directly concerned with data
requiring a parallel processing and memory-based integration and data quality. These include:
processing. While the initial data loads are usually the • EAI and ETL, will continue to converge due to the
most voluminous, organizations have a relatively long load need to update the data warehouse with the recent
window in which to accomplish this task since the initial transactions.
load is done prior to the data warehouse being opened
for production. After the data warehouse is in use, new • The use of “active” data warehouses that directly feed
data content must be loaded on a periodic basis. The load analytical results back to operational systems will grow
volume can be reduced if change data capture techniques • Pragmatic hybrid approaches to data warehousing will
are employed to capture only data that has changed since continue to win-out over insistence on architectural
the prior data load. In some cases, Enterprise Application purity
Integration (EAI) technology, frequently involving message
• Data quality gains additional recognition as an up-front
queues, can be used to link enterprise applications to the
requirement for both operational and analytical
data warehouse data integration processes in order to
systems efforts, rather than an after-the-fact fix
capture new data on a near-real-time basis.
• EII is recognized as complementary to, not a
Collaborative User/IT Development Efforts replacement for traditional data warehouses and data
marts
Many data warehouses have been implemented only to
quickly discover that the data content was not what the • Disparate data integration tools will give way to end-
users had in mind. Much finger pointing and general ill to-end data integration platforms with end-to-end data
will can be avoided if the data integration staff can work integration functionality
collaboratively with the end-user analysts. They should
• Data integration platforms, callable both through
be able to view the results of the data transformation
direct application programming interfaces and as Web
process on real data rather than trying to interpret
services will also tap into leading-edge features, such
somewhat abstract data flow diagrams and transformation
as appliances, open source, and cloud (SaaS)
descriptions. The ability to view live data and the associated
computing
transformation processes involved in the data integration
process can help avoid nasty surprises when, for example,
a field in the source system thought to contain telephone respond to new data sources, or changes to the underlying
numbers actually contains text data. file structure of the existing source systems, without
compromising existing processes or causing them to be
Changing Requirements rewritten. Increased user demand often translates into a
narrower data warehouse load window, especially if new
A successful data warehouse builds user momentum
users are in geographic areas that now require access to the
and generates increased user demand that results in
data warehouse at times during which there was previously
a larger user audience and new data requirements.
little or no user demand.
The data integration processes must be able to quickly
www.pbinsight.com
lack of data integration and poor data quality are the
most common causes of post-implementation data
warehouse failures
Metadata Integration the data in each source could exist in a different format 11
and use a different value list or code set. One system might
Most data integration tools store the metadata (or data
use of the alphabetic codes (S,M,D) to represent “single,”
about data) associated with its sources and targets in a
“married,” and “divorced” while another might represent
metadata repository that is included with the product.
them with the numeric codes (1, 2, 3). The data loaded
At a minimum, this metadata includes information such
into the warehouse must conform to a single set of values;
as source and target data formats, transformation rules,
data cleansing and data transformation technology must
business processes concerned with data flows from the
work together to ensure that they do. Of course, duplicate
production systems to the data warehouse (i.e., data
occurrences of the same customer or vendor across
lineage), and the formulas for computing the values of
multiple systems, or even in the same system, with different
any derived data fields. While this metadata is needed
variations of the same name and/or address, is a well-
by the data integration tool for use in defining and
known example of a data quality issue that was previously
creating appropriate data transformation processes, its
discussed.
value is enhanced when shared with other tools utilized
in designing the data warehouse tables and business
As more decisions are distributed across the organizational
intelligence tools that access the warehouse data. If the
hierarchy and as more information is processed and
metadata also includes information about what analysis
exposed to end-consumers, there is a growing need
program uses which data element, it can be a valuable
for visual representations of many aspects of business
source for analyzing the ramifications of any change to the
transactions, such as type, time, duration, and critically,
data element (i.e., impact analysis).
business transaction locations. Enhancing the value and
utility of business information for both operational and
Additional Thoughts on Data Quality analytical applications requires a combination of sound
Data quality is involved throughout the entire data data management practices, such as data governance, data
warehousing environment and is an integral part of the quality assurance, data enhancement, and increasingly,
data integration process. Data quality involves ensuring location intelligence capabilities such as geocoding,
the accuracy, timeliness, completeness, and consistency mapping, and routing.
of the data used by an organization while also making
sure that all parties utilizing the data have a common Corporate mergers and acquisitions, changes in regulatory
understanding of what the data represents. For example, compliance requirements, increased customer attrition,
does sales data include or exclude internal sales and is it noticeable increase in call center activity, system
measured in units or dollars, or perhaps even Euros? In migrations, improvements to customer-facing business
most data warehousing implementations data quality is processes, introduction or CRM or ERP, or committing to
applied in at least two phases. The first phase is concerned MDM are all examples of business events that depend on
with ensuring that the source systems themselves contain organizational adjustments in data management practices
high quality data while the second phase is concerned with in order to succeed.
ensuring that the data extracted from these sources can
then be combined and loaded into the data warehouse. Each of these events demonstrates a need for instituting
As mentioned earlier, even if the data residing in each of solid business processes and the corresponding information
the sources is already accurate and clean, it is not simply management practices to improve the customer experience
a matter of directly combining the individual sources as while simultaneously maintaining economies of scale. As
WHITE PAPER: DATA QUALITY & DATA INTEGRATION
Data Warehousing
The Keys for a Successful Implementation
12 opposed to blindly throwing technology, perhaps multiple date), the programming team has in many cases succeeded,
times, at a presumed problem and hoping that something although usually after the originally estimated completion
will stick, a more thoughtful approach seeks a return on the date. Unfortunately, the initial extract and load is usually
technology investment that is achieved when the necessary the easy part!
information management practices are supported with core
data services2 implemented once but deployed multiple It is a fact of life and systems that “things change.” Even
times in a consistent way. when the initial data integration programs work as required,
there will be a continuing need to maintain them and
The techniques employed in data warehouse development keep them up-to-date. This is one of the most overlooked
have become ubiquitous, especially as the need for costs of the “do it yourself ” approach to data integration
system interoperability has grown. As more people want to and one that is frequently ignored in estimating the
reengineer their environments to allow sharing, or even magnitude of any in-house data integration effort. This
consolidation of siloed data sets, there is a growing need approach frequently does not consider data quality. Even
for the underlying data integration services to provide when it does, only the most obvious data quality issues are
the bridging ability to seamlessly access data from many considered as the organization’s programming staff does
different sources and deliver that data to numerous targets. not have the time or experience to build strong data quality
The ability for remote servers to access data from their tools that are comparable to those offered by commercial
original sources supports the efficient implementation and data quality vendors.
deployment of data management best practices that can be
implemented using the core data services as illustrated. Additionally, most packaged data integration software has
a metadata repository component that allows for sharing
Integration services rely on the ability to connect to of metadata with other data warehouse components
commonly-used infrastructures and data environments such as database design and business intelligence tools.
coupled with the parsing and standardization services However, in house software frequently does not provide for
for transforming extracted data into formats that are sharing its own the metadata or leveraging the metadata
suitable for data delivery. And in the modern enterprise, collected by other data warehouse components. In fact,
data sharing, federation, and synchronization is rapidly the metadata collected in the “do it yourself” approach
becoming the standard. As more organizations transition to is usually rather limited and may only be contained in
a services oriented architecture, seamless data integration COBOL file descriptions for the input and output formats
becomes a fundamental building block for enterprise or in the actual program code for the transformation and
applications. aggregation logic. In general, metadata residing in “home-
grown” software cannot be readily shared with other data
Approaches to DQ/DI The “Do it Yourself” warehouse tools.
Approach
Commercial Data Integration Tools
Many organizations have attempted to access and
consolidate their data through in-house programming. After Fortunately, the industry has recognized the need for data
all, how difficult can it be to write a few programs to extract integration tools and a variety of offerings are commercially
data from computer files? Assuming for a moment that available. An appropriate tool should be capable of
the files are documented (and the documentation up-to- accessing the organization’s data sources and provide
www.pbinsight.com
implementing a successful data warehouse environment
is a continuous journey not a one-time event
connectors for packaged enterprise application software metadata capabilities should be included, both for use by 13
systems and XML data sources. It should include a powerful the data integration tool itself, and to facilitate the sharing
library of built-in data transformation and data aggregation of metadata with the business intelligence tools that will
functions that can be extended with the additional of new access the integrated data. If directly licensed the product
functions developed by the deploying organization. It must, should operate in a “lights out” environment by setting up
of course, be able to populate the target data warehouse appropriate jobs steps and event-driven conditional error
databases, be they relational databases or proprietary OLAP handling routines, with the ability to notify the operations
cubes. staff if an error condition is encountered that cannot be
resolved. It should be offered with several pricing models
The tool should also have sufficient performance not only so it can be economically deployed, such as through an
for the initial data requirements, but also for the additional application services (ASP) provider or Software-as-a-Service
content, or additional generations of current content. This (SaaS)for utilization by small organizations, or smaller units
can be accomplished perhaps through the use of change of large enterprises.
data capture techniques that can be expected in the
future as the data warehouse subject areas grow. Sufficient Many organizations have attempted to resolve data quality
headroom should be available to be able to handle not only issues through their own devices as well. While edit routines
the current update frequency and data volumes but also can be built into programs to check for proper formats,
anticipated future requirements. Both batch ETL loads and value ranges, and field value dependencies in the same
(should future plans require this capability) event-driven record, these are relatively simple when compared, for
“near-real-time” EAI transactions should be supported. example, to ensuring that a customer address is a valid and
up-to-date.
The tool should provide a simple, yet powerful, user interface
allowing users to visually perform the data integration tasks Data quality vendors, once best known for name and address
without having to write low-level code or even utilize a high correction and validation, have significantly expanded
level programming language or 4GL. It should be able to their capabilities in terms of the scope of the data they can
be used by a variety of audiences, not just data integration handle (i.e., non-name-and-address data), the volumes they
specialists. Users and data integration specialists would be can support, and the databases they maintain in order to
able to collaborate in an iterative process, and ideally view validate data. Many have expanded their offerings to include
the transformation process against live data samples prior to software licensing, ASP and SaaS delivery.
actually populating the data warehouse.
Data Integration Tools Supplied by
The data integration tool should itself be easy to integrate Database Vendors
with other technology in use by the organization and
Most database vendors now offer data integration tools
should callable through a variety of mechanisms including
packaged with, or as options for, their database offerings.
application programming interfaces (APIs) and Web
Although these tools should certainly be considered when
services. The technology should also be deployable as a
evaluating data integration products, it is important to
standalone tool. Of particular importance is the ability to
recognize that database vendors want their own database to
perform data cleansing and validation or to be able to be
be at the center of their customers’ data universe.
integrated with tools that can provide data quality. Strong
WHITE PAPER: DATA QUALITY & DATA INTEGRATION
Data Warehousing
The Keys for a Successful Implementation
14 Consequently, a data integration tool from a database implementing a successful data warehousing environment is
vendor could very well be optimized to populate the a continuous journey, not a one-time event.
vendor’s own databases by, for example, taking advantage
of proprietary features of that database. If used to populate Whatever the choice, two things are certain: data
other databases, they may not perform as well, or even at all. integration and data quality will be key components of, if
not the enabling technology for, the organization’s data
However, if an organization has standardized on a particular warehousing success. Data integration is an ongoing
database for all of its data warehousing projects, the data process that comes into play with each data load and
integration tools offered by that database vendor could be with each subject area extension; the quality of the
used as the basis against which other data integration tools data in the warehouse must be continually monitored
are compared. to ensure its accuracy. Organizations that ignore these
requirements must be careful that instead of building a data
Summary warehouse that will be of benefit to their users, they do not
inadvertently wind up creating a repository that provides
Today organizations recognize the significant advantages
suboptimal business value.
and value that data warehousing can provide both for
pure analysis and as a complement to operational systems.
Many data warehousing industry vendors can provide
While data warehouses exist in many forms, including
robust data integration and data quality solutions. In
enterprise-scale centralized monoliths, dependent and
addition to developing and marketing products, these
independent data marts, operational data stores, and EII
vendors offer a wealth of experience and expertise that they
implementations, they all benefit from complete, consistent,
can share with their customers. As a result, an organization
and accurate data.
is best served when it deploys a commercial, fully supported
and maintained set of tools rather than trying to develop
While an organization’s overall data warehouse architecture
and maintain such a technology on its own.
can encompass a variety of forms, each organization must
decide what is right for its own purposes and recognize that
www.pbinsight.com
AN organization is best served when it deploys a commercial,
fully supported and maintained set of tools
strategic analysis offerings and apply our expertise to help you take action that
leads to better, more insightful decisions. You will get a more accurate view of your
customers, and integrate that intelligence into your daily business operations to
MAS Strategies specializes in helping vendors market and position their business
tactical and strategic product and marketing decisions. MAS Strategies also assists
analysis, and project implementations. For more information about MAS Strategies,
FOOTNOTES
1 W.H. Inmon “Building the Data Warehouse,” John Wiley & Sons, Inc. (1992)
2 David Loshin, “Core Data Services: Basic Components for Establishing Business Value,” Pitney Bowes Business Insight, White Paper 2009
Every connection is a new opportunity ™
United States
One Global View
Troy, NY 12180
1.800.327.8627
pbbi.sales@pb.com
www.pbinsight.com
Canada
26 Wellington Street East
Suite 500
Toronto, ON M5E 1S2
1.800.268.3282
pbbi.canada.sales@pb.com
www.pbinsight.ca
EUROPE/UNITED KINGDOM
Minton Place
Victoria Street
Windsor, Berkshire SL4 1EG
+44.800.840.0001
pbbi.europe@pb.com
www.pbinsight.co.uk
Asia Pacific/AUSTRALIA
Level 7, 1 Elizabeth Plaza
North Sydney NSW 2060
+61.2.9437.6255
pbbi.australia@pb.com
pbbi.singapore@pb.com
www.pbinsight.com.au
92416 AM 1110