Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views44 pages

01 - Overview and Concepts

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 44

CS131-8:

Data Warehousing and Data Mining


01 – Overview and Concepts
Objectives
• Understand the desperate need for strategic information
• Recognize the information crisis at every enterprise
• Distinguish between operational and informational systems
• Learn why all past attempts to provide strategic information
failed
• Clearly see why data warehousing is the viable solution
• Understand business intelligence for an enterprise
A New Paradigm
• Companies started building and using computer systems in the
1960s and have become completely dependent on them. As an
enterprise grows larger, hundreds of computer applications are
needed to support the various business processes.
• In the 1990s, as businesses grew more complex, corporations
spread globally, and competition became fiercer, business
executives became desperate for information to stay competitive
and improve the bottom line.
Competitive Advantage

Figure 1. Organizations’ use of data warehousing.


Escalating Need for Strategic Information
• Who needs strategic information in an enterprise?

• What do we mean by strategic information?


Escalating Need for Strategic Information
• Who needs strategic information in an enterprise?
The executives and managers who are responsible for keeping the
enterprise competitive need information to make proper decisions.

• What do we mean by strategic information?


Combined essential information needed to make decisions in the
formulation and execution of business strategies and objectives.
Examples of Business Objectives
• Retain the present customer base
• Increase the customer base by 15% over the next 5 years
• Improve product quality levels in the top five product groups
• Gain market share by 10% in the next 3 years
• Enhance customer service level in shipments
• Bring three new products to market in 2 years
• Increase sales by 15% in the North East Division
Strategic Information
• Strategic information is not for running the day-to-day operations of
the business. It is not intended to produce an invoice, make a
shipment, settle a claim, or post a withdrawal from a bank account.
• Strategic information is far more important for the continued health
and survival of the corporation.
• Critical business decisions depend on the availability of proper
strategic information in an enterprise.
Characteristics of Strategic Information

Figure 2. Characteristics of Strategic Information


The Information Crisis
Whatever the size of the company is, we are faced with two startling
facts:
• Organizations have lots of data
• Information technology resources and systems are not effective at
turning all that data into useful strategic information
Technology Trends

Figure 3. Explosive growth of information technology.


Opportunities
• A business unit of a leading long-distance telephone carrier empowers its
sales personnel to make better business decisions and thereby capture
more business in a highly competitive, multibillion-dollar market. A Web-
accessible solution gathers internal and external data to provide strategic
information.
• Availability of strategic information at one of the largest banks in the
United States with assets in the $250 billion range allows users to make
quick decisions to retain their valued customers.
• In the case of a large health management organization, significant
improvements in health care programs are realized, resulting in a 22%
decrease in emergency room visits, 29% decrease in hospital admissions
for asthmatic children, potentially sight saving screenings for hundreds of
diabetics, improved vaccination rates, and more than 100,000
performance reports created annually for physicians and pharmacists.
Opportunities
• At one of the top five U.S. retailers, strategic information combined
with Web-enabled analysis tools enables merchants to gain insights
into their customer base, manage inventories more tightly, and
keep the right products in front of the right people at the right place
at the right time.
• A community-based pharmacy that competes on a national scale
with more than 800 franchised pharmacies coast to coast gains in-
depth understanding of what customers buy, resulting in reduced
inventory levels, improved effectiveness of promotions and
marketing campaigns, and improved profitability for the company.
• A large electronics company saves millions of dollars a year because
of better management of inventory.
Risks
• With an average fleet of about 150,000 vehicles, a nationwide car
rental company can easily get into the red at the bottom line if fleet
management is not effective. The fleet is the biggest cost in that
business. With intensified competition, the potential for failure is
immense if the fleet is not managed effectively. Car idle time must
be kept to an absolute minimum. In attempting to accomplish this,
failure to have the right class of car available in the right place at the
right time, all washed and ready, can lead to serious loss of
business.
Risks
• For a world-leading supplier of systems and components to
automobile and light truck equipment manufacturers, serious
challenges faced included inconsistent data computations across
nearly 100 plants, inability to benchmark quality metrics, and time
consuming manual collection of data. Reports needed to support
decision making took weeks. It was never easy to get company-wide
integrated information.
Risks
• For a large utility company that provided electricity to about 25
million consumers in five mid-Atlantic states in the United States,
deregulation could result in a few winners and lots of losers.
Remaining competitive and perhaps even just surviving depended
on centralizing strategic information from various sources,
streamlining data access, and facilitating analysis of the information
by the business units.
Failures of Past Decision-Support Systems
Assume a specific scenario. The marketing department in your
company has been concerned about the performance of the West
Coast region and the sales numbers from the monthly report this
month are drastically low. The marketing vice president is agitated and
wants to get some reports from the IT department to analyze the
performance over the past two years, product by product, and
compared to monthly targets. He wants to make quick strategic
decisions to rectify the situation. The CIO wants your boss to deliver
the reports as soon as possible. Your boss runs to you and asks you to
stop everything and work on the reports. There are no regular reports
from any system to give the marketing department what they want.
You have to gather the data from multiple applications and start from
scratch. Does this sound familiar?
Inability to Provide Information

Figure 4. Inadequate attempts by IT to provide strategic information.


Operational vs. Decision-Support Systems

Figure 5. Operational Systems (left) vs. Decision-Support Systems (right)


The Need for Informational Systems
As companies face fiercer competition and businesses become
more complex, continuing the past practices will only lead to
disaster. We need to design and build informational systems
• That serve different purposes
• Whose scopes are different
• Whose data content is different
• Where the data usage patterns are different
• Where the data access types are different
Inability to Provide Information

Figure 6. Operational and informational systems.


Data Warehousing – The Only Viable Solution

Figure 7. General overview of the data warehouse


Data Warehouse Defined
A physical repository where relational data (current and historical)
are specially organized to provide enterprise-wide, cleansed data in
a standardized format.
The data warehouse is an informational environment that:
• Provides an integrated and total view of the enterprise.
• Makes the enterprise’s current and historical information easily
available for strategic decision making.
• Makes decision-support transactions possible without hindering
operational systems.
• Renders the organization’s information consistent.
• Presents a flexible and interactive source of strategic information.
Data Warehousing – The Only Viable Solution

Figure 8. A blend of technologies


Characteristics of Data Warehouse
• Subject oriented: data are organized by detailed subject
containing only information relevant for decision support. It
provides a more comprehensive view of the organization
• Integrated: data warehouses must place data from different
sources into a consistent format
• Time variant (time series): it contains historical (daily, weekly and
monthly) inc addition to current data (real-time)
• Non-volatile: data can not be changed or updated after it had
been entered into data warehouse. Obsolete (old) data are
discarded, and changes are recorded as new data
Characteristics of Data Warehouse
• Web based: designed for web based applications
• Relational/multidimensional: its structure is either relational or
multidimensional
• Uses client server: to be easy to be accessed
• Real-time: this a character for new data warehouse
• Include metadata: it is a data about data (about how data are
organized and to use them)
Definitions and Concepts
Data mart
A logical and physical subset of a data warehouse; in its most
simplistic form, represents data from a single business process (e.g.,
retail sales, retail inventory, purchase orders). Basically, a
departmental data warehouse.
• Dependent data mart - A subset that is created directly from a
data warehouse
• Independent data mart - A small data warehouse designed for a
strategic business unit (SBU) or a department and its source is not
the EDW (Enterprise Data Warehouse)
Definitions and Concepts
Operational data stores (ODS)
A type of database often used as an interim (temporal) area for a
data warehouse, especially for customer information files
Oper marts
An operational data mart. An oper mart is a small-scale data mart
typically used by a single department or functional area in an
organization when they need to analyze operational data
Definitions and Concepts
Enterprise data warehouse (EDW)
A technology that provides a vehicle for pushing data from source
systems into a data warehouse that is used across the enterprise for
decision support
Metadata
Data about data. In a data warehouse, metadata describe the
contents of a data warehouse and the manner of its use
Data Warehousing Process Overview
• Organizations continuously
collect data, information, and
knowledge at an increasingly
accelerated rate and store them
in computerized systems
• The number of users needing to
access the information continues
to increase as a result of
improved reliability and
availability of network access,
especially the Internet

Figure 9. Process Overview


Major Components
Data sources: internal, external (data provider), OLAP, ERP, Web data
Data extraction: using custom-written or commercial software called
(ETL)
Data loading: loaded into a staging area to be transformed and cleansed,
then loaded into the warehouse
Comprehensive database: It is the EDW to support all decision analysis
Metadata: to ease indexing and search
Middleware tools: to enable access to DW. It includes data mining tools,
OLAP, reporting tools, and data visualization tools.
Cardinal Merch Case Study
A number of Mapua alumni formed ‘Cardinal Merch, Inc.’, one of the
world’s largest marketers of statement shirts.
Their statement shirts can be purchased at their stores throughout
Europe and Asia. They can also be purchased online at
CardinalMerch.com.
The company is headquartered in Singapore where its Asia Sales
Office is located. The European operation is headquartered in Paris.
Cardinal Merch shirts are manufactured at third party factories
throughout the world. It has warehouses in Asia and Europe
accepting product and shipping to store locations and online
customers.
Cardinal Merch Marketing
CMI has an innovative approach to understanding their customers.
As part of the sales transaction they collect demographic
information.
Their products are frequently reviewed on Lazada.com. It attracts
customers through a bunch of promotional offers.
Their goal is to become #1 seller of statement shirts and they hired
us to build them a data warehouse to increase sales and reduce
costs.
CMI Problem
CMI wants to answer the following question:
How many men statement shirts were sold in the Philippines the
week before NCAA basketball semifinals in 2016 and was the total
net sales?
Let’s try to answer this using their transactional OLTP system
• Men statement shirts is a category type
• Philippines is a division
StoreList
RegionList
Sale Transaction
DivisionList
PK StoreNumber
PK RegionNumber
PK TransactionID
PK DivisionNumber
StoreName
RegionName
FK StoreNumber
DivisionName
FK RegionNumber
FK DivisionNumber
TimestampOf Sale
DivisionDirector
AddressLine1
RegionManager
RegisterNumber
AddressLine2
TotalSales
City
PaymentMethod
State

Zip
TransactionLedger

PK TransactionID

PK Line Number

FK SKU

Price

Discount
CategoryList BrandList
ProductList SKUList
NetPrice
PK CategoryCode
PK BrandCode PK SKU
PK ProductNumber

CategoryName
FK CategoryCode FK ProductNumber
FK BrandCode

Category Description
BrandName Size
Color

BrandManager
ProductName

ManufacturerID

Figure 10. The Transactional Data Model


The Query
select count(TL,NetPrice) as Total_Items, sum(TL.NetPrice) as Total_Sales
from TransactionLedger TL
inner join SaleTransation ST on (TL.TransactionID = ST.transactionID)
inner join StoreList SL on (ST.StoreNumber = SL.StoreNumber)
inner join RegionList RL on (SL.RegionNumber = RL.RegionNumber)
inner join DivisionList DL on (RL.DivisionNumber = DL.DivisionNumber)
inner join SKUList SK on (TL.SKU = SK.SKU)
inner join ProductList PL on (SK.ProductNumber = PL.ProductNumber)
inner join BrandList BL on (PL.BrandCode = BL.BrandCode)
inner join CategoryList CL on (BL.CategoryCode = CL.CategoryCode)
where DL.DivisionName = ‘Philippines’ Complex and easy
and BL,BrandName = ‘Men Statement Shirt’ to commit errors
and ST.TimestampOfSale > to_timestamp(‘2016-03-09:00.00.00’)
and ST.TimestampOfSale < to_timestamp(‘2016-03-16:23.59.59’)
The Solution
Use a separate system – a data warehouse is designed for analytics
Transactional models, while efficient for transaction processing, are
not good for analytics
• Goal of transactional system is to capture data quickly. Users use the
system to do their jobs.
• Transactional systems are not designed to minimize the time or
complexity of retrieving large amount of data for analysis.
Why Use a Separate System?
Performance
• Operational databases are designed and tuned for known
transactions and workloads.
• Complex decision-support queries would degrade performance
for operational transactions.
• Special data organization, access and implementation methods
needed for multidimensional views and queries.
Why Use a Separate System?
Function
• Historical Data: Decision support requires historical data, which
operational databases do not typically maintain.
• Data Consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many heterogeneous
sources: operational databases, external sources.
• Data Quality: Different sources typically use inconsistent data
representations, codes, and formats which must be reconciled.
Why Use a Separate System?
Security
• Confidentiality: Need to limit visibility to sensitive data.
• Privacy: Need to de-identify personal data.
Simple Warehouse Data Model
DateDim
ProductDim
PK Date SalesFact

PK SKU
CharDate PK FK SKU

PK FK Date ProductNumber
(OtherDate Info...)

PK FK StoreNumber (Product Info ...)

BrandID
NetPrice

StoreDim BrandName
QuantitySold

PK StoreNumber (Other Brand Info...)

StoreName CategoryCode

(StoreInfo ...) (Category Info...)

RegionNumber

(Region Info...)

DivisionNumber

DivisionName

(Other DivisionInfo...)

Figure 11. A Simple Warehouse Data Model


The Query
select sum(SF.QuantitySold) as Total_Items, sum(SF.NetPrice) as Total_Sales
from SalesFact SF
inner join StoreDim SD on (SF.StoreNumber = SD.StoreNumber)
inner join ProductDim PD on (SF.SKU = PD.SKU)
inner join DateDim DD on (SF.Date = DD.Date)
where DD.DivisionName = ‘Philippines’
and PD.CategoryName = ‘Men Statement Shirts’
and DD.CharDate >=‘2016-03-09’
and DD.CharDate <=‘2016-03-16’
Simpler and easy
to understand
The Solution
For the rest of the course, we will discuss how to design and
implement a data warehouse, best practices and other
considerations.
Data Warehouses (Analytical systems)
• Designed to extract and query data quickly
• Access speed is the main concern
• Hence, normalization which is widely used for transactional databases, is
generally not appropriate for data warehouse design
• Design should reflect multidimensional view
This is called a dimensional model or “star schema”
Questions?

You might also like