Systems: Local-And Wide-Area Computer Networks (Such As The Internet) Connect
Systems: Local-And Wide-Area Computer Networks (Such As The Internet) Connect
Systems: Local-And Wide-Area Computer Networks (Such As The Internet) Connect
com
IT6702 - DATA WAREHOUSING AND DATA MINING
UNIT-1 DATA WAREHOUSING
Part – A
1. What is data warehouse? (May/June 2010)
A data warehouse is a repository of multiple heterogeneous data sources organized
under a unified schema at a single site to facilitate management decision making.
A data warehouse is a subject-oriented, time-variant and nonvolatile collection of
data in support of management’s decision-making process.
2. What are the uses of multifeature cubes? (Nov/Dec 2007)
Multifeature cubes, which compute complex queries involving multiple dependent
aggregates at multiple granularity. These cubes are very useful in practice. Many complex
data mining queries can be answered by multifeature cubes without any significant increase
in computational cost, in comparison to cube computation for simple queries with standard
data cubes.
3. What is Data mart? (May/June 2013)
Data mart is a data store that is subsidiary to a data ware house of integrated data.
The data mart is directed at a partition of data that is created for the use of a dedicated group
of users.
4. What is data warehouse metadata? (Apr/May 2008)
Metadata are data about data. When used in a data warehouse, metadata are the data
that define warehouse objects. Metadata are created for the data names and definitions of
the given warehouse. Additional metadata are created and captured for time stamping any
extracted data, the source of the extracted data, and missing fields that have been added by
data cleaning or integration processes.
5. In the context of data warehousing what is data transformation? (May/June 2009)
In data transformation, the data are transformed or consolidated into forms appropriate
for mining. Data transformation can involve the following:
Smoothing
Aggregation
Generalization
Normalization
Attribute construction
6. List the characteristics of a data warehouse. (Nov/Dec 2009)
There are four key characteristics which separate the data warehouse from other
major operational systems:
o Subject Orientation: Data organized by subject
o Integration: Consistency of defining parameters
o Non-volatility: Stable data storage medium
o Time-variance: Timeliness of data and access terms
7. What are the various sources for data warehouse? (Nov/Dec 2009)
Handling of relational and complex types of data: Because relational databases and
data warehouses are widely used, the development of efficient and effective data
mining systems for such data is important.
Mining information from heterogeneous databases and global information
systems: Local- and wide-area computer networks (such as the Internet) connect
many sources of data, forming huge, distributed, and heterogeneous databases.
8. What is bitmap indexing? (Nov/Dec 2009)
The bitmap indexing method is popular in OLAP products because it allows quick
searching in data cubes. The bitmap index is an alternative representation of the record ID
(RID) list.
9. Differentiate fact table and dimension table. (May/June 2010)
Fact table contains the name of facts (or) measures as well as keys to each of the
related dimensional tables.
A dimension table is used for describing the dimension. (e.g.) A dimension table for
item may contain the attributes item_ name, brand and type.
10. Briefly discuss the schemas for multidimensional databases. (May/June 2010)
Stars schema: The most common modeling paradigm is the star schema, in which
the data warehouse contains (1) a large central table (fact table) containing the bulk
of the data, with no redundancy, and (2) a set of smaller attendant tables (dimension
tables), one for each dimension.
Snowflakes schema: The snowflake schema is a variant of the star schema model,
where some dimension tables are normalized, thereby further splitting the data into
additional tables. The resulting schema graph forms a shape similar to a snowflake.
Fact Constellations: Sophisticated applications may require multiple fact tables to
share dimension tables. This kind of schema can be viewed as a collection of stars,
and hence is called a galaxy schema or a fact constellation.
11. How is a data warehouse different from a database? How are they similar?
(Nov/Dec 2007, Nov/Dec 2010, May/June 2012)
Data warehouse is a repository of multiple heterogeneous data sources, organized
under a unified schema at a single site in order to facilitate management decision-making. A
relational database is a collection of tables, each of which is assigned a unique name. Each
table consists of a set of attributes (columns or fields) and usually stores a large set of tuples
(records or rows). Each tuple in a relational table represents an object identified by a unique
key and described by a set of attribute values. Both are used to store and manipulate the
data.
12. List out the functions of OLAP servers in the data warehouse architecture.
(Nov/Dec 2010)
The OLAP server performs multidimensional queries of data and stores the results in
its multidimensional storage. It speeds the analysis of fact tables into cubes, stores the cubes
until needed, and then quickly returns the data to clients.
13. Differentiate data mining and data warehousing. (Nov/Dec 2011)
Data mining refers to extracting or “mining” knowledge from large amounts of
data. The term is actually a misnomer. Remember that the mining of gold from rocks
or sand is referred to as gold mining rather than rock or sand mining. Thus, data
mining should have been more appropriately named “knowledge mining from data,”
A data warehouse is usually modeled by a multidimensional database structure,
where each dimension corresponds to an attribute or a set of attributes in the schema,
and each cell stores the value of some aggregate measure, such as count or sales
amount.
14. List out the logical steps needed to build a Data warehouse.
Collect and analyze business requirements.
Create a data model and a physical design for the Data warehouse.
Define data source
Choose the database technology and platform for the warehouse.
Extract the data from the operational databases, transform it, clean it up and
load it into the database.
Choose database access and reporting tool.
Choose database connectivity software.
Choose data analysis and presentation software.
Update the data warehouse
15. Write note on shared-nothing architecture.
The data is partitioned across all disks and the DBMS is partitioned across multiple
conservers.
Each of which resides on individual nodes of the parallel system and has an
ownership of its disk and thus it own database partition.
A shared-nothing RDBMS parallelizes the execution of a SQL query across multiple
processing nodes.
Each processor has its own memory nd disk and communicates with other processors
by exchanging messages and data over the interconnection network.
16. What are the access tools groups available?
Data query and reporting tools
Application development tools
Executive information system(EIS) tools
On-line analytical processing tools
Data mining tools
1 With a neat sketch, Describe in detail about Data warehouse architecture. (Nov/Dec
2012. (OR) List and discuss the characteristics and main functions performed by the
components of a data warehouse. Give diagrammatic illustration. (May/June
2014,May/June 2012)
2 List and discuss the steps involved in building a data warehouse. (Nov/Dec 2012)
3 Give detailed information about Meta data in data warehousing. (May/June 2014)
4 List and discuss the steps involved in mapping the data warehouse to a
multiprocessor architecture. (May/June 2014, Nov/Dec 2011)
5 i) Explain the role played by sourcing, acquisition, clean up and transformation tools
in data warehousing. (May/June 2013)
ii) Explain about STAR Join and STAR Index. (Nov/Dec 2012)
6 Describe in detail about DBMS schemas for decision support.
7 Explain about data extraction, clean up and transformation tools.
8 Explain the following:
i) Implementation considerations in building data warehouse
ii) Database architectures for parallel processing.
PART – A
1. What are production reporting tools? Give examples. (May/June 2013)
Production reporting tools will let companies generate regular operational
reports or support high-volume batch jobs. Such as calculating and printing pay
checks.
Examples:
Third generation languages such as COBOL
Specialized fourth generation languages such as Information
builders, Inc’s Focus
High-end client/server tools such as MITI’s SQL.
2. Define data cube. (May/June 2013)
Data cube consists of a large set of facts or measures and a number of
dimensions. Facts are numerical measures that are quantities by which we can
analyze the relationship between dimensions. Dimensions are the entities or
perspectives with respect to an organization for keeping records and are
hierarchical nature.
3. What is a Reporting tool? List out the two different types of reporting tools.
(May/June 2014,Nov/Dec 2012)
Reporting tools are software applications that make data extracted in a
query accessible to the user. That is it used for to generate the various types of
reports.
It can be divided into 2 types:
1. Production reporting tools
2. Desktop reporting tools
4. Define OLAP. (May/June 2014)
OLAP (online analytical processing) is computer processing that enables a
user to easily and selectively extract and view data from different points of
view.
OLAP is becoming an architecture that an increasing number of enterprises
are implementing to support analytical applications.
5. Briefly discuss the schemas for multidimensional databases.
(May/June 2010, Nov/Dec 2014, May/June 2011)
Stars schema: The most common modeling paradigm is the star schema, in
which the data warehouse contains (1) a large central table (fact table)
containing the bulk of the data, with no redundancy, and (2) a set of smaller
attendant tables (dimension tables), one for each dimension.
Snowflakes schema: The snowflake schema is a variant of the star schema
model, where some dimension tables are normalized, thereby further splitting
the data into additional tables. The resulting schema graph forms a shape
similar to a snowflake.
Fact Constellations: Sophisticated applications may require multiple fact
tables to share dimension tables. This kind of schema can be viewed as a
collection of stars, and hence is called a galaxy schema or a fact constellation.
6. Define the categories of tools in business analysis. (Nov/Dec 2014)
There are 5 categories of tools in business analysis.
i) Reporting tools – it can be used to generate the reports.
ii) Managed query tools – it can be used to SQL queries for accessing
the databases.
iii) Executive information systems – It allow developers to build
customized, graphical decision support applications or “briefing
books”.
iv) On-line analytical processing – these tools aggregate data along
common business subjects or dimensions and then let users navigate
the hierarchies and dimensions with the click of a mouse button.
v) Data mining – It use a variety of statistical and artificial intelligence
algorithm to analyze the correlation of variables in the data and
extract interesting patterns and relationship to investigate.
7. Differentiate between MOLAP, ROLAP and HOLAP. (Nov/Dec 2013)
The MOLAP storage The ROLAP storage The HOLAP storage mode
mode causes the mode causes the combines attributes of both
aggregations of the aggregations of the MOLAP and ROLAP. Like
partition and a copy of its partition to be stored in MOLAP, HOLAP causes the
source data to be stored in indexed views in the aggregations of the partition
a multidimensional relational database that to be stored in a
structure in Analysis was specified in the multidimensional structure in
Services when the partition’s data source. an SQL Server Analysis
partition is processed. Services instance.
Brio technology
9. Classify OLAP Tools. (Apr/May 2011)
MOLAP – Multidimensional Online Analytical Processing
ROLAP – Multirelational Online Analytical Processing
MQE – Managed Query Environment
10. Define how the complex aggregation at multiple granularities is achieved using
multi-feature cubes? (May/June 2012)
Multi-feature cubes, which compute complex queries involving multiple
dependent aggregates at multiple granularity. These cubes are very useful in
practice. Many complex data mining queries can be answered by multi-feature
cubes without any significant increase in computational cost, in comparison to
cube computation for simple queries with standard data cubes.
11. Give examples for managed query tools. (Nov/Dec 2012)
IQ software’s IQ objects
Andyne Computing Ltd’s GQL
IBM’s Decision server
Oracle Corp’s Discoverer/2000
12. What is Apex cuboid? (Apr/May 2011,Nov/Dec 2011)
Apex cuboid or 0-D cuboid which holds the highest level of summarization.
The Apex cuboid is typically denoted by all.
13. What is multidimensional database? (Nov/Dec 2011)
Data warehouses and OLAP tools are based on a multidimensional data
model. This model is used for the design of corporate data warehouses and
department data marts. This model contains a star schema, snowflake schema and
fact constellation schemas. The core of multidimensional model is the data cube.
14. What are the applications of query tools? (Nov/Dec 2014)
The applications of query tools are
Multidimensional analysis
Decision making
In-depth analysis such as data classification
Clustering.
15. Compare OLTP and OLAP. (Apr/May 2008,May/June 2010)
Data Warehouse (OLAP) Operational Database (OLTP)
Involves historical processing of information. Involves day-to-day processing.
OLAP systems are used by knowledge OLTP systems are used by clerks,
workers such as executives, managers and DBAs, or database professionals.
analysts.
Dept. of IT, Jerusalem College of Engineering
16. List out OLAP operations in multidimensional data model. (May/June 2009)
Roll-up - performs aggregation on a data cube
Drill-down - is the reverse operation of roll-up.
Slice and dice – Slice operation selects one particular dimension from a
given cube and provides a new sub-cube. Dice selects two or more
dimensions from a given cube and provides a new sub-cube.
Pivot (or) rotate - The pivot operation is also known as rotation. It rotates
the data axes in view in order to provide an alternative presentation of data.
17. Mention the functions of OLAP servers in the data warehousing architecture.
(Nov/Dec 2010)
The OLAP server performs multidimensional queries of data and stores the
results in its multidimensional storage. It speeds the analysis of fact tables into
cubes, stores the cubes until needed, and then quickly returns the data to clients.
18. What is Impromptu?
Impromptu from Cognos Corporation is positioned as an enterprise solution
for interactive database reporting that delivers 1 to 100+ seat scalability.
19. Mention some supported databases of Impromptu.
ORACLE
Microsoft SQL Server
SYBASE
Omni SQL Gateway
SYBASE Net Gateway
20. What is enterprise warehouse?
PART – B
1. Explain in detail about the reporting and query tools. (May/June 2014)
2. Describe in detail about COGNOS IMPROMTU. (May/June 2014)
3. Explain the categorization of OLAP tools with necessary diagrams.(May/June
2014)
4. i) List and explain the OLAP operation in multidimensional data model.
(Nov/Dec 2014)
ii) Differentiate between OLTP and OLAP. (Nov/Dec 2014)
5. i)List and discuss the features of Cognos Impromptu. (Nov/Dec 2012)
ii)List and discuss the basic features data provided by reporting and query tools
used for business analysis. (Apr/May 2011)
6. i) What is a Multidimensional data model? Explain star schema with an example.
(May/June 2014)
ii) Write the difference between multi-dimensional OLAP (MOLAP) and Multi-
relational OLAP (ROLAP). (May/June 2014, Nov/Dec 2012)
7. Explain the following: (May/June 2012)
i) Different schemas for multidimensional databases.
PART – A
1. Define data mining. Give some alternative terms of data mining.
Data mining refers to extracting or “mining” knowledge from large amounts of data.
Data mining is a process of discovering interesting knowledge from large amounts of
data stored either, in database, data warehouse or other information repositories.
Alternative names are
Knowledge mining
Knowledge extraction
Data/pattern analysis
Data Archaeology
Data Dredging
2. What is KDD? What are the steps involved in KDD process?
Knowledge discovery in databases (KDD) is the process of discovering useful
knowledge from a collection of data. This widely used data mining technique is a process
that includes data preparation and selection, data cleansing, incorporating prior knowledge
on data sets and interpreting accurate solutions from the observed results.
The steps involved in KDD process are
Data Cleaning − In this step, the noise and inconsistent data is removed.
Data Integration − In this step, multiple data sources are combined.
Data Selection − In this step, data relevant to the analysis task are retrieved from
the database.
Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation − In this step, to identify the truly interesting patterns
representing knowledge based on some interestingness measures.
Knowledge Presentation − In this step, visualization and knowledgerepresentation
techniques are used to present the mined knowledge to the user.
3. What are the various forms of data preprocessing? (Apr/May 2008)
Data cleaning
Data integration
Data transformation
Data reduction
4. State why preprocessing an important issue for data warehousing and data
mining? (Apr/May 2011)
In real world data tend to be incomplete, noisyand inconsistent data. So
preprocessing is an important issue for data warehousing and data mining.
PART-B
PART-A
PART-B
1. Write and explain the algorithm for mining frequent item sets with candidate generation.
Give relevant example.
2. Write and explain the algorithm for mining frequent item sets without candidate
generation. Give relevant example.
3. Discuss the approaches foe mining multi-level and multi-dimensional association rules
from the transactional databases. Give relevant example.
4. i) Explain the algorithm for constructing a decision tree from training samples. (12)
ii) Explain about Bayes Theorem. (4)
5. i) Apply the Apriori algorithm for discovering frequent item sets of the following. Use
0.3 for minimum support value.
(12)
TID Items purchased
101 milk,bread,eggs
102 milk,juice
103 juice,butter
104 milk,bread,eggs
105 coffee,eggs
106 coffee
107 coffee,juice
108 milk,bread,cookies,eggs
109 cookies,butter
110 milk,bread
engine containing a large knowledge base. Visual data mining essentially combines the power of these
components, making it a highly attractive and effective tool for the comprehension of data
distributions, patterns, clusters, and outliers in data.
10. What is mean by the frequency item set property? (Nov/Dec 2008)
A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. The
set {computer, antivirus software} is a 2-itemset. The occurrence frequency of an itemset is the
number of transactions that contain the itemset. This is also known, simply, as the frequency, support
count, or count of the itemset.
11. Mention the advantages of hierarchical clustering. (Nov/Dec 2008)
Hierarchical clustering (or hierarchic clustering) outputs a hierarchy, a structure that is more
informative than the unstructured set of clusters returned by flat clustering. Hierarchical clustering
does not require us to prespecify the number of clusters and most hierarchical algorithms that have
been used in IR are deterministic. These advantages of hierarchical clustering come at the cost of
lower efficiency.
12. Define time series analysis. (May/June 2009)
Time series analysis comprises methods for analyzing time series data in order to extract
meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model
to predict future values based on previously observed values. Time series are very frequently plotted
via line charts.
13. What is mean by web content mining? (May/June 2009)
Web content mining, also known as text mining, is generally the second step in Web data
mining. Content mining is the scanning and mining of text, pictures and graphs of a Web page to
determine the relevance of the content to the search query. This scanning is completed after the
clustering of web pages through structure mining and provides the results based upon the level of
relevance to the suggested query. With the massive amount of information that is available on the
World Wide Web, content mining provides the results lists to search engines in order of highest
relevance to the keywords in the query.
14. Write down some applications of data mining.(Nov/Dec 2009)
Financial Data Analysis
Retail Industry
Telecommunication Industry
Biological Data Analysis
Scientific Applications
Intrusion Detection
15. List out the methods for information retrieval. (May/June 2010)
They generally either view the retrieval problem as a document selection problem or as a
document ranking problem. In document selection methods, the query is regarded as specifying
constraints for selecting relevant documents. A typical method of this category is the Boolean retrieval
model, in which a document is represented by a set of keywords and a user provides a Boolean
expression of keywords, such as “car and repair shops,” “tea or coffee” .
Document ranking methods use the query to rank all documents in the order of relevance. For
ordinary users and exploratory queries, these methods are more appropriate than document selection
methods.
16. What is the categorical variable? (Nov/Dec 2010)
A categorical variable is a generalization of the binary variable in that it can take on more than
two states. For example, map color is a categorical variable that may have, say, five states: red, yellow,
green, pink, and blue. Let the number of states of a categorical variable be M. The states can be
denoted by letters, symbols, or a set of integers, such as 1, 2, …, M. Notice that such integers are used
just for data handling and do not represent any specific ordering.
17. What is the difference between row scalability and column scalability? (Nov/Dec 2010)
Data mining has two kinds of scalability issues: row (or database size) scalability and column
(or dimension) scalability.
A data mining system is considered row scalable if, when the number of rows is enlarged 10
times, it takes no more than 10 times to execute the same data mining queries. A data mining system is
considered column scalable if the mining query execution time increases linearly with the number of
columns (or attributes or dimensions). Due to the curse of dimensionality, it is much more challenging
to make a system column scalable than row scalable.
18. What are the major challenges faced in bringing data mining research to market? (Nov/Dec
2010)
The diversity of data, data mining tasks, and data mining approaches poses many challenging
research issues in data mining. The development of efficient and effective data mining methods and
systems, the construction of interactive and integrated data mining environments, the design of data
mining languages, and the application of data mining techniques to solve large application problems
are important tasks for data mining researchers and data mining system and application developers.
19. What is mean by multimedia database? (Nov/Dec 2011)
A multimedia database system stores and manages a large collection of multimedia data, such
as audio, video, image, graphics, speech, text, document, and hypertext data, which contain text, text
markups, and linkages. Multimedia database systems are increasingly common owing to the popular
use of audio, video equipment, digital cameras, CD-ROMs, and the Internet.
20. Define DB miner. (Nov/Dec 2011)
DBMiner delivers business intelligence and performance management applications powered
by data mining. With new and insightful business patterns and knowledge revealed by DBMiner.
DBMiner Insight solutions are world's first server applications providing powerful and highly scalable
association, sequence and differential mining capabilities for Microsoft SQL Server Analysis Services
platform, and they also provide market basket, sequence discovery and profit optimization for
Microsoft Accelerator for Business Intelligence.
21. Define: Dendrogram.
A tree structure called a dentrogram is commonly used to represent the process of hierarchical
clustering.
Decompose data objects into a several levels of nested partitioning (tree of clusters) called a
dendrogram.
PART – B