Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

22621_SUM23_Model Answer

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION

(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
SUMMER – 2023 EXAMINATION
Model Answer – Only for the Use of RAC Assessors

Subject Name: Data Warehouse Modeling Subject Code: 22621


Important Instructions to examiners: XXXXX
1) The answers should be examined by key words and not as word-to-word as given in the model answer scheme.
2) The model answer and the answer written by candidate may vary but the examiner may try to assess the
understanding level of the candidate.
3) The language errors such as grammatical, spelling errors should not be given more Importance (Not applicable for
subject English and Communication Skills.
4) While assessing figures, examiner may give credit for principal components indicated in the figure. The figures
drawn by candidate and model answer may vary. The examiner may give credit for any equivalent figure drawn.
5) Credits may be given step wise for numerical problems. In some cases, the assumed constant values may vary and
there may be some difference in the candidate’s answers and model answer.
6) In case of some questions credit may be given by judgement on part of examiner of relevant answer based on
candidate’s understanding.
7) For programming language papers, credit may be given to any other program based on equivalent concept.
8) As per the policy decision of Maharashtra State Government, teaching in English/Marathi and Bilingual (English +
Marathi) medium is introduced at first year of AICTE diploma Programme from academic year 2021-2022. Hence if
the students in first year (first and second semesters) write answers in Marathi or bilingual language (English
+Marathi), the Examiner shall consider the same and assess the answer based on matching of concepts with model
answer.

Q. Sub Answer Marking


No. Q. Scheme
N.
1 Attempt any FIVE of the following: 10 M
a) State any four Benefits of Data warehouse. 2M
Ans Four Benefits of data warehouse: Any four
1. Delivers enhanced business intelligence (1/2 M each)
2. Saves times
3. Enhances data quality and consistency
4. Generates a high Return on Investment (ROI)
5. Provides competitive advantage
6. Enables organizations to forecast with confidence
7. Improves the decision-making process
b) Define Data cube used in Data warehouse modeling 2M
Ans Data Cube: Definition: 2
The data warehouse data is grouped or combined in multidimensional matrices, it is called M
Data Cube. Data or OLAP cube is a data structure optimized for very quick data analysis.
Data cube is also called as OLAP cube or hypercube.

c) Describe the term HOLAP. 2M


Ans HOLAP: Description: 2
HOLAP: Hybrid Online Analytical Process. M
Hybrid OLAP is a combination of ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP stores aggregations in MOLAP for
fast query performance, and detailed data in ROLAP to optimize time of cube processing.

Page No: 1 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
d) Define the term Data mining. 2M
Ans Data Mining: Definition: 2
Data mining means searching for knowledge (interesting patterns or useful data) in data. M
Data mining is the process of discovering interesting patterns and knowledge from large
amounts of data.
Data mining is also known as knowledge discovery in Database (KDD).
e) Describe requirements of cluster analysis. 2M
Ans Requirements of Cluster Analysis: Any four
1. Scalability (1/2 M each)
2. Ability to deal with different kinds of attributes
3. Discovery of clusters with attribute
4. High dimensionality
5. Ability to deal with noisy data
6. Interpretability
f) Explain role of OLAP queries with example. 2M
Ans Role of OLAP Role:
OLAP is a database technology that has been optimized for querying and reporting, instead 1M
of processing transactions. Example:
OLAP queries can be used to identify and compute the specific values from cube which 1M
are required for decision support.
Ex: (consider other related example also)
compute cube sales iceberg as
select month, city, customer_group, count(*)
from salesInfo cube by month, city, customer_group
having count(*) >= min_sup
g) Explain Rollup OLAP operation. 2M
Ans Roll-up operation: Explanation:2
Roll-up is also known as "consolidation" or "aggregation." M
Rollup operation can aggregate the data in data cube.
The Roll-up operation can be performed in 2 ways
a. Reducing dimensions
b. Climbing up concept hierarchy. Concept hierarchy is a system of grouping things based
on their order or level.

2. Attempt any THREE of the following: 12 M


a) Explain Metadata repository. 4M
Ans Metadata Repository: 2M
(Metadata: data about data, repository: big container)
Metadata is the information about the structures that contain the actual data.
It is data about the structures that contain data. Metadata may describe the structure of any
data, of any subject, stored in any format.
Metadata repository contains the structures of all data at one place, which gives the plenty
of data more than requirement for decision making.
Metadata Repository used for building, maintain, managing Data warehouse.
Concept example, a line in sales database may contain: 4030 KJ732 299.90 1M
This is a meaningless data until we consult the Meta that tells us what it was.
The Meta of the data is
• Model number: 4030
• Sales Agent ID: KJ732
Page No: 2 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
• Total sales amount of $299.90
Therefore, Metadata are essential ingredients in the transformation of data into knowledge.
Example: Metadata of a Book Store: 1M
1. Name of book
2. Summary of book
3. Publication of book
4. Edition of book
5. Author of book
6. Date of publication
7. Availability of book
8. Reviews of book
Above information (metadata) helps to search the book, access the book, etc.
b) Explain following OLAP operation : 4M
(i) Drill down
(ii) Slice
(iii) Dice
(iv) Pivot
Ans i. Drill down: Each
In drill-down data is fragmented (divided) into smaller parts. It is the opposite of the rollup Operation:
process. It can be done via 1M
a. Moving down in the concept hierarchy and b. Increasing a dimension.
Consider the following diagram: (consider any
In this overview section, drill-down operation is performed by moving down in concept other relevant
hierarchy of Time dimension (Quarter to Months). example and
values)

ii. Slice:
In this operation, one dimension is selected, and a new sub-cube is created.
In the overview section, slice is performed on the dimension Time (Q1).

Page No: 3 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________

iii. Dice:
This operation is similar to a slice.
The difference in dice is that, you can select 2 or more dimensions that result in the
creation of a sub-cube.
In the overview section, a sub-cube is selected by selecting Location Pune Or Mumbai
and Time Q1 or Q2.

iv. Pivot:
In Pivot operation, you rotate the data axes to provide a substitute presentation of data.
In this overview section, a sub-cube obtained after Slice operation performing Pivot
operation gives a new view of that slice.
Consider the result (slice) in slice operation.

Page No: 4 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________

c) Describe business framework for Data warehouse design. 4M


Ans Business framework for DW design: 1M
The business analyst gets the information from the data warehouses to measure the
performance and make critical adjustments in order to win over other business holders in
the market.
Having a data warehouse offers the following advantages:
i. Since a data warehouse can gather information quickly and efficiently, it can enhance
business productivity.
ii. A data warehouse provides us a consistent view of customers and items; hence, it helps
us manage customer relationship.
iii. A data warehouse also helps in bringing down the costs by tracking trends, patterns
over a long period in a consistent and reliable manner.

To design an effective and efficient data warehouse, we need to understand and analyze 3M
the business needs and construct a business analysis framework. Each person has
different views regarding the design of a data warehouse. These views are as follows:
a. The top-down view: This view allows the selection of relevant information needed for
a data warehouse.
b. The data source view: This view presents the information being captured, stored, and
managed by the operational system.
c. The data warehouse view: This view includes the fact tables and dimension tables. It
represents the information stored inside the data warehouse.
d. The business query view: It is the view of the data from the viewpoint of the end user.

Page No: 5 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
d) Explain major issues of Data mining. 4M
Ans Major issues in Data mining: Explain any
A. Mining methodology and user interaction issues: two major
i. Mining different kinds of knowledge in databases: issues:
Different user - different knowledge - different way. 2 M each
That means different client want a different kind of information, so it becomes difficult to
cover vast range of data that can meet the client requirement.
ii. Incorporation of background knowledge:
Background knowledge is used to guide discovery process and to express the discovered
patterns. So, in mining process to know the background of data is must for easy process.
iii. Query languages and ad hoc mining:
Relational query languages allow users to use ad-hoc queries for data retrieval.
The language of data mining query language and the query language of data warehouse
should be matched.
iv. Handling noisy or incomplete data:
In a large database, many of the attribute values will be incorrect.
This may be due to human error or because of any instruments fail.
B. Performance issues:
i. Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable.
ii. Parallel, distributed, and incremental mining algorithms:
There are huge size of databases, the wide distribution of data, and complexity of some
data mining methods.
These factors should be considered during development of parallel and distributed data
mining algorithms.
C. Issues relating to the diversity of database types:
i. Handling of relational and complex types of data:
There are many kinds of data stored in databases and data warehouses.
It is not possible for one system to mine all these kinds of data. So, different data mining
system should be constructed for different kind data.
ii. Mining information from heterogeneous databases and global information systems:
Since data is fetched from different data sources on Local Area Network (LAN) and Wide
Area Network (WAN), the discovery of knowledge from different sources of structured is
a great challenge to data mining.

3. Attempt any THREE of the following: 12 M


a) Explain any two Data warehouse models. 4M
Ans Data Warehouse models: 2 M for each
There are three models for Data warehouse. model with
1. Enterprise Data Warehouse explanation
2. Data Marts (any two
3. Virtual Warehouse models)
1. Enterprise Data Warehouse (EDW):
Enterprise Data Warehouse is a centralized warehouse, which aggregates the information
or data automatically.
It offers a unified approach for organizing and representing data.
It also provides the ability to classify data according to the subject and give access
accordingly to users.
Page No: 6 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
It provides decision support service across the enterprise.
Example: All Polytechnic data available at MSBTE

2. Data Marts:
A data mart is a subset of the data warehouse.
It is specially designed for a particular line of business, such as sales, finance, sales or
finance. In an independent data mart, data can collect directly from sources.
Due to large amount of data, a single warehouse can become overburdened.
So, to prevent the warehouse from becoming impossible to navigate, subdivisions created,
called as Data Marts.
These data marts divide the information saved in the warehouse into categories or specific
groups of users.
In a simple word Data mart is a subsidiary of a data warehouse.
Example: Five regions of MSBTE: One region may be referred as Data Mart

3. Virtual Warehouse:
The view over an operational data warehouse is known as a virtual warehouse.
A virtual warehouse is essentially a separate business database, which contains only
required data for operation system.
The data found in a virtual warehouse is usually copied from multiple sources throughout
an operation system.
Virtual warehouse is used to search the data quickly and without accessing the entire
system.
It speeds up the overall access process.
Example: It may contain only one or two Polytechnics data.
b) Describe data objects & attribute type. 4M
Ans Data Objects and Attribute Types : 2 M for data
Data Objects: objects,
• Data sets are made up of data objects. 2 M for
• A data object represents an entity. attribute type
• Example :
- In a sales database, the objects may be customers, store items, and sales.
- In a medical database, the objects may be patients.
- In a university database, the objects may be students, professors, and courses.
• Data objects are typically described by attributes.
• Data objects can also be referred to as samples, examples, instances, data points, or
objects.
• If the data objects are stored in a database, they are data tuples. That is, the rows of a
database correspond to the data objects, and the columns correspond to the attributes.

Attribute :
• An attribute is a data field, representing a characteristic or feature of a data object.
• The nouns attribute, dimension, feature, and variable are often used interchangeably
in the literature.
• The term dimension is commonly used in data warehousing.
• Machine learning literature tends to use the term feature, while statisticians prefer the
term variable.
• Data mining and database professionals commonly use the term attribute.

Page No: 7 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
• Example : Attributes describing a customer object can include, customer ID, name,
and address

Types of attributes:
1. Qualitative Attributes
2. Quantitative Attributes

1. Qualitative Attributes:
a. Nominal Attributes (N):
These attributes are related to names.
The values of a Nominal attribute are name of things, some kind of symbols.
Values of Nominal attributes represents some category or state and that’s why nominal
attribute also referred as categorical attributes and there is no order (rank, position)
among values of nominal attribute.
Example:
Attribute Values
Colors Black, Red, Green
Categorical Data Lecturer, Professor

b. Binary Attributes (B):


Binary data has only 2 values/states.
Example: yes or no, affected or unaffected, true or false.
i.Symmetric: Both values are equally important (Gender).
ii.Asymmetric: Both values are not equally important (Result).
Attribute Values
Attribute Values Result Pass, Fail
Gender Male, Female

c. Ordinal Attributes (O):


The Ordinal Attributes contains values that have a meaningful sequence or ranking (order)
between them.

Attribute Values
Grade A, B, C, D, E
Income low, medium, high
Age Teenage, young, old

2. Quantitative Attributes:
a. Numeric:
A numeric attribute is quantitative because, it is a measurable quantity, represented in
integer or real values.

Attribute Values
Salary 2000, 3000
Units sold 10, 20
Age 5,10,20..

Page No: 8 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
b. Discrete:
Discrete data have finite values, it can be numerical and can also be in categorical form.
These attributes have finite or countable infinite set of values.
Example:

Attribute Values
Profession Teacher, Businessman, Peon
Zip Code 413736, 413713

c. Continuous:
Continuous data have infinite no. of states. Continuous data is of float type. There can be
many values between 2 and 3.

Example:

Attribute Values
Height 2.3, 3, 6.3……
Weight 40, 45.33,…….
c) Explain steps for generating association rule from frequent item sets. 4M
Ans Association rule-generation is a two-step process. 4 M for steps
First is to generate an itemset like {Bread, Egg, Milk} and second is to generate a rule from and
each itemset like {Bread → Egg, Milk}, {Bread, Egg → Milk} etc. Both the steps are explanation.
discussed below.
1. Generating itemsets from a list of items
First step in generation of association rules is to get all the frequent itemsets on which
binary partitions can be performed to get the antecedent and the consequent. For example, if
there are 6 items {Bread, Butter, Egg, Milk, Notebook, Toothbrush} on all the transactions
combined, itemsets will look like {Bread}, {Butter}, {Bread, Notebook}, {Milk,
Toothbrush}, {Milk, Egg, Vegetables} etc. Size of an itemset can vary from one to the total
number of items that we have. Now, we seek only frequent itemsets from this and not all so
as to put a check on the number of total itemsets generated.

2. Generating all possible rules from the frequent itemsets


Once the frequent itemsets are generated, identifying rules out of them is comparatively less
taxing. Rules are formed by binary partition of each itemset. If {Bread,Egg,Milk,Butter} is
the frequent itemset, candidate rules will look like:
(Egg, Milk, Butter → Bread), (Bread, Milk, Butter → Egg), (Bread, Egg → Milk, Butter),
(Egg, Milk → Bread, Butter), (Butter→ Bread, Egg, Milk)
From a list of all possible candidate rules, we aim to identify rules that fall above a
minimum confidence level (minconf). Just like the anti-monotone property of
support, confidence of rules generated from the same itemset also follows an anti-
monotone property. It is anti-monotone with respect to the number of elements in
consequent.
This means that confidence of (A,B,C→ D) ≥ (B,C → A,D) ≥ (C → A,B,D). To remind,
confidence for {X → Y} = support of {X,Y}/support of {X}
As we know that support of all the rules generated from same itemset remains the same and
difference occurs only in the denominator calculation of confidence. As number of items in
X decrease, support{X} increases (as follows from the anti-monotone property of support)
and hence the confidence value decreases.
Page No: 9 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
An intuitive explanation for the above will be as follows. Consider F1 and F2:
F1 = fraction of transactions having (butter) also having (egg, milk, bread)
There will be many transactions having butter and all three of egg, milk and bread will be
able to find place only in a small number of those.
F2 = fraction of transactions having (milk, butter, bread) also having (egg)
There will only be a handful of transactions having all three of milk, butter and bread (as
compared to having just butter) and there will be high chances of having egg on those.
So it will be observed that F1 < F2. Using this property of confidence, pruning is done in a
similar way as was done while looking for frequent itemsets. It is illustrated in the figure
below.

We start with a frequent itemset {a,b,c,d} and start forming rules with just one consequent.
Remove the rules failing to satisfy the minconf condition. Now, start forming rules using a
combination of consequents from the remaining ones. Keep repeating until only one item is
left on antecedent. This process has to be done for all frequent itemsets.
d) Differentiate between ROLAP & MOLAP. 4M
Ans ROLAP & MOLAP Difference: 1 M for each
Sr. correct point.
Parameters ROLAP MOLAP
No. (any 4 points)
Relational Online Multidimensional Online
1 Acronym
Analytical Processing Analytical Processing
Information
2 Slow Fast
retrieval
sparse array to store data-
3 Storage Method relational table
sets
4 Easy to use Yes No
Page No: 10 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
when data warehouse
when data warehouse
5 When to use contains relational as well
contains relational data
as non-relational data
6 Implementation Easy Complex
Response Time
7 More Less
Required
8 Storage Space Less More

4. Attempt any THREE of the following: 12 M


a) Explain ETL process. (Extraction Transformation & Load). 4M
Ans ETL Process in Data Warehouse : 1 M for
• ETL stands for Extract, Transform and Load. diagram,
• It is a process, in which an ETL tool extracts the data from various data source 3 M for
systems, transforms it in the staging area and then finally, loads it into the Data explanation
Warehouse system.

1. Extraction:
The first step of the ETL process is extraction.
In this step, data from various source systems is extracted which can be in various formats
like relational databases, No SQL, XML and flat files into the staging area.
The data cannot be loaded in data warehouse; therefore, this is one of the most important
steps of ETL process.

2. Transformation:
The second step of the ETL process is transformation.
In this step, a set of rules or functions are applied on the extracted data to convert it into a
single standard format.
It may involve following processes/tasks:
• Filtering – loading only certain attributes into the data warehouse.
• Cleaning – filling up the NULL values and missing values.
• Joining – joining multiple attributes into one.
• Splitting – splitting a single attribute into multiple attributes.
• Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
3. Loading:
The third and final step of the ETL process is loading.
In this step, the transformed data is finally loaded into the data warehouse.

Page No: 11 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
b) Explain need of OLAP. 4M
Ans Need of OLAP: 1/2 M for
OLAP can be needed for business data analysis and decision making. OLAP can be each point
used/needed: (any 8 points)
1. To support multidimensional view of data.
2. To provides fast and steady access to various views of information.
3. To processes complex queries.
4. To analyze the information.
5. To pre-calculate and pre-aggregate the data.
6. For all type of business includes planning, budgeting, reporting, and analysis.
7. To quickly create and analyze "What if" scenarios
8. To easily search OLAP database for broad or specific terms.
9. To provide the building blocks for business modelling tools, Data mining tools,
performance reporting tools.
10. To do slice and dice cube data all by various dimensions, measures, and filters.
11. To analyze time series data.
12. To find some clusters and outliers.
13. To visualize online analytical process system which provides faster response times.
c) Explain OLAP data indexing with its type. 4M
Ans OLAP Indexing: 2 M for each
OLAP uses two indices: Bitmap and Join Index. type with
1. Bitmap Index: example
The bitmap index is an alternative representation of the record ID (RID) list. (bitmap
Each attribute is represented by distinct bit value. index, bitmap
If attribute’s domain consists of n values, then n bits are needed for each entry in the join index)
bitmap index.
If the attribute value is present in the row then it is represented by 1 in the corresponding
row of the bitmap index and rest are 0 (zero).

Example: Bitmap Index


Company's customers table.
SELECT cust_id, cust_gender, cust_income FROM customers;
Table 1: Customer

Cust_id Cust_gender Cust_income


101 M 10000
102 F 20000
103 M 15000
104 F 21000
105 F 11000

Because cust_gender and cust_income are all low-cardinality columns (there are two
possible values for gender and 05 for income), bitmap indexes are ideal for these columns.
Do not create a bitmap index on cust_id because this is a unique column.
The following table illustrates the bitmap index for the cust_gender column in this
example. It consists of two separate bitmaps, each one for gender.

Page No: 12 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
Sample Bitmap Index on Gender

RID Gender F Gender M


1 0 1
2 1 0
3 0 1
4 1 0
5 1 0

2. Bitmap Join Index:


In addition to a bitmap index on a single table, we can create a bitmap join index, which is
a bitmap index for the join of two or more tables.
In a bitmap join index, the bitmap for the table to be indexed is built for values coming
from the joined tables.

Example: Bitmap Join Index:


Consider Table 1: Customer above and Table 2: Sales below

Table 2: Sales
Time_id Cust_id Amount_sold
Jan 101 2000
Feb 103 3000
Mar 106 5000
Apr 104 6000
May 107 7000

By joining the two table (Customer and sales) illustrates the join result that is used to
create the bitmaps that are stored in the bitmap join index:

Query to join two tables:


SELECT sales.time_id, customers.cust_gender, sales.amount_sold
FROM sales, customers WHERE sales.cust_id = customers.cust_id;

Base Table for Calculating Bitmap Join Index on Gender

Time_id Cust_gender Amount_sold


Jan M 2000
Feb M 3000
Apr F 6000

Bitmap Join Index Output

RID Cust_gender M Cust_gender F


1 1 0
2 1 0
3 0 1
Page No: 13 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
d) Describe Data cleaning process. 4M
Ans Data Cleaning : 2 M for
• Data cleaning is a crucial process in Data Mining. It carries an important part in explanation,
the building of a model. 2 M for
• Data Cleaning can be regarded as the process needed, but everyone often methods
neglects it.
• Data quality is the main issue in quality information management. Data quality
problems occur anywhere in information systems. These problems are solved
by data cleaning.
• Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. If data is incorrect, outcomes and
algorithms are unreliable, even though they may look correct. When combining
multiple data sources, there are many opportunities for data to be duplicated or
mislabeled.
Methods of Data Cleaning
• There are many data cleaning methods through which the data should be run.
The methods are described below:
• Ignore the tuples: This method is not very feasible, as it only comes to use
when the tuple has several attributes is has missing values.
• Fill the missing value: This approach is also not very effective or feasible.
Moreover, it can be a time-consuming method. In the approach, one has to fill
in the missing value. This is usually done manually, but it can also be done by
attribute mean or using the most probable value.
• Binning method: This approach is very simple to understand. The smoothing
of sorted data is done using the values around it. The data is then divided into
several segments of equal size. After that, the different methods are executed to
complete the task.
• Regression: The data is made smooth with the help of using the regression
function. The regression can be linear or multiple. Linear regression has only
one independent variable, and multiple regressions have more than one
independent variable.
• Clustering: This method mainly operates on the group. Clustering groups the
data in a cluster. Then, the outliers are detected with the help of clustering.
Next, the similar values are then arranged into a "group" or a "cluster".
e) Explain Market Basket Analysis with example. 4M
Ans Market Basket Analysis: 2 M for
Market Basket Analysis is a modelling technique based upon the theory that if you buy a explanation,
certain group of items, you are more (or less) likely to buy another group of items. 2 M for
Ex: (Computer → Antivirus) example
Market Basket Analysis is one of the key techniques used by large retailers to uncover
associations between items.
It works by looking for combinations of items that occur together frequently in
transactions. i.e it allows retailers to identify relationships between the items that people
buy.
Market basket analysis can be used in deciding the location and promotion of goods inside
a store.
Market Basket Analysis creates If-Then scenario rules, for example, if item A is purchased
then item B is likely to be purchased.

Page No: 14 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________

Example: How is it used?


As a first step, market basket analysis can be used in deciding the location and promotion
of goods inside a store.
If, it has been observed, purchasers of Barbie dolls have been more likely to buy candy,
then high-margin candy can be placed near to the Barbie doll display.
Customers who would have bought candy with their Barbie dolls had they thought of
it will now be suitably tempted.
But this is only the first level of analysis. Differential market basket analysis can find
interesting results and can also eliminate the problem of a potentially high volume of
trivial results.
In differential analysis, compare results between different stores, between customers in
different demographic groups, between different days of the week, different seasons of the
year, etc.
If we observe that a rule holds in one store, but not in any other (or does not hold in one
store, but holds in all others), then we know that there is something interesting about that
store.
Investigating such differences may yield useful insights which will improve company
sales.
Market Basket Analysis used for:
1. Analysis of credit card purchases.
2. Analysis of telephone calling patterns.
3. Identification of fraudulent medical insurance claims.
4. Analysis of telecom service purchases.

5. Attempt any TWO of the following: 12 M


a) Explain Data warehouse design process. 6M
Ans A data warehouse is a single data repository where a record from multiple data sources is 3 M to each
integrated for online business analytical processing (OLAP). This implies a data approach
warehouse needs to meet the requirements from all the business stages within the entire
organization. Thus, data warehouse design is a hugely complex, lengthy, and hence error-

Page No: 15 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
prone process.
Data warehouse design takes a method different from view materialization in the
industries. It sees data warehouses as database systems with particular needs such as
answering management related queries.
There are two approaches
1. "top-down" approach
2. "bottom-up" approach

1. Top-Down Approach:
An approach is a data-driven approach as the information is gathered and integrated
first and then business requirements by subjects for building data marts are
formulated.

Fig: DW Design: Top-Down Approach

i. External Sources:
External source is a source from where data is collected irrespective of the type of
data. Data can be structured, semi structured and unstructured as well.
ii. Stage Area:
Since the data, extracted from the external sources does not follow a particular
format, so there is a need to validate this data to load into Data warehouse.
For this purpose, it is recommended to use ETL tool.
• E (Extracted): Data is extracted from External data source.
• T (Transform): Data is transformed into the standard format.
• L (Load): Data is loaded into Data warehouse after transforming it into the
standard format.

iii. Data-warehouse:
After cleansing of data, it is stored in the Data warehouse as central repository.
It actually stores the meta data and the actual data gets stored in the data marts.

iv. Data Marts:


Data mart is also a part of storage component (subset of Data Warehouse).
It stores the information of a particular function of an organization which is handled
by single authority. There can be as many numbers of data marts in an organization

Page No: 16 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
depending upon the functions.

v. Data Mining:
It is used to find the hidden patterns that are present in the database or in Data
warehouse with the help of algorithm of data mining.

2. Bottom-Up Design Approach.


In the "Bottom-Up" approach, a data warehouse is described as "a copy of transaction data
specifically architecture for query and analysis," term the star schema. In this approach, a
data mart is created first to necessary reporting and analytical capabilities for
particular business processes (or subjects). Bottom-up approach is opposite of Top-
Down approach.
The advantage of the "bottom-up" design approach is that it has quick ROI, as developing
a data mart, a data warehouse for a single subject, takes far less time and effort than
developing an enterprise-wide data warehouse.

Fig: DW Design: Bottom-Up Approach


b) List & explain schema used in Data warehouse modeling. 6M
Ans Schema in Data warehouse modeling: 2 M to each
1. Star Schema Schema with
2. Snowflake Schema diagram
3. Fact Constellation or Galaxy Schema
(consider any other relevant diagram for all schemas)
1. Star Schema:
A star schema is the primary form of a dimensional model, in which data are organized
into facts and dimensions.
A fact is an event that is counted or measured, such as a sale.
A dimension includes all information about the fact, such as date, item, or customer.
The star schema is the explicit data warehouse schema.
It is known as star schema because the entity-relationship diagram of this schemas
simulates a star, with points, diverge from a central table.
The centre of the schema consists of a large fact table, and the points of the star are the
dimension tables.
Fact Table: (applicable for all schema)
This table contains primary key of multiple dimension tables.
It contains facts or measures like quantity sold, amount sold, etc.
Page No: 17 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
Dimension Table: (applicable for all schema)
This table provides descriptive information for all measures recorded in fact table, like
product, item, location, time, etc.

Fig: Star Schema

2. Snowflake Schema:
A snowflake schema is refinement of the star schema.
"A schema is known as a snowflake where one or more-dimension tables do not connect
directly to the fact table, but must join through other dimension tables."
The snowflake schema is an expansion of the star schema where each point (dimension
table) of the star explodes into more points (more dimension tables).
Snowflaking is a method of normalizing the dimension tables in a STAR schema.
Snowflaking is used to develop the performance of specific queries.
The snowflake schema consists of one fact table which is linked to many dimension tables,
which can be linked to other dimension tables through a many-to-one relationship.
Tables in a snowflake schema are generally normalized to the third normal form.

Fig: Snowflake Schema

Page No: 18 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
3. Fact Constellation Schema:
A Fact constellation means two or more fact tables sharing one or more dimensions. It is
also called Galaxy schema.
Fact Constellation Schema is a sophisticated (advanced but difficult to understand)
database design that is difficult to summarize information. Fact Constellation Schema can
implement between aggregate Fact tables.

Fig: Fact Constellation/Galaxy Schema

c) Explain Apriori Algorithm with examples. 6M


Ans Apriori Algorithm: 2 M for
To find frequent itemsets Apriori algorithm is used, because it uses prior knowledge of Explanation
frequent itemset properties. We apply an iterative approach or level-wise search where k- of Apriori
frequent itemsets are used to find k+1 itemsets. Algorithm
This algorithm uses two steps “join” and “prune” (prune means delete) to reduce the
search space. 4 M for
It is an iterative approach to discover the most frequent itemsets. Example with
proper steps
Apriori says: to find
The probability that item x is not frequent is if: frequent
• P(x) is less than minimum support threshold, and then x is not frequent. itemset

The steps followed in the Apriori Algorithm of data mining are:


1. Join Step: This step generates (K+1) itemset from K-itemsets by joining each item
with itself.
2. Prune Step: This step scans the count of each item in the database. If the candidate
item does not meet minimum support, then it is denoted as infrequent and thus it is
removed. This step is performed to reduce the size of the candidate itemsets.

Page No: 19 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
Example Apriori Method: (consider any other relevant example)
Consider the given database D and minimum support 50%. Apply the Apriori algorithm
and find frequent itemsets with confidence greater than 70%

TID Items
1 134
2 235
3 1235
4 25

Solution:
Calculate min_supp=0.5*4=2 (support count is 2)
(0.5: given minimum support in problem, 4: total transactions in database D)
Step 1: Generate candidate list C1 from D
C1=
Itemsets
1
2
3
4
5
Step 2: Scan D for count of each candidate and find the support.
C1=
Itemsets Support count
1 2
2 3
3 3
4 1
5 3

Step 3: Compare candidate support count with min_supp (i.e. 2)


(prune or remove the itemset which have support count less than min_supp i.e. 2)
L1=
Itemsets Support count
1 2
2 3
3 3
5 3

Step 4: Generate candidate list C2 from L1 (k-itemsets converted to k+1 itemsets)


C2=
Itemsets (k+1)
1,2
1,3
1,5
2,3
2,5
3,5
Page No: 20 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
Step 5: Scan D for count of each candidate and find the support.
C2=
Itemsets Support count
1,2 1
1,3 2
1,5 1
2,3 2
2,5 3
3,5 2

Step 6: Compare candidate support count with min_supp (i.e. 2)


(prune or remove the itemset which have support count less than min_supp i.e. 2)
L2=
Itemsets Support count
1,3 2
2,3 2
2,5 3
3,5 2

Step 7: Generate candidate list C3 from L2


(k-itemsets converted to k+1 itemsets)
C3=
Itemsets (k+1)
1,2,3
1,2,5
1,3,5
2,3,5

Step 8: Scan D for count of each candidate and find the support.
C3=
Itemsets Support count
1,2,3 1
1,2,5 1
1,3,5 1
2,3,5 2

Step 9: Compare candidate support count with min_supp (i.e. 2)


(prune or remove the itemset which have support count less than min_supp i.e. 2)
L3=
Itemsets Support count
2,3,5 2

Here, {2,3,5} is the frequent itemset found by using Apriori method.


Page No: 21 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________

6. Attempt any TWO of the following: 12 M


a) Describe steps in the process of KDD. 6M
Ans Steps in KDD: 2 M for KDD
Diagram: Diagram

4 M for Steps
in KDD

Steps in KDD Process:

1. Data cleaning:
In data cleaning method, it removes the noise and inconsistent data. It can handle the
missing values, use binning method to remove noise and also uses regression and cluster
analysis for cleaning the data.

2. Data integration:
The data in data warehouse may be collected from different sources. This different data
can be integrated at a single location using loose and tight coupling and then send to data
selection step.

3. Data selection:
In this step, the data relevant to the analysis task are retrieved or selected from the
database.

Page No: 22 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
4. Data transformation:
The data are transformed and consolidated into forms appropriate for mining by
performing summary or aggregation operations. i.e. the data from different data sources
which is of varied types can be converted into a single standard format.

5. Data mining:
Data mining is the process in which intelligent methods or algorithms are applied on data
to extract useful data patterns for decision support system.

6. Pattern evaluation:
This process identifies the truly interesting patterns representing actual knowledge based
on user requirements for analysis.

7. Knowledge presentation:
In this process, visualization and knowledge representation techniques are used to
represent mined knowledge to the end users for analysis and decision making.
b) Explain cluster analysis with its application. 6M
Ans Cluster Analysis: 2M
Clustering is a data mining technique used to place data elements into related groups Explanation,
without advance knowledge. 2 M for
Clustering is the process of grouping a set of data objects into multiple groups or clusters Requirements,
so that objects within a cluster have high similarity. 2 M for
Dissimilarities and similarities are assessed based on the attribute values describing the applications
objects and often involve distance measures.
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets.
Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters. The set of clusters resulting from a cluster analysis
can be referred to as a clustering.

Requirements of Cluster Analysis:


i. Scalability: Need highly scalable clustering algorithms to deal with large databases.
ii. Ability to deal with different kinds of attributes: Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and binary
data.
iii. Discovery of clusters with attribute shape: The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
Page No: 23 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
iv. High dimensionality: the clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
v. Ability to deal with noisy data: Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
vi. Interpretability: The clustering results should be interpretable, comprehensible, and
usable.

Applications of Clustering:
Clustering algorithms can be applied in many fields, for instance:
1. Marketing: finding groups of customers with similar behavior given a large database
of customer data containing their properties and past buying records;
2. Biology: classification of plants and animals given their features;
3. Libraries: book ordering;
4. Insurance: identifying groups of motor insurance policy holders with a high average
claim cost; identifying frauds;
5. City-planning: identifying groups of houses according to their house type, value and
geographical location;
6. Earthquake studies: clustering observed earthquake epicenters to identify dangerous
zones;
7. WWW: document classification; clustering weblog data to discover groups of similar
access patterns.
c) Explain major tasks in Data Preprocessing. 6M
Ans Major tasks in Data preprocessing: 1 M for Task
Data goes through a series of tasks during preprocessing: Diagram
1. Data Cleaning
2. Data Integration 5 M for Task
3. Data Transformation explanation
4. Data Reduction (1 M each)
5. Data Discretization
Diagram:

Page No: 24 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
1. Data Cleaning in Data Mining:
Quality of your data is important for final analysis. Any data which is incomplete, noisy
and inconsistent can affect the results.
Data cleaning in data mining is the process of detecting and removing corrupt or
inaccurate records from a record set, table or database.
There are 4 methods to clean the data.
i. Handle the missing values
ii. Cleaning the noisy data using Binning method
iii. Regression
iv. Cluster Analysis

2. Data Integration in Data Mining:


Data Integration is a data preprocessing technique that combines data from multiple data
sources and provides a unified view of these data to users.
There are mainly 2 major approaches for data integration:
i. Tight Coupling
In tight coupling data is combined from different sources into a single physical location
through the process of ETL - Extraction, Transformation and Loading.
ii. Loose Coupling
In loose coupling data only remains in the actual source databases. In this approach, an
interface is provided that takes query from user and then sends the query directly to the
source databases to obtain the result.

3. Data Transformation in Data Mining:


In data transformation process data are transformed from one format to another format that
is more appropriate for data mining.
Ex: Original data: 1.2, 3.2, 4.6, 123
Transformed data: 120, 320, 460, 123

Some Data Transformation Strategies:


i. Smoothing:
Smoothing is a process of removing noise from the data using Binning method.
ii. Aggregation:
Aggregation in data mining is the process of finding, collecting, and presenting the data in
a summarized format to perform statistical analysis of business decisions.
Aggregated data help in finding useful information about a group after they are written as
reports.
iii. Generalization:
In generalization low-level data are replaced with high-level data by using concept
hierarchies climbing.

4. Data Reduction in Data Mining:


A database or date warehouse may store large amount of data. So, it may take very long to
perform data analysis and mining on such huge amounts of data.
Data reduction techniques can be applied to obtain a reduced representation (without loss
of any data) of the data set that is much smaller in volume but still contain critical
information.

Page No: 25 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
i. Data Cube Aggregation:
Aggregation operations are applied to the data in the construction of a data cube.
ii. Dimensionality Reduction:
In dimensionality reduction redundant attributes are detected and removed, which reduces
the data set size.

5. Data Discretization:
Data Discretization techniques can be used to divide the range of continuous attribute into
intervals. (Continuous values can be divided into discrete (finite) values)
i.e. it divides the large dataset into smaller parts.
Numerous continuous attribute values are replaced by small interval labels.
This leads to a brief, easy-to-use, knowledge-level representation of mining results.
Data mining on a reduced data set means fewer input/output operations and is more
efficient than mining on a larger data set.
Methods used for discretization are:
i. Binning method
ii. Cluster Analysis

Page No: 26 | 26

You might also like