22621_SUM23_Model Answer
22621_SUM23_Model Answer
22621_SUM23_Model Answer
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
SUMMER – 2023 EXAMINATION
Model Answer – Only for the Use of RAC Assessors
Page No: 1 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
d) Define the term Data mining. 2M
Ans Data Mining: Definition: 2
Data mining means searching for knowledge (interesting patterns or useful data) in data. M
Data mining is the process of discovering interesting patterns and knowledge from large
amounts of data.
Data mining is also known as knowledge discovery in Database (KDD).
e) Describe requirements of cluster analysis. 2M
Ans Requirements of Cluster Analysis: Any four
1. Scalability (1/2 M each)
2. Ability to deal with different kinds of attributes
3. Discovery of clusters with attribute
4. High dimensionality
5. Ability to deal with noisy data
6. Interpretability
f) Explain role of OLAP queries with example. 2M
Ans Role of OLAP Role:
OLAP is a database technology that has been optimized for querying and reporting, instead 1M
of processing transactions. Example:
OLAP queries can be used to identify and compute the specific values from cube which 1M
are required for decision support.
Ex: (consider other related example also)
compute cube sales iceberg as
select month, city, customer_group, count(*)
from salesInfo cube by month, city, customer_group
having count(*) >= min_sup
g) Explain Rollup OLAP operation. 2M
Ans Roll-up operation: Explanation:2
Roll-up is also known as "consolidation" or "aggregation." M
Rollup operation can aggregate the data in data cube.
The Roll-up operation can be performed in 2 ways
a. Reducing dimensions
b. Climbing up concept hierarchy. Concept hierarchy is a system of grouping things based
on their order or level.
ii. Slice:
In this operation, one dimension is selected, and a new sub-cube is created.
In the overview section, slice is performed on the dimension Time (Q1).
Page No: 3 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
iii. Dice:
This operation is similar to a slice.
The difference in dice is that, you can select 2 or more dimensions that result in the
creation of a sub-cube.
In the overview section, a sub-cube is selected by selecting Location Pune Or Mumbai
and Time Q1 or Q2.
iv. Pivot:
In Pivot operation, you rotate the data axes to provide a substitute presentation of data.
In this overview section, a sub-cube obtained after Slice operation performing Pivot
operation gives a new view of that slice.
Consider the result (slice) in slice operation.
Page No: 4 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
To design an effective and efficient data warehouse, we need to understand and analyze 3M
the business needs and construct a business analysis framework. Each person has
different views regarding the design of a data warehouse. These views are as follows:
a. The top-down view: This view allows the selection of relevant information needed for
a data warehouse.
b. The data source view: This view presents the information being captured, stored, and
managed by the operational system.
c. The data warehouse view: This view includes the fact tables and dimension tables. It
represents the information stored inside the data warehouse.
d. The business query view: It is the view of the data from the viewpoint of the end user.
Page No: 5 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
d) Explain major issues of Data mining. 4M
Ans Major issues in Data mining: Explain any
A. Mining methodology and user interaction issues: two major
i. Mining different kinds of knowledge in databases: issues:
Different user - different knowledge - different way. 2 M each
That means different client want a different kind of information, so it becomes difficult to
cover vast range of data that can meet the client requirement.
ii. Incorporation of background knowledge:
Background knowledge is used to guide discovery process and to express the discovered
patterns. So, in mining process to know the background of data is must for easy process.
iii. Query languages and ad hoc mining:
Relational query languages allow users to use ad-hoc queries for data retrieval.
The language of data mining query language and the query language of data warehouse
should be matched.
iv. Handling noisy or incomplete data:
In a large database, many of the attribute values will be incorrect.
This may be due to human error or because of any instruments fail.
B. Performance issues:
i. Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable.
ii. Parallel, distributed, and incremental mining algorithms:
There are huge size of databases, the wide distribution of data, and complexity of some
data mining methods.
These factors should be considered during development of parallel and distributed data
mining algorithms.
C. Issues relating to the diversity of database types:
i. Handling of relational and complex types of data:
There are many kinds of data stored in databases and data warehouses.
It is not possible for one system to mine all these kinds of data. So, different data mining
system should be constructed for different kind data.
ii. Mining information from heterogeneous databases and global information systems:
Since data is fetched from different data sources on Local Area Network (LAN) and Wide
Area Network (WAN), the discovery of knowledge from different sources of structured is
a great challenge to data mining.
2. Data Marts:
A data mart is a subset of the data warehouse.
It is specially designed for a particular line of business, such as sales, finance, sales or
finance. In an independent data mart, data can collect directly from sources.
Due to large amount of data, a single warehouse can become overburdened.
So, to prevent the warehouse from becoming impossible to navigate, subdivisions created,
called as Data Marts.
These data marts divide the information saved in the warehouse into categories or specific
groups of users.
In a simple word Data mart is a subsidiary of a data warehouse.
Example: Five regions of MSBTE: One region may be referred as Data Mart
3. Virtual Warehouse:
The view over an operational data warehouse is known as a virtual warehouse.
A virtual warehouse is essentially a separate business database, which contains only
required data for operation system.
The data found in a virtual warehouse is usually copied from multiple sources throughout
an operation system.
Virtual warehouse is used to search the data quickly and without accessing the entire
system.
It speeds up the overall access process.
Example: It may contain only one or two Polytechnics data.
b) Describe data objects & attribute type. 4M
Ans Data Objects and Attribute Types : 2 M for data
Data Objects: objects,
• Data sets are made up of data objects. 2 M for
• A data object represents an entity. attribute type
• Example :
- In a sales database, the objects may be customers, store items, and sales.
- In a medical database, the objects may be patients.
- In a university database, the objects may be students, professors, and courses.
• Data objects are typically described by attributes.
• Data objects can also be referred to as samples, examples, instances, data points, or
objects.
• If the data objects are stored in a database, they are data tuples. That is, the rows of a
database correspond to the data objects, and the columns correspond to the attributes.
Attribute :
• An attribute is a data field, representing a characteristic or feature of a data object.
• The nouns attribute, dimension, feature, and variable are often used interchangeably
in the literature.
• The term dimension is commonly used in data warehousing.
• Machine learning literature tends to use the term feature, while statisticians prefer the
term variable.
• Data mining and database professionals commonly use the term attribute.
Page No: 7 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
• Example : Attributes describing a customer object can include, customer ID, name,
and address
Types of attributes:
1. Qualitative Attributes
2. Quantitative Attributes
1. Qualitative Attributes:
a. Nominal Attributes (N):
These attributes are related to names.
The values of a Nominal attribute are name of things, some kind of symbols.
Values of Nominal attributes represents some category or state and that’s why nominal
attribute also referred as categorical attributes and there is no order (rank, position)
among values of nominal attribute.
Example:
Attribute Values
Colors Black, Red, Green
Categorical Data Lecturer, Professor
Attribute Values
Grade A, B, C, D, E
Income low, medium, high
Age Teenage, young, old
2. Quantitative Attributes:
a. Numeric:
A numeric attribute is quantitative because, it is a measurable quantity, represented in
integer or real values.
Attribute Values
Salary 2000, 3000
Units sold 10, 20
Age 5,10,20..
Page No: 8 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
b. Discrete:
Discrete data have finite values, it can be numerical and can also be in categorical form.
These attributes have finite or countable infinite set of values.
Example:
Attribute Values
Profession Teacher, Businessman, Peon
Zip Code 413736, 413713
c. Continuous:
Continuous data have infinite no. of states. Continuous data is of float type. There can be
many values between 2 and 3.
Example:
Attribute Values
Height 2.3, 3, 6.3……
Weight 40, 45.33,…….
c) Explain steps for generating association rule from frequent item sets. 4M
Ans Association rule-generation is a two-step process. 4 M for steps
First is to generate an itemset like {Bread, Egg, Milk} and second is to generate a rule from and
each itemset like {Bread → Egg, Milk}, {Bread, Egg → Milk} etc. Both the steps are explanation.
discussed below.
1. Generating itemsets from a list of items
First step in generation of association rules is to get all the frequent itemsets on which
binary partitions can be performed to get the antecedent and the consequent. For example, if
there are 6 items {Bread, Butter, Egg, Milk, Notebook, Toothbrush} on all the transactions
combined, itemsets will look like {Bread}, {Butter}, {Bread, Notebook}, {Milk,
Toothbrush}, {Milk, Egg, Vegetables} etc. Size of an itemset can vary from one to the total
number of items that we have. Now, we seek only frequent itemsets from this and not all so
as to put a check on the number of total itemsets generated.
We start with a frequent itemset {a,b,c,d} and start forming rules with just one consequent.
Remove the rules failing to satisfy the minconf condition. Now, start forming rules using a
combination of consequents from the remaining ones. Keep repeating until only one item is
left on antecedent. This process has to be done for all frequent itemsets.
d) Differentiate between ROLAP & MOLAP. 4M
Ans ROLAP & MOLAP Difference: 1 M for each
Sr. correct point.
Parameters ROLAP MOLAP
No. (any 4 points)
Relational Online Multidimensional Online
1 Acronym
Analytical Processing Analytical Processing
Information
2 Slow Fast
retrieval
sparse array to store data-
3 Storage Method relational table
sets
4 Easy to use Yes No
Page No: 10 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
when data warehouse
when data warehouse
5 When to use contains relational as well
contains relational data
as non-relational data
6 Implementation Easy Complex
Response Time
7 More Less
Required
8 Storage Space Less More
1. Extraction:
The first step of the ETL process is extraction.
In this step, data from various source systems is extracted which can be in various formats
like relational databases, No SQL, XML and flat files into the staging area.
The data cannot be loaded in data warehouse; therefore, this is one of the most important
steps of ETL process.
2. Transformation:
The second step of the ETL process is transformation.
In this step, a set of rules or functions are applied on the extracted data to convert it into a
single standard format.
It may involve following processes/tasks:
• Filtering – loading only certain attributes into the data warehouse.
• Cleaning – filling up the NULL values and missing values.
• Joining – joining multiple attributes into one.
• Splitting – splitting a single attribute into multiple attributes.
• Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
3. Loading:
The third and final step of the ETL process is loading.
In this step, the transformed data is finally loaded into the data warehouse.
Page No: 11 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
b) Explain need of OLAP. 4M
Ans Need of OLAP: 1/2 M for
OLAP can be needed for business data analysis and decision making. OLAP can be each point
used/needed: (any 8 points)
1. To support multidimensional view of data.
2. To provides fast and steady access to various views of information.
3. To processes complex queries.
4. To analyze the information.
5. To pre-calculate and pre-aggregate the data.
6. For all type of business includes planning, budgeting, reporting, and analysis.
7. To quickly create and analyze "What if" scenarios
8. To easily search OLAP database for broad or specific terms.
9. To provide the building blocks for business modelling tools, Data mining tools,
performance reporting tools.
10. To do slice and dice cube data all by various dimensions, measures, and filters.
11. To analyze time series data.
12. To find some clusters and outliers.
13. To visualize online analytical process system which provides faster response times.
c) Explain OLAP data indexing with its type. 4M
Ans OLAP Indexing: 2 M for each
OLAP uses two indices: Bitmap and Join Index. type with
1. Bitmap Index: example
The bitmap index is an alternative representation of the record ID (RID) list. (bitmap
Each attribute is represented by distinct bit value. index, bitmap
If attribute’s domain consists of n values, then n bits are needed for each entry in the join index)
bitmap index.
If the attribute value is present in the row then it is represented by 1 in the corresponding
row of the bitmap index and rest are 0 (zero).
Because cust_gender and cust_income are all low-cardinality columns (there are two
possible values for gender and 05 for income), bitmap indexes are ideal for these columns.
Do not create a bitmap index on cust_id because this is a unique column.
The following table illustrates the bitmap index for the cust_gender column in this
example. It consists of two separate bitmaps, each one for gender.
Page No: 12 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
Sample Bitmap Index on Gender
Table 2: Sales
Time_id Cust_id Amount_sold
Jan 101 2000
Feb 103 3000
Mar 106 5000
Apr 104 6000
May 107 7000
By joining the two table (Customer and sales) illustrates the join result that is used to
create the bitmaps that are stored in the bitmap join index:
Page No: 14 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
Page No: 15 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
prone process.
Data warehouse design takes a method different from view materialization in the
industries. It sees data warehouses as database systems with particular needs such as
answering management related queries.
There are two approaches
1. "top-down" approach
2. "bottom-up" approach
1. Top-Down Approach:
An approach is a data-driven approach as the information is gathered and integrated
first and then business requirements by subjects for building data marts are
formulated.
i. External Sources:
External source is a source from where data is collected irrespective of the type of
data. Data can be structured, semi structured and unstructured as well.
ii. Stage Area:
Since the data, extracted from the external sources does not follow a particular
format, so there is a need to validate this data to load into Data warehouse.
For this purpose, it is recommended to use ETL tool.
• E (Extracted): Data is extracted from External data source.
• T (Transform): Data is transformed into the standard format.
• L (Load): Data is loaded into Data warehouse after transforming it into the
standard format.
iii. Data-warehouse:
After cleansing of data, it is stored in the Data warehouse as central repository.
It actually stores the meta data and the actual data gets stored in the data marts.
Page No: 16 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
depending upon the functions.
v. Data Mining:
It is used to find the hidden patterns that are present in the database or in Data
warehouse with the help of algorithm of data mining.
2. Snowflake Schema:
A snowflake schema is refinement of the star schema.
"A schema is known as a snowflake where one or more-dimension tables do not connect
directly to the fact table, but must join through other dimension tables."
The snowflake schema is an expansion of the star schema where each point (dimension
table) of the star explodes into more points (more dimension tables).
Snowflaking is a method of normalizing the dimension tables in a STAR schema.
Snowflaking is used to develop the performance of specific queries.
The snowflake schema consists of one fact table which is linked to many dimension tables,
which can be linked to other dimension tables through a many-to-one relationship.
Tables in a snowflake schema are generally normalized to the third normal form.
Page No: 18 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
3. Fact Constellation Schema:
A Fact constellation means two or more fact tables sharing one or more dimensions. It is
also called Galaxy schema.
Fact Constellation Schema is a sophisticated (advanced but difficult to understand)
database design that is difficult to summarize information. Fact Constellation Schema can
implement between aggregate Fact tables.
Page No: 19 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
Example Apriori Method: (consider any other relevant example)
Consider the given database D and minimum support 50%. Apply the Apriori algorithm
and find frequent itemsets with confidence greater than 70%
TID Items
1 134
2 235
3 1235
4 25
Solution:
Calculate min_supp=0.5*4=2 (support count is 2)
(0.5: given minimum support in problem, 4: total transactions in database D)
Step 1: Generate candidate list C1 from D
C1=
Itemsets
1
2
3
4
5
Step 2: Scan D for count of each candidate and find the support.
C1=
Itemsets Support count
1 2
2 3
3 3
4 1
5 3
Step 8: Scan D for count of each candidate and find the support.
C3=
Itemsets Support count
1,2,3 1
1,2,5 1
1,3,5 1
2,3,5 2
4 M for Steps
in KDD
1. Data cleaning:
In data cleaning method, it removes the noise and inconsistent data. It can handle the
missing values, use binning method to remove noise and also uses regression and cluster
analysis for cleaning the data.
2. Data integration:
The data in data warehouse may be collected from different sources. This different data
can be integrated at a single location using loose and tight coupling and then send to data
selection step.
3. Data selection:
In this step, the data relevant to the analysis task are retrieved or selected from the
database.
Page No: 22 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
4. Data transformation:
The data are transformed and consolidated into forms appropriate for mining by
performing summary or aggregation operations. i.e. the data from different data sources
which is of varied types can be converted into a single standard format.
5. Data mining:
Data mining is the process in which intelligent methods or algorithms are applied on data
to extract useful data patterns for decision support system.
6. Pattern evaluation:
This process identifies the truly interesting patterns representing actual knowledge based
on user requirements for analysis.
7. Knowledge presentation:
In this process, visualization and knowledge representation techniques are used to
represent mined knowledge to the end users for analysis and decision making.
b) Explain cluster analysis with its application. 6M
Ans Cluster Analysis: 2M
Clustering is a data mining technique used to place data elements into related groups Explanation,
without advance knowledge. 2 M for
Clustering is the process of grouping a set of data objects into multiple groups or clusters Requirements,
so that objects within a cluster have high similarity. 2 M for
Dissimilarities and similarities are assessed based on the attribute values describing the applications
objects and often involve distance measures.
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets.
Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters. The set of clusters resulting from a cluster analysis
can be referred to as a clustering.
Applications of Clustering:
Clustering algorithms can be applied in many fields, for instance:
1. Marketing: finding groups of customers with similar behavior given a large database
of customer data containing their properties and past buying records;
2. Biology: classification of plants and animals given their features;
3. Libraries: book ordering;
4. Insurance: identifying groups of motor insurance policy holders with a high average
claim cost; identifying frauds;
5. City-planning: identifying groups of houses according to their house type, value and
geographical location;
6. Earthquake studies: clustering observed earthquake epicenters to identify dangerous
zones;
7. WWW: document classification; clustering weblog data to discover groups of similar
access patterns.
c) Explain major tasks in Data Preprocessing. 6M
Ans Major tasks in Data preprocessing: 1 M for Task
Data goes through a series of tasks during preprocessing: Diagram
1. Data Cleaning
2. Data Integration 5 M for Task
3. Data Transformation explanation
4. Data Reduction (1 M each)
5. Data Discretization
Diagram:
Page No: 24 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
1. Data Cleaning in Data Mining:
Quality of your data is important for final analysis. Any data which is incomplete, noisy
and inconsistent can affect the results.
Data cleaning in data mining is the process of detecting and removing corrupt or
inaccurate records from a record set, table or database.
There are 4 methods to clean the data.
i. Handle the missing values
ii. Cleaning the noisy data using Binning method
iii. Regression
iv. Cluster Analysis
Page No: 25 | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2013 Certified)
__________________________________________________________________________________________________
i. Data Cube Aggregation:
Aggregation operations are applied to the data in the construction of a data cube.
ii. Dimensionality Reduction:
In dimensionality reduction redundant attributes are detected and removed, which reduces
the data set size.
5. Data Discretization:
Data Discretization techniques can be used to divide the range of continuous attribute into
intervals. (Continuous values can be divided into discrete (finite) values)
i.e. it divides the large dataset into smaller parts.
Numerous continuous attribute values are replaced by small interval labels.
This leads to a brief, easy-to-use, knowledge-level representation of mining results.
Data mining on a reduced data set means fewer input/output operations and is more
efficient than mining on a larger data set.
Methods used for discretization are:
i. Binning method
ii. Cluster Analysis
Page No: 26 | 26