Oral Questions LP-II: Star Schema

Data Mining
Oral Questions LP-II

1. What are the different Data Warehousing Schemas?
Schema is a logical description of the entire database. It includes the name and
description of records of all record types including all associated data-items and
aggregates. Much like a database, a data warehouse also requires to maintain a
schema. A database uses relational model, while a data warehouse uses Star,
Snowflake, and Fact Constellation schema. In this chapter, we will discuss the
schemas used in a data warehouse.
Star Schema
 Star Schema in data warehouse, in which the center of the star can have one fact table
and a number of associated dimension tables. It is known as star schema as its structure
resembles a star. The Star Schema data model is the simplest type of Data Warehouse
schema. It is also known as Star Join Schema and is optimized for querying large data
sets.
 The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.
Example: Each dimension has only one dimension table and each table holds a set of
attributes. For example, the location dimension table contains the attribute set {location_key,
street, city, province_or_state,country}. This constraint may cause data redundancy. For
example, "Vancouver" and "Victoria" both the cities are in the Canadian province of British
Columbia. The entries for such cities may cause data redundancy along the attributes
province_or_state and country.
Snowflake Schema
Snowflake Schema in data warehouse is a logical arrangement of tables in a multidimensional
database such that the ER diagram resembles a snowflake shape. A Snowflake Schema is an
extension of a Star Schema, and it adds additional dimensions. The dimension tables are
normalized which splits data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example,
the item dimension table in star schema is normalized and split into two dimension tables,
namely item and supplier table.
Now the item dimension table contains the attributes item_key, item_name,
type, brand, and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.
2. Explain Star Schema vs. Snowflake Schema
Sr. No Star Schema Snowflake Schema

1 In star schema, The fact tables and the While in snowflake schema, The fact
dimension tables are contained. tables, dimension tables as well as sub
dimension tables are contained.
2 Star schema is a top-down model. While it is a bottom-up model.
3 Star schema uses more space. While it uses less space.
4 It takes less time for the execution of While it takes more time than star schema
queries. for the execution of queries.
5 In star schema, Normalization is not While in this, Both normalization and
used. denormalization are used.
6 It’s design is very simple. While it’s design is complex.
7 The query complexity of star schema While the query complexity of snowflake
is low. schema is higher than star schema.
8 It’s understanding is very simple. While it’s understanding is difficult.
9 It has less number of foreign keys. While it has more number of foreign keys.
10 It has high data redundancy. While it has low data redundancy.
3. Mention what is the responsibility of a Data analyst?

 Interpret data, analyze results using statistical techniques and provide ongoing reports
 Develop and implement databases, data collection systems, data analytics and other
strategies that optimize statistical efficiency and quality
 Acquire data from primary or secondary data sources and maintain databases/data
systems
 Identify, analyze, and interpret trends or patterns in complex data sets
 Filter and “clean” data by reviewing computer reports, printouts, and performance
indicators to locate and correct code problems
 Work with management to prioritize business and information needs
 Locate and define new process improvement opportunities
4. Define Data mining? List out the applications of data mining.

In simple words, data mining is defined as a process used to extract usable data from a
larger set of any raw data. It implies analysing data patterns in large batches of data
using one or more software. Data mining has applications in multiple fields, like science
and research. As an application of data mining, businesses can learn more about their
customers and develop more effective strategies related to various business functions
and in turn leverage resources in a more optimal and insightful manner. This helps
businesses be closer to their objective and make better decisions. Data mining involves
effective data collection and warehousing as well as computer processing. For
segmenting the data and evaluating the probability of future events, data mining uses
sophisticated mathematical algorithms. Data mining is also known as Knowledge
Discovery in Data (KDD).
List out the applications of data mining.

a. Financial Data Analysis
b. Retail Industry
c. Telecommunication Industry
d. Biological Data Analysis
e. Intrusion Detection
f. Education
g. Fraud detection
5. Mention what is data cleaning?

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset. When combining multiple data sources, there
are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes
and algorithms are unreliable, even though they may look correct. There is no one absolute
way to prescribe the exact steps in the data cleaning process because the processes will vary
from dataset to dataset. But it is crucial to establish a template for your data cleaning process
so you know you are doing it the right way every time.
6. List out the data mining processing steps?
 Identifying the source information.
 Picking the data points that need to be analyzed.
 Extracting the relevant information from the data.
 Identifying the key values from the extracted data set.
 Interpreting and reporting the results.
7. What is difference between Supervised and Unsupervised Learning?
Parameters Supervised machine learning Unsupervised machine learning

technique technique
Process In a supervised learning model, In unsupervised learning model, only
input and output variables will be input data will be given
given.
Input Data Algorithms are trained using Algorithms are used against data which
labeled data. is not labeled
Algorithms Used Support vector machine, Neural Unsupervised algorithms can be
network, Linear and logistics divided into different categories: like
regression, random forest, and Cluster algorithms, K-means,
Classification trees. Hierarchical clustering, etc.
Computational Supervised learning is a simpler Unsupervised learning is
Complexity method. computationally complex
Use of Data Supervised learning model uses Unsupervised learning does not use
training data to learn a link output data.
between the input and the
outputs.
Accuracy of Highly accurate and trustworthy Less accurate and trustworthy method.
Results method.
Number of Number of classes is known. Number of classes is not known.
Classes
Main Drawback Classifying big data can be a real You cannot get precise information
challenge in Supervised regarding data sorting, and the output
Learning. as data used in unsupervised learning is
labeled and not known.
8. What are difference between Kmeans and KNN Algorithm?

 KNN represents a supervised classification algorithm that will give new data
points accordingly to the k number or the closest data points, while k-means
clustering is an unsupervised clustering algorithm that gathers and groups data
into k number of clusters.
 What is K-means clustering?
K- means clustering is used for analyzing and grouping data which does not include
pre-labeled class or even a class attribute at all.
How is the K-nearest neighbor algorithm different from K-means clustering?
 KNN Algorithm is based on feature similarity and K-means refers to the division of
objects into clusters (such that each object is in exactly one cluster, not several).
 KNN is a classification technique and K-means is a clustering technique.
Let’s understand the difference in a better way using an example of a crocodile and an alligator,
KNN Algorithm:
You can differentiate between a crocodile and an alligator based on their characteristics.
The features of the unknown animal are more a crocodile.
Therefore, it is a crocodile!
K-means clustering:
K-means performs division of objects into clusters which are “similar” between them and are
“dissimilar” to the objects belonging to another cluster.
9. What is Euclidean distance? Explain with Suitable example?

 The Euclidean distance between two points in either the plane or 3-dimensional
space measures the length of a segment connecting the two points. It is the most
obvious way of representing distance between two points.
 The Pythagorean Theorem can be used to calculate the distance between two points, as
shown in the figure below. If the points (x1,y1) and (x2,y2) are in 2-dimensional space,
then the Euclidean distance between them is
For points (x1,y1,z1)(x1,y1,z1) and (x2,y2,z2)(x2,y2,z2) in 3-dimensional space, the Euclidean
distance between them is
Example
the Euclidean distance between (−1,2,3) and (4,0,−3) is
10. What are different types of Clustering?

 Connectivity-based Clustering (Hierarchical clustering)
 Centroids-based Clustering (Partitioning methods)
 Distribution-based Clustering
 Density-based Clustering (Model-based methods)
 Fuzzy Clustering
 Constraint-based (Supervised Clustering)
11. Explain Association Rule

 Association rule mining finds interesting associations and relationships
among large sets of data items. This rule shows how frequently a itemset
occurs in a transaction. A typical example is Market Based Analysis.
 Market Based Analysis is one of the key techniques used by large
relations to show associations between items.It allows retailers to identify
relationships between the items that people buy together frequently.
 Given a set of transactions, we can find rules that will predict the
occurrence of an item based on the occurrences of other items in the
transaction.
 An antecedent is something that’s found in data, and a consequent is an
item that is found in combination with the antecedent. Have a look at this
rule for instance:
If a customer buys bread, he’s 70% likely of buying milk.”
• Association rules are created by thoroughly analyzing data and looking

for frequent if/then patterns. Then, depending on the following two
parameters, the important relationships are observed:
• Support: Support indicates how frequently the if/then relationship
appears in the database.
• Confidence: Confidence tells about the number of times these

relationships have been found to be true.
• So, in a given transaction with multiple items, Association Rule Mining

primarily tries to find the rules that govern how or why such
products/items are often bought together. For example, peanut butter and
jelly are frequently purchased together because a lot of people like to
make PB&J sandwiches.
• Rule Evaluation Metrics –
Support(s) –
The number of transactions that include items in the {X} and {Y} parts of
the rule as a percentage of the total number of transaction. It is a
measure of how frequently the collection of items occur together as a
percentage of all transactions
• Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well
as the no of transactions that includes all items in {A} to the no of
transactions that includes all items in {A}.
Example
ID ITEMS
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Cake
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Cake
1. { Milk, Diaper} I want to check association of milk with diaper
How many times I buy milk alongwith Diaper

In this A- Milk
B- Diaper
 So, to calculate support we need to check how many times A and B comes together
in dataset means how many times milk and diaper comes together
So, (A+ B)= 3 times
 And dividied by total no of items in itemset
So, Total= 5
So, Support= 3/5= 0.6 = 60%
Confidence= ( A+B)
________
A
So, A+B means milk with diaper comes in itemset 3 times
A means only milk comes in itemset i.e 4 times and no of items
5 means A= 4/5= 0.8
Confidence= 0.6/0.8= 0.75= 75%
12. What is Market Basket Analysis? Explain with suitable example?

 A market basket analysis is a set of affinity calculations meant to determine which
items sell together. For example, a grocery store may use market basket analysis
to determine that consumers typically buy both hot dogs and hot dog buns
together.
 If you’ve ever gone onto an online retailer’s website, you’ve probably seen a
recommendation on a product’s page phrased as “Customers who bought this item
also bought” or “Customers buy these together”. More than likely, the online retailer
performed some sort of market basket analysis to link the products together.
 Examples of market basket analysis
 The Amazon website employs a well-known example of market basket analysis. On
a product page, Amazon presents users with related products, under the headings of
“Frequently bought together” and “Customers who bought this item also bought.”
 Market basket analysis also applies to bricks-and-mortar stores. If analysis showed

that magazine purchases often include the purchase of a bookmark (which could be
considered an unexpected combination, since the consumer did not purchase a
book), then the book store might place a selection of bookmarks near the magazine
rack.
13. Who propose A-Priori algorithm?
Apriori algorithm was the first algorithm that was proposed for frequent itemset

mining. It was later improved by R Agarwal and R Srikant and came to be known
as Apriori. This algorithm uses two steps “join” and “prune” to reduce the search
space.
14. Explain a-priori algorithm

• The best known algorithm
• Two steps:
– Find all item sets that have minimum support (frequent item sets, also called large
item sets).
– It Create Association rule with support and Confidence.
– E.g. if we buy tooth brush : it suggest Colgate and tongue cleaner
Data Set
T-Id Item Set
T-1000 M,O,N,K,E,Y
T-1001 D,O,N,K,E,Y
T-1002 M,A,K,E
T-1003 M,U,C,K,Y
T-1004 C,O,O,K,E
Given: Minimum Support = 60%

Minimum Confidence = 80%
3.9.1 Candidate Table C1: Now find support count of each item set
Item Set Support Count

M 3
O 4
N 2
E 4
Y 3
D 1
A 1
U 1
C 2
K 5
• Now find out minimum Support

• Support = 60/100*5
=3
• Where 5 is Number of entry
Compare Min Support with each item set
3.9.2 L1 Support Count
M 3
O 4
K 5
E 4
Y 3
3.9.3 Candidate Table C2:
MO 1
MK 3
ME 2
MY 2
OK 3
OE 3
OY 2
KE 4
KY 3
EY 2
• Now again Compare C2 with Min Support 3
MK 3
OK 3
OE 3
KE 4
KY 3
Table 3.5: L2 Support Count
• After satisfied minimum support criteria
• Make Pair to generate C3
3.9.5 Candidate Table C3
Item Set Support count
M,K,O 1
M,K,E 2
M,K,Y 2
O,K,E 3
O,K,Y 2
Table 3.6: Candidate Table C3
Now again compare the item set with min support 3
O,K,E 3
Table 3.7: L3 Support Count
3.9.7 Now create association rule with support and Confidence for {O,K,E}
• Confidence =Support/No. of time it Occurs
Association Rule Support Confidence Confidence (%)
O^K⇒E 3 3/3 = 1 1*100=100

OÊ⇒K 3 3/3 = 1 1*100=100
KÊ⇒O 3 3/4 = 0.75 0.75*100=75
E⇒ O ^ K 3 3/4 = 0.75 0.75*100=75
K⇒ O ^ E 3 3/5 = 0.6 0.6*100=60
O⇒ K ^ E 3 3/4 = 0.75 0.75*100=75

Table 3.8: Association Rule
• Compare this with Minimum Confidence=80%
Rule Support Confidence
O^K⇒E 3 100
OÊ⇒K 3 100
Table 3.9: Support and Confidence
Hence final Association rule are

{O ^ K ⇒ E}
{O ^ E ⇒ K}
 From first observation we predict that if the customer buy item O and item K then
defiantly he will by item E
 From Second observation we predict that the customer buy item O and item E then
defiantly he will by item K
https://youtu.be/LZii6N4vGDs
Also To solve the example of a-priori please check this video for a- priori algorithm.
15. What is frequent itemset?

• Frequent item sets are those items whose support is greater than the threshold value
or user-specified minimum support. It means if A & B are the frequent item sets together,
then individually A and B should also be the frequent item set.
• Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent item sets.
16. What is minimum support and minimum confidence?
1. Minimum support: A minimum support threshold is applied to find all frequent

itemsets in a database.
2. Minimum Confidence: A minimum confidence constraint is applied to these
frequent itemsets in order to form rules.
17. Define Data warehouse? Applications of datawarehouse

 A Data Warehousing (DW) is process for collecting and managing data from varied
sources to provide meaningful business insights. A Data warehouse is typically used to
connect and analyze business data from heterogeneous sources. The data warehouse is
the core of the BI system which is built for data analysis and reporting.
 It is a blend of technologies and components which aids the strategic use of data. It is
electronic storage of a large amount of information by a business which is designed for
query and analysis instead of transaction processing. It is a process of transforming
data into information and making it available to users in a timely manner to make a
difference.
Applications:
1. Airline:
In the Airline system, it is used for operation purpose like crew assignment, analyses
of route profitability, frequent flyer program promotions, etc.
2. Banking:
It is widely used in the banking sector to manage the resources available on desk
effectively. Few banks also used for the market research, performance analysis of the
product and operations.
3. Healthcare:
Healthcare sector also used Data warehouse to strategize and predict outcomes,
generate patient's treatment reports, share data with tie-in insurance companies,
medical aid services, etc.
4. Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps
government agencies to maintain and analyze tax records, health policy records, for
every individual.
5. Investment and Insurance sector:
In this sector, the warehouses are primarily used to analyze data patterns, customer
trends, and to track market movements.
6. Retail chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It
also helps to track items, customer buying pattern, promotions and also used for
determining pricing policy.
7. Telecommunication:
A data warehouse is used in this sector for product promotions, sales decisions and
to make distribution decisions.
8. Hospitality Industry:
This Industry utilizes warehouse services to design as well as estimate their

advertising and promotion campaigns where they want to target clients based on their
feedback and travel patterns.
18. Define OLAP?
 Stands for "Online Analytical Processing." OLAP allows users to analyze database
information from multiple database systems at one time. While relational databases are
considered to be two-dimensional, OLAP data is multidimensional, meaning the
information can be compared in many different ways. For example, a company might
compare their computer sales in June with sales in July, then compare those results with
the sales from another location, which might be stored in a different database
 In order to process database information using OLAP, an OLAP server is required to
organize and compare the information. Clients can analyze different sets of data using
functions built into the OLAP server. Some popular OLAP server software programs
include Oracle Express Server and Hyperion Solutions Essbase. Because of its powerful
data analysis capabilities, OLAP processing is often used for data mining, which aims to
discover new relationships between different sets of data.
19. OLTP VS OLAP
Parameters OLTP OLAP
Process It is an online transactional system. It OLAP is an online analysis and

manages database modification. data retrieving process.
Characteristic It is characterized by large numbers of It is characterized by a large

short online transactions. volume of data.
Functionality OLTP is an online database modifying OLAP is an online database

system. query management system.
Method OLTP uses traditional DBMS. OLAP uses the data warehouse.
Query Insert, Update, and Delete information Mostly select operations

from the database.
Usefulness It helps to control and run It helps with planning, problem-

fundamental business tasks. solving, and decision support.
Audience It is a market orientated process. It is a customer orientated

process.
20. Define metadata
Metadata is "data that provides information about other data". In other words, it is "data
about data". Many distinct types of metadata exist, including descriptive metadata,
structural metadata, administrative metadata, reference metadata, statistical metadata
and legal metadata.
21. Explain data, information and knowledge?
1. What is a data?
 The concept of data commonly referred to as ‘raw’ data – a collection of text, numbers
and symbols with no meaning. Data therefore has to be processed, or provided with a
context, before it can have meaning.
 Example
• 3, 6, 9, 12
• cat, dog, gerbil, rabbit
• 161.2, 175.3, 166.4, 164.7, 169.3
These are meaningless sets of data.
2. What is information?
Information is the result of processing data, usually by computer. This results in facts,
which enables the processed data to be used in context and have meaning. Information is
data that has meaning.
2.3.2 When does data become information?
Data on its own has no meaning. It only takes on meaning and becomes information
when it is interpreted. Data consists of raw facts and figures. When that data is processed
into sets according to context, it provides information. Data refers to raw input that when
processed or arranged makes meaningful output. Information is usually the processed
outcome of data. When data is processed into information, it becomes interpretable and
gains significance. In IT, symbols, characters, images, or numbers are data. These are the
inputs an IT system needs to process in order to produce a meaningful interpretation. In
other words, data in a meaningful form becomes information. Information can be about
facts, things, concepts, or anything relevant to the topic concerned. It may provide
answers to questions like who, which, when, why, what, and how. If we put Information
into an equation it would look like this:
Data + Meaning = Information
Example
Looking at the examples given for data:
• 3, 6, 9, 12
• cat, dog, gerbil, rabbit, cockatoo
• 161.2, 175.3, 166.4, 164.7, 169.3
Only when we assign a context or meaning does the data become information. It all
becomes meaningful when we are told:
• 3, 6, 9 and 12 are the first four answers in the 3 x table
• cat, dog, gerbil, rabbit, cockatoo is a list of household pets
• 161.2, 175.3, 166.4, 164.7, 169.3 are the heights of 15-year-old students.
3. What is knowledge?
When someone memorises information this is often referred to as ‘rote-learning’ or

‘learning by heart’. We can then say that they have acquired some knowledge. Another
form of knowledge is produced as a result of understanding information that has been
given to us, and using that information to gain knowledge of how to solve problems.
21. What Are Cubes?

A data cube stores data in a summarized version which helps in a faster analysis of data.
The data is stored in such a way that it allows reporting easily.E.g. using a data cube A
user may want to analyze weekly, monthly performance of an employee. Here, month
and week could be considered as the dimensions of the cube.
22. Mention Some Of The Data Mining Techniques?:

1. Statistics
2. Machine learning
3. Decision Tree
4. Hidden markov models
5. Artificial Intelligence
6. Genetic Algorithm
7. Meta learning
23. What is Dimension Table?

Dimension table is a table which contain attributes of measurements stored in fact tables. This
table consists of hierarchies, categories and logic that can be used to traverse in nodes.
24. What is fact table?
A fact table is a primary table in a dimensional model. Fact table contains measurements,
metrics, and facts about a business process
24. What is ETL?
ETL is abbreviated as Extract, Transform and Load. ETL is a software which is used to reads the
data from the specified data source and extracts a desired subset of data. Next, it transform the
data using rules and lookup tables and convert it to a desired state. Then, load function is used to
load the resulting data to the target database.
25. Explain data mart.
Data mart contains the subset of organization-wide data. This subset of data is valuable to
specific groups of an organization. In other words, we can say that a data mart contains data
specific to a particular group.
26. Accuracy, Error Rate, precision, Recall, Sensitivity, Specificity in multiclass classification
 Accuracy: It gives you the overall accuracy of the model, meaning the fraction of the
total samples that were correctly classified by the classifier. To calculate accuracy, use
the following formula: (TP+TN)/(TP+TN+FP+FN).
 Misclassification Rate: It tells you what fraction of predictions were incorrect. It is also
known as Classification Error. You can calculate it
using (FP+FN)/(TP+TN+FP+FN) or (1-Accuracy).
 Precision: It tells you what fraction of predictions as a positive class were actually
positive. To calculate precision, use the following formula: TP/(TP+FP).
 Recall: It tells you what fraction of all positive samples were correctly predicted as
positive by the classifier. It is also known as True Positive Rate (TPR), Sensitivity,
Probability of Detection. To calculate Recall, use the following formula: TP/(TP+FN).
 Specificity: It tells you what fraction of all negative samples are correctly predicted as
negative by the classifier. It is also known as True Negative Rate (TNR). To calculate
specificity, use the following formula: TN/(TN+FP).
27. Explain the techniques to improve the efficiency of apriori algorithm

 Hash based technique
 Transaction
 Reduction
 Portioning
 Sampling
 Dynamic item counting
28. In the context of data warehousing what is data transformation? `
In data transformation, the data are transformed or consolidated into forms appropriate
for mining. Data transformation can involve the following: Smoothing, Aggregation,
Generalization, Normalization, Attribute construction.
29. What Are The Steps Involved In Kdd( Knowledge Discovery in Database)
Process?
1. Data cleaning
2. Data Mining
3. Pattern Evaluation
4. Knowledge Presentation
5. Data Integration
6. Data Selection
7. Data Transformation
30. Difference between ROLAP and MOLAP

ROLAP MOLAP
ROLAP stands for Relational Online While MOLAP stands
Analytical Processing. for Multidimensional Online Analytical
Processing.
ROLAP is used for large data volumes. While it is used for limited data volumes.
The access of ROLAP is slow. While the access of MOLAP is fast.
In ROLAP, Data is stored in relation While in MOLAP, Data is stored in
tables. multidimensional array.
In ROLAP, Data is fetched from data- While in MOLAP, Data is fetched from
warehouse. MDDBs database.
In ROLAP, Complicated sql queries are While in MOLAP, Sparse matrix is used.
used.
In ROLAP, Static multidimensional view While in MOLAP, Dynamic
of data is created. multidimensional view of data is
created.
31. Expalin Classification and Regression?

Classification and Regression are two major prediction problems which are usually dealt
with Data mining and machine learning.
1. Classification is the process of finding or discovering a model or function which helps in

separating the data into multiple categorical classes i.e. discrete values. In classification, data is
categorized under different labels according to some parameters given in input and then the
labels are predicted for the data.
The derived mapping function could be demonstrated in the form of “IF-THEN” rules. The
classification process deal with the problems where the data can be divided into binary or
multiple discrete labels.
Let’s take an example, suppose we want to predict the possibility of the wining of match
by Team A on the basis of some parameters recorded earlier. Then there would be two
labels Yes and No.
Fig : Binary Classification and Multiclass Classification
2. Regression: Regression is the process of finding a model or function for distinguishing the
data into continuous real values instead of using classes or discrete values. It can also identify
the distribution movement depending on the historical data. Because a regression predictive
model predicts a quantity, therefore, the skill of the model must be reported as an error in those
predictions
Let’s take a similar example in regression also, where we are finding the possibility of rain in
some particular regions with the help of some parameters recorded earlier. Then there is a
probability associated with the rain.

Oral Questions LP-II: Star Schema

Uploaded by

Copyright:

Available Formats

Oral Questions LP-II: Star Schema

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Oral Questions LP-II: Star Schema

Uploaded by

Copyright:

Available Formats

Data Mining

Oral Questions LP-II

2. Explain Star Schema vs. Snowflake Schema

Sr. No Star Schema Snowflake Schema

3. Mention what is the responsibility of a Data analyst?

4. Define Data mining? List out the applications of data mining.

List out the applications of data mining.

5. Mention what is data cleaning?

7. What is difference between Supervised and Unsupervised Learning?

Parameters Supervised machine learning Unsupervised machine learning

8. What are difference between Kmeans and KNN Algorithm?

 KNN Algorithm is based on feature similarity and K-means refers to the division of

 KNN is a classification technique and K-means is a clustering technique.

“dissimilar” to the objects belonging to another cluster.

9. What is Euclidean distance? Explain with Suitable example?

distance between them is

10. What are different types of Clustering?

11. Explain Association Rule

If a customer buys bread, he’s 70% likely of buying milk.”

• Association rules are created by thoroughly analyzing data and looking

• Confidence: Confidence tells about the number of times these

• So, in a given transaction with multiple items, Association Rule Mining

• Rule Evaluation Metrics –

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Cake

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Cake

1. { Milk, Diaper} I want to check association of milk with diaper

How many times I buy milk alongwith Diaper

12. What is Market Basket Analysis? Explain with suitable example?

 Market basket analysis also applies to bricks-and-mortar stores. If analysis showed

13. Who propose A-Priori algorithm?

Apriori algorithm was the first algorithm that was proposed for frequent itemset

14. Explain a-priori algorithm

T-Id Item Set

Given: Minimum Support = 60%

Item Set Support Count

• Now find out minimum Support

3.9.2 L1 Support Count

Item Set Support Count

Item Set Support Count

3.9.4 L2 Support Count

Item Set Support Count

Table 3.5: L2 Support Count

• After satisfied minimum support criteria

• Make Pair to generate C3

3.9.5 Candidate Table C3

Item Set Support count

Table 3.6: Candidate Table C3

3.9.6 L3 Support Count

Now again compare the item set with min support 3

Item Set Support Count

Association Rule Support Confidence Confidence (%)

O^K⇒E 3 3/3 = 1 1*100=100

K^E⇒O 3 3/4 = 0.75 0.75*100=75

E⇒ O ^ K 3 3/4 = 0.75 0.75*100=75

K⇒ O ^ E 3 3/5 = 0.6 0.6*100=60

O⇒ K ^ E 3 3/4 = 0.75 0.75*100=75

• Compare this with Minimum Confidence=80%