Bi Ut2 Answers
Bi Ut2 Answers
Bi Ut2 Answers
Ans:
Data pre-processing is a critical step in data analysis and machine learning, which involves
preparing raw data to be suitable for analysis by cleaning, transforming, and reducing its
complexity. The need for data pre-processing arises from the fact that raw data is often
messy, inconsistent, and unstructured, which can lead to incorrect or biased results if not
properly processed.
1. Data cleaning: Raw data may contain missing values, outliers, and errors that need to be
identified and removed to avoid bias and inconsistencies in the analysis.
2. Data transformation: The format of the raw data may not be suitable for analysis, and may
need to be transformed into a more useful format, such as converting categorical variables
into numerical values.
3. Data integration: Raw data may come from different sources and need to be integrated
into a single dataset for analysis.
4. Data reduction: Raw data may contain too many variables or observations, which can lead
to overfitting, slow analysis, and decreased accuracy. Data reduction techniques can be
used to reduce the complexity of the data while preserving its important features.
1. Sampling: Sampling techniques involve selecting a representative subset of the data for
analysis, which can reduce the complexity of the data and speed up the analysis.
2. Feature selection: Feature selection involves identifying the most important features in the
data and selecting only those features for analysis, while discarding the rest.
3. Principal Component Analysis (PCA): PCA is a statistical technique that reduces the
dimensionality of the data by identifying the most important features and creating new
variables that capture the majority of the variation in the data.
Overall, data preprocessing is essential to ensure that the data used for analysis is clean,
consistent, and properly formatted, which can lead to more accurate and reliable results.
Data reduction techniques can be particularly useful for reducing the complexity of the data
while preserving its important features.
Q1. [B] Apply Binning mean technique for given data for book price (in
dollars), for smoothing the noise 15, 21, 21, 29, 34, 24, 4, 8, 9, 25, 26, 28.
Ans:
Step 1: Sort the given data
Sorted List: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step 2: Partition the data into equal frequency size bin of size = 4
Bin1 = 4, 8, 9, 15
Bin2 = 21, 21, 24, 25
Bin3 = 26, 28, 29, 34
Bin1 = 9
Bin2 = 22. 75
Bin3 = 29.25
Step 4: Replace each value in bin with its respective arithmetic mean
Bin1 = 9, 9, 9, 9
Bin2 = 22.75, 22.75, 22.75, 22.75
Bin3 = 29.25, 29.25, 29.25, 29.25
After applying the Binning mean technique, the resulting data points are 9, 9, 9, 9, 22.75,
22.75, 22.75, 22.75, 29.25, 29.25, 29.25, 29.25.
An outlier is a data object that deviates essentially from the rest of the objects, as if it were
produced by a different structure. For ease of presentation, it can define data objects that
are not outliers as “normal” or expected information. Similarly, it can define outliers as
“abnormal” data.
Outliers are data components that cannot be combined in a given class or cluster. These are
the data objects which have several behaviour from the general behaviour of different data
objects. The analysis of this kind of data can be important to mine the knowledge.
1. Modeling normal objects and outliers effectively − Outlier detection element largely based
on the modeling of normal (nonoutlier) objects and outliers. This is slightly because it is
complex to enumerate some available normal behaviors in an application.
The border among data normality and abnormality (outliers) is not clear cut. Instead, there
can be a broad range of gray application. Consequently, while various outlier detection
techniques assign to each object in the input information set a label of either “normal” or
“outlier,” other approach assign to each object a score calculating the “outlier-ness” of the
object.
3. Handling noise in outlier detection − Outliers are different from noise. It is known that the
quality of real information sets influence to be poor. Noise provide unavoidably exists in data
collected in several applications. Noise can be show as deviations in attribute values or
make smooth as missing values.
Low data quality and the existence of noise carry a huge challenge to outlier detection. They
can deceive the information, blurring the differentiation among normal objects and outliers.
Furthermore, noise and missing information can “hide” outliers and decrease the
effectiveness of outlier detection an outlier can occur “disguised” as a noise point, and an
outlier detection approach can erroneously recognize a noise point as an outlier.
4. Understandability − In some application methods, a user can required to not only detect
outliers, but also learn why the detected objects are outliers. It can combine the
understandability requirement, an outlier detection techniques has to support some reasons
of the detection.
For instance, a statistical approach can be used to validate the degree to which an object
can be an outlier depends on the likelihood that the object was created by the same
structure that generated the majority of the records. The smaller the likelihood, the more
unlikely the object was produced by the same structure, and the more acceptable the object
is an outlier.
______________________________________________
Q2. [A] Interpret the need of data pre-processing. Explain various data
transformation techniques.
Ans:
Data pre-processing is a crucial step in data analysis and machine learning that involves
preparing raw data for analysis by cleaning, transforming, and reducing its complexity. The
need for data pre-processing arises from the fact that raw data is often messy, inconsistent,
and unstructured, which can lead to incorrect or biased results if not properly processed.
Data transformation is a vital aspect of data pre-processing that involves converting raw data
into a more suitable format for analysis. There are various data transformation techniques
used in data pre-processing, some of which include:
1. Standardization
Standardizing or normalizing the data refers to scaling the data values to a much smaller
range such as [-1, 1] or [0.0, 1.0]. There are different methods to normalize the data, as
discussed below.
Consider that we have a numeric attribute A and we have n number of observed values for
attribute A that are V1, V2, V3, ….Vn.
a). Min-max normalization: This method implements a linear transformation on the original
data. Let us consider that we have minA and maxA as the minimum and maximum value
observed for attribute A and Viis the value for attribute A that has to be normalized.
For example, we have $1200 and $9800 as the minimum, and maximum value for the
attribute income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
b) Z-score normalization: This method normalizes the value for attribute A using
the meanand standard deviation. The following formula is used for Z-score normalization:
For example, we have a mean and standard deviation for attribute A as $54,000 and
$16,000. And we have to normalize the value $73,600 using z-score normalization.
c) Decimal Scaling: This method normalizes the value of attribute A by moving the decimal
point in the value. This movement of a decimal point depends on the maximum absolute
value of A.
For example, the observed values for attribute A range from -986 to 917, and the maximum
absolute value for attribute A is 986. Here, to normalize each value of attribute A using
decimal scaling, we have to divide each value of attribute A by 1000, i.e., j=3.
So, the value -986 would be normalized to -0.986, and 917 would be normalized to 0.917.
The normalization parameters such as mean, standard deviation, the maximum absolute
value must be preserved to normalize the future data uniformly.
2. Log transformation: Log transformation involves taking the logarithm of numerical values.
This is useful when the data is highly skewed and the distribution is not normal. Log
transformation can help normalize the data and make it more suitable for analysis.
The Effect of Log Transformation
3. Feature Extraction: Feature Extraction involves creating new features from existing
features. This is useful when the existing features are not informative enough or when the
machine learning algorithm requires more features. Feature engineering techniques include
polynomial features, interaction features, and dimensionality reduction.
4. Encoding: Encoding involves converting categorical variables into numerical values. This
is important because most machine learning algorithms cannot handle categorical variables
directly. Encoding techniques include one-hot encoding, ordinal encoding, and binary
encoding.
5. Data Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources
into a data analysis description. This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.
For example, we have a data set of sales reports of an enterprise that has quarterly sales of
each year. We can aggregate the data to get the enterprise's annual sales report.
Q2. [B] Apply Z-score normalization method to normalize income value
$73,600. Suppose for the attribute income the mean value is $54,000 and
the standard deviation of $16,000.
Ans:
Given values:
Mean salary = 54,000
Standard Deviation = 16,000
V = 73,600
Calculation:
Z-score = (73600 - 54000) / 16000
= 19600 / 16000
Z- score = 1.225
Q2. [C] Demonstrate the various outlier detection methods with suitable
example.
Ans:
Outlier detection is an important technique in data analysis that involves identifying
observations or data points that deviate significantly from the rest of the data. Outliers can
occur due to measurement errors, data entry errors, or genuine extreme values. In this
section, I will demonstrate some common outlier detection methods with suitable examples.
1. Z-score method:
The Z-score method is a statistical technique that involves calculating the standard deviation
from the mean of a dataset and identifying observations that fall outside a certain threshold.
Observations that fall outside the threshold are considered outliers. For example, suppose
we have a dataset of exam scores, and we want to identify outliers. We can use the Z-score
method to identify students who scored significantly higher or lower than the average score.
2. Tukey's method:
Tukey's method is a technique used to identify outliers based on the interquartile range
(IQR). The IQR is calculated as the difference between the 75th percentile (Q3) and the 25th
percentile (Q1) of the dataset. According to Tukey's method, any observation that falls
outside the range of Q1 - 1.5IQR and Q3 + 1.5IQR is considered an outlier. For example,
suppose we have a dataset of salaries, and we want to identify outliers. We can use Tukey's
method to identify employees who have salaries that are significantly higher or lower than
the median salary.
3. Box Plot:
The box plot, also known as the box-and-whisker plot, is a graphical representation of a
dataset that displays the median, quartiles, and extreme values. The box plot is useful for
identifying outliers because it displays the range of the data, the median, and the quartiles,
which can help identify values that fall significantly outside of the typical range.
In a box plot, outliers are identified as individual data points that are located outside the
whiskers of the box. The whiskers of the box plot represent the range of the data, and
observations that fall outside the whiskers are considered outliers. Outliers are usually
plotted as individual data points beyond the whiskers, although some box plots may use
other symbols to indicate outliers.
________________________________________________
Q3. [A] Demonstrate different data mining techniques. List different data
mining tools.
Ans:
Data mining refers to the process of extracting useful information from large datasets. There
are many different data mining techniques available, and the choice of technique depends
on the type of data and the specific problem being addressed. Here are some examples of
data mining techniques:
1. Clustering:
Clustering involves grouping similar data points together based on their attributes. This
technique is useful for identifying patterns and relationships in data. Examples of clustering
algorithms include k-means, hierarchical clustering, and DBSCAN.
2. Classification:
Classification involves assigning a label or category to a data point based on its attributes.
This technique is useful for making predictions and identifying patterns in data. Examples of
classification algorithms include decision trees, logistic regression, and support vector
machines (SVMs).
4. Regression analysis:
Regression analysis involves identifying the relationship between two or more variables in a
dataset. This technique is useful for making predictions and identifying patterns in data.
Examples of regression algorithms include linear regression, polynomial regression, and
logistic regression.
5. Apriori Algorithm:
The Apriori algorithm is a popular data mining technique used for identifying frequent
itemsets and association rules in large datasets. It is commonly used in market basket
analysis, where the goal is to identify the co-occurrence of items in transactions, such as in a
retail store. The Apriori algorithm works by scanning the dataset multiple times to identify
frequent itemsets, which are sets of items that occur together in a transaction above a
specified minimum support threshold. The algorithm uses a bottom-up approach.
6. Anomaly detection:
Anomaly detection involves identifying data points that are significantly different from the rest
of the dataset. This technique is useful for identifying outliers and detecting fraudulent
activity. Examples of anomaly detection algorithms include Isolation Forest and Local Outlier
Factor (LOF).
1. RapidMiner: RapidMiner is an open-source data mining software that offers a wide range
of data mining techniques and algorithms. It has a user-friendly interface and can handle
large datasets.
2. WEKA: WEKA is a collection of machine learning algorithms for data mining tasks. It is
open-source software and provides a graphical user interface for data analysis.
3. KNIME: KNIME is an open-source data analytics platform that allows users to create
visual workflows for data mining and analysis. It offers a wide range of tools and techniques
for data mining and analysis.
4. SAS: SAS is a commercial data mining software that provides advanced analytics and
data visualization tools. It is widely used in industries such as finance and healthcare.
5. IBM SPSS: IBM SPSS is a commercial data mining software that provides advanced
statistical analysis and predictive modeling tools. It is widely used in industries such as
market research and social sciences.
These are just a few examples of data mining techniques and tools, and there are many
others available depending on the specific problem being addressed.
Ans:
The decision tree algorithm is a popular machine learning technique used in Business
Intelligence (BI) for predictive modeling, data classification, and data exploration.
Here are a few ways that decision trees can be used in a BI perspective:
1.. Predictive Modeling: Decision trees can be used to build predictive models for a variety of
applications. For example, a company may use a decision tree to predict which customers
are most likely to churn or which products are most likely to be returned. These predictions
can help businesses make data-driven decisions, such as developing targeted marketing
campaigns or improving product quality.
2. Data Classification: Decision trees can be used to classify data into different categories
based on their attributes. For example, a company may use a decision tree to classify
customer demographics into different segments based on age, income, and buying behavior.
This information can be used to develop targeted marketing campaigns that are tailored to
each segment.
3. Data Exploration: Decision trees can be used to explore data and identify patterns and
relationships between different attributes. For example, a company may use a decision tree
to explore the relationship between product features and customer satisfaction. This
information can be used to improve product design and customer service.
The first step in using a decision tree for predictive modeling is to preprocess the data by
cleaning, transforming, and selecting relevant features. In this case, we might select features
such as product category, price, and customer age.
Next, we would use the decision tree algorithm to build a model that can predict which
products are most likely to be returned based on these features. The decision tree algorithm
would evaluate each feature to determine which one has the greatest impact on the outcome
(i.e., whether or not a product is returned). The algorithm would then split the dataset into
branches based on this feature, and continue evaluating each branch until it reaches a leaf
node, which represents a final decision.
Finally, we would use the decision tree model to make predictions on new data, such as
upcoming product releases. The model would use the features of each new product to
determine the likelihood of returns, which can help the company make data-driven decisions
about product design and marketing strategies.
Ans:
Naive Bayes algorithm can be used in a business intelligence (BI) context to classify data
into different categories based on the probability of each category.
Here's an example of how it can be used:
Suppose a company wants to classify customer reviews of its products into different
sentiment categories (positive, neutral, or negative) to gain insights into customer
satisfaction. The company has collected a large dataset of customer reviews, each labeled
as positive, neutral, or negative.
To use the Naive Bayes algorithm for BI, the company would first preprocess the data by
cleaning and transforming the text data into numerical features. For example, it might
convert the text into bag-of-words representations, where each word is treated as a separate
feature.
Next, the company would use the Naive Bayes algorithm to train a model that can classify
new reviews into sentiment categories based on the probability of each category. The
algorithm would calculate the conditional probability of each word given each sentiment
category, and the prior probability of each category based on the frequency of each category
in the training data.
Once the Naive Bayes model has been trained, the company can use it to classify new
customer reviews into sentiment categories. For example, if a new review contains words
such as "excellent," "amazing," and "fantastic," the model would classify it as positive. If a
review contains words such as "okay," "average," and "adequate," the model would classify it
as neutral. And if a review contains words such as "terrible," "horrible," and "awful," the
model would classify it as negative.
By classifying customer reviews into sentiment categories, the company can gain insights
into customer satisfaction and identify areas for improvement. For example, if the model
reveals that customers are frequently mentioning specific product features in negative
reviews, the company can make data-driven decisions to improve those features.
_________________________________________________
Q4. [A] Interpret the various criteria on the basis of which classification
and prediction can be compared.
Ans:
Classification and prediction are both important tasks in machine learning, and there are
various criteria on which they can be compared. Here are some of the most important
criteria:
1. Accuracy: One of the most important criteria for comparing classification and prediction
models is accuracy. Accuracy measures how well the model performs at correctly classifying
or predicting the target variable. A higher accuracy score indicates a better-performing
model.
2. Precision and Recall: Precision and recall are two metrics commonly used to evaluate
classification models. Precision measures the proportion of correctly identified positive cases
among all cases identified as positive, while recall measures the proportion of correctly
identified positive cases among all true positive cases. Both precision and recall can be used
to evaluate how well a model is performing at correctly identifying cases in a given category.
3. F1 Score: The F1 score is a measure of a model's accuracy that takes both precision and
recall into account. It is the harmonic mean of precision and recall and provides a single
score that summarizes the overall performance of the model.
4. AUC-ROC Curve: The Area Under the Receiver Operating Characteristic (AUC-ROC)
curve is a measure of a model's ability to distinguish between positive and negative cases. It
is calculated by plotting the true positive rate against the false positive rate for different
threshold values and calculating the area under the resulting curve. A higher AUC-ROC
score indicates a better-performing model.
By considering these criteria, researchers and decision-makers can choose the best model
for a given task and make data-driven decisions to improve performance.
Ans:
Clustering is a popular unsupervised machine learning technique used to group similar data
points together based on certain characteristics. There are several types of clustering
algorithms, including K-Means Clustering, K Medoids Clustering, Hierarchical Clustering,
Linkage Clustering, and Agglomerative Clustering.
1..K-Means Clustering:
K-Means Clustering is one of the most popular clustering algorithms used in machine
learning. It involves partitioning a dataset into K clusters, where K is a predetermined
number. The algorithm starts by randomly selecting K centroids and assigning each data
point to the closest centroid. The centroids are then recalculated based on the mean of the
data points in each cluster, and the process is repeated until the centroids no longer move or
the maximum number of iterations is reached.
2. K Medoids Clustering:
K Medoids Clustering is similar to K-Means Clustering, but instead of using the mean of the
data points in each cluster, it uses the most centrally located point (medoid) as the cluster
representative. The algorithm starts by randomly selecting K medoids and assigning each
data point to the closest medoid. The medoids are then recalculated based on the sum of
the dissimilarity between each data point and its assigned medoid, and the process is
repeated until the medoids no longer move or the maximum number of iterations is reached.
3. Hierarchical Clustering:
4. Linkage Clustering:
Linkage Clustering is a type of hierarchical clustering that involves calculating the distance
between all pairs of data points and then iteratively merging the two closest points or
clusters until all data points belong to a single cluster. The distance between clusters can be
calculated using different methods, such as single linkage, complete linkage, or average
linkage.
5. Agglomerative Clustering:
Agglomerative Clustering is a type of hierarchical clustering that starts with each data point
as a separate cluster and iteratively merges the closest pairs of clusters until all data points
belong to a single cluster. The distance between clusters can be calculated using different
methods, such as single linkage, complete linkage, or average linkage.
Ans:
Let's say that a retail store wants to analyze the buying patterns of its customers to identify
items that are frequently purchased together. The store has a database of all the
transactions made by its customers, which contains the items purchased by each customer.
Here are the steps to perform association algorithm for this example:
Data preparation: The transaction data needs to be cleaned and formatted in a way that is
suitable for analysis. This involves removing any duplicate transactions and converting the
data into a format that can be easily read by the algorithm.
Itemset identification: The next step is to identify the itemsets, i.e., sets of items that are
frequently purchased together. This involves calculating the frequency of occurrence of each
item and identifying the itemsets that occur above a certain threshold frequency.
Rule generation: Once the itemsets have been identified, the next step is to generate
association rules. These rules specify the likelihood of a particular item being purchased
given the purchase of another item. For example, if customers who buy bread are also likely
to buy butter, then the association rule could be "if bread is purchased, then butter is also
likely to be purchased."
Rule evaluation: The final step is to evaluate the generated association rules to determine
their usefulness. This involves calculating metrics such as support, confidence, and lift,
which provide insight into the strength of the association between items.
Based on the generated association rules, the retail store can create targeted promotional
offers that bundle the frequently purchased items together. For example, if the association
rule "if bread is purchased, then butter is also likely to be purchased" has a high confidence
level, the store can create a promotional offer that bundles bread and butter together.
This can help the store to increase its sales and improve customer satisfaction by providing
relevant and useful promotional offers. Additionally, the store can use the results of the
association algorithm to optimize its product placement strategy and to identify cross-selling
opportunities.
_________________________________________________