K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
Support Vector Machine ppt presentationAyanaRukasar
Support vector machines (SVM) is a supervised machine learning algorithm used for both classification and regression problems. However, it is primarily used for classification. The goal of SVM is to create the best decision boundary, known as a hyperplane, that separates clusters of data points. It chooses extreme data points as support vectors to define the hyperplane. SVM is effective for problems that are not linearly separable by transforming them into higher dimensional spaces. It works well when there is a clear margin of separation between classes and is effective for high dimensional data. An example use case in Python is presented.
In machine learning, support vector machines (SVMs, also support vector networks[1]) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier.
The document discusses the K-nearest neighbor (K-NN) classifier, a machine learning algorithm where data is classified based on its similarity to its nearest neighbors. K-NN is a lazy learning algorithm that assigns data points to the most common class among its K nearest neighbors. The value of K impacts the classification, with larger K values reducing noise but possibly oversmoothing boundaries. K-NN is simple, intuitive, and can handle non-linear decision boundaries, but has disadvantages such as computational expense and sensitivity to K value selection.
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
This document discusses support vector machines (SVM) and provides an example of using SVM for classification. It begins with common applications of SVM like face detection and image classification. It then provides an overview of SVM, explaining how it finds the optimal separating hyperplane between two classes by maximizing the margin between them. An example demonstrates SVM by classifying people as male or female based on height and weight data. It also discusses how kernels can be used to handle non-linearly separable data. The document concludes by showing an implementation of SVM on a zoos dataset to classify animals as crocodiles or alligators.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
Machine Learning With Logistic RegressionKnoldus Inc.
Machine learning is the subfield of computer science that gives computers the ability to learn without being programmed. Logistic Regression is a type of classification algorithm, based on linear regression to evaluate output and to minimize the error.
This document discusses the K-nearest neighbors (KNN) algorithm, an instance-based learning method used for classification. KNN works by identifying the K training examples nearest to a new data point and assigning the most common class among those K neighbors to the new point. The document covers how KNN calculates distances between data points, chooses the value of K, handles feature normalization, and compares strengths and weaknesses of the approach. It also briefly discusses clustering, an unsupervised learning technique where data is grouped based on similarity.
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
This K-Means clustering algorithm presentation will take you through the machine learning introduction, types of clustering algorithms, k-means clustering, how does K-Means clustering work and at least explains K-Means clustering by taking a real life use case. This Machine Learning algorithm tutorial video is ideal for beginners to learn how K-Means clustering work.
Below topics are covered in this K-Means Clustering Algorithm presentation:
1. Types of Machine Learning?
2. What is K-Means Clustering?
3. Applications of K-Means Clustering
4. Common distance measure
5. How does K-Means Clustering work?
6. K-Means Clustering Algorithm
7. Demo: k-Means Clustering
8. Use case: Color compression
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
The document discusses gradient descent methods for unconstrained convex optimization problems. It introduces gradient descent as an iterative method to find the minimum of a differentiable function by taking steps proportional to the negative gradient. It describes the basic gradient descent update rule and discusses convergence conditions such as Lipschitz continuity, strong convexity, and condition number. It also covers techniques like exact line search, backtracking line search, coordinate descent, and steepest descent methods.
Classification techniques in data miningKamal Acharya
The document discusses classification algorithms in machine learning. It provides an overview of various classification algorithms including decision tree classifiers, rule-based classifiers, nearest neighbor classifiers, Bayesian classifiers, and artificial neural network classifiers. It then describes the supervised learning process for classification, which involves using a training set to construct a classification model and then applying the model to a test set to classify new data. Finally, it provides a detailed example of how a decision tree classifier is constructed from a training dataset and how it can be used to classify data in the test set.
K-means clustering is an algorithm that groups data points into k number of clusters based on their similarity. It works by randomly selecting k data points as initial cluster centroids and then assigning each remaining point to the closest centroid. It then recalculates the centroids and reassigns points in an iterative process until centroids stabilize. While efficient, k-means clustering has weaknesses in that it requires specifying k, can get stuck in local optima, and is not suitable for non-convex shaped clusters or noisy data.
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
k-Nearest Neighbors (k-NN) is a simple machine learning algorithm that classifies new data points based on their similarity to existing data points. It stores all available data and classifies new data based on a distance function measurement to find the k nearest neighbors. k-NN is a non-parametric lazy learning algorithm that is widely used for classification and pattern recognition problems. It performs well when there is a large amount of sample data but can be slow and the choice of k can impact performance.
PCA and LDA are dimensionality reduction techniques. PCA transforms variables into uncorrelated principal components while maximizing variance. It is unsupervised. LDA finds axes that maximize separation between classes while minimizing within-class variance. It is supervised and finds axes that separate classes well. The document provides mathematical explanations of how PCA and LDA work including calculating covariance matrices, eigenvalues, eigenvectors, and transformations.
This presentation introduces clustering analysis and the k-means clustering technique. It defines clustering as an unsupervised method to segment data into groups with similar traits. The presentation outlines different clustering types (hard vs soft), techniques (partitioning, hierarchical, etc.), and describes the k-means algorithm in detail through multiple steps. It discusses requirements for clustering, provides examples of applications, and reviews advantages and disadvantages of k-means clustering.
The document discusses machine learning techniques, including supervised learning methods like decision tree induction, k-nearest neighbors classification, and artificial neural networks. It provides details on how each technique works, such as how decision trees and k-NN classify new data, and how neural networks are trained through backpropagation to reduce error on training data. Risks like overfitting are also addressed.
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...IRJET Journal
1) The document proposes modifying the K-Nearest Neighbors (KNN) classification algorithm to improve its accuracy by incorporating the concept of attribute strength, as measured by entropy.
2) In the conventional KNN algorithm, the distance between data points is calculated without considering the strength of individual attributes. The proposed modification alters the distance calculation to take into account each attribute's entropy.
3) The accuracy of the modified KNN algorithm is evaluated on various datasets from the UCI Machine Learning Repository and compared to the accuracy of the conventional KNN algorithm.
This document discusses using unsupervised support vector analysis to increase the efficiency of simulation-based functional verification. It describes applying an unsupervised machine learning technique called support vector analysis to filter redundant tests from a set of verification tests. By clustering similar tests into regions of a similarity metric space, it aims to select the most important tests to verify a design while removing redundant tests, improving verification efficiency. The approach trains an unsupervised support vector model on an initial set of simulated tests and uses it to filter future tests by comparing them to support vectors that define regions in the similarity space.
This document provides an overview of the K-nearest neighbors (KNN) machine learning algorithm. It defines KNN as a supervised learning method used for both regression and classification. The document explains that KNN finds the k closest training examples to a test data point and assigns the test point the majority class of its neighbors (for classification) or the average of its neighbors (for regression). An illustrative example is provided. Key properties of KNN discussed include distance metrics, choosing k, and that it is a lazy learner. The pros and cons of KNN are summarized. Finally, the document states it will provide an implementation of KNN on a diabetes dataset.
The document discusses decision trees, which classify data by recursively splitting it based on attribute values. It describes how decision trees work, including building the tree by selecting the attribute that best splits the data at each node. The ID3 algorithm and information gain are discussed for selecting the splitting attributes. Pruning techniques like subtree replacement and raising are covered for reducing overfitting. Issues like error propagation in decision trees are also summarized.
Instance-based learning, also known as lazy learning, is a non-parametric learning method where the training data is stored and a new instance is classified based on its similarity to the nearest stored instances. It is similar to a desktop in that all data is kept in memory. The key aspects are setting the K value for the K-nearest neighbors algorithm and the distance metric such as Euclidean distance. Training involves storing all input data, finding the K nearest neighbors of each test instance, and classifying based on the majority class of those neighbors.
This document provides an overview of clustering and k-means clustering algorithms. It begins by defining clustering as the process of grouping similar objects together and dissimilar objects separately. K-means clustering is introduced as an algorithm that partitions data points into k clusters by minimizing total intra-cluster variance, iteratively updating cluster means. The k-means algorithm and an example are described in detail. Weaknesses and applications are discussed. Finally, vector quantization and principal component analysis are briefly introduced.
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
Supervised learning uses labeled training data to predict outcomes for new data. Unsupervised learning uses unlabeled data to discover patterns. Some key machine learning algorithms are described, including decision trees, naive Bayes classification, k-nearest neighbors, and support vector machines. Performance metrics for classification problems like accuracy, precision, recall, F1 score, and specificity are discussed.
K-nearest neighbors (KNN) is a non-parametric lazy learning algorithm that is used for classification and regression. It stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). For classification, the k nearest neighbors of the new case are identified and the new case is assigned to the most common class among its k nearest neighbors. KNN can be used for continuous variable prediction by averaging the values of the k nearest neighbors. Techniques like condensing are used to improve the computational efficiency of KNN by reducing the size of the training set.
K-nearest neighbors (KNN) is a non-parametric classification technique where an unlabeled sample is classified based on the labels of its k nearest neighbors in the training set, as determined by a distance function. The basic steps are to 1) compute the distance between the test sample and all training samples, 2) select the k nearest neighbors based on distance, and 3) assign the test sample the most common label of its k nearest neighbors. Key aspects include choosing an appropriate value of k, distance metrics, handling high-dimensional data, and reducing computational complexity through techniques like condensing.
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya
Supervised ML technique, K-Nearest Neighbor and Unsupervised Clustering techniques are learnt in this lesson. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
This document summarizes a paper presentation on selecting the optimal number of clusters (K) for k-means clustering. The paper proposes a new evaluation measure to automatically select K without human intuition. It reviews existing methods, analyzes factors influencing K selection, describes the proposed measure, and applies it to real datasets. The method was validated on artificial and benchmark datasets. It aims to suggest multiple K values depending on the required detail level for clustering. However, it is computationally expensive for large datasets and the data used may not reflect real complexity.
The document discusses text categorization and compares several machine learning algorithms for this task, including Support Vector Machines (SVM), Transductive SVM (TSVM), and SVM combined with K-Nearest Neighbors (SVM-KNN). It provides an overview of text categorization and challenges. It then describes SVM, TSVM which uses unlabeled data to improve classification, and SVM-KNN which combines SVM with KNN to better handle unlabeled data. Pseudocode is presented for the algorithms.
1. K-nearest neighbors (k-NN) is a simple machine learning algorithm that stores all training data and classifies new data based on the majority class of its k nearest neighbors.
2. It is a lazy, non-parametric algorithm that makes no assumptions about the distribution of the data. Learning involves storing training examples, while classification assigns a class based on similarity to stored examples.
3. k-NN has applications in areas like credit ratings, political science, handwriting recognition, and image recognition. It works by finding the k closest training examples in feature space and assigning the new example the majority class of those neighbors.
The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups.
This document discusses k-nearest neighbors (KNN) classification, an instance-based machine learning algorithm. KNN works by finding the k training examples closest in distance to a new data point, and assigning the most common class among those k neighbors as the prediction for the new point. The document notes that KNN has high variance, since each data point acts as its own hypothesis. It suggests ways to reduce overfitting, such as using KNN with multiple neighbors (k>1), weighting neighbors by distance, and approximating KNN with data structures like k-d trees.
This document discusses clustering algorithms in machine learning. It explains that clustering aims to group unlabeled data points into natural clusters. Hierarchical clustering builds clusters iteratively by merging the closest pairs of clusters until all points are in one cluster, while k-means assigns points to k predefined clusters by iteratively updating cluster centroids. Choosing the right number of clusters k is important for k-means to produce meaningful results.
We propose an algorithm for training Multi Layer Preceptrons for classification problems, that we named Hidden Layer Learning Vector Quantization (H-LVQ). It consists of applying Learning Vector Quantization to the last hidden layer of a MLP and it gave very successful results on problems containing a large number of correlated inputs. It was applied with excellent results on classification of Rurtherford
backscattering spectra and on a benchmark problem of image recognition. It may also be used for efficient feature extraction.
This document discusses instance-based classifiers and the k-nearest neighbors (k-NN) algorithm. It explains that k-NN classifiers store all available cases and classify new cases based on a similarity measure (e.g. distance functions). The k-NN algorithm identifies the k training examples nearest to the new case, and predicts the class as the most common class among those k examples. The document covers choosing the k value, distance measures, scaling issues, and using k-NN for regression problems.
Tailoring a Seamless Data Warehouse ArchitectureGetOnData
In today's data-driven world, a well-structured data warehouse is crucial for business success.
Data is a valuable asset in the modern business landscape. Efficient data management and analysis are critical for making informed decisions and driving success.
A data warehouse is a centralized system for storing and managing large volumes of data. It provides a comprehensive view of data, enabling informed decision-making across the organization.
Tailored data warehouses are essential to address specific business needs, ensuring that data management and analysis are efficient and effective.
Embrace the power of a tailored data warehouse to transform your business. Start your journey towards data-driven success today!
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of July 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
Docker has revolutionized the way we develop, deploy, and run applications. It's a powerful platform that allows you to package your software into standardized units called containers. These containers are self-contained environments that include everything an application needs to run: code, libraries, system tools, and settings.
Here's a breakdown of what Docker offers:
Faster Development and Deployment:
Spin up new environments quickly: Forget about compatibility issues and dependency management. With Docker, you can create consistent environments for development, testing, and production with ease.
Share and reuse code: Build reusable Docker images and share them with your team or the wider community on Docker Hub, a public registry for Docker images.
Reliable and Consistent Applications:
Cross-platform compatibility: Docker containers run the same way on any system with Docker installed, eliminating compatibility headaches. Your code runs consistently across Linux, Windows, and macOS.
Isolation and security: Each container runs in isolation, sharing only the resources it needs.
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to EdgeTimothy Spann
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
https://www.meetup.com/unstructured-data-meetup-new-york/
https://www.meetup.com/unstructured-data-meetup-new-york/events/301720478/
Details
This is an in-person event! Registration is required to get in.
Topic: Connecting your unstructured data with Generative LLMs
What we’ll do:
Have some food and refreshments. Hear three exciting talks about unstructured data and generative AI.
5:30 - 6:00 - Welcome/Networking/Registration
6:05 - 6:30 - Tim Spann, Principal DevRel, Zilliz
6:35 - 7:00 - Chris Joynt, Senior PMM, Cloudera
7:05 - 7:30 - Lisa N Cao, Product Manager, Datastrato
7:30 - 8:30 - Networking
Tech talk 1: Unstructured Data Processing From Cloud to Edge
Speaker: Tim Spann, Principal Dev Advocate, Zilliz
In this talk I will do a presentation on why you should add a Cloud Native vector database to your Data and AI platform. He will also cover a quick introduction to Milvus, Vector Databases and unstructured data processing. By adding Milvus to your architecture you can scale out and improve your AI use cases through RAG, Real-Time Search, Multimodal Search, Recommendations Engines, fraud detection and many more emerging use cases.
As I will show, Edge devices even as small and inexpensive as a Raspberry Pi 5 can work in machine learning, deep learning and AI use cases and be enhanced with a vector database.
Tech talk 2: RAG Pipelines with Apache NiFi
Speaker: Chris Joynt, Senior PMM, Cloudera
Executing on RAG Architecture is not a set-it-and-forget-it endeavor. Unstructured or multimodal data must be cleansed, parsed, processed, chunked and vectorized before being loaded into knowledge stores and vector DB's. That needs to happen efficiently to keep our GenAI up to date always with fresh contextual data. But not only that, changes will have to be made on an ongoing basis. For example, new data sources must be added. Experimentation will be necessary to find the ideal chunking strategy. Apache NiFi is the perfect tool to build RAG pipelines to stream proprietary and external data into your RAG architectures. Come learn how to use this scalable and incredible versatile tool to quickly build pipelines to activate your GenAI use case.
Tech Talk 3: Metadata Lakes for Next-Gen AI/ML
Speaker: Lisa N Cao, Datastrato
Abstract: As data catalogs evolve to meet the growing and new demands of high-velocity, unstructured data, we see them taking a new shape as an emergent and flexible way to activate metadata for multiple uses. This talk discusses modern uses of metadata at the infrastructure level for AI-enablement in RAG pipelines in response to the new demands of the ecosystem. We will also discuss Apache (incubating) Gravitino and its open source-first approach to data cataloging across multi-cloud and geo-distributed architectures.
Who Should attend:
Anyone interested in talking and learning about Unstructured Data and Generative AI Apps.
When:
July 25, 2024
5:30PM
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
2. Contents
Eager learners vs Lazy learners
What is KNN?
Discussion about categorical attributes
Discussion about missing values
How to choose k?
KNN algorithm – choosing distance measure and k
Solving an Example
Weka Demonstration
Advantages and Disadvantages of KNN
Applications of KNN
Comparison of various classifiers
Conclusion
References
3. Eager Learners vs Lazy Learners
Eager learners, when given a set of training tuples,
will construct a generalization model before receiving
new (e.g., test) tuples to classify.
Lazy learners simply stores data (or does only a little
minor processing) and waits until it is given a test
tuple.
Lazy learners store the training tuples or “instances,”
they are also referred to as instance based learners,
even though all learning is essentially based on
instances.
Lazy learner: less time in training but more in
predicting.
-k- Nearest Neighbor Classifier
-Case Based Classifier
4. k- Nearest Neighbor Classifier
History
• It was first described in the early 1950s.
• The method is labor intensive when given large
training sets.
• Gained popularity, when increased computing
power became available.
• Used widely in area of pattern recognition and
statistical estimation.
5. What is k- NN??
Nearest-neighbor classifiers are based on
learning by analogy, that is, by comparing a given
test tuple with training tuples that are similar to it.
The training tuples are described by n attributes.
When k = 1, the unknown tuple is assigned the
class of the training tuple that is closest to it in
pattern space.
7. Remarks!!
Similarity Function Based.
Choose an odd value of k for 2 class problem.
k must not be multiple of number of classes.
8. Closeness
The Euclidean distance between two points or
tuples, say,
X1 = (x11,x12,...,x1n) and X2 =(x21,x22,...,x2n), is
Min-max normalization can be used to transform
a value v of a numeric attribute A to v0 in the
range [0,1] by computing
9. What if attributes are categorical??
How can distance be computed for attribute
such as colour?
-Simple Method: Compare corresponding value of
attributes
-Other Method: Differential grading
10. What about missing values ??
If the value of a given attribute A is missing in
tuple X1 and/or in tuple X2, we assume the
maximum possible difference.
For categorical attributes, we take the difference
value to be 1 if either one or both of the
corresponding values of A are missing.
If A is numeric and missing from both tuples X1
and X2, then the difference is also taken to be 1.
11. How to determine a good value for
k?
Starting with k = 1, we use a test set to estimate
the error rate of the classifier.
The k value that gives the minimum error rate
may be selected.
13. Distance Measures
Which distance measure to use?
We use Euclidean Distance as it treats each feature as
equally important.
𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = 𝑥𝑖 − 𝑦𝑖 2
𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = 𝑥𝑖 − 𝑦𝑖 2
𝑀𝑎𝑛ℎ𝑎𝑡𝑡𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = |(𝑥𝑖 − 𝑦𝑖)|
14. How to choose K?
If infinite number of samples available, the larger
is k, the better is classification.
k = 1 is often used for efficiency, but sensitive to
“noise”
15. Larger k gives smoother boundaries, better for generalization,
but only if locality is preserved. Locality is not preserved if end
up looking at samples too far away, not from the same class.
Interesting relation to find k for large sample data : k =
sqrt(n)/2 where n is # of examples
Can choose k through cross-validation
17. Example
We have data from the questionnaires survey and
objective testing with two attributes (acid durability and
strength) to classify whether a special paper tissue is good
or not. Here are four training samples :
X1 = Acid Durability
(seconds)
X2 = Strength
(kg/square meter)
Y = Classification
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Now the factory produces a new paper tissue that passes the
laboratory test with X1 = 3 and X2 = 7. Guess the classification of
this new tissue.
18. Step 1 : Initialize and Define k.
Lets say, k = 3
(Always choose k as an odd number if the number of
attributes is even to avoid a tie in the class prediction)
Step 2 : Compute the distance between input sample and
training sample
- Co-ordinate of the input sample is (3,7).
- Instead of calculating the Euclidean distance, we
calculate the Squared Euclidean distance.
X1 = Acid Durability
(seconds)
X2 = Strength
(kg/square meter)
Squared Euclidean distance
7 7 (7-3)2 + (7-7)2 = 16
7 4 (7-3)2 + (4-7)2 = 25
3 4 (3-3)2 + (4-7)2 = 09
1 4 (1-3)2 + (4-7)2 = 13
19. Step 3 : Sort the distance and determine the nearest
neighbours based of the Kth minimum distance :
X1 = Acid
Durability
(seconds)
X2 = Strength
(kg/square
meter)
Squared
Euclidean
distance
Rank
minimum
distance
Is it included
in 3-Nearest
Neighbour?
7 7 16 3 Yes
7 4 25 4 No
3 4 09 1 Yes
1 4 13 2 Yes
20. Step 4 : Take 3-Nearest Neighbours:
Gather the category Y of the nearest neighbours.
X1 = Acid
Durability
(seconds)
X2 =
Strength
(kg/square
meter)
Squared
Euclidean
distance
Rank
minimum
distance
Is it
included in
3-Nearest
Neighbour?
Y =
Category of
the nearest
neighbour
7 7 16 3 Yes Bad
7 4 25 4 No -
3 4 09 1 Yes Good
1 4 13 2 Yes Good
21. Step 5 : Apply simple majority
Use simple majority of the category of the nearest
neighbours as the prediction value of the query instance.
We have 2 “good” and 1 “bad”. Thus we conclude that
the new paper tissue that passes the laboratory test with
X1 = 3 and X2 = 7 is included in the “good” category.
22. Iris Dataset Example using Weka
Iris dataset contains 150 sample instances belonging
to 3 classes. 50 samples belong to each of these 3
classes.
Statistical observations :
Let's denote the true value of interest as 𝜃 (𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑)
and the value estimated using some algorithm as
𝜃. (𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑)
Kappa Statistics : The kappa statistic measures the
agreement of prediction with the true class -- 1.0
signifies complete agreement. It measures the
significance of the classification with respect to the
observed value and expected value.
Mean absolute error :
24. Complexity
Basic kNN algorithm stores all examples
Suppose we have n examples each of dimension
d
O(d) to compute distance to one examples
O(nd) to computed distances to all examples
Plus O(nk) time to find k closest examples
Total time: O(nk+nd)
Very expensive for a large number of samples
But we need a large number of samples for kNN
to to work well!!
25. Advantages of KNN classifier :
Can be applied to the data from any distribution for
example, data does not have to be separable with a
linear boundary
Very simple and intuitive
Good classification if the number of samples is large
enough
Disadvantages of KNN classifier :
Choosing k may be tricky
Test stage is computationally expensive
No training stage, all the work is done during the test
stage
This is actually the opposite of what we want. Usually we
can afford training step to take a long time, but we want
26. Applications of KNN Classifier
Used in classification
Used to get missing values
Used in pattern recognition
Used in gene expression
Used in protein-protein prediction
Used to get 3D structure of protein
Used to measure document similarity
27. Comparison of various classifiers
Algorithm Features Limitations
C4.5
Algorithm
- Models built can be easily
interpreted
- Easy to implement
- Can use both discrete and
continuous values
- Deals with noise
- Small variation in data can lead
to different decision trees
- Does not work very well on
small training dataset
- Over-fitting
ID3
Algorithm
- It produces more accuracy
than C4.5
- Detection rate is increased
and space consumption is
reduced
- Requires large searching time
- Sometimes it may generate
very long rules which are
difficult to prune
- Requires large amount of
memory to store tree
K-Nearest
Neighbour
Algorithm
- Classes need not be linearly
separable
- Zero cost of the learning
process
- Sometimes it is robust with
regard to noisy training data
- Well suited for multimodal
- Time to find the nearest
neighbours in a large training
dataset can be excessive
- It is sensitive to noisy or
irrelevant attributes
- Performance of the algorithm
depends on the number of
28. Naïve Bayes
Algorithm
- Simple to implement
- Great computational efficiency
and classification rate
- It predicts accurate results for
most of the classification and
prediction problems
- The precision of the
algorithm decreases if the
amount of data is less
- For obtaining good results,
it requires a very large
number of records
Support vector
machine
Algorithm
- High accuracy
- Work well even if the data is
not linearly separable in the
base feature space
- Speed and size
requirement both in training
and testing is more
- High complexity and
extensive memory
requirements for
classification in many
cases
Artificial Neural
Networks
Algorithm
- It is easy to use with few
parameters to adjust
- A neural network learns and
reprogramming is not needed.
- Easy to implement
- Applicable to a wide range of
problems in real life.
- Requires high processing
time if neural network is
large
- Difficult to know how many
neurons and layers are
necessary
- Learning can be slow
29. Conclusion
KNN is what we call lazy learning (vs. eager
learning)
Conceptually simple, easy to understand and
explain
Very flexible decision boundaries
Not much learning at all!
It can be hard to find a good distance measure
Irrelevant features and noise can be very
detrimental
Typically can not handle more than a few dozen
attributes
Computational cost: requires a lot computation
30. References
“Data Mining : Concepts and Techniques”, J. Han, J.
Pei, 2001
“A Comparative Analysis of Classification Techniques
on Categorical Data in Data Mining”, Sakshi, S.
Khare, International Journal on Recent and Innovation
Trends in Computing and Communication, Volume: 3
Issue: 8, ISSN: 2321-8169
“Comparison of various classification algorithms on
iris datasets using WEKA”, Kanu Patel et al, IJAERD,
Volume 1 Issue 1, February 2014, ISSN: 2348 - 4470