KNN Algorithm
KNN Algorithm
DESCRIPTION
A nearest-neighbor classification object, where both distance metric ("nearest") and number of neighbors can be altered. The object classifies new observations using the predict method. The object contains the data used for training, so can compute resubstitution predictions.
CLASSIFICATION
mdl = ClassificationKNN.fit(X,Y) creates a k-nearest neighbor classification model. For details, see ClassificationKNN.fit. mdl = ClassificationKNN.fit(X,Y,Name,Value) creates a classifier with additional options specified by one or more Name,Value pair arguments. For details, see ClassificationKNN.fit.
Input Arguments X Matrix of predictor values. Each column of X represents one variable, and each row represents one observation. Y Grouping variables of response values with the same number of elements (rows) as X. Each entry in Y is the response to the data in the corresponding row of X.
Properties BreakTies String specifying the method predict uses to break ties if multiple classes have the same smallest cost. By default, ties occur when multiple classes have the same number of nearest points among the K nearest neighbors.
'nearest' Use the class with the nearest neighbor among tied groups. 'random' Use a random tiebreaker among tied groups. 'smallest' Use the smallest index among tied groups. 'BreakTies' applies when 'IncludeTies' is false. Change BreakTies using dot addressing: mdl.BreakTies = newBreakTies
'all' All predictors are categorical. [] No predictors are categorical. List of elements in the training data Y with duplicates removed.ClassNames can be a numeric vector, vector of categorical variables (nominal or ordinal), logical vector, character array, or cell array of strings.ClassNames has the same data type as the data in the argument Y. Change ClassNames using dot addressing: mdl.ClassNames = newClassNames
ClassNames
Cost
class j if its true class is i. Cost is K-by-K, where K is the number of classes. Change a Cost matrix using dot addressing: mdl.Cost = costMatrix
Distance
String or function handle specifying the distance metric. The allowable strings depend on the NSMethod parameter, which you set inClassificationKNN.fit, and which exists as a field in ModelParams. NSMethod exhaustive kdtree Distance Metric Names Any distance metric of ExhaustiveSearcher 'cityblock', 'chebychev', 'euclidean', or'minkowski'
For definitions, see Distance Metrics. The distance metrics of ExhaustiveSearcher: Value 'cityblock' 'chebychev' Description City block distance. Chebychev distance (maximum coordinate difference). 'correlation' One minus the sample linear correlation between observations (treated as sequences of values). 'cosine' One minus the cosine of the included angle between observations (treated as vectors).
'euclidean' 'hamming'
'jaccard'
One minus the Jaccard coefficient, the percentage of nonzero coordinates that differ.
'mahalanobis' Mahalanobis distance, computed using a positive definite covariance matrix C. The default value of C is the sample covariance matrix of X, as computed by nancov(X). To specify a different value for C, use the 'Cov' name-value pair. 'minkowski' Minkowski distance. The default exponent is 2. To specify a different exponent, use the 'P'name-value pair. 'seuclidean' Standardized Euclidean distance. Each coordinate difference between X and a query point is scaled, meaning divided by a scale valueS. The default value of S is the standard deviation computed from X, S = nanstd(X). To specify another value for S, use the Scale name-value pair. 'spearman' One minus the sample Spearman's rank correlation between observations (treated as sequences of
values). @distfun Distance function handle. distfun has the form function D2 = DISTFUN(ZI,ZJ)
% calculation of distance
...
where
ZI is a 1-by-N vector containing one row of X orY. ZJ is an M2-by-N matrix containing multiple rows of X or Y.
D2 is an M2-by-1 vector of distances, andD2(k) is the distance between observationsZI and ZJ(J,:).
If NSMethod is kdtree, you can use dot addressing to change Distanceonly among the types 'cityblock', 'chebychev', 'euclidean', or'minkowski'. DistanceWeight String or function handle specifying the distance weighting function.
DistanceWeight Meaning
No weighting Weight is 1/distance Weight is 1/distance2 fcn is a function that accepts a matrix of nonnegative distances, and returns a matrix the same size containing nonnegative distance weights. For example, 'inversesquared' is equivalent to @(d)d.^(-2).
DistParameter
Distance Metric 'mahalanobis' 'minkowski' 'seuclidean' Parameter Positive definite covariance matrix C. Minkowski distance exponent, a positive scalar. Vector of positive scale values with length equal to the number of columns of X. For values of the distance metric other than those in the table,DistParameter must be []. Change DistParameter using dot
IncludeTies
Logical value indicating whether predict includes all the neighbors whose distance values are equal to the Kth smallest distance. IfIncludeTies is true, predict includes all these neighbors. Otherwise,predict uses exactly K neighbors (see 'BreakTies'). Change IncludeTies using dot addressing: mdl.IncludeTies = newIncludeTies
ModelParams NObservations
Parameters used in training mdl. Number of observations used in training mdl. This can be less than the number of rows in the training data, because data rows containing NaNvalues are not part of the fit.
NumNeighbors
Positive integer specifying the number of nearest neighbors in X to find for classifying each point when predicting. Change NumNeighbors using dot addressing: mdl.NumNeighbors = newNumNeighbors
PredictorNames
Cell array of names for the predictor variables, in the order in which they appear in the training data X. Change PredictorNames using dot addressing: mdl.PredictorNames = newPredictorNames
Prior
Prior probabilities for each class. Prior is a numeric vector whose entries relate to the corresponding ClassNames property. Add or change a Prior vector using dot addressing: obj.Prior = priorVector
ResponseName
String describing the response variable Y. Change ResponseName using dot addressing: mdl.ResponseName = newResponseName
Numeric vector of nonnegative weights with the same number of rows as Y. Each entry in W specifies the relative importance of the corresponding observation in Y. Change W using dot addressing: mdl.W = newW
Numeric matrix of predictor values. Each column of X represents one predictor (variable), and each row represents one observation.
Numeric vector of response values with the same number of rows as X. Each entry in Y is the response to the data in the corresponding row of X.
Methods crossval edge fit loss margin predict resubEdge resubLoss resubMargin resubPredict template Cross-validated k-nearest neighbor classifier Edge of k-nearest neighbor classifier Fit k-nearest neighbor classifier Loss of k-nearest neighbor classifier Margin of k-nearest neighbor classifier Predict k-nearest neighbor classification Edge of k-nearest neighbor classifier by resubstitution Loss of k-nearest neighbor classifier by resubstitution Margin of k-nearest neighbor classifier by resubstitution Predict resubstitution response of k-nearest neighbor classifier k-nearest neighbor classifier template for ensemble
Definitions Prediction ClassificationKNN predicts the classification of a point Xnew using a procedure equivalent to this: 1. Find the NumNeighbors points in the training set X that are nearest to Xnew. 2. Find the NumNeighbors response values Y to those nearest points. 3. Assign the classification label Ynew that has smallest expected misclassification cost among the values in Y.