Cheat Sheet

Binning: Scaling: u[u0,um] to v[v0,vm] Norms: Dissimilarity matrix:
1- Sort ascending
2- Partition data:
Equal-width (interval/distance) 0-1 scaling: v=(u-u0)/(um-u0) L1 norm (p=1):
Range=Max-Min, Interval Z-score normalization:
Length(L)=Range/No.Bins v=(u-µ)/σ, z=(X-µ)/σ L2 norm (p=2):
Bins:Bin1=[min,min+L)…BinMax=[m Decimal scaling:v=u/10k Minkowsky distances:
ax-L,max)→different bin sizes k max(|v|)≤1,v∈[-1,1]
Equal-freq (equal-depth): each Manhattan (city block):
bin contains (L) samples Median = mid-point
3- Smooth: mean, median, Range = Max(X)-Min(X) Euclidean:
boundary (based on assumption)
Euclidean norm (length) Quartiles: Entropy: The more random the data, the higher
• Q2 = median information, the higher entropy, the lower
Scaling: • Q1,3 = median of the right/left half of Q2: probability
Normalization (ǁxǁ=1): x~=x/ǁxǁ o If total count is odd, we take the middle value to be
Q2, then split into right and left (excluding the Q2
value)
o If total count is even, we take the average of the
middle 2 values to be Q2. Then we split into right and
left (including those 2 values)
• Python way: Qi = i(n+1)/4 => gives the position of Qi, Entropy based discretization:
then compensate with the real value at that positions. best split τ to max info
o If the position is i.5 (or a fraction), take the average
between i and i+1 p, q: probabilities of each class
H(S)= –p log2 p – q log2 q
Empirical cumulative Sample variance:
or distribution function: Select best τ (mid-points) then split:
If xTy = 0 → orthogonal or normal vector S1: value ≤ τ, S2: value> τ
Info of τ: H(S1,S2)=(|S1|/|S|)H(S1)+(|S2|/|S|)H(S2)
Standard deviation Info gain: G(τ,S)=H(S)-H(S1,S2), find max Info gain
Kronecker delta:
where I(.) is binary ξσ2 =σ
Projection of y on x: indicator function:
ux: unit vector of x, ux=x/ǁxǁ Also: where
(n-1) for samples, (n) for populations
Bivariate Joint Distribution
Empirical Probability
Linear independence: Mass Function:
Statistical Independence: condition:
where α≠0, Orthogonal vectors are linear
(but not vice versa) Hence: (cdf) and
Probabilities: Pr(X < 5) = number of conditioned (pdf)
samples/total Mean of multivariate vector: Pearson Correlation Coefficient:
Probability Distribution Functions (Discrete):
o Probability Mass Function (pmf)
Total variance: sum of variances of X1, X2…
Linear Transformation (Eigenvectors): Given X,
Set of probabilities, so the sum = 1 µX, ƩY
Covariance:
o Cumulative Distribution Function (cdf)
Covariance Matrix:
Probability Distribution Function (Continuous):

o Cumulative Distribution Function (cdf)
o Probability Density Function (pdf) Calculate Eigenvectors

Eigen Values:
T=a+d
D=ad-bc
c≠0? b≠0? b=0 & c=0
Total:
Verify by Av=Lv
To obtain probabilities -> integrate: Generalized: Then normalize using Euclidean norm: divide
each eigenvector [a b] by
Matrix Inverse:
Ratio:
if p(x1)> p(x2), then probability that X is closer to
x1 is higher than x2.
Normal Distributions (Gaussians) Univariate Categorical Attribute Multivariate Bernoulli
X is a normal dist. if its pdf is gaussian Sample mean:
Bernoulli Variable:
pmf:
, i.e. X is has a normal dist.
Standard Normal Distribution:
Mean: Covariance between Xi, Xj:
Variance:
where E[Xi Xj] = 0 (never overlap)
Sample mean:
Sample variance:
n1: xi = 1, n0: xi = 0, n: total
z-score = 0 is the mean
Linear transformation:
p: prob. of success, K: success count Transforming x onto u = y
Joint pdf of multivariate normal RV => Number of ways (possible y=

combinations) Approximating x → x~:
Mahalanobis distance: Distance of x from The projection of onto the subspace spanned by
the mean normalized by covariance X = X1+X2…, µX= µX2 + µX1…
ui (i = 1, …, r):
PCA:
If covariance matrix is identity matrix Total variance along u: Orthogonal projection matrix (m x m)
Error vector:
Mean:
Center the data:
with m attributes.
Covariance matrix: Kernel Function
Squared norm of each point:
Number of matching symbols of 1’s: Eigenvectors:
Euclidean distance:
Hamming distance: m – s For 3x1 matrices:
Cosine similarity: Xr → projection on the same feature space

Jaccard similarity coefficient: Yr → projection on a different feature space
In this case, it is a six-dimension vector
Eigenvector of max eigenvalue → 1st pc, 2nd Sum of any 2 kernel functions is a kernel
max→2nd pc function: K=K1+K2
Mean square error = smallest eigen value / m Polynomial Kernel
Mean in feature space: Eigenvector of min eigenvalue → 1st minor
component
Quadratic kernel (p=2), for m=2:
Projection of x onto u1:
Norm of the mean: Gaussian (RBF) Kernel
Where is the squared Euclidean distance

Centering in feature space: where K: centered kernel matrix Kernel matrix is symmetric
c1: eigenvector of K
Projection of (Kj is the jth column of K): When calculating the quadratic kernel where
c=1:
The ith pc: where ◦ denotes the
Hadamard (entry-wise) product
Kernel PCA Kernel PCA Algorithm: Norms of a point:
Compute kernel matrix: K=[Kij] → Kij = K(xi,xj)
Center the matrix:
Distance in feature space:
Eigenvalue K:
where u1: principal eigenvector, Σfi Scale:
covariance in feature space
Fraction of total variance, choose r such that: K(x,y) is a similarity measure in feature space
Reduce dimensionality (r < n)

SVD LDA Fisher’s LDA: finding the w vector
Matrix X to factorize to: Finding vector w to project x on that maximizes
separation between classes ω1 and ω2 (yi = wT xi) &
mxn = mxm, mxn, nxn Between-class scatter:
U: left singular vectors , m1 = projected means of class ω1
V: right singular vectors Distance between means m1, m2 defines class
Delta (Δ): diag(singular values) (no. non- separability
zero singular values=rank of matrix X) Optimum LD vector is the dominant
Linear expansion of X using SVD: eigenvector of S-1 B → eigenvector of the
largest |λi|
B: between-class scatter
Then project data on the selected
X can be approximated by using the best (mxm rank-one)
singular values (δi) in descending order: Within-class scatter: Variances within each class eigenvector
For 2 classes only:
Ur can be used to project X onto subspace: s12, s22: sum of squared deviation of means: LDA vector , then
normalize:
LDA Algorithm Multi-class LDA
1-Join X1 and X2 (2x4 becomes 2x8) Total squared deviation from the means: Given the global mean xbar:
2-Apply → K = 8x8
3-Means: m1: mean of each X1 portion in
K rows, m2: similar, m: mean(m1,m2)
4-Between-class scatter B: Apply S: within-class scatter matrix (mxm symmetric) LDA Classifier
→ 8x8 Discriminant Functions g(x) Project x on LDA vector
Minimum Distance Classifier: find the minimum distance
5-Within-class scatter S 8x8: Apply
We take the max g(x) → i.e. the min distance
Nearest Centroid Classifier Minimum Mahalanobis Distance Classifier
Square Euclidean distance: Find the squared Mahalanobis distance
6-Compute dominant eigenvector between x and µi of class wi
Multiple classes:
Binary (if + → x belongs to w1, if - → x belongs to w2):
K-Nearest Neighbor (KNN) Classifier
7- X belongs to the class of the closest K
Bayes Rule Quadratic Discriminant Analysis (QDA) neighbors
LDA classifier is a special case: Σ1 = Σ2 = S/n (Q = 0) Distance-Weighted KNN

Rank the K-neighbors:
If Σ1 = Σ2 = σ2 I, MMDC is reduced to minimum furthest = 1, nearest = K
Log likelihood: centroid classifier gi(x) = Sum(class ωi weights)/Sum(all weights)
Multiclass Optimal classifiers for normal patterns
Prior probabilities: πi = P(ωi)
If Σi = Σ (i belongs to k) → linear discriminant function: Maximum Likelihood estimates:
Posteriori Rule
If Σi = σ2 I, P(ωi) = π, (i belongs to k) → minimum

For binary: Euclidean distance classifier
or
Binary Classification of Gaussian Patterns: Logistic Regression (Maximum
entropy/maxent)
Softmax: Multiclass logistic regression Logistic sigmoid function:
Posterior modeled with softmax function: Where
Cross-entropy error function for target

output yi=[yi1, yi2,…, yik] If Σ1 = Σ2 (Gaussian):
If → Mahalanobis
If Σ1 = Σ2 → linear classifier:
Gradient with respect to wi:
If Σ1 = Σ2 = σ2 I, P(ω1) = P(ω2) →minimum dist. classifier Given training set:
Naïve Bayes Classifier Normal pdf:
Decision trees: Sigmoid Output:

1) Extract class-specific data subset find thresholds
that maximizes Weight vector w:
info for multiple
2) 3) classes
Gradient of E(w):
4) Classify x: → More robust to outliers than least squares
Distances: Pearson Correlation Coefficient Distance
Minkowsky (Lp norm)
Single linkage: nearest neighbor method

City block (Manhattan/sum of absolute
difference)
Complete linkage: furthest neighbor method
Euclidean distance (L2 norm) Group average: unweighted and weighted
n: number of points, d
Chebychev distance (L norm) Weighted:

Centroid: Distance between p, q and a third cluster k: Hierarchical Clustering Strategies
Canberra distance Agglomerative: bottom-up approach
(most used)
Median: Distances are given equal weights Divisive: top-down (computationally
intensive)
Quadratic Distance:
Ward’s method (sum of squared errors, scatter of
the error)
Cosine
Optimization Algorithms
µi is the mean of cluster Ci
K-Means Mixture scatter matrix (T)
1-Assign two initial means
2-Cluster 2 groups based on their distances to each mean
3-Calculate the means of the new clusters → the new means
4-Repeat 2 until the new means = the old means
Buckshot Algorithm
1-Randomly select subsample with N1 objects Generalized Formula:
2-Apply group-average hierarchical clustering
3-Use the result as seeds for K-mean
Optimization criteria:
min. within-class scatter, max between-class scatter
1st criterion:
2nd criterion:
3rd criterion:
4th criterion:
Recalculate all distances when merging points using the specified method
Associations: A → B Apriori
Support: P(A,B) i.e. A&&B / total • F1 = {all 1-itemset}.Support >= minsup
• C2 = All possible 2-itemset combinations from F1
Confidence: P(A,B|A) i.e. A&&B/count(A) • F2 = C2.Support >= minsup
Minsup: minimum support • C3 = Join F2 (using same 1st elements, different last element)
Minconf: minimum confidence • F3 = Pruning C3 (downward closure): each 2-itemset combination of each C3 item must exist
in F2. Then add the 3-itemset to F3
returns Xs (items) that are common in all • Then from F2 & F3 (i.e. all F >= F2), find all possible associations (2m-2 associations), calculate
confidence, > minconf
transactions
returns
transactions that has at least one X
Support in this context is the count, e.g. {A:4, B:5}, {A,B}:4
Class Association Rule (CAR)
X →y (X: itemset, y: label) same rules apply

Cheat Sheet

Uploaded by

Copyright:

Available Formats

Cheat Sheet

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cheat Sheet

Uploaded by

Copyright:

Available Formats

Binning: Scaling: u[u0,um] to v[v0,vm] Norms: Dissimilarity matrix:

Probability Distribution Function (Continuous):

o Probability Density Function (pdf) Calculate Eigenvectors

Joint pdf of multivariate normal RV => Number of ways (possible y=

Cosine similarity: Xr → projection on the same feature space

Where is the squared Euclidean distance

Reduce dimensionality (r < n)

LDA classifier is a special case: Σ1 = Σ2 = S/n (Q = 0) Distance-Weighted KNN

If Σi = σ2 I, P(ωi) = π, (i belongs to k) → minimum

Cross-entropy error function for target

Decision trees: Sigmoid Output:

Single linkage: nearest neighbor method

Euclidean distance (L2 norm) Group average: unweighted and weighted

Chebychev distance (L norm) Weighted:

You might also like