Cheat Sheet
Cheat Sheet
Cheat Sheet
1- Sort ascending
2- Partition data:
Equal-width (interval/distance) 0-1 scaling: v=(u-u0)/(um-u0) L1 norm (p=1):
Range=Max-Min, Interval Z-score normalization:
Length(L)=Range/No.Bins v=(u-µ)/σ, z=(X-µ)/σ L2 norm (p=2):
Bins:Bin1=[min,min+L)…BinMax=[m Decimal scaling:v=u/10k Minkowsky distances:
ax-L,max)→different bin sizes k max(|v|)≤1,v∈[-1,1]
Equal-freq (equal-depth): each Manhattan (city block):
bin contains (L) samples Median = mid-point
3- Smooth: mean, median, Range = Max(X)-Min(X) Euclidean:
boundary (based on assumption)
Euclidean norm (length) Quartiles: Entropy: The more random the data, the higher
• Q2 = median information, the higher entropy, the lower
Scaling: • Q1,3 = median of the right/left half of Q2: probability
Normalization (ǁxǁ=1): x~=x/ǁxǁ o If total count is odd, we take the middle value to be
Q2, then split into right and left (excluding the Q2
value)
o If total count is even, we take the average of the
middle 2 values to be Q2. Then we split into right and
left (including those 2 values)
• Python way: Qi = i(n+1)/4 => gives the position of Qi, Entropy based discretization:
then compensate with the real value at that positions. best split τ to max info
o If the position is i.5 (or a fraction), take the average
between i and i+1 p, q: probabilities of each class
H(S)= –p log2 p – q log2 q
Empirical cumulative Sample variance:
or distribution function: Select best τ (mid-points) then split:
If xTy = 0 → orthogonal or normal vector S1: value ≤ τ, S2: value> τ
Info of τ: H(S1,S2)=(|S1|/|S|)H(S1)+(|S2|/|S|)H(S2)
Standard deviation Info gain: G(τ,S)=H(S)-H(S1,S2), find max Info gain
Kronecker delta:
where I(.) is binary ξσ2 =σ
Projection of y on x: indicator function:
ux: unit vector of x, ux=x/ǁxǁ Also: where
(n-1) for samples, (n) for populations
Bivariate Joint Distribution
Empirical Probability
Linear independence: Mass Function:
Statistical Independence: condition:
where α≠0, Orthogonal vectors are linear
(but not vice versa) Hence: (cdf) and
Probabilities: Pr(X < 5) = number of conditioned (pdf)
samples/total Mean of multivariate vector: Pearson Correlation Coefficient:
Probability Distribution Functions (Discrete):
o Probability Mass Function (pmf)
Total variance: sum of variances of X1, X2…
Linear Transformation (Eigenvectors): Given X,
Set of probabilities, so the sum = 1 µX, ƩY
Covariance:
o Cumulative Distribution Function (cdf)
Covariance Matrix:
Total:
Verify by Av=Lv
To obtain probabilities -> integrate: Generalized: Then normalize using Euclidean norm: divide
each eigenvector [a b] by
Matrix Inverse:
Ratio:
if p(x1)> p(x2), then probability that X is closer to
x1 is higher than x2.
Normal Distributions (Gaussians) Univariate Categorical Attribute Multivariate Bernoulli
X is a normal dist. if its pdf is gaussian Sample mean:
Bernoulli Variable:
pmf:
, i.e. X is has a normal dist.
Standard Normal Distribution:
Mean: Covariance between Xi, Xj:
Variance:
where E[Xi Xj] = 0 (never overlap)
Sample mean:
Sample variance:
n1: xi = 1, n0: xi = 0, n: total
z-score = 0 is the mean
Linear transformation:
p: prob. of success, K: success count Transforming x onto u = y
Mahalanobis distance: Distance of x from The projection of onto the subspace spanned by
the mean normalized by covariance X = X1+X2…, µX= µX2 + µX1…
ui (i = 1, …, r):
PCA:
If covariance matrix is identity matrix Total variance along u: Orthogonal projection matrix (m x m)
Error vector:
Mean:
Center the data:
with m attributes.
Covariance matrix: Kernel Function
Squared norm of each point:
Number of matching symbols of 1’s: Eigenvectors:
Euclidean distance:
Hamming distance: m – s For 3x1 matrices:
Multiple classes:
Binary (if + → x belongs to w1, if - → x belongs to w2):
K-Nearest Neighbor (KNN) Classifier
7- X belongs to the class of the closest K
Bayes Rule Quadratic Discriminant Analysis (QDA) neighbors
or
Binary Classification of Gaussian Patterns: Logistic Regression (Maximum
entropy/maxent)
Softmax: Multiclass logistic regression Logistic sigmoid function:
Posterior modeled with softmax function: Where
n: number of points, d
Optimization Algorithms
µi is the mean of cluster Ci
K-Means Mixture scatter matrix (T)
1-Assign two initial means
2-Cluster 2 groups based on their distances to each mean
3-Calculate the means of the new clusters → the new means
4-Repeat 2 until the new means = the old means
Buckshot Algorithm
1-Randomly select subsample with N1 objects Generalized Formula:
2-Apply group-average hierarchical clustering
3-Use the result as seeds for K-mean
Optimization criteria:
min. within-class scatter, max between-class scatter
1st criterion:
2nd criterion:
3rd criterion:
4th criterion:
Recalculate all distances when merging points using the specified method
Associations: A → B Apriori
Support: P(A,B) i.e. A&&B / total • F1 = {all 1-itemset}.Support >= minsup
• C2 = All possible 2-itemset combinations from F1
Confidence: P(A,B|A) i.e. A&&B/count(A) • F2 = C2.Support >= minsup
Minsup: minimum support • C3 = Join F2 (using same 1st elements, different last element)
Minconf: minimum confidence • F3 = Pruning C3 (downward closure): each 2-itemset combination of each C3 item must exist
in F2. Then add the 3-itemset to F3
returns Xs (items) that are common in all • Then from F2 & F3 (i.e. all F >= F2), find all possible associations (2m-2 associations), calculate
confidence, > minconf
transactions
returns
transactions that has at least one X
Support in this context is the count, e.g. {A:4, B:5}, {A,B}:4
Class Association Rule (CAR)
X →y (X: itemset, y: label) same rules apply