Module 4_chapter 2
Module 4_chapter 2
Module 4
Understanding Data
Bivariate and Multivariate data, Multivariate statistics, Essential mathematics for Multivariate data,
Overview hypothesis, Feature engineering and dimensionality reduction techniques, Basics of Learning
Theory: Introduction to learning and its types, Introduction computation learning theory, Design of
learning system, Introduction concept learning. Similarity-based learning: Introduction to Similarity or
instance based learning, Nearest-neighbour learning, weighted k- Nearest - Neighbour algorithm.
CHAPTER -2
2.6 BIVARIATE DATA AND MULTIVARIATE DATA
Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The aim is
to find relationships among data. Consider the following Table 2.3, with data of the temperature in
a shop and sales of sweaters.
Here, the aim of bivariate analysis is to find relationships among variables. The relationships can then be
used in comparisons, finding causes, and in further explorations. To do that, graphical display of the data is
necessary. One such graph method is called scatter plot.
Scatter plot is used to visualize bivariate data. It is useful to plot two variables with or without nominal
variables, to illustrate the trends, and also to show differences. It is a plot between explanatory and response
variables. It is a 2D graph showing the relationship between two variables. Line graphs are similar to scatter
plots. The Line Chart for sales data is shown in Figure 2.12.
Here, xi and yi are data values from X and Y. E(X) and E(Y) are the mean values of xi and yi. N is the number
of given data. Also, the COV(X, Y) is same as COV(Y, X).
If the given attributes are X = (x1, x2, … , xN) and Y = (y1, y2, … , yN), then the Pearson correlation coefficient,
that is denoted as r, is given as: (σX, σY are the standard deviations of X and Y.)
Heatmap A heat map is a graphical representation of data where individual values are represented by
colors. Heat maps are often used in data analysis and visualization to show patterns, density, or intensity of
data points in a two-dimensional grid.
Example: Let's consider a heat map to display the average temperatures (in °C) across different regions in
a country over a week. Each cell in the heat map will represent a temperature for a specific region on a
specific day. This is useful to quickly identify trends, such as higher temperatures in certain regions or
specific days with unusual weather patterns. The color gradient (from blue to red) indicates the
temperature range: cooler colors represent lower temperatures, while warmer colors represent higher
temperatures.
Pairplot
Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists of several
pair-wise scatter plots of variables of the multivariate data. A random matrix of three columns is chosen and
the relationships of the columns is plotted as a pairplot (or scatter matrix) as shown in Figure 2.14.
Machine learning involves many mathematical concepts from the domain of Linear algebra, Statistics,
Probability and Information theory. The subsequent sections discuss important aspects of linear algebra
and probability.
If there is a unique solution, then the system is called consistent independent. If there are various
solutions, then the system is called consistent dependant. If there are no solutions and if the equations are
contradictory, then the system is called inconsistent.
For solving large number of system of equations, Gaussian elimination can be used. The
procedure for applying Gaussian elimination is given as follows:
1.Write the given matrix.
2.Append vector y to the matrix A. This matrix is called augmentation matrix.
3.Keep the element a11 as pivot and eliminate all a11 in second row using the matrix operation,
R2 - (a21/a11), here R2 is the 2nd row and (a21/a11) is called the multiplier.
The same logic can be used to remove a11 in all other equations.
4.Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable as:
To facilitate the application of Gaussian elimination method, the following row operations are
applied:
1.Swapping the rows
2.Multiplying or dividing a row by a constant
3.Replacing a row by adding or subtracting a multiple of another row to it
These concepts are illustrated in Example 2.8.
where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and Q T is the transpose of matrix Q.
LU Decomposition
One of the simplest matrix decomposition is LU decomposition where the matrix A can be decomposed
matrices: A = LU. Here, L is the lower triangular matrix and U is the upper triangular matrix. The
decomposition can be done using Gaussian elimination method as discussed in the previous section. First,
an identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination is
applied to reduce the given matrix to get matrices L and U. Example 2.9 illustrates the application of
Gaussian elimination to get LU.
Now, it can be observed that the first matrix is L as it is the lower triangular matrix whose values are the
determiners used in the reduction of equations above such as 3, 3 and 2/3.
The second matrix is U, the upper triangular matrix whose values are the values of the reduced matrix
because of Gaussian elimination.
Probability Distributions
Definition: A probability distribution describes the likelihood of various outcomes for a variable XXX.
Types:
o Discrete Probability Distributions: For countable events (e.g., binomial, Poisson).
o Continuous Probability Distributions: For measurable events on a continuum (e.g., normal,
exponential).
1 Binomial Distribution
2 Poisson Distribution
3 Bernoulli Distribution
Density Estimation
1 Parzen Window
Definition: A non-parametric technique that estimates the PDF based on local samples.
Example: Uses a kernel function like Gaussian around each data point.
Overview of Hypothesis
Data collection alone is not enough. Data must be interpreted to give a conclusion. This assumption
of the outcome is called a hypothesis. Statistical methods are used to confirm or reject the hypothesis.
- Null Hypothesis (H0): The initial assumption or existing belief (often represents no effect or no
difference).
- Alternative Hypothesis (H1): Represents the hypothesis the researcher aims to establish.
1. Parametric Tests: Based on parameters like mean and standard deviation (e.g., t-test, Z-test).
Dr.Sudhamani M J, Professor, Dept of CSE, RNSIT
MODULE -4 21CS752
2. Non-Parametric Tests: Dependent on data characteristics, like event independence or
distribution type.
p-value
The p-value indicates the probability that the null hypothesis is true.
- If p-value ≤ α, reject H0 (null hypothesis).
- If p-value > α, accept H0.
Confidence Intervals
Z-test
Paired t-test
Used when samples are dependent (e.g., pre and post tests for the same subjects).
- Formula: t = (d̄ ) / (sd / sqrt(n))
- d̄ : Mean difference between pairs, sd: Standard deviation of differences, n: Number of pairs.
Chi-Square Test
Overview
The Chi-Square Test is a non-parametric test used to determine if there is a significant association
between observed and expected frequencies in categorical data. It’s often used for:
1. Goodness-of-fit: Testing if sample data matches an expected distribution.
2. Test of independence: Checking if two categorical variables are independent of each other.
Key Concepts
Observed Frequency (O): The actual count of occurrences in each category.
Expected Frequency (E): The count expected if the null hypothesis is true.
Degree of Freedom (df): The number of categories minus one (C - 1), where C is the number of
categories.
χ² = Σ [(O - E)² / E]
Hypotheses
In the Chi-Square Test, we set up two hypotheses:
- Null Hypothesis (H₀): There is no significant difference between the observed and expected
frequencies.
Solution
1. Set Hypotheses:
- Expected value for boys who registered = (Total boys × Total registered) / Grand Total.
- Repeat the process to calculate expected frequencies for each cell.
3. Apply the Chi-Square Formula: Calculate the Chi-Square statistic using χ² = Σ [(O - E)² / E].
4. Degree of Freedom: df = C - 1 = 2 - 1 = 1.
Conclusion
Based on the Chi-Square Test, we conclude that there is a statistically significant difference between
boys and girls in terms of course registration for the machine learning class.
Summary Points
- Chi-Square Test helps compare observed vs. expected data frequencies.
- Use χ² = Σ [(O - E)² / E] formula for calculations.
- A p-value less than the significance level (e.g., 0.05) indicates a significant result.
Features are attributes. Feature engineering is about determining the subset of features that form
an important part of the input that improves the performance of the model, be it classification or any other
model in machine learning.
Feature engineering deals with two problems – Feature Transformation and Feature Selection.
Feature transformation is extraction of features and creating new features that may be helpful in increasing
performance. For example, the height and weight may give a new attribute called Body Mass Index (BMI).
Feature subset selection is another important aspect of feature engineering that focuses on selection of
features to reduce the time but not at the cost of reliability.
Filter-based selection uses statistical measures for assessing features. In this approach, no learning
algorithm is used. Correlation and information gain measures like mutual information and entropy are all
examples of this approach.
Wrapper-based methods use classifiers to identify the best features. These are selected and evaluated by
the learning algorithms. This procedure is computationally intensive but has superior performance.
The operator E refers to the expected value of the population. This is calculated theoretically using the
probability density functions (PDF) of the elements xi and the joint probability density functions between
the elements xi and xj. From this, the covariance matrix can be calculated as:
The mapping of the vectors x to y using the transformation can now be described as:
This transform is also called as Karhunen-Loeve or Hoteling transform. The original vector x
can now be reconstructed as follows:
If K largest eigen values are used, the recovered information would be:
The new data is a dimensionaly reduced matrix that represents the original data.
Figure 2.15. The scree plot indicates that only 6 out of 246 attributes are important.
From Figure 2.15, one can infer the relevance of the attributes. The scree plot indicates that
the first attribute is more important than all other attributes.
Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension is m × n, S is the
diagonal matrix of dimension n × n, and V is the orthogonal matrix. The procedure for finding decomposition
matrix is given as follows:
1.For a given matrix, find AA^T
2.Find eigen values of AA^T
3.Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U.
4.Arrange the square root of the eigen values in diagonal. This matrix is diagonal matrix, S.
5.Find eigen values and eigen vectors for A^TA. Find the eigen value and pack the eigen vector as a
matrix called V.
Thus, A = USV^ T. Here, U and V are orthogonal matrices. The columns of U and V are left and right
singular values, respectively. SVD is useful in compression, as one can decide to retain only a certain
component instead of the original matrix A as: