Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Understanding Data (1)

The document provides an overview of data, its types, and the processes involved in data analysis, including data storage, preprocessing, and analytics. It distinguishes between small data and big data, outlines various types of data (structured, unstructured, semi-structured), and discusses the importance of data cleaning and integration. Additionally, it covers analytics types, data visualization techniques, and measures of central tendency and dispersion.

Uploaded by

Sunil as
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Understanding Data (1)

The document provides an overview of data, its types, and the processes involved in data analysis, including data storage, preprocessing, and analytics. It distinguishes between small data and big data, outlines various types of data (structured, unstructured, semi-structured), and discusses the importance of data cleaning and integration. Additionally, it covers analytics types, data visualization techniques, and measures of central tendency and dispersion.

Uploaded by

Sunil as
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Understanding Data

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 1


2.1 What is Data?????????
• All facts are Data.
• Data can be directly human interpretable or diffused data such as images or
video that can be interpreted only by a computer .
• Data is available in different data sources like fat files, databases or data
ware houses. It can be either be an operational data or a non operational
data.
• Operational data: Business procedures and processes
• non operational data: Decision making
• Data has to be processed to generate any information .Processed data is
called Information It includes patterns, associations, or relationships
among data.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 2


Elements of Big data
• Data whose volume is less and can be stored and processed by a small scale
computer is called ‘Small Data”.
• Big Data is a larger data whose volume is much larger than ‘Small data’
and it is characterized as follows:
• 1.Volume
• 2.Veocity
• 3.Variety -> Forms, Functions, Source of data
• 4.Veracity of Data
• 5.Validity
• 6.Value

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 3


2.11 Types of Data
• In Big Data, there are three kinds of data
• Structured Data
• Unstructured Data
• Semi-Structured data
• Structured Data: Data is stored in an organized manner such as a database
where it is available in the form of a table. The data can also be retrieved in
an organized manner using tools like SQL.
– Record Data: A dataset is a collection of measurements taken from a process.
– Data Matrix: It is a variation of the record type because it consists of numeric attributes.
– Graph Data: It involves the relation ships among objects.
– Ordered Data: Ordered data objects involve attributes that have an implicit order among
them.
• Temporal data
• Sequence Data
• Spatial Data

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 4


Unstructured Data and Semi structured Data
• Unstructured Data include video, image and audio. It also includes textual
documents, programs and big data .
• Semi Structure Data: are partially structured and partially unstructured.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 5


2.1.2 Data Storage and Representation
• Once the data set is assembled, it must be stored in a structure that is suitable for
Data Analysis.
• The goal of data storage management is to make data available for analysis. There
are different approaches to organize and manage data in storage files and systems
from flat file to data warehouses.

– Flat Files: This is most commonly available data source.


• CSV: Comma Separated value files
• TSV: Tab Separated value files.

– Database systems
• Transactional Database: Collection of transactional records
• Time Series Data base: Time related information like log files
• Spatial Database: roaster form or vector form.(Eg: Images, maps and
points)

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 6


2.1.2 Data Storage and Representation
• WWW: The objective of data mining algorithm is to mine interesting patterns
of information present in WWW.
• XML: It is both human and machine interpretable data format that can be used to
represent the data that needs to be shared across platforms.
• Data Stream: it is dynamic data which flows in and out of the observing
Environment .
• RSS(really simple Syndication): It is a format of sharing instant feed across
services
• JSON( Java script Object Notation):Data interchange format that I sused in ML
algorithms

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 7


2.2 Big Data Analytics and types of
Analytics
• The primary aim of data is to assist business organizations to take
decisions .
• Data Analysis and data analytics are terms that are used interchangeably to
refer to the same concept. Data Analytics is a general term and data
analysis is a part of it. Data Analytics refers to the process of data
collection, preprocessing and analysis. It deals with complete cycle of data
management.
– Descriptive Analytics
– Diagnostic Analytics
– Predictive Analytics
– Prescriptive Analytics

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 8


Big Data Analytics and types of Analytics
• Descriptive Analytics(DA): It is about describing the main features. After
data collection, DA deals with collected data and quantifies data.
• Diagnostic Analytics: It deals with the question: Why? It is also termed as
casual analysis.
• Predictive Analytics: It deals with future. It deals with the question: What
happens in future given this data.
• Prescriptive Analytics: It is about of finding the best course of action for
the business organisations .It goes beyond prediction and helps in decision
by making set of actions.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 9


2.3 Big Data Analysis Framework
• For performing data analytics, many frameworks are proposed.All
proposed analytics frameworks have some common factors.
• A 4 – layer architecture has following layers
– Data Connection Layer: It has data ingestion mechanism and data
Connectors . It perform the task of ETL process.
– Data Management layer: It performs preprocessing of data.
– Data analytics layer: It has many functionalities like Statistical tests, machine
learning algorithms to understand and Construct machine Learning models .
– Presentation layer: It has mechanisms such as dashboards and application that
display the results of Analytical engines and ML algorithms

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021


10
2.3.1 Data Collection
Data Collection: The first task of gathering datasets are the collection of
data .
1.Timeliness
2.Relevancy
3.Knowledge about the data.
• Data Source can be classified into open/public data, social media data and
multimodal data .
• Public Data: It is a data source which don’t have any copyrights rules and
restrictions. Government census is good example for open source data.
• Eg. Digital libraries, Scientific domains, health care systems.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 11


2.3.2 Data Preprocessing
• In real world, the available data is ‘dirty’
• Incomplete data
• Inaccurate data
• Duplicate data
• Data with missing values
1. Data Preprocessing improve the quality of the data mining techniques.
The raw data must be preprocessed to give accurate results.
2. The process of detection and removal of errors is called as “Data
cleaning”.
3. Data Wrangling-> Data processable for machine Learning Algorithm

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 12


Missing Data Analysis
• The Primary data cleaning process is missing data analysis .Data Cleaning
routines attempt to fill up the missing values, smoothen the noise while
identifying the outliers and correct the inconsistencies of the data.
– Ignore the tuple
– Fill in the values manually
– A global constant can be used to fill in the missing attributes
– The attributes value may be filled by the attribute value
– Use the attribute mean for all samples belonging to the same class
– Use the most possible value to fill the missing value.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021


13
Removal of noisy or Outlier data
• Noise a random error or variance in a measured value. It can be removed
where the given data values are sorted and distributed into equal frequency
bins. The bins are also called as ‘Buckets’
• The binning methods then uses the neighbor values to smooth the noisy
data.
• Smoothing by means where the mean of the bin removes the values of the
bins.
• Smoothing by bin medians, where the bin median replaces the bin values
and ‘smoothing by bin boundaries’ where the bin value is replaced by the
closest bin boundary.
• The maximum and minimum values are called’ Bin boundaries’

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 14


Binning technique

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 15


Data Integration and Data transformation
• Data integration involves routines that merge data from multiple sources
into a single data source. So, this may lead to redundant data.
• The main goal of data integration is to detect and remove redundancies
that arises from integration.
• Data transformation routines perform operations like normalization to
improve the performance of the data mining algorithms.
• Normalization, the attribute values are scaled to fit in a range to improve
the performance of the data mining algorithm.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 16


Min-Max Procedure

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 17


Min-Max Procedure

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 18


Z-score Normalization

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 19


Data Reduction
• Data Reduction reduces data size but produces the same
results. There are different ways in which data reduction can
be carried out such as data aggregation, feature selection, and
dimensionality reduction.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 20


Descriptive statistics:
• Is a branch of statistics that does dataset summarization. It is
used to summarize and describe data.
• Data Visualization:
• Is a branch of study that is useful for investigating the given
data
• Descriptive Analytics and data visualization techniques help to
understand the nature of the data, which further helps to
determine the kinds of machine learning or data mining tasks
that can be applied to the data. This step is often known as
EDA.
• The focus of EDA is to understand the given data andto
prepare it for machine learning algorithms.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 21


Data set and Data Types
• A dataset can be assumed to be collection of data objects. The
data objects may be records, points ,vectors ,patterns ,events ,
samples or observations.
• An attribute can be a property or characteristics of an object.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 22


Types of Data

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 23


Types of Data based on variables

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 24


Univariate Data Analysis and Visualization
• Univariate Analysis is the simplest form of statistical analysis.The dataset
has only one variable.
• A variable can be called as a category. The aim of Univariate analysis is to
describe data and find patterns.
• Univariate data involves finding the frequency distributions, central
tendency measures , dispersion or variation and shape of the data.

• Data Visualization:
• Data visualization helps to understand the data. It presents information and
data , summarization of data, description of data and to make comparison
of data.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 25


Bar chart
• A Bar chart is used to display the frequency distribution for variables.
They are used to illustrate discrete data.
• It also helps in comparing the frequency of different groups.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 26


Pie chart
• These are equally helpful in illustrating the univariate data. The percentage
frequency distribution of students marks{20,22,40,40,70,70,70,85,90,90}is
below in figure

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 27


Histogram
• It plays an important role in data mining for showing frequency
distributions.
• Histogram conveys useful information like nature of data and its mode.
Mode indicates the peak of dataset .

• `

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 28


Dot Plots
• These are similar to bar charts. They are less clustered as compared to bar
charts,as they illustrate the bars only with single points.
• The dot plot of English marks for five students with Id as {1,2,3,4,5} and
marks{45,60,60,80,85} is given in figure. The advantage is that by visual
inspection one can find out who got more marks.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 29


Central Tendency
• Central tendency is a summary of the data. It can explain the characteristics
of data and further helps in comparison.
• Mass data have tendency to concentrate at certain values, normally in the
central location. It is called measure of central tendency .This represents
the first order of measures. It measures are mean , median, mode
• Mean:
• Arithmetic average is a measure of central tendency that represents the
‘center’ of the dataset .

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 30


Central Tendency
• Weighted Mean: Unlike arithmetic mean that gives the weightage of all
items equally Weighted mean gives different importance to al items as the
item importance varies
• Geometric Mean:

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 31


Central Tendency
• Median: The middle value in the distribution is called Median.
• If the total number of items in the distribution is odd, then the middle value
is called median.
• If the numbers are even, then the average value of the two items in the
centre is the median.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 32


Central Tendency
• Mode: mode is the value that occurs more frequently in the
dataset .The value that has the highest frequency is called
Mode.
• Mode is the only for discrete data and is not applicable for
continuous data as there are no repeated values in continuous
data.
• The procedure for finding the mode is to calculate the
frequencies for all the values in the data, and mode is the value
with the highest frequency .

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 33


Dispersion
• The spread out of a set of data around the central tendency is called
Dispersion. Dispersion is represented by various ways such as range,
variance, standard deviation, and standard errors.
• Range: Range is the difference between the maximum and minimum of
values of the given list of data .
• Standard Deviation: The mean does not convey much more than a middle
point. Example, the following dataset {10,20,30} and{10,50,0} both have a
mean of 20 .The difference between these two sets is the spread of data.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 34


Quartiles and Inter Quartile Range
• It is sometimes convenient to subdivide the dataset using coordinates.
Percentiles are about data that are less than the co ordinates by some
percentage of the total value.
• Example: median is 50th percentile and can be denoted as
• The 25th percentile is called first quartile (Q1) and the 75th is called third
quartile(Q3).

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 35


Numericals

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 36


Five –point Summary and Box Plots
• The Medians , quartiles Q1 and Q3 and minimum and maximum written in the
order <minimum,q1,Median q3, Maximum> is known as five point summary .
• Box plots are suitable for continuous variables and a nominal variable. Box plots
can be used to illustrate data distributions and summary of data.
• Also knows as Box and Whisker plot.
• The box contains bulk of data. These data are between first and third quartiles.
• The line inside the box indicates location- median of the data
• If the median is not equidistant ,then the data is skewed. The whiskers that project
from the ends of the box indicate the spread of the tails and the maximum and
minimum of the data value

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 37


Examples

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 38


2.5.4 Shape
• Skewness and Kurtosis indicate symmetry /asymmetry
and the peak location of data.
• Skewness: The measure of direction and degree of
symmetry are called measures of third order.
• Skewness should be zero as in ideal normal distribution .
More over given data set will not have perfect symmetry .

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 39


Skewness

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 40


Kurtosis
o Kurtosis also indicates the peaks of the data. If the data is high
peak, then it indicates higher kurtosis and vice versa.
o Kurtosis is the measure of whether the data is heavy tailed or
light tailed relative to normal distribution.
o It can be observed that normal distribution has bell shaped
curve with no long tails.
o Low Kurtosis tends to have light tails. The implication is that
there is no outlier data.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 41


Mean absolute Deviation(MAD)
• MAD is another dispersion measure and is robust to outliers.
• Outlier point is detected by computing the deviation from
median and by dividing it by MAD.
• The absolute deviation between data and the mean is taken.
Thus the absolute deviation is given by

• Coefficient of Variation: CV is used to Compare datasets with


different units .CV is the ration of standard deviation and
mean, and %CV is the percentage of CV

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 42


Special Uni variate Points
• The ideal way to check the shape of the dataset is a stem and
leafplot.
• A stem and leaf plot are the display ,that help us to know the
shape and distribution of the data.
• In this method, each value is split into a stem and a leaf. The
last digit is usually the leaf and the digits to the left mostly
from the stem.

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 43


QQ PLOT

S. Sridhar, M Vijayalakshmi “Machine Learning”. Oxford ,2021 44

You might also like