0% found this document useful (0 votes)

70 views

Module 1

1. The document provides an overview of the topics covered in the CSE 2027 Fundamentals of Data Analysis course including introductions to data, data analysis, types of data, variables, central tendency, scales of data, sources of data, data preparation, and types of data analysis. 2. Key aspects of data discussed are structured vs unstructured data, the many "Vs" of data including volume, velocity, variety, value, variability, veracity, validity, vulnerability, volatility, and visualization. 3. The main types of data analysis covered are text analysis, statistical analysis, diagnostic analysis, predictive analysis, and prescriptive analysis. Descriptive and inferential statistical analysis are also introduced.

Uploaded by

Md Esteyak alam Khan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views

Module 1

Uploaded by

Md Esteyak alam Khan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

CSE 2027-Fundamental of Data Analysis

Module 1-Introduction to Data Analysis

Introducing Data, overview of data analysis: Data

in the Real World, Data vs. Information, Many “Vs”
of Data, Structured Data and Unstructured Data,
Types of Data, Data Analysis Defined, Types of
Variables, Central Tendency of Data, Scales of Data,
Sources of Data, Data preparation: Cleaning the
data, Removing variables, Data Transformations.
Introducing Data
• Facts and statistics collected together for reference or analysis

• Data has to be transformed into a form that is efficient for movement or

processing.

2
Over view of Data
Analysis
• Data analysis is defined as a process of cleaning,
transforming, and modeling data to discover
useful information for business decision-making.
• The purpose of Data Analysis is to extract useful
information from data and taking the decision
based upon the data analysis.
• A simple example of Data analysis is whenever
we take any decision in our day-to-day life is by
thinking about what happened last time or what
will happen by choosing that particular decision.

4
• This is nothing but analyzing our past or future
and making decisions based on it.
• For that, we gather memories of our past or
dreams of our future.
• So that is nothing but data analysis. Now same
thing analyst does for business purposes, is
called Data Analysis.

5
Data Analysis Tools

6
Data in the Real World

7
8
Many v’s of Data
A. Volume
• The term Volume is meant for the Magnitude or Scale of data.

• Massive amounts of data generated from multiple resources are not

possible to handle through the traditional ways like a database.

• This large volume data is a composition of multiple data types, which is

unstructured in nature.
• This kind of data can be either in the form of audio, video, tweets, likes
etc.
B. Velocity
• Velocity refers to the speed at which the gigantic amount of data is
being generated, collected and scrutinized.

• With every flip of second, data is being searched on the internet.

• On a day to day basis, social networking sites like Facebook, Twitter,

Linkedln etc, are sharing a large amount of data.

• For easy analysis of this high amount of constantly generating data

with keeping an eye on it speed and easy access.
C. Variety
• In terms of Big data, term Variety of data pretends to be a composition of
structured and unstructured kind of data.

• The data collected from different sources like mobile phones, laptops etc is not
homogenous in nature.

• Apart from text, audio ,video files, there may be some log files ,clicks or likes or
dislikes etc.
D. Value
• Value refers to convert our investigated data into values.
• Value is one of the most important characteristics of Big data with
a composition of collection and analyzing the same in order to
boost the performance of any organization along with a better
understanding of customers.
• With the access to this useful data, one must analyze great values
in order to get amazing benefits.
E. Variability
• Variability refers to unpredictable changes in the data.

• It may happen because of multiple data types & the speed with
which data is generating and being loaded into the database.
F. Veracity
• Veracity refers to the term trustworthiness with reference to accurate data.

• If the data is accurate, only then you could think of meaningful data.

• For example, consider a dataset of thirty students on which we have to make an

analysis about the reason they got distinction.

• Being an analyzer, you can ask questions like:

• what are the methodology you adopted to get good marks in all the subjects?
• How much time you devote to individual subject?

• Do you learn some subjects with the help of daily life activities like sports etc?

• Have you ever been a scholar?

• Be getting answers like this it would be easier to determine the accuracy of

information which could easily be maintained in statistical form.
G. Validity
• Two terms of big data veracity and validity seems to be alike but
are quite different.

• validity is meant for an accurate analysis in order to get optimized

results.
H. Vulnerability
• Vulnerability is one of the major challenge in big data as the data
generated from multiple sources with such an erratic speed has high
chances of being harmed by any intruder.

• Currently, in a case of facebook, where the Belgium court has

threatened to fine a high amount on breaking privacy recently.
I. Volatility
• Volatility refers to how long the perceived data remains to be
useful for us and how it is to be kept.

• For analyzing the same, it is necessary to develop some new rules

and techniques through which rapid access to information is
possible.
J. Visualization
• Data Visualization is one of the most complex challenge in big data.

• In this information age, data is not only going beyond the limits but also
is composed of different data types.

• So, there is a need of communicate the information by visualizing it

through some special ways with special functionalities like a web-based
approach, statistical analysis etc.
• Traditional tools of data visualization face severe challenges
like low response time, complex methods of scalability,
precision in reporting time etc.

• So, it is a challenge to work with the concept which way of

communication with data is most suitable in order to make
visualization more effective.
22
Typical human-generated unstructured
data includes
• Text files: Word processing, spreadsheets, presentations, email, logs.
• Email: Email has some internal structure thanks to its metadata, and we sometimes refer to
it as semi-structured. However, its message field is unstructured and traditional analytics
tools cannot parse it.
• Social Media: Data from Facebook, Twitter, LinkedIn.
• Website: YouTube, Instagram, photo sharing sites.
• Mobile data: Text messages, locations.
• Communications: Chat, IM, phone recordings, collaboration software.
• Media: MP3, digital photos, audio and video files.
• Business applications: MS Office documents, productivity applications

23
Typical machine-generated unstructured
data includes:
• Satellite imagery: Weather data, land forms, military movements.

• Scientific data: Oil and gas exploration, space exploration, seismic imagery,
atmospheric data.

• Digital surveillance: Surveillance photos and video.

• Sensor data: Traffic, weather, oceanographic sensors.

24
25
26
27
Types of Digital Data

28
Data Analysis-Types
• There are several types of Data Analysis techniques that
exist based on business and technology. However, the
major Data Analysis methods are:
• Text Analysis
• Statistical Analysis
• Diagnostic Analysis
• Predictive Analysis
• Prescriptive Analysis

29
30
Text Analysis
• Text Analysis is also referred to as Data Mining. It
is one of the methods of data analysis to discover
a pattern in large data sets using databases or
data mining tools.
• It used to transform raw data into business
information. Business Intelligence tools are
present in the market which is used to take
strategic business decisions. Overall it offers a
way to extract and examine data and deriving
patterns and finally interpretation of the data.

31
Statistical Analysis

• Statistical Analysis shows "What happen?" by

using past data in the form of dashboards.
Statistical Analysis includes collection, Analysis,
interpretation, presentation, and modeling of
data. It analyses a set of data or a sample of data.
• There are two categories of this type of Analysis -
Descriptive Analysis
Inferential Analysis.

32
Descriptive Analysis
• Analyses complete data or a sample of
summarized numerical data. It shows mean and
deviation for continuous data whereas
percentage and frequency for categorical data.
Inferential Analysis
• Analyses sample from complete data. In this type
of Analysis, you can find different conclusions
from the same data by selecting different
samples.

33
Diagnostic Analysis
• Diagnostic Analysis shows "Why did it happen?"
by finding the cause from the insight found in
Statistical Analysis. This Analysis is useful to
identify behavior patterns of data. If a new
problem arrives in your business process, then
you can look into this Analysis to find similar
patterns of that problem. And it may have
chances to use similar prescriptions for the new
problems.

34
Predictive Analysis
• Predictive Analysis shows "what is likely to happen" by using
previous data. The simplest data analysis example is like if last
year I bought two dresses based on my savings and if this year
my salary is increasing double then I can buy four dresses. But
of course it's not easy like this because you have to think about
other circumstances like chances of prices of clothes is
increased this year or maybe instead of dresses you want to
buy a new bike, or you need to buy a house!
• So here, this Analysis makes predictions about future
outcomes based on current or past data. Forecasting is just an
estimate. Its accuracy is based on how much detailed
information you have and how much you dig in it.

35
Prescriptive Analysis
• Prescriptive Analysis combines the insight from
all previous Analysis to determine which action
to take in a current problem or decision. Most
data-driven companies are utilizing Prescriptive
Analysis because predictive and descriptive
Analysis are not enough to improve data
performance. Based on current situations and
problems, they analyze the data and make
decisions.

36
Types of Variable
 Categorical (qualitative) variables have values that
can only be placed into categories, such as “yes”
and “no.”

 Numerical (quantitative) variables have values that

represent quantities.
 Discrete variables arise from a counting process
 Continuous variables arise from a measuring process

37
Types of Variables
Variables

Categorical Numerical

Examples:
 Marital Status Discrete Continuous
 Political Party
 Eye Color
Examples: Examples:
(Defined categories)
 Number of Children  Weight
 Defects per hour  Voltage
(Counted items) (Measured characteristics)

Copyright ©2011
2-38
Pearson Education
Central Tendency-Mode
• The mode is the most commonly reported value
for a particular variable.
• It is illustrated using the following variable
whose values are: 3, 4, 5, 6, 7, 7, 7, 8,8,9
• The mode would be the value 7 since there are
three occurrences of 7 (more than any other
value).
• The following values, both 7 and 8 are reported
three times: 3, 4, 5, 6, 7,7, 7, 8, 8, 8, 9 The mode
may be reported as {7, 8} or 7.5.

39
Median
• The median is the middle value of a variable
once it has been sorted from low to high. For
variables with an even number of values, the
mean of the two values closest to the middle is
selected (sum the two values and divide by 2).
• The following set of values will be used to
illustrate: 3, 4, 7, 2, 3, 7,4, 2, 4, 7, 4.
• Before identifying the median, the values must
be sorted: 2, 2, 3, 3, 4, 4, 4, 4, 7,7, 7

40
Mean

41
Source of Data
• Surveys or polls
A survey or poll can be useful for gathering data to answer
specific questions
• Experiments:
Experiments measure and collect data to answer a specific
question in a highly controlled manner. The data collected
should be reliably measured, that is, repeating the
measurement should not result in different values.
Experiments attempt to understand cause and affect
phenomena by controlling other factors that may be
important.

42
• Observational and other studies: In certain
situations it is impossible on either logistical or
ethical grounds to conduct a controlled
experiment. In these situations, a large number
of observations are measured and care taken
when interpreting the results.
• Operational databases: These databases contain
ongoing business transactions. They are accessed
constantly and updated regularly. Examples
include supply chain management systems,
customer relationship management (CRM)
databases and manufacturing production
databases.

43
• Data warehouses: A data warehouse is a copy of data
gathered from other sources within an organization that
has been cleaned, normalized, and optimized for making
decisions. It is not updated as frequently as operational
databases.
• Historical databases: Databases are often used to house
historical polls, surveys and experiments.
• Purchased data: In many cases data from in-house
sources may not be sufficient to answer the questions
now being asked of it. One approach is to combine this
internal data with data from other sources.

44
Scales of Data
• Nominal: Scale describing a variable with a
limited number of different values. This scale is
made up of the list of possible values that the
variable may take. It is not possible to determine
whether one value is larger than another.
• Ordinal: This scale describes a variable whose
values are ordered; however, the difference
between the values does not describe the
magnitude of the actual difference.

45
• Interval: Scales that describe values where the
interval between the values has meaning.
• Ratio: Scales that describe variables where the
same difference between values has the same
meaning (as in interval) but where a double,
tripling, etc. of the values implies a double,
tripling, etc. of the measurement.

46
Table

47
Cleaning the Data
• Since the data available for analysis may not have been
originally collected with this project’s goal in mind, it is
important to spend time cleaning the data.
• It is also beneficial to understand the accuracy with
which the data was collected as well as correcting any
errors.
• For variables measured on a nominal or ordinal scale
(where there are a fixed number of possible values), it is
useful to inspect all possible values to uncover mistakes
and/or inconsistencies.
• Any assumptions made concerning possible values that
the variable can take should be tested.

48
• For example, a variable Company may include a
number of different spellings for the same
company such as:
• General Electric Company
• General Elec. Co
• GE
• Gen. Electric Company
• General electric company
• G.E. Company

49
• These different terms, where they refer to the
same company, should be consolidated into one
for analysis.
• In addition, subject matter expertise may be
needed in cleaning these variables.
• For example, a company name may include one
of the divisions of the General Electric Company
and for the purpose of this specific project it
should be included as the ‘‘General Electric
Company.’’

50
Removing Variables
• On the basis of an initial categorization of the
variables, it may be possible to remove variables
from consideration at this point.
• For example, constants and variables with too
many missing data points should be considered
for removal.
• Further analysis of the correlations between
multiple variables may identify variables that
provide no additional information to the analysis
and hence could be removed.

51
Data Transformation
Normalization
• Normalization is a process where numeric columns are transformed
using a mathematical function to a new range. It is important for two
reasons.
• First, analysis of the data should treat all variables equally so that
one column does not have more influence over another because the
ranges are different.
• For example, when analyzing customer credit card data, the Credit
limit value is not given more weightage in the analysis than the
Customer’s age.
• Second, certain data analysis and data mining methods require the
data to be normalized prior to analysis, such as neural networks or k-
nearest neighbors

52
Problem

53
Solution

CMPG 111 Pec - 2024
No ratings yet
CMPG 111 Pec - 2024
7 pages
The Aircraft Environmental Flight Envelope
No ratings yet
The Aircraft Environmental Flight Envelope
11 pages
Alpine MRV f305
No ratings yet
Alpine MRV f305
2 pages
Capstone Proj2023
No ratings yet
Capstone Proj2023
20 pages
Focal Bathys
No ratings yet
Focal Bathys
1 page
CE155 Quantity Take-Off - Excavation Quantity Calculation
No ratings yet
CE155 Quantity Take-Off - Excavation Quantity Calculation
23 pages
ENGM115 Tutorial 9 Critical Writing Exercises This One
100% (1)
ENGM115 Tutorial 9 Critical Writing Exercises This One
20 pages
Cause Effect Essay Sample Grade 12
100% (1)
Cause Effect Essay Sample Grade 12
3 pages
Interpreting Tables and Charts
100% (1)
Interpreting Tables and Charts
3 pages
Kebritchi, Mansureh, Issue & Challenges For Teaching Successful Online Courses
No ratings yet
Kebritchi, Mansureh, Issue & Challenges For Teaching Successful Online Courses
26 pages
IRMT TERM Module 5
100% (1)
IRMT TERM Module 5
90 pages
Principles of Information Technology Management PDF
No ratings yet
Principles of Information Technology Management PDF
25 pages
Observation As A Tool For Collecting Data: June 2023
No ratings yet
Observation As A Tool For Collecting Data: June 2023
15 pages
Econ 1011 Course Outline
No ratings yet
Econ 1011 Course Outline
4 pages
Written Report (Module) Educational Technology and The Teacher
No ratings yet
Written Report (Module) Educational Technology and The Teacher
12 pages
The Impact of Social Media On Society
No ratings yet
The Impact of Social Media On Society
5 pages
Summative Exam Empowerment Technology Semester 1 Quarter 1
No ratings yet
Summative Exam Empowerment Technology Semester 1 Quarter 1
39 pages
CSCE-365: Please Make Sure The Exam Has 3 Parts and 4pages Including The Cover Page and Answer Sheet
No ratings yet
CSCE-365: Please Make Sure The Exam Has 3 Parts and 4pages Including The Cover Page and Answer Sheet
4 pages
Reading Process Worksheet Palencia
No ratings yet
Reading Process Worksheet Palencia
4 pages
04 - 05-AI-Knowledge and Reasoning
No ratings yet
04 - 05-AI-Knowledge and Reasoning
61 pages
Read Theory
No ratings yet
Read Theory
2 pages
Good Day!: Empowerment Technologies
No ratings yet
Good Day!: Empowerment Technologies
35 pages
H1F734 Professionalism and Ethics in Computing
No ratings yet
H1F734 Professionalism and Ethics in Computing
14 pages
SQA Week 3 (BSCS 8) .
No ratings yet
SQA Week 3 (BSCS 8) .
23 pages
Strengths and Weaknesses of Data Collection Methods
No ratings yet
Strengths and Weaknesses of Data Collection Methods
6 pages
Ict - 101 New Syllabus (2018)
No ratings yet
Ict - 101 New Syllabus (2018)
7 pages
Research Methods - Writing A Research Proposal
No ratings yet
Research Methods - Writing A Research Proposal
14 pages
Biology Grade 11 SBK Units 1-3
No ratings yet
Biology Grade 11 SBK Units 1-3
114 pages
Frame Narrative
No ratings yet
Frame Narrative
15 pages
The Analysis of Service Blueprint Application For Qantas Airways Passenger Handling in Departure Terminal Soekarno-Hatta International Airport, Banten
No ratings yet
The Analysis of Service Blueprint Application For Qantas Airways Passenger Handling in Departure Terminal Soekarno-Hatta International Airport, Banten
8 pages
Information Gap
No ratings yet
Information Gap
5 pages
Types and Example of Research
No ratings yet
Types and Example of Research
27 pages
Chapter 4 Systems Thinking and Organizational Innovation - Question
100% (1)
Chapter 4 Systems Thinking and Organizational Innovation - Question
2 pages
ISQOLS 2024 1st Regional Conference Africa October 2024 Summary & Highlights
100% (1)
ISQOLS 2024 1st Regional Conference Africa October 2024 Summary & Highlights
4 pages
Question Bank Marketing All With Answers
No ratings yet
Question Bank Marketing All With Answers
36 pages
Amsalu Neme MSC Thesis... Edited
No ratings yet
Amsalu Neme MSC Thesis... Edited
87 pages
Case Study Method or Secondary Data
No ratings yet
Case Study Method or Secondary Data
2 pages
Advanced Academic Reading and Writing
No ratings yet
Advanced Academic Reading and Writing
16 pages
Computer Literacy Syllabus
No ratings yet
Computer Literacy Syllabus
3 pages
Campus Journalism Reviewer
No ratings yet
Campus Journalism Reviewer
7 pages
Lesson Plan 2
No ratings yet
Lesson Plan 2
2 pages
Final - IT8761 Security Lab Manual
No ratings yet
Final - IT8761 Security Lab Manual
55 pages
BOPPPs
No ratings yet
BOPPPs
17 pages
UNIT 2 Emerging Technology
No ratings yet
UNIT 2 Emerging Technology
66 pages
Methods of Paragraph Development 3
No ratings yet
Methods of Paragraph Development 3
4 pages
REVIEW QUESTIONs FOR BRM FINAL EXAM
No ratings yet
REVIEW QUESTIONs FOR BRM FINAL EXAM
15 pages
Hiv Aids Special Examination
No ratings yet
Hiv Aids Special Examination
3 pages
AMS 101 Course Outlines
No ratings yet
AMS 101 Course Outlines
1 page
Technopreneurship 2 PDF
No ratings yet
Technopreneurship 2 PDF
11 pages
DBIS Lecture 4 - Slides (AI and Big Data)
No ratings yet
DBIS Lecture 4 - Slides (AI and Big Data)
84 pages
Concepts in Enterprise Resource Planning, Fourth Edition 1-1
No ratings yet
Concepts in Enterprise Resource Planning, Fourth Edition 1-1
21 pages
Chapter 5 - System Software
No ratings yet
Chapter 5 - System Software
46 pages
Computer System Servicing and Maintenance: By: Cherry Mae I. Marcelo
No ratings yet
Computer System Servicing and Maintenance: By: Cherry Mae I. Marcelo
150 pages
TN Board Class 12 Maths-I
100% (1)
TN Board Class 12 Maths-I
304 pages
302 Book FOR STUDENTS
No ratings yet
302 Book FOR STUDENTS
188 pages
Advertisement for Gulu University
No ratings yet
Advertisement for Gulu University
25 pages
Computer Competency
No ratings yet
Computer Competency
1 page
Eng8 Quarter 2 Module 4reviseddocx 1
No ratings yet
Eng8 Quarter 2 Module 4reviseddocx 1
77 pages
Grade 11: I. Instruction: Identification. Choose The Best Answer From The Box Below. Avoid Erasures
No ratings yet
Grade 11: I. Instruction: Identification. Choose The Best Answer From The Box Below. Avoid Erasures
2 pages
Data Structures and Algorithms Course Syllabus
No ratings yet
Data Structures and Algorithms Course Syllabus
3 pages
Cse 2027 Fda M1
No ratings yet
Cse 2027 Fda M1
55 pages
Screenshot 2024-11-08 at 11.01.05 AM
No ratings yet
Screenshot 2024-11-08 at 11.01.05 AM
54 pages
Introduction to Data Science_students
No ratings yet
Introduction to Data Science_students
237 pages
unit-1ppt-241202105748-ba1c594f
No ratings yet
unit-1ppt-241202105748-ba1c594f
30 pages
Corrosion Under Insulation (CUI) Key Indicators: Issued November 2020 Rev 1
No ratings yet
Corrosion Under Insulation (CUI) Key Indicators: Issued November 2020 Rev 1
6 pages
DP Biometric 14121 Drivers
No ratings yet
DP Biometric 14121 Drivers
160 pages
HLR Profile
0% (1)
HLR Profile
6 pages
Lunyiu SOP UT
No ratings yet
Lunyiu SOP UT
2 pages
Fortigate Cheatsheets
No ratings yet
Fortigate Cheatsheets
2 pages
LMI - PID-Robust PID Controller Design Via LMI Approach - (Ge2002)
No ratings yet
LMI - PID-Robust PID Controller Design Via LMI Approach - (Ge2002)
11 pages
Business Case
No ratings yet
Business Case
9 pages
MyFerrari - Ferrari F8 Tributo - Fn61uFC
No ratings yet
MyFerrari - Ferrari F8 Tributo - Fn61uFC
11 pages
Catalog Klauke
100% (1)
Catalog Klauke
696 pages
Iball Clickscan (A3) User Manual
No ratings yet
Iball Clickscan (A3) User Manual
24 pages
Procedures For Excavation, Pipe Laying & Jointing - Mechanical Engineering Notes & Pipe Laying Procedures
No ratings yet
Procedures For Excavation, Pipe Laying & Jointing - Mechanical Engineering Notes & Pipe Laying Procedures
15 pages
Evpn Vxlan Interop Between Nxos and Junos Os - 230521 - 053618
No ratings yet
Evpn Vxlan Interop Between Nxos and Junos Os - 230521 - 053618
68 pages
ĐỀ MOCK TEST 4
No ratings yet
ĐỀ MOCK TEST 4
14 pages
DOC-20250114-WA0008.
No ratings yet
DOC-20250114-WA0008.
7 pages
Final PPT On Supply Chain MGMT
100% (2)
Final PPT On Supply Chain MGMT
20 pages
Demo Script
No ratings yet
Demo Script
5 pages
Modified Vol I of Feasibility Report
100% (1)
Modified Vol I of Feasibility Report
212 pages
Science and Technology Competitiveness Rankings of The Philippines (2011 - 2017)
No ratings yet
Science and Technology Competitiveness Rankings of The Philippines (2011 - 2017)
41 pages
Wiring Comunicacion
No ratings yet
Wiring Comunicacion
4 pages
AIX Operating System Hardening Procedures
No ratings yet
AIX Operating System Hardening Procedures
10 pages
Homework 2 Solution
No ratings yet
Homework 2 Solution
5 pages
Artificial Intelligence (AI) Literacy in Early Childhood Education: The Challenges and Opportunities
No ratings yet
Artificial Intelligence (AI) Literacy in Early Childhood Education: The Challenges and Opportunities
66 pages
11.1.4.11 Lab - Working With File Explorerc990
No ratings yet
11.1.4.11 Lab - Working With File Explorerc990
4 pages
Comparative Analysis and Benefits of Digital Library PDF
No ratings yet
Comparative Analysis and Benefits of Digital Library PDF
7 pages
HMV01 ProjectPlanning
No ratings yet
HMV01 ProjectPlanning
92 pages
Download
No ratings yet
Download
8 pages