0% found this document useful (0 votes)

10 views

Data Exploration and Analysis With Python

This document describes data exploration and analysis with Python. He explains that Python is a popular language for data scientists because of libraries like NumPy, Pandas, Matplotlib, and Scikit-learn. It describes how these packages can be used to clean, explore, and visualize data, as well as perform predictive analytics. It also provides an example of how a teacher could use these methods to analyze the relationship between students' academic performance and time spent learning.

Uploaded by

ScribdTranslations

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Data Exploration and Analysis With Python

Uploaded by

ScribdTranslations

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Exploring and analyzing data with

Python
Introduction
Completed 100 XP
 2 minutes

Not surprisingly, the role of a data scientist primarily involves data exploration and
analysis. The results of an analysis can form the basis of a report or a machine learning
model, but it all starts with the data, with Python being the most popular programming
language among data scientists.

After decades of open source development, Python offers extensive functionality with
powerful statistical and numerical libraries:

 NumPy and Pandas simplify data analysis and manipulation.

 Matplotlib provides attractive data visualizations.
 Scikit-learn offers simple and effective predictive data analysis.
 TensorFlow and PyTorch provide machine learning and deep learning
capabilities.

Typically, a data analysis project is designed to draw conclusions about a specific scenario
or to test a hypothesis.

For example, suppose that a university professor collects data from his students, such as the
number of classes they have attended, the hours of study, and the final grade obtained on
the end-of-course exam. The teacher could analyze the data to determine if there is a
relationship between the amount of studying a student does and the final grade they receive.
The teacher could use the data to test a hypothesis that only students who study a minimum
number of hours can expect to get a pass.
Previous requirements
 Knowledge of basic mathematics
 Previous experience with Python programming

Learning objectives

In this module, you will learn to:

 Common data exploration and analysis tasks.

 Using Python packages such as NumPy, Pandas, and Matplotlib to analyze
data

Exploring data with NumPy and Pandas

Completed 100 XP
 3 minutes

Data scientists can use various tools and techniques to explore, visualize, and manipulate
data. One of the most common ways data scientists work with data is using the Python
programming language and some specific data processing packages.

What is NumPy

NumPy is a Python library that offers functionality comparable to that of mathematical

tools such as MATLAB and R. Although NumPy greatly simplifies the user experience, it
also offers comprehensive mathematical functions.
What is Pandas

Pandas is a well-known Python library for data analysis and manipulation. Pandas is like
Python's Excel: it provides easy-to-use functionality for data tables.

Exploring data in a Jupyter Notebook

Jupyter Notebooks are a popular way to run basic scripts using your web browser.
Typically, these notebooks are a single web page, divided into sections of text and sections
of code that run on the server rather than the local machine. This means you can get up and
running quickly without needing to install Python or other tools.

Hypothesis testing

Data exploration and analysis is typically an iterative process where the data scientist takes
a sample of the data and performs the following tasks to analyze it and test a hypothesis:

 Clean data to check for errors, missing values, and other problems.
 Apply statistical techniques to better understand the data and how the
sample can be expected to represent the real-world data population, taking
into account random variation.
 Visualize data to determine relationships between variables and, in the case
of a machine learning project, identify potentially predicted characteristics of
the label .
 Review of hypotheses and repetition of the process.
Data visualization
Completed 100 XP
 3 minutes

Data scientists visualize data to better understand it. This may mean examining the
raw data, summary measures such as means, or plotting the data. Charts are a
powerful means of data visualization, as we can quickly discern moderately
complex patterns without needing to define summary mathematical measures.

Visual representation of data

Representing data visually usually means representing it in graphs. This is done to

provide a quick qualitative assessment of our data, which can be useful for
understanding the results, finding outliers, understanding how the numbers are
distributed, etc.

Although sometimes we know in advance what type of graph will be most useful,
other times we use graphs in an exploratory way. To understand the power of data
visualization, consider the following data: The location (x,y) of a self-driving car. In
its raw form, it is difficult to see real patterns. The mean or average tells us that its
trajectory revolved around x=0.2 and ey=0.3, and the range of numbers seems to
be between approximately -2 and 2.

time Location X Location Y

0 0 2
1 1,682942 1,080605
2 1,818595 -0,83229
3 0,28224 -1,97998
4 -1,5136 -1,30729
5 -1,91785 0,567324
6 -0,55883 1,920341
7 1,313973 1,507805
12 0,00001 0,00001
13 0,840334 1,814894
14 1,981215 0,273474
15 1,300576 -1,51938
16 -0,57581 -1,91532
17 -1,92279 -0,55033
time Location X Location Y
18 -1,50197 1,320633
19 0,299754 1,977409
20 1,825891 0,816164

If we now plot location X over time, we can see that we appear to have some
missing values between times 7 and 12.

If we plot X against Y, we end up with a map of where the car has moved. It is
immediately obvious that the car has been driving in a circle, but at some point it
drove towards the center of that circle.
The charts are not limited to 2D scatterplots like those above, but can be used to
explore other types of data, such as ratios (shown via pie charts, stacked bar
charts), how data (with histograms, box plots) and how two sets of data differ.
Often, when we are trying to understand raw data or results, we can experiment
with different types of graphs until we find one that explains the data visually.

Examining real-world data

Completed 100 XP
 3 minutes

The data presented in educational materials are often remarkably perfect, designed to show
students how to find clear relationships between variables. Real-world data is a little less
straightforward.

Due to the complexity of "real world" data, raw data must be inspected for problems before
use.

Therefore, the best practice is to inspect the raw data and process it before using it, which
reduces errors or problems, usually by removing bad data points or modifying the data to
make it more useful.
Real world data problems

Real-world data can contain many different problems that can affect the usefulness of the
data and our interpretation of the results.

It is important to note that most real-world data is influenced by factors that were not
recorded at the time. For example, we might have a table of race car times along with
engine sizes, but other factors that weren't noted down, such as weather, probably played a
role as well. If problematic, the influence of these factors can often be reduced by
increasing the size of the data set.

In other situations, data points that are clearly outside of expectations, also known as
outliers , can sometimes be safely removed from analyses, although care should be taken
not to remove data points that provide real information.

Another common problem in real-world data is bias. Bias refers to the tendency to select
certain types of securities more frequently than others, thereby misrepresenting the
underlying population, or the "real world." Bias can sometimes be identified by exploring
the data and taking into account basic knowledge about where the data comes from.

Remember that real-world data will always have problems, but this is usually a
surmountable problem. Remember:

 Check for missing values and incorrectly recorded data.

 Consider removing obvious outliers.
 Consider what real-world factors might affect the analysis and see if the data
set size is large enough to control for them.
 Check the raw data for bias and consider your options to correct it, if found.

Knowledge test
Completed 200 XP
 3 minutes

Answer the following questions to check what you have learned.

It has a NumPy array of the form (2,20). What does this indicate about the elements
of the matrix?

The array is two-dimensional and consists of two arrays, each with 20 elements.
Correct. A form of (2,20) indicates a multidimensional array with two arrays,
each containing 20 elements.
The array contains 2 elements with values 2 and 20.
The array contains 20 elements, all of them with the value 2.
That is incorrect. A form of (2,20) indicates a multidimensional array with two
arrays, each containing 20 elements.
2.

You have a Pandas DataFrame object called df_sales that contains daily sales data.
The DataFrame object contains the following columns: year, month, day_of_month,
and sales_total. You want to determine the average value of sales_total. What code
should I use?

df_sales['sales_total'].avg()
df_sales['sales_total'].mean()
Correct. This code will return the average of the values in the sales_total
column.
mean(df_sales['sales_total'])
3.

It has a DataFrame object that contains data about daily ice cream sales. Use the
"corr" method to compare the avg_temp and units_sold columns and get a result of
0.97. What does this result indicate?

On the day with the maximum value of units_sold, the value of avg_temp was 0.97.
Days with high avg_temp values tend to coincide with days with high units_sold
values.
Correct. The "corr" method returns the correlation, and a value close to 1
indicates a positive correlation.
The units_sold value is, on average, 97% of the avg_temp value.
That is incorrect. The "corr" method returns the correlation between two
numeric columns.

Summary
Completed 100 XP
 1 minute

In this module, you learned how to use Python to explore, visualize, and manipulate data.
Data exploration is the foundation of data science and a key element in data analysis and
machine learning.

Machine learning is a subset of data science that deals with predictive modeling. In other
words, machine learning uses data to create predictive models, in order to predict unknown
values. You could use machine learning to predict how much food a supermarket should
order, or to identify plants in photographs.

What machine learning does is identify relationships between data values that describe the
properties of something, its characteristics , such as the height and color of a plant, and the
value you want to predict, the label , such as the species of floors. These relationships are
integrated into a model through a training process.

Challenge: Flight data analysis

If the exercises in this module have inspired you to try exploring data for yourself, why not
take on the challenge of a real-world data set containing flight records from the US
Department of Transportation? You will find the challenge in notebook 01 - Flights
Challenge.ipynb .

Note

The time to complete this optional challenge is not included in the estimated time for this
module, you can spend as much time as you want.

Project Report On M & A
80% (10)
Project Report On M & A
163 pages
Issm Strobe Checklist
100% (1)
Issm Strobe Checklist
3 pages
BBRC4103 Research Methodology
100% (1)
BBRC4103 Research Methodology
23 pages
Introductory Econometrics Test Bank
84% (32)
Introductory Econometrics Test Bank
133 pages
Data Mining Vs Data Exploration UNIT-II
No ratings yet
Data Mining Vs Data Exploration UNIT-II
11 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
PDS Qba
No ratings yet
PDS Qba
12 pages
data analysis
No ratings yet
data analysis
42 pages
Module1 DS Ppt
No ratings yet
Module1 DS Ppt
61 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
L6 and 7-Data Preprocessing-coding
No ratings yet
L6 and 7-Data Preprocessing-coding
34 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Week13 2 Data Analysis 2
No ratings yet
Week13 2 Data Analysis 2
44 pages
Data Science
No ratings yet
Data Science
59 pages
Ch-1 Introduction To Data Analysis
No ratings yet
Ch-1 Introduction To Data Analysis
23 pages
Microsoft Ai Automate
No ratings yet
Microsoft Ai Automate
259 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
1st Class-Introduction and Python Package (1)
No ratings yet
1st Class-Introduction and Python Package (1)
93 pages
Data Understanding and Prepration
100% (1)
Data Understanding and Prepration
10 pages
IDS CH2 Bharath S
No ratings yet
IDS CH2 Bharath S
57 pages
Data Exploration
No ratings yet
Data Exploration
11 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
ITS62604 Tutorial 6 (Answer)
No ratings yet
ITS62604 Tutorial 6 (Answer)
2 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Mod 4
No ratings yet
Mod 4
115 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Andrews M. Doing Data Science in R. an Introduction...2021
No ratings yet
Andrews M. Doing Data Science in R. an Introduction...2021
486 pages
Data Analytics Questions
No ratings yet
Data Analytics Questions
6 pages
Lesson 4
No ratings yet
Lesson 4
64 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Module 1
No ratings yet
Module 1
91 pages
MLS+1+-+Python+for+Data+Science
No ratings yet
MLS+1+-+Python+for+Data+Science
33 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Crash Course Data Science
No ratings yet
Crash Course Data Science
7 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
fds print
No ratings yet
fds print
7 pages
Unit1-Data Science Fundamentals
No ratings yet
Unit1-Data Science Fundamentals
35 pages
DSBA+-+Exploratory+Data+Analysis+v2
No ratings yet
DSBA+-+Exploratory+Data+Analysis+v2
22 pages
AI-Data Science
No ratings yet
AI-Data Science
21 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
DA Interview Questions
No ratings yet
DA Interview Questions
7 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
Unit 1
100% (1)
Unit 1
69 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
DM_merged
No ratings yet
DM_merged
169 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
ds with py
No ratings yet
ds with py
39 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
100% (7)
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
227 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Key Concepts in Discrete Mathematics
From Everand
Key Concepts in Discrete Mathematics
Udayan Bhattacharya
No ratings yet
Introduction to Machine Learning and Neural Classification
From Everand
Introduction to Machine Learning and Neural Classification
Trilokesh Khatri
No ratings yet
Data Visualization Hacks: Tricks for Clear Insights
From Everand
Data Visualization Hacks: Tricks for Clear Insights
Aarya Ganaka
No ratings yet
FIRST PRACTICAL PARTIAL OF PHYSICS I_ Review of the attempt
No ratings yet
FIRST PRACTICAL PARTIAL OF PHYSICS I_ Review of the attempt
2 pages
History of Early Childhood Education in the World
No ratings yet
History of Early Childhood Education in the World
7 pages
?? 2nd Diagnostic Exam Creative Board
No ratings yet
?? 2nd Diagnostic Exam Creative Board
17 pages
me and math
No ratings yet
me and math
2 pages
Carmen Garcia
No ratings yet
Carmen Garcia
23 pages
Electric Gate Motor
100% (1)
Electric Gate Motor
3 pages
Diagnostic Interpretation English
No ratings yet
Diagnostic Interpretation English
11 pages
FLOWCHART OR DIAGRAM OF FLOW.docx
No ratings yet
FLOWCHART OR DIAGRAM OF FLOW.docx
5 pages
Hogwarts - Role-playing game
100% (1)
Hogwarts - Role-playing game
29 pages
History of the Visual Arts
No ratings yet
History of the Visual Arts
6 pages
Actors of Systems Analysis and Design
No ratings yet
Actors of Systems Analysis and Design
1 page
Nursing notes
No ratings yet
Nursing notes
15 pages
The Hymn of the Table
No ratings yet
The Hymn of the Table
2 pages
70s Fashion
No ratings yet
70s Fashion
4 pages
Short Story for Children About Breast Cancer
No ratings yet
Short Story for Children About Breast Cancer
16 pages
Viability
No ratings yet
Viability
8 pages
Main religions of the World
100% (1)
Main religions of the World
13 pages
p36 Risk Management and Climate Change
No ratings yet
p36 Risk Management and Climate Change
44 pages
manual BPM LOVE OF MOTHER.docx
No ratings yet
manual BPM LOVE OF MOTHER.docx
32 pages
SUBJECT: HISTORY - Argentine Military Aviation School
No ratings yet
SUBJECT: HISTORY - Argentine Military Aviation School
5 pages
Summary Judgment Demand for Dangerous Work
No ratings yet
Summary Judgment Demand for Dangerous Work
9 pages
FORM
No ratings yet
FORM
6 pages
Application Audit
No ratings yet
Application Audit
10 pages
Explosion of Inputs
No ratings yet
Explosion of Inputs
2 pages
Final Exam - Week 8_ RA_FIRST BLOCK-ECONOMY AND INTERNATIONAL TRADE-2642
No ratings yet
Final Exam - Week 8_ RA_FIRST BLOCK-ECONOMY AND INTERNATIONAL TRADE-2642
10 pages
learning session
No ratings yet
learning session
6 pages
Evolution of Nursing
No ratings yet
Evolution of Nursing
54 pages
Accounting and Bookkeeping
No ratings yet
Accounting and Bookkeeping
11 pages
500 rock in Spanish
No ratings yet
500 rock in Spanish
4 pages
Songbook format.pdf
No ratings yet
Songbook format.pdf
1 page
GMAT Club MBA Guide 2018 PDF
No ratings yet
GMAT Club MBA Guide 2018 PDF
147 pages
Managerial Economics
No ratings yet
Managerial Economics
355 pages
12.1 Simple Statistics Mean, Median, Mode, Stem-Leaf, Correlation
No ratings yet
12.1 Simple Statistics Mean, Median, Mode, Stem-Leaf, Correlation
13 pages
Decision Making and Modelling in Cognitive Science PDF
100% (3)
Decision Making and Modelling in Cognitive Science PDF
174 pages
Module 4
No ratings yet
Module 4
5 pages
8805Get Quantitative Methods for Business The A to Z of QM 1st Edition John Buglear PDF ebook with Full Chapters Now
100% (2)
8805Get Quantitative Methods for Business The A to Z of QM 1st Edition John Buglear PDF ebook with Full Chapters Now
51 pages
Quantitative Reasoning Course-I(Compulsory Course ADP(23-25) & BS(23-27) 2nd Semester(Revised)
No ratings yet
Quantitative Reasoning Course-I(Compulsory Course ADP(23-25) & BS(23-27) 2nd Semester(Revised)
2 pages
Chapter 3: Research Methodology
No ratings yet
Chapter 3: Research Methodology
16 pages
Probability and Statistics Review PT 1
No ratings yet
Probability and Statistics Review PT 1
57 pages
W 28363
No ratings yet
W 28363
58 pages
Download Full Quantitative Analysis for Management, 11th Edition (eBook PDF) PDF All Chapters
100% (4)
Download Full Quantitative Analysis for Management, 11th Edition (eBook PDF) PDF All Chapters
46 pages
MANSI THAKUR (146) - Section C
No ratings yet
MANSI THAKUR (146) - Section C
28 pages
Biostat Compiled
No ratings yet
Biostat Compiled
617 pages
Impact of Accounting Software For Business Performance: January 2017
No ratings yet
Impact of Accounting Software For Business Performance: January 2017
7 pages
An Undergraduate Thesis Proposal To The Faculty of The Piccio Garden, Villamor, Pasay City
No ratings yet
An Undergraduate Thesis Proposal To The Faculty of The Piccio Garden, Villamor, Pasay City
55 pages
To Study The Rise of The Vegan Cosmetics Industry in India
No ratings yet
To Study The Rise of The Vegan Cosmetics Industry in India
23 pages
Epistemological Assumptions: Qualitative vs. Quantitative Research
No ratings yet
Epistemological Assumptions: Qualitative vs. Quantitative Research
6 pages
Case-Control Study Design
No ratings yet
Case-Control Study Design
60 pages
Information Bulletin No. 1 - Final v4
No ratings yet
Information Bulletin No. 1 - Final v4
2 pages
Status of Mabini Batangas Municipal Disaster Risk
No ratings yet
Status of Mabini Batangas Municipal Disaster Risk
24 pages
Chapter 1 DSWD Mr. Alfredo Revised
No ratings yet
Chapter 1 DSWD Mr. Alfredo Revised
10 pages
CHAPTER 2 - Descriptive Statistics
No ratings yet
CHAPTER 2 - Descriptive Statistics
44 pages
Super Position Theorem
No ratings yet
Super Position Theorem
14 pages
CHAPTER 1V RESULT AND DISCCUSION Rini
No ratings yet
CHAPTER 1V RESULT AND DISCCUSION Rini
13 pages
Chapter 1 Forecasting ARIMA Method
No ratings yet
Chapter 1 Forecasting ARIMA Method
19 pages
The Impacts of Social Isolation in The Well Being of The University of Baguio School of Nursing Students
100% (1)
The Impacts of Social Isolation in The Well Being of The University of Baguio School of Nursing Students
21 pages

Data Exploration and Analysis With Python

Uploaded by

Data Exploration and Analysis With Python

Uploaded by

Exploring and analyzing data with

 NumPy and Pandas simplify data analysis and manipulation.

In this module, you will learn to:

 Common data exploration and analysis tasks.

Exploring data with NumPy and Pandas

NumPy is a Python library that offers functionality comparable to that of mathematical

Exploring data in a Jupyter Notebook

Visual representation of data

Representing data visually usually means representing it in graphs. This is done to

time Location X Location Y

Examining real-world data

 Check for missing values and incorrectly recorded data.

Answer the following questions to check what you have learned.

Challenge: Flight data analysis

You might also like