Data Exploration and Analysis With Python
Data Exploration and Analysis With Python
Python
Introduction
Completed 100 XP
2 minutes
Not surprisingly, the role of a data scientist primarily involves data exploration and
analysis. The results of an analysis can form the basis of a report or a machine learning
model, but it all starts with the data, with Python being the most popular programming
language among data scientists.
After decades of open source development, Python offers extensive functionality with
powerful statistical and numerical libraries:
Typically, a data analysis project is designed to draw conclusions about a specific scenario
or to test a hypothesis.
For example, suppose that a university professor collects data from his students, such as the
number of classes they have attended, the hours of study, and the final grade obtained on
the end-of-course exam. The teacher could analyze the data to determine if there is a
relationship between the amount of studying a student does and the final grade they receive.
The teacher could use the data to test a hypothesis that only students who study a minimum
number of hours can expect to get a pass.
Previous requirements
Knowledge of basic mathematics
Previous experience with Python programming
Learning objectives
Data scientists can use various tools and techniques to explore, visualize, and manipulate
data. One of the most common ways data scientists work with data is using the Python
programming language and some specific data processing packages.
What is NumPy
Pandas is a well-known Python library for data analysis and manipulation. Pandas is like
Python's Excel: it provides easy-to-use functionality for data tables.
Jupyter Notebooks are a popular way to run basic scripts using your web browser.
Typically, these notebooks are a single web page, divided into sections of text and sections
of code that run on the server rather than the local machine. This means you can get up and
running quickly without needing to install Python or other tools.
Hypothesis testing
Data exploration and analysis is typically an iterative process where the data scientist takes
a sample of the data and performs the following tasks to analyze it and test a hypothesis:
Clean data to check for errors, missing values, and other problems.
Apply statistical techniques to better understand the data and how the
sample can be expected to represent the real-world data population, taking
into account random variation.
Visualize data to determine relationships between variables and, in the case
of a machine learning project, identify potentially predicted characteristics of
the label .
Review of hypotheses and repetition of the process.
Data visualization
Completed 100 XP
3 minutes
Data scientists visualize data to better understand it. This may mean examining the
raw data, summary measures such as means, or plotting the data. Charts are a
powerful means of data visualization, as we can quickly discern moderately
complex patterns without needing to define summary mathematical measures.
Although sometimes we know in advance what type of graph will be most useful,
other times we use graphs in an exploratory way. To understand the power of data
visualization, consider the following data: The location (x,y) of a self-driving car. In
its raw form, it is difficult to see real patterns. The mean or average tells us that its
trajectory revolved around x=0.2 and ey=0.3, and the range of numbers seems to
be between approximately -2 and 2.
If we now plot location X over time, we can see that we appear to have some
missing values between times 7 and 12.
If we plot X against Y, we end up with a map of where the car has moved. It is
immediately obvious that the car has been driving in a circle, but at some point it
drove towards the center of that circle.
The charts are not limited to 2D scatterplots like those above, but can be used to
explore other types of data, such as ratios (shown via pie charts, stacked bar
charts), how data (with histograms, box plots) and how two sets of data differ.
Often, when we are trying to understand raw data or results, we can experiment
with different types of graphs until we find one that explains the data visually.
The data presented in educational materials are often remarkably perfect, designed to show
students how to find clear relationships between variables. Real-world data is a little less
straightforward.
Due to the complexity of "real world" data, raw data must be inspected for problems before
use.
Therefore, the best practice is to inspect the raw data and process it before using it, which
reduces errors or problems, usually by removing bad data points or modifying the data to
make it more useful.
Real world data problems
Real-world data can contain many different problems that can affect the usefulness of the
data and our interpretation of the results.
It is important to note that most real-world data is influenced by factors that were not
recorded at the time. For example, we might have a table of race car times along with
engine sizes, but other factors that weren't noted down, such as weather, probably played a
role as well. If problematic, the influence of these factors can often be reduced by
increasing the size of the data set.
In other situations, data points that are clearly outside of expectations, also known as
outliers , can sometimes be safely removed from analyses, although care should be taken
not to remove data points that provide real information.
Another common problem in real-world data is bias. Bias refers to the tendency to select
certain types of securities more frequently than others, thereby misrepresenting the
underlying population, or the "real world." Bias can sometimes be identified by exploring
the data and taking into account basic knowledge about where the data comes from.
Remember that real-world data will always have problems, but this is usually a
surmountable problem. Remember:
Knowledge test
Completed 200 XP
3 minutes
1.
It has a NumPy array of the form (2,20). What does this indicate about the elements
of the matrix?
The array is two-dimensional and consists of two arrays, each with 20 elements.
Correct. A form of (2,20) indicates a multidimensional array with two arrays,
each containing 20 elements.
The array contains 2 elements with values 2 and 20.
The array contains 20 elements, all of them with the value 2.
That is incorrect. A form of (2,20) indicates a multidimensional array with two
arrays, each containing 20 elements.
2.
You have a Pandas DataFrame object called df_sales that contains daily sales data.
The DataFrame object contains the following columns: year, month, day_of_month,
and sales_total. You want to determine the average value of sales_total. What code
should I use?
df_sales['sales_total'].avg()
df_sales['sales_total'].mean()
Correct. This code will return the average of the values in the sales_total
column.
mean(df_sales['sales_total'])
3.
It has a DataFrame object that contains data about daily ice cream sales. Use the
"corr" method to compare the avg_temp and units_sold columns and get a result of
0.97. What does this result indicate?
On the day with the maximum value of units_sold, the value of avg_temp was 0.97.
Days with high avg_temp values tend to coincide with days with high units_sold
values.
Correct. The "corr" method returns the correlation, and a value close to 1
indicates a positive correlation.
The units_sold value is, on average, 97% of the avg_temp value.
That is incorrect. The "corr" method returns the correlation between two
numeric columns.
Summary
Completed 100 XP
1 minute
In this module, you learned how to use Python to explore, visualize, and manipulate data.
Data exploration is the foundation of data science and a key element in data analysis and
machine learning.
Machine learning is a subset of data science that deals with predictive modeling. In other
words, machine learning uses data to create predictive models, in order to predict unknown
values. You could use machine learning to predict how much food a supermarket should
order, or to identify plants in photographs.
What machine learning does is identify relationships between data values that describe the
properties of something, its characteristics , such as the height and color of a plant, and the
value you want to predict, the label , such as the species of floors. These relationships are
integrated into a model through a training process.
If the exercises in this module have inspired you to try exploring data for yourself, why not
take on the challenge of a real-world data set containing flight records from the US
Department of Transportation? You will find the challenge in notebook 01 - Flights
Challenge.ipynb .
Note
The time to complete this optional challenge is not included in the estimated time for this
module, you can spend as much time as you want.