Data Science With Python
Data Science With Python
Data Science With Python
When we combine domain expertise and scientific methods with technology, we get Data Science.
Operating Systems
Analysis
Data Science
Data Scientists collect data and explore, analyze, and visualize it. They apply mathematical and statistical
models to find patterns and solutions in the data.
Analysis
• Descriptive: Study a dataset to decipher the
details
• Predictive: Create a model based on existing
Mathematical Scientific
and Statistical Tools and information to predict outcome and behavior
Models Methods
• Prescriptive: Suggest actions for a given
situation using the collected information
Data Processing and Analytics
Modern tools and technologies have made data processing and analytics faster and efficient.
Library
Data Processing
Tools
Data analysis that uses only technology and domain knowledge without mathematical and statistical knowledge often leads to
! incorrect patterns and wrong interpretations. This can cause serious damage to businesses.
A Day in a Data Scientist’s Life
Basic Skills of a Data Scientist
Data Scientists work with different types of datasets for various purposes. Now that Big Data is generated every
second through different media, the role of Data Science has become more important.
The 3 Vs of Big Data
Big Data is a huge collection of data stored on distributed systems/machines popularly referred to as Hadoop
clusters. Data Science helps extract information from the Data and build information-driven enterprises.
Different Sectors Using Data Science
Various sectors use Data Science to extract the information they need to create different services and
products.
Using Data Science—Social Network Platforms
LinkedIn uses data points from its users to provide them with relevant digital services and data products.
Profile
Groups
Location
Digital
Information
Data Points
Services
Connections
Data
Products
Post
Likes
Using Data Science—Search Engines
Google uses Data Science to provide relevant search recommendations as the user types a query.
Search keyword
Wearable devices use Data Science to analyze data gathered by their biometric sensors.
Make informed
decisions
Data Analytics
Engagement
Dashboard
Using Data Science—Finance
A Loan Manager can easily access and sift through a loan applicant’s financial details using Data
Science.
Sectors/Domains
The Real Challenge
Python deals with each stage of data analytics efficiently by applying different libraries and packages.
Acquire
Wrangle
Explore
Model
Data Analytics
Visualize
Bokeh
Python Tools and Technologies
Python is a general purpose, open source, programming language that lets you work quickly and
integrate systems more effectively.
Benefits of Python
Easy to learn
Open source
Data Scientist
Big Data
Key Takeaways
A lot of datasets are freely available to apply Data Science and turn them
into data services and data products.
Data Scientists are more in demand with the evolution of Big Data and
real-time analytics.
b. acquires data
b. Acquires data
Explanation: A Data Scientist asks the right questions to the stakeholders, acquires data from various sources and data points,
performs data wrangling that makes the data available for analysis, and creates reports and plots for data visualization.
QUIZ
The Search Engine’s Autocomplete feature identifies unique and verifiable users who
search for a particular keyword or phrase _____. Select all that apply.
2
Explanation: The Search Engine’s Autocomplete feature identifies unique and verifiable users who search for a particular
keyword or phrase to build a Query Volume. It also helps identify the users’ locations and tag them to the query, enabling it to be
location-specific.
QUIZ
What is the sequential flow of Data Analytics?
3
Explanation: In Data Analytics, the data is acquired from various sources and is then wrangled to ease its analysis. This is
followed by data exploration and data modeling. The final stage is data visualization, where the data is presented and the
patterns are identified.
This concludes “Data Science Overview.”
The next lesson is “Data Analytics Overview.”
Data Science with Python
Lesson 2 – Data Analytics Overview
What’s In It For Me
Data by itself is just an information source. But unless you can understand it, you will not be able to use
it effectively.
When the transaction details are presented as a line chart, the deposit and withdrawal patterns
become apparent.
Overall pattern
Amount in $
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Bank Transaction Details for April 2016
Why Data Analytics (contd.)
When the transaction details are presented as a line chart, the deposit and withdrawal patterns become
apparent. It helps view and analyze general trends and discrepancies.
Mar-16
Apr-16
May-16 Discrepancy
Amount in $
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Bank Transaction Details
Introduction to Data Analytics
Sales Inventory
Twitter, Facebook,
LinkedIn, and other
social media and
information sites
provide streaming APIs.
Data Scientist Expertise:
File handling
• Database Skills Data
• File formats
• Web scraping Server logs can be
extracted from
enterprise system
servers to analyze and
optimize application
performance.
Data Wrangling and Exploration
Data wrangling is the most important phase of the data analytic process.
Data wrangling is the most challenging phase and takes up 70% of the data scientist’s time.
Data Exploration—Model Selection
This phase includes data cleansing, data manipulation, data aggregation, data split, and reshaping of data.
Model selection
• Based on the overall data analysis process
• Should be accurate to avoid iterations
• Depends on pattern identification and algorithms
• Depends on hypothesis building and testing
• Leads to building mathematical statistical functions
Exploratory Data Analysis (EDA)
EDA – Quantitative technique has two goals, measurement of central tendency and
spread of data.
Measurement of Central Tendency
Mean Mean is the point which indicates how centralized the data points
are.
• Suitable for symmetric distributions
Measurement of Spread
Variance Variance is approximately the mean of the squares of the
deviations.
Standard Deviation Standard deviation is the square root of the variance.
Inter-quartile Range Inter-quartile range is the distance between the 75th and 25th
percentile. It’s essentially the middle 50% of the data.
EDA – Graphical Technique
Histograms and Scatter Plots are two popular graphical techniques to depict data.
25
It shows:
• the center or location of data (mean, median, or mode)
• the spread of data
15
Histograms and Scatter Plots are two popular graphical techniques to depict data.
a.
Collect data from various data sources
d.
Scrape web through web APIs
Explanation: Data acquisition is a process to collect data from various data sources such as RDBMS, No
SQL databases, web server logs and also scrape the web through web APIs.
KNOWLEDGE What is the Exploratory Data Analysis technique?
CHECK
Select all that apply.
a.
Analysis of data using quantitative techniques
d.
Suggests models that best fit the data
Explanation: Most EDA techniques are graphical in nature with a few quantitative techniques and also
suggest models that best fit the data. They use almost the entire data with minimum and no assumptions.
Conclusion or Predictions
This step involves reaching a conclusion and making predictions based on the data
analysis.
This phase:
There are three phases to hypothesis building which are model building, model evaluation,
and model deployment.
Draw two samples from the population and calculate the difference between their means.
μ1 Calculating the
difference
S1
between the two
means is
“hypothesis
μ2 testing”.
S2
Hypothesis Testing
Draw two samples from the population and calculate the difference between their means.
Alternative Hypothesis
• Proposed model outcome is
accurate and matches the
data.
• There is a difference between
the means of S1 and S2.
Null Hypothesis
• Opposite of the alternative
hypothesis.
• There is no difference
between the means of S1 and
S2.
Hypothesis Testing Process
Choosing the training and test dataset and evaluating them with the null and alternative
hypothesis.
Usually the training dataset is between 60% and 80% of the big dataset and the test dataset is
between 20% and 40% of the big dataset.
Communication
Features of plotting:
• Plotting is like telling a story about
data using different colors, shapes,
and sizes.
• Plotting shows the relationship
between variables.
• Example:
o Change in value of Y results in
change in value of X.
o X is independent of y.
Data Types for Plotting
Data measured in time blocks such date, month, year, and time (hours, minutes, and seconds
Time Series
Types of Plot
Skills and tools required for each step of the data analysis process.
a. Regression plot
b. Line chart
c. Histogram
d. Heat map
Which plotting technique is used for continuous data?
KNOWLEDGE
CHECK
Select all that apply.
a.
Regression plot
b. Line chart
c. Histogram
d.
Heat map
Explanation: Line charts and histograms are used to plot continuous data.
Quiz
QUIZ
Which Python library is the main machine learning library?
1
a. Pandas
b. Matplotlib
c. Scikit-learn
d. NumPy
QUIZ
Which Python library is the main machine learning library?
1
a. Pandas
b. Matplotlib
c. Scikit-learn
d. NumPy
a. Data Acquisition
b. Data Visualization
c. Data Wrangling
d. Machine learning
QUIZ Which of the following includes data transformation, merging, aggregation, group by
operation, and reshaping?
2
a. Data Acquisition
b. Data Visualization
c. Data Wrangling
d. Machine learning
a. Mean
b. Median
c. Mode
d. Variance
QUIZ
Which measure of central tendency is used to catch outliers in the data?
3
a. Mean
b. Median
c. Mode
d. Variance
c. a small subset.
c. a small subset.
a. data wrangling
b. web scraping
c. plotting
d. machine learning
QUIZ
Beautiful soup library is used for _____.
5
a. data wrangling
b. web scraping
c. plotting
d. machine learning.
Types of frequencies
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.
Well-informed decision
PROBLEMS SOLVED
DATA
COMPLEX PROBLEMS
Introduction to Statistics
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.
Although both forms of analysis provide results, quantitative analysis provides more insight
and a clearer picture. This is why statistical analysis is important for businesses.
Major Categories of Statistics
There are two major categories of statistics: Descriptive analytics and inferential analytics
Descriptive analysis organizes the data and focuses on the main characteristics of the data.
HIGH
NUMBER OF STUDENT
MEDIUM
MATH SCORES
LOW
1847369250
0 1 2 3 4 5 6 7 8 9
Inferential analytics is valuable when it is not possible to examine each member of the
population.
Major Categories of Statistics – An Example
Study of the height of the population
Tall
Inferential Method
Medium Categorize height as “Tall,” “Medium,” and “Short”
Take a sample to study from the population.
Short
Descriptive Method
Record the height of each and every person.
Provide the tallest, shortest, and average height of
the population.
Statistical Analysis Considerations
A sample is:
• The part/piece drawn from the population
• The subset of the population
• A random selection to represent the characteristics of
the population
• Representative analysis of the entire population
Statistics and Parameters
“Statistics” are quantitative values calculated from the sample.
“Parameters” are the characteristics of the population.
Sample Xo, X1,X2……….Xn
Population
Sample Statistics Formula
Parameters
1
x xi
Mean µ x n
1
2 2
2 2
Variance S S n 1
( x i
x )
Standard
Deviation
S S
1
(
xi )x 2
n 1
Terms Used to Describe Data
Typical terms used in data analysis are:
1.5
Range of the data refers to
minimum and maximum values.
1
Observations
98
95
92 75th percentile =91 Third Quartile
90
85
81 50th percentile =80 Second Quartile or Median
79
70
63 25th percentile =59 First Quartile
55
47
42
Dispersion
Dispersion denotes how stretched or squeezed a distribution is.
Observations
98
95
92 75th percentile = 91 Range: The difference between the maximum and
90 minimum values
85
Inter-quartile Range: Difference between the 25th
81 50th percentile = 80 and 75th percentiles
79
70 Variance: Data values around the Mean. (74.75)
63 25th percentile = 59
Standard Deviation: Square root of the variance
55 measured in small units
47
42
Knowledge Check
KNOWLEDGE
What does frequency indicate?
CHECK
Features of a Histogram:
• It was first introduced by Karl Pearson. 5.0
Frequency
3.0
variable.
• Bins are of equal size. 2.0
Normal Curve
Bell curve is: Standard Deviation
• Symmetric around the mean, Mean
• Defined by mean and standard deviation, and tails 15.0% 15.0% tails
• Known as the “Gaussian” curve. 9.2% 9.2%
0.1% 0.5% 0.5% 0.1%
4.4% 4.4%
1.7% 1.7%
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
The Bell curve is fully characterized by the mean (μ) and standard deviation (σ).
The Bell Curve
The Bell curve is divided into three parts to understand data distribution better.
Flanks
Flanks = Between one and two
standard deviations from the mean
Tail
Tail = Beyond two standard
deviations from the mean
Bell Curve – Left Skewed
Skewed data distribution indicates the tendency of the data distribution to be more spread out on one side.
Left Skewed
• The data is left skewed. 80
Frequency
• Left tail contains large distributions.
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13
Measurement
Bell Curve – Right Skewed
Skewed data distribution indicates the tendency of the data distribution to be more spread out on one
side.
Right
• The data is right skewed. Skewed
80
Frequency
• Right tail contains large distributions.
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13
Measurement
Kurtosis
Kurtosis describes the shape of a probability distribution.
Kurtosis
There are:
measures the tendency of the data toward
the center or toward the tail. (+) Leptokurtic
• Different ways of quantifying a theoretical
Platykurtic
distribution.
is negative kurtosis.
(0) Mesokurtic (Normal)
Mesokurtic represents
• Corresponding a normal
ways of distribution
estimating it from curve.
a sample ofispopulation.
Leptokurtic positive kurtosis. (-) Platykurtic
Knowledge Check
KNOWLEDGE
Which of the following is true for a normal distribution?
CHECK
Explanation: for Bell curve mean, median, and mode are equal.
Hypothesis Testing
Hypothesis testing is an inferential statistical technique that determines if a certain condition is true for the
population.
Step 4
Step 3
Make decision
Step 2 p-value < α
Collect Data p-value ≥ α
Step 1 Collect sample from
Set α Significant population
Level
Set Hypothesis Set α or choose the
H0 (μ1 = μ2) : Equality significant level for
H1(μ1 ≠ μ2) : Difference the population.
Company A Company B
Null Hypothesis:
Both medicines are
equally effective. Alternative Hypothesis:
Both medicines are
NOT equally effective.
Data for Hypothesis Testing
There are three types of data on which you can perform hypothesis testing.
Association
Two variables are associated 85% 15% 68% 32%
or independent of each other.
85% 15% 95% 55%
Chi-Square Test
It is a hypothesis test that compares the observed distribution of your data to an expected distribution of data.
Test of Association:
To determine whether one variable is associated with a different variable. For
example, determine whether the sales for different cellphones depends on the city
or country where they are sold.
Test of Independence:
To determine whether the observed value of one variable depends on the
observed value of a different variable. For example, determine whether the color
of the car that a person chooses is independent of the person’s gender.
Test is usually applied when there are two categorical variables from a single population.
Chi Square Test - Example
An example of Chi-Square test.
Null Hypothesis
fo .55 .45
Alternative Hypothesis
fo .75 .25
• There is association between gender and
purchase.
• The probability of purchase over 500 dollars
is different for female and male.
Types of Frequencies
Expected and observed frequencies are the two types of frequencies.
fo .75 .25
Observed Frequencies (fo)
( fe fo ) 2
2
fe
(0,0) (0,1) (0,2) Correlation coefficient measures the extent to which two
variables tend to change together.
(1,0) (1,1) (1,2)
The coefficient describes both the strength and direction of
(2,0) (2,1) (2,2) the relationship.
3 × 3 matrix (simple square matrix)
Correlation Matrix
A Correlation matrix is a square matrix that compares a large number of variables.
A correlation matrix that is calculated for the stock market will probably show the short-term,
medium-term, and long-term relationship between data variables.
Inferential Statistics
Inferential statistics uses a random sample from the data to make inferences about the population.
Inferential statistics can be used even if the data does not meet the
criteria.
• It can help determine the strength of the relationships within the
sample.
• If it is very difficult to obtain a population list and draw a random
sample, do the best you can with what you have.
Applications of Inferential Statistics
Inferential Statistics has its uses in almost every field such as business, medicine, data science, and so on.
60
40
20
0
Quiz
QUIZ
If a sample of five boxes weigh 90, 135, 160, 115, and 110 pounds, what will be the
median weight of this sample?
1
a. 160
b. 115
c. 90
d. 135
QUIZ
If a sample of five boxes weigh 90, 135, 160, 115, and 110 pounds, what will be the
median weight of this sample?
1
a. 160
b. 115
c. 90
d. 135
a. Variance
b. Mean
c. Standard deviation
d. Range
QUIZ
Identify the parameters that characterize a bell curve. Select all that apply.
2
a. Variance
b. Mean
c. Standard deviation
d. Range
b. Is independent of p-value
d. Is independent of α
QUIZ
Identify the hypothesis decision rules. Select all that apply.
4
b. Is independent of p-value
d. Is independent of α
Acquire
Wrangle
Explore
Model
Data Science
Visualize
Bokeh
Why Anaconda
To use Python, we recommend that you download Anaconda. Following are some of the reasons why
Anaconda is one of the best Data Science platforms:
Interactive visualizations,
governance, security, and
operational support
Website URL:
https://www.continuum.io/downloads
Graphical Installer
• Download the graphical installer.
• Double-click the .exe file to install Anaconda and
follow the instructions on the screen.
Click each tab to know how to install Python on those operating systems.
Installation of Anaconda Python Distribution (contd.)
You can install and run the Anaconda Python distribution on different platforms.
Website URL:
https://www.continuum.io/downloads
Graphical Installer
• Download the graphical installer.
• Double-click the downloaded .pkg file and follow the instructions.
Click each tab to know how to install Python on those operating systems.
Installation of Anaconda Python Distribution (contd.)
You can install and run the Anaconda Python distribution on different platforms.
Website URL:
https://www.continuum.io/downloads
Python 2.7:
bash Anaconda2-4.0.0-Linux-x86_64.sh
Click each tab to know how to install Python on those operating systems.
Jupyter Notebook
Jupyter is an open source and interactive web-based Python interface for Data Science and scientific
computing. Some of its advantages are:
Comment line
Test string
Assignment
Access variable
without assignment
Multiple assignments
Assignment and Reference
When a variable is assigned a value, it refers to the value’s memory location or address. It does not
equal the value itself.
7 7
8
Numeric
Integer value
Integer Float
Float value
32-bit 64-bit
Basic Data Types: String
Python has extremely powerful and flexible built-in string processing capabilities.
Boolean type
Boolean type
Type Casting
You can change the data type of a number using type casting.
Float number
Create a tuple
View tuple
Try to modify
the tuple
Tuple
Tuple
View a list
Key Value
Any
Any data
Dictionary immutable
type
type
Create a
dictionary
View entire
dictionary
View only
keys
View only
values
Data Structure: Access and Modify dict Elements
You can also access and modify individual elements in a dict.
Modify dictionary:
update
Modify dictionary:
delete
Data Structure: Set
A set is an unordered collection of unique elements.
Create a set
Create a set
View the
object type
Create sets
OR – Union
set operation
Create a list
Create a string
Create tuples
Add tuples
Create lists
Add lists
Create strings
Concatenate
strings
Basic Operator: “*”
The “multiplication” operator produces a new tuple, list, or string that “repeats” the original content.
The ‘*” operator does not actually multiply the values; it only repeats the values for the specified
number of times.
Functions
Functions are the primary method of code organization and reuse in Python.
Syntax Properties
Return type
Call function
Functions: Returning Values
You can use a function to return a single value or multiple values.
Create function
Call function
Create function
Multiple return
Call function
Built-in Sequence Functions
The built-in sequence functions of Python are as follows:
enumerate
Indexes data to keep track of indices and corresponding data mapping
sorted
Returns the new sorted list for the given sequence
reversed
Iterates the data in reverse order
Zip
Creates lists of tuples by pairing up elements of lists, tuples, or other sequence
Built-in Sequence Functions: enumerate
List of food
stores
Create a data
element and index
map using dict
Sort numbers
Sort a string
value
Built-in Sequence Functions: reversed and zip
Let us see how to use reversed and zip functions
Create a list of
numbers for range 15
View type
Control Flow: if, elif, else
The “if”, “elif,” and “else” statements are the most commonly used control flow statements.
If condition
Else block
While condition
Control Flow: Exception Handling
Handling Python errors or exceptions gracefully is an important part of building robust programs and
algorithms.
Create function
Error
a. Int
b. Float
c. String
a. Int
b. Float
c. String
a. tuple
b. list
c. dict
d. set
QUIZ
Which of the data structures can be modified? Select all that apply.
2
a. tuple
b. list
c. dict
d. set
Explanation: Only a tuple is immutable and cannot be modified. All the other data structures can be modified.
What will be the output of the following code?
QUIZ
a. [‘NYC', 'Madrid']
b. [‘London', 'Madrid']
c. [‘Miami', 'Madrid']
d. [‘Miami', ‘Paris']
What will be the output of the following code?
QUIZ
a. [‘NYC', 'Madrid']
b. [‘London', 'Madrid']
c. [‘Miami', 'Madrid']
d. [‘Miami', ‘Paris']
a. dict
b. list
c. set
d. tuple
QUIZ
Which of the following data structures is preferred to contain a unique collection of
values?
4
a. dict
b. list
c. set
d. tuple
Download Python 2.7 version from Anaconda and install Jupyter notebook.
When you assign values to variables, you create references and not duplicates.
Integers, floats, strings, None, and Boolean are some of the data types supported by
Python.
Tuples, lists, dicts, and sets are some of the data structures of Python.
The “in”, “+”, and “*” are some of the basic operators.
Functions are the primary and the most important methods of code
organization and reuse in Python.
List
Collection of values
Error
Why NumPy
Numerical Python (NumPy) supports multidimensional arrays over which you can easily apply mathematical
operations.
Import NumPy
Output
NumPy Overview
NumPy is the foundational package for mathematical computing in Python.
It has the following properties:
Algorithm
[1, 2, 1]
Question/Problem
[[ 1, 0, 0],
[ 0, 1, 2]]
Sharing
Data
([[ 2, 8, 0, 6],
Algorithm [ 4, 5, 1, 1],
[ 8, 9, 3, 6]])
ndarray
Write Program
Algorithm
Knowledge Check—Sequence it Right!
The code here is buggy. You have to correct its sequence to debug it.
4
Knowledge Check—Sequence it Right!
The code here is buggy. You have to correct its sequence to debug it.
4
Types of Arrays
Arrays can be one-dimensional, two dimensional, three-dimensional, or multi-dimensional.
array([[[ 0, 1, 2],
1 axis [ 3, 4, 5],
array([5, 7,9]) array([[ 0, 1, 2], 2 axes
rank 1 [ 6, 7, 8]],
[ 5, 6, 7]]) rank 2
Length = 3 [[ 9, 10, 11],
3 axes
[12, 13, 14],
rank 3
Length = 3 [15, 16, 17]],
a. 4
b. 7
c. 11
d. 13
KNOWLEDGE How many elements will the following code print?
CHECK
print(np.linspace(4,13,7))
a. 4
b. 7
c. 11
d. 13
Explanation: In the “linspace” function, “4” is the starting element and “13” is the end element. The last number “7” specifies that
a total of seven equally spaced elements should be created between “4” and “13,” both numbers inclusive. In this case, the
“linspace” function creates the following array: [ 4. 5.5 7. 8.5 10. 11.5 13. ]
Class and Attributes of ndarray—.ndim
Numpy’s array class is “ndarray,” also referred to as “numpy.ndarray.” The attributes of ndarray are:
This refers to the number of axes (dimensions) of the array. It is also called the rank of the
ndarray.ndim array.
ndarray.shape
ndarray.size
Two axes or 2D array Three axes or 3D array
ndarray.dtype
Concept Example
Class and Attributes of ndarray—.ndim
Numpy’s array class is “ndarray,” also referred to as “numpy.ndarray.” The attributes of ndarray are:
ndarray.shape
ndarray.size
ndarray.dtype
Concept Example
Class and Attributes of ndarray—.shape
Numpy’s array class is “ndarray,” also referred to as “numpy.ndarray.” The attributes of ndarray are:
This consists of a tuple of integers showing the size of the array in each dimension. The
ndarray.ndim length of the “shape tuple” is the rank or ndim.
ndarray.shape
ndarray.dtype
Concept Example
Class and Attributes of ndarray—.shape
Numpy’s array class is “ndarray,” also referred to as “numpy.ndarray.” The attributes of ndarray are:
The shape tuple of both the arrays indicate their size along each dimension.
ndarray.ndim
ndarray.shape
ndarray.size
ndarray.dtype
Concept Example
Class and Attributes of ndarray—.size
Numpy’s array class is “ndarray,” also referred to as “numpy.ndarray.” The attributes of ndarray are:
It gives the total number of elements in the array. It is equal to the product of the elements
ndarray.ndim of the shape tuple.
ndarray.shape
ndarray.dtype
Concept Example
Class and Attributes of ndarray—.size
Numpy’s array class is “ndarray,” also referred to as “numpy.ndarray.” The attributes of ndarray are:
Look at the examples to see how the shape tuples of the arrays are used to calculate their
ndarray.ndim size.
ndarray.shape
ndarray.size
ndarray.dtype
Concept Example
Class and Attributes of ndarray—.dtype
Numpy’s array class is “ndarray,” also referred to as “numpy.ndarray.” The attributes of ndarray are:
It’s an object that describes the type of the elements in the array. It can be created or
ndarray.ndim specified using Python.
ndarray.shape
ndarray.dtype
Concept Example
Class and Attributes of ndarray—.dtype
Numpy’s array class is “ndarray,” also referred to as “numpy.ndarray.” The attributes of ndarray are:
Both the arrays are of “string” data type (dtype) and the longest string is of length 7, which is
ndarray.ndim “Houston.”
ndarray.shape
ndarray.size
ndarray.dtype
Concept Example
Basic Operations
Using the following operands, you can easily apply various mathematical, logical, and comparison operations
on an array.
First trial
Second trial
Total distance
10 15 17 26 + 12 11 21 24 = 22 26 38 50
0 1 2 3 0 1 2 3
Index Index
Vector addition
Accessing Array Elements: Indexing
You can access an entire row of an array by referencing its axis index.
10 15 17 26
2 rows
Shape of the array 12 11 21 24
4 columns
5 8 10 21
Test scores
Student 1 83 71 57 63 83 71 57 63
Student 2 54 68 81 45 54 68 81 45
Test score > 60
Test 1 Test 2 Test 3 Test 4
In this method, a variable is directly assigned the value of another variable. No new copy is made.
Simple Assignments
Original dataset
View/Shallow Copy
Assigned dataset
Simple Assignments
Original dataset
Copy is also called “deep copy” because it entirely copies the original dataset. Any change in the copy
will not affect the original dataset.
Simple Assignments
Shows “copy” and original
object are different
sqrt function provides the square root sqrt cos function gives cosine values for all
of every element in the array. elements in the array.
cos
floor
floor function returns the largest exp exp function performs exponentiation
integer value of every element in the
on each element.
array.
ufunc—Examples
Let’s look at some common ufunc examples:
Import pi*
Trigonometric functions
Flatten
Split Resize
Shape
Manipulation
Stack Reshape
Shape Manipulation—Example
You can use certain functions to manipulate the shape of an array to do the following:
array_a
2 3 5 8
Element-wise
multiplication
0.3 0.3 0.3 0.3
array_b
array_a
2 3 5 8
Broadcasting
0.3
scalar_c
Broadcasting—Constraints
Though broadcasting can help carry out mathematical operations between different-shaped arrays, they are subject
to certain constraints as listed below:
Element-wise operation
Hourly wage
transpose()
Axis 1 Axis 1
83 71 57 63
54 68 81 45
Axis 0
Axis 0
Linear Algebra—Inverse and Trace Functions
Using NumPy, you can also find the inverse of an array and add its diagonal data elements.
np.linalg.inv()
np.trace()
Problem Instructions
Problem Instructions
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document from the
“Resources” tab to view the steps for installing Anaconda and the Jupyter notebook.
• Download the “Assignment 02” notebook and upload it on the Jupyter notebook to access it.
• Follow the cues provided to complete the assignment.
Assignment
Problem Instructions
Problem Instructions
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document from the
“Resources” tab to view the steps for installing Anaconda and the Jupyter notebook.
• Download the “Assignment 01” notebook and upload it on the Jupyter notebook to access it.
• Follow the cues provided to complete the assignment.
a. ravel()
b. reshape()
a. ravel()
b. reshape()
Explanation: The function ravel() is used to convert a multidimensional array into a one-dimensional array. Though reshape()
also functions in a similar way, it creates a new array instead of transforming the input array.
QUIZ
The np.trace() method gives the sum of _____.
3
Explanation: The trace() function is used to find the sum of the diagonal elements in an array. It is carried out in an incremental
order of the indices. Therefore, it can only add diagonal values from left to right and not vice versa.
QUIZ
The function np.transpose() when applied on a one-dimensional array gives _____.
4
a. a reverse array
c. an inverse array
a. a reverse array
c. an inverse array
Explanation: Transposing a one-dimensional array does not change it in any way. It returns an unchanged view of the original
array.
QUIZ
Which statement will slice the highlighted data? 11 14 21 32 53 64
5
a. [3 : 5]
b. [3 : 6]
c. [2 : 5]
d. [2 : 4]
QUIZ
Which statement will slice the highlighted data? 11 14 21 32 53 64
5
a. [3 : 5]
b. [3 : 6]
c. [2 : 5]
d. [2 : 4]
Explanation: Let’s assume that the index of the first element is m and the second element is n. Then, you need to use the
statement “[n : m + 1]” to slice the required dataset. In this case, the index of the element “21” is “2” and that of “53” is “4.” So, the
correct statement to use would be [2 : 5].
Key Takeaways
NumPy is a very powerful Python library for mathematical and scientific computing.
You can create and print NumPy arrays using different methods.
NumPy uses basic operations, data access techniques, and copy and view
techniques for data wrangling.
Statistics
Platform integration
Mathematical equations
Scientific Domains
SciPy: The Solution
SciPy has built-in packages that help in handling the scientific domains.
Mathematics
Integration Statistics
(Normal distribution)
Linear algebra
Multidimensional image
processing
Mathematics Language Integration
constants
SciPy and its Characteristics
Characteristics of SciPy are as follows:
Simplifies scientific
application development 6
Efficient and fast data
3 processing
cluster ndimage
Clustering algorithms N-dimensional image processing
constants odr
Physical and mathematical constant Orthogonal distance regression
fftpack optimize
Fast Fourier Transform routines Optimization and root-finding routines
integrate signal
Integration and ordinary differential equation solvers Signal processing
Spatial sparse
Spatial data structures and algorithms Sparse matrices and associated routines
interpolate weave
Interpolation and smoothing splines C/C++ integration
IO stats
Input and Output Statistical distributions and functions
special
linalg
Special functions
Linear algebra
SciPy Sub-package: Integration
SciPy provides integration techniques that solve mathematical sequences and series,
or perform function approximation.
•integrate.quad(f, a, b)
•integrate.dblquad()
•integrate.tplquad()
•integrate.nquad()
Perform quad
integration and pass
functions and arguments
SciPy Sub-package: Integration
Perform multiple
integration using the
lambda built-in function
SciPy Sub-package: Optimization
lower limit in a
given range
Perform optimize
minimize function
using bfgs method
and options
Explanation: Both the upper and lower limit values should be specified for optimize.curve.fit function.
SciPy Sub-package: Linear Algebra
SciPy provides rapid linear algebra capabilities and contains advanced algebraic
functions.
Click each tab to know more.
Inverse of matrix Determinant Linear systems Single value
decomposition (svd)
This function is used to compute the inverse of the given matrix. Let’s take a look at the inverse matrix
operation.
With this function you can compute the value of the determinant for the given matrix.
2x + 3 y + z = 21
-x + 5y + 4z = 9
3x + 2y + 9z = 6
Use solve
method
SciPy Sub-package: Linear Algebra
SciPy provides very rapid linear algebra capabilities and contains advanced algebraic functions.
Click each tab to know more.
Import linalg
Define matrix
a. SciPy.special
b. SciPy.linalg
c. SciPy.signal
d. SciPy.stats
KNOWLEDGE
CHECK
Which of the following function is used for inversing the matrix?
a. SciPy.special
b. SciPy.linalg
c. • SciPy.signal
d. SciPy.stats
SciPy provides a very rich set of statistical functions which are as follows:
One standard
Age Range Frequency
Cumulative deviation
Frequency
0-10 19 19
68% of data
10-20 55 74
Total number of 95% of data
21-30 23 97 persons within
this age
31-40 36 133 99.7% of data
41-50 10 143
-3 -2 -1 01 1 2 3
51-60 17 160
F(x) = P(X≤x)
negative infinity
SciPy Sub-package: Statistics
Probability Density Function, or PDF, of a continuous random variable is the derivative of its
Cumulative Distribution Function, or CDF.
Derivative of CDF
SciPy Sub-package: Statistics
loc and scale are used to adjust the location and scale of the data distribution.
SciPy Sub-package: Weave
The weave package provides ways to modify and extend any supported extension libraries.
The IO package provides a set of functions to deal with several kinds of file formats.
Problem Instructions
Problem Instructions
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions”
document from the “Resources” tab to view the steps for installing Anaconda and the
Jupyter notebook.
• Download the “Assignment 01” notebook and upload it on the Jupyter notebook to
access it.
• Follow the cues provided to complete the assignment.
Quiz
Assignment 02
Problem Instructions
Use SciPy to declare 20 random values for random values and perform the following:
1. CDF – Cumulative Distribution Function for 10 random variables.
2. PDF – Probability Density Function for 14 random variables.
Assignment 02
Problem Instructions
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions”
document from the “Resources” tab to view the steps for installing Anaconda and the
Jupyter notebook.
• Download the “Assignment 02” notebook and upload it on the Jupyter notebook to
access it.
• Follow the cues provided to complete the assignment.
Quiz
QUIZ
Which of the following is performed using SciPy?
1
a. Website
b. Plot data
c. Scientific calculations
d. System administration
QUIZ
Which of the following is performed using SciPy?
1
a. Website
b. Plot data
c. Scientific calculations
d. System administration
a. optimize.minimize()
b. integrate.quad()
c. stats.linregress()
d. linalg.solve()
QUIZ
Which of the following functions is used to calculate minima?
2
a. optimize.minimize()
b. integrate.quad()
c. stats.linregress()
d. linalg.solve()
a. stats.t.pmf(df=10, size=100)
b. stats.t.pdf(df=10, size=100)
c. stats.t.rvs(df=10, size=100)
d. stats.t.rand(df=10, size=100)
QUIZ
Which of the following syntaxes is used to generate 100 random variables from a
3
t-distribution with df = 10?
a. stats.t.pmf(df=10, size=100)
b. stats.t.pdf(df=10, size=100)
c. stats.t.rvs(df=10, size=100)
d. stats.t.rand(df=10, size=100)
a. io.loadmat()
b. weave.inline()
c. weave.blitz()
d. io.whosmat()
QUIZ
Which of the following functions is used to run C or C++ codes in SciPy?
4
a. io.loadmat()
b. weave.inline()
c. weave.blitz()
d. io.whosmat()
SciPy has multiple sub-packages, which proves useful for different scientific
computing domains.
Integration can be used to solve mathematical sequences and series or
perform function approximation.
NumPy
Why Pandas
Intrinsic data
alignment
Data Structures
Data operation
handling major
functions
use cases
Pandas
The various features of Pandas makes it an efficient library for Data Scientists.
Powerful data
structure
Pandas
Intelligent and Easy data aggregation
automated data and transformation
alignment
Series is a one-dimensional array-like object containing data and labels (or index).
Data 4 11 21 36
0 1 2 3
Label(index)
Data alignment is intrinsic and will not be broken until changed explicitly by program.
Series
Data Input
• Integer
• ndarray 2 3 8 4
• String
• dict
• Python Object 0 1 2 3
• scalar
• Floating Point
• list Label(index)
Data Types
Series
How to Create Series
Basic Method
4 11 21 36
S = pd.Series(data, index = [index])
Series
Create Series from List
This example shows you how to create a series from a list:
Import libraries
Data value
Index
Data type
We have not created index for data but notice that data alignment is done automatically
Create Series from ndarray
countries
Index
Data type
Create Series from dict
A series can also be created with dict data input for faster operations.
dict for countries and their gdp
GDP
Country
Data type
Create Series from Scalar
Scalar input
Index
Data
index
Data type
Accessing Elements in Series
Data can be accessed through different functions like loc, iloc by passing data element position or
index range.
Vectorized Operations in Series
a. Created automatically
b. Needs to be assigned
a.
Created automatically
b. Needs to be assigned
d.
Index is not applicable as series is one-dimensional
Explanation: Data alignment is intrinsic in Pandas data structure and happens automatically. One can also assign index to data
elements.
KNOWLEDGE
CHECK
What will the result be in vector addition if label is not found in a series?
a.
Marked as Zeros for missing labels
d.
Will throw an exception, index not found
Explanation: The result will be marked as NaN (Not a Number) for missing labels.
DataFrame
DataFrame is a two-dimensional labeled data structure with columns of potentially different types.
Data Input
• Integer
• ndarray 2 3 8 4
• String
• dict 5 8 10 1
• Python Object
• scalar
• Floating Point 0 1 2 3
• list
Label(index)
Data Types
DataFrame
Create DataFrame from Lists
Let’s see how you can create a DataFrame from lists:
Entire dict
View DataFrame
You can view a DataFrame by referring the column name or with the describe function.
Create DataFrame from dict of Series
Create DataFrame from ndarray
Explanation: This is DataFrame slicing technique with indexing or selection on data elements. When a user
passes the range 3:9, the entire range from 3 to 9 gets sliced and displayed as output.
KNOWLEDGE
CHECK
What does the fillna() method do?
Explanation: fillna is one of the basic methods to fill NaN values in a dataset with a desired value by passing
that in parenthesis.
File Read and Write Support
read_hdf
read_excel to_hdf read_clipboard
to_excel to_clipboard
read_csv read_html
to csv to_html
read_json read_pickle
to_json to_pickle
read_sql read_stata
read_sas
to_sql to_stata
to sas
Pandas SQL operation
Pandas SQL operation
Pandas SQL operation
Activity—Sequence it Right!
The code here is buggy. You have to correct its sequence to debug it. To do that, click any two code
snippets, which you feel are out of place, to swap their places.
Problem Instructions
Analyze the Federal Aviation Authority (FAA) dataset using Pandas to do the following:
1. View
a. Aircraft make name
b. State name
c. Aircraft model name
d. Text information
e. Flight phase
f. Event description type
g. Fatal flag
2. Clean the dataset and replace the fatal flag NaN with “No”
3. Find the aircraft types and their occurrences in the dataset
4. Remove all the observations where aircraft names are not available
5. Display the observations where fatal flag is “Yes”
Assignment 01
Problem Instructions
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document from the
“Resources” tab to view the steps for installing Anaconda and the Jupyter notebook.
• Download the “Assignment 01” notebook and upload it on the Jupyter notebook to access it.
• Follow the cues provided to complete the assignment.
Assignment
Assignment 02
Problem Instructions
A dataset in CSV format is given for the Fire Department of New York City. Analyze the dataset to
determine:
1. The total number of fire department facilities in New York city
2. The number of fire department facilities in each borough
3. The facility names in Manhattan
Assignment 02
Problem Instructions
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document from the
“Resources” tab to view the steps for installing Anaconda and the Jupyter notebook.
• Download the “Assignment 02” notebook and upload it on the Jupyter notebook to access it.
• Follow the cues provided to complete the assignment.
Quiz
QUIZ
Which of the following data structures is used to store three-dimensional data?
1
a. Series
b. DataFrame
c. Panel
d. PanelND
QUIZ
Which of the following data structures is used to store three-dimensional data?
1
a. Series
b. DataFrame
c. Panel
d. PanelND
a. iat
b. iloc
c. loc
d. std
QUIZ
Which method is used for label-location indexing by label?
2
a. iat
b. iloc
c. loc
d. std
Explanation: The loc method is used to for label-location indexing by label; iat is strictly integer location and
iloc is integer-location-based indexing by position.
QUIZ
While viewing a dataframe, head() method will _____.
3
Explanation: The default value is 5 if nothing is passed in head method. So it will return the first five rows
of the DataFrame.
Key Takeaways
Machine learning
Purpose of Machine Learning
Machine learning is a great tool to analyze data, find hidden data patterns and relationships, and
extract information to enable information-driven decisions and provide insights.
Data
Insights into
unknown data
Information-driven
decisions
Machine Learning Terminology
These are some machine learning terminologies that you will come across in this lesson:
Inputs Attributes
Features
Label Records
Response Observations
Outcome Examples
Target Samples
Machine Learning Approach
The machine learning approach starts with either a problem that you need to solve or a given
dataset that you need to analyze.
Strive for
accuracy
Train and test
the model
Choose the right
model
Identify the
problem type
Extract the
features from
Understand the the dataset
problem/dataset
Steps 1 and 2: Understand the Dataset and Extract its Features
Let us look at a dataset and understand its features in terms of machine learning.
Features Response
(attributes) (label)
Education Professional Training Hourly Rate
(Yrs.) (Yes/No) (USD)
16 1 90
15 0 65
12 1 70
Observations 18 1 130
(records)
16 0 110
16 1 100
15 1 105
31 0 70
Predictors
Steps 3 and 4: Identify the Problem Type and Learning Model
Machine learning can either be supervised or unsupervised. The problem type should be selected
based on the type of learning model.
• In supervised learning, the dataset used to train a model • In unsupervised learning, the response or the outcome of
should have observations, features, and responses. The the data is not known.
model is trained to predict the “right” response for a given
set of data points. • Supervised learning models are used to identify and
visualize patterns in data by grouping similar types of data.
• Supervised learning models are used to predict an
outcome. • The goal of this model is to “represent” data in a way that
meaningful information can be extracted.
• The goal of this model is to “generalize” a dataset so that
the “general rule” can be applied to new data as well.
Steps 3 and 4: Identify the Problem Type and Learning Model (contd.)
Data can either be continuous or categorical. Based on whether it is supervised or unsupervised
learning, the problem type will differ.
Data Data
Categories of news based on the topics Grouping of similar stories on different news networks
How it Works—Supervised Learning Model
In supervised learning, a known dataset with observations, features, and response is used to create and train
a machine learning algorithm. A predictive model, built on top of this algorithm, is then used to predict the
response for a new dataset that has the same features.
New or
Known Data
Unseen Data
Observations/ Observations/
Records Records
Features/
Attributes
Predictive Features/
Model Attributes
Response/
Label Machine
Learning
Algorithm Predicted
Response/Label
How it Works—Unsupervised Learning Model
In unsupervised learning, a known dataset has a set of observations with features. But the response is not
known. The predictive model uses these features to identify how to classify and represent the data points of
new or unseen data.
New or
Known Data
Unseen Data
Observations/ Observations/
Records Records
Machine Predictive
Features/ Features/
Learning Model
Attributes Attributes
Algorithm
Data
Representation
Steps 5 and 6: Train, Test, and Optimize the Model
To train supervised learning models, data analysts usually divide a known dataset into training and
testing sets.
Known Data
Train
(60%-80%)
Test Observations/
(20%-40%) Records
Features/
Attributes
Response/
Label Machine
Learning
Algorithm
Steps 5 and 6: Train, Test, and Optimize the Model (contd.)
Let us look at an example to see how the split approach works.
Model Training
Observation Response
45 18 1 130
54 16 0 110
67 16 1 100
71 15 1 105
31 15 0 70
Supervised Learning Model Considerations
Some considerations of supervised and unsupervised learning models are shown here.
Performance
Response optimization
Model Accuracy
Features
Generalization
Knowledge Check
KNOWLEDGE
CHECK In machine learning, which one of the following is an observation?
a. Features
b. Attributes
c. Records
d. Labels
KNOWLEDGE
CHECK In machine learning, which one of the following is an observation?
a. Features
b. Attributes
c. Records
d. Labels
Explanation: The regression algorithm belonging to the supervised learning model is best suited to analyze continuous data.
KNOWLEDGE
CHECK Identify the goal of unsupervised learning. Select all that apply.
Explanation: The goal of unsupervised learning is to understand the structure of the data and represent it. There is no right or
certain answer in unsupervised learning.
Scikit-Learn
Scikit is a powerful and modern machine learning Python library for fully and semi-automated data
analysis and information extraction.
Since features and response would be in the form of arrays, they would have shapes and
sizes.
a. model
b. feature
c. dataset
d. response
KNOWLEDGE
CHECK The estimator instance in Scikit-learn is a _____.
a. model
b. feature
c. dataset
d. response
Intercept Coefficient of x
Supervised Learning Models: Linear Regression (contd.)
Linear regression is the most basic technique to predict a value of an attribute.
Data point
Residual
y Residual
(response) Least square line
dy 𝑦 = β0 + β1 𝑥 + u
dx
d𝑦
(0, y) β1 = Slope/ gradient
d𝑥 Actual Predicted
β0 value value
x (predictor variable)
! The attributes are usually fitted using the “least square” approach.
Supervised Learning Models: Linear Regression (contd.)
Smaller the value of SSR or SSE, the more accurate the prediction will be, which would make the
model the best fit.
Data point
𝑦 = β0 + β1 𝑥 + u
y SSE
(response) Least square line
SSR SSR = (𝑦𝑖 − 𝑦)2
𝑦
Regression of sum of squares
𝑦 = β0 + β1 𝑥
(0, y)
β0
SSE = (𝑦𝑖 − 𝑦𝑖 )2
x (predictor variable)
Error of sum of squares
! The attributes are usually fitted using the “least square” approach.
Supervised Learning Models: Linear Regression (contd.)
Let us see how linear regression works in Scikit-Learn.
𝑒 β0+β1𝑥
𝜋 = Pr 𝑦 = 1 𝑥 =
1 + 𝑒 β0+β1𝑥
Probability of y = 1, given x
Change in the log-odds
for a unit change in x
𝜋
Odds =
1−𝜋
Probability
𝜋 β +β1 𝑥
log = log 𝑒 0 = β0 + β1 𝑥
1−𝜋
Logarithm of odds Linear regression
Supervised Learning Models: Logistic Regression (contd.)
Logistic regression is a generalization of the linear regression model used for classification problems.
Inverse of
Specifies the norm used regularization
in penalization Calculates
the intercept
Implemented only
Class for L2 penalty
K=6
K=3
If you are using this method for binary classification, choose an odd number for k to avoid the case
of a "tied" distance between two classes.
Demo 03—K-NN and Logistic Regression Models
Demonstrate the use of K-NN and logistic regression models
Unsupervised Learning Models: Clustering
A cluster is a group of similar data points.
It is used:
• To extract the structure of the data
• To identify groups in the data
Assign Optimize
Find the number of clusters and assign Iterate and optimize the mean for each cluster for
mean its respective data points
Unsupervised Learning Models: K-means Clustering (contd.)
Let us see how the k-means algorithm works in Scikit-Learn.
Number of clusters to
form and number of
Number of times the K-
centroids to generate
means algorithm will be run Pre-compute
with different centroid for faster
Class
Selects initial cluster seeds operation
centers
sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001,
precompute_distances='auto',
verbose=0, random_state=None, copy_x=True, n_jobs=1)
Maximum number of
Initialize the iterations of the K-means
centers Number of jobs in algorithm for a single run
parallel
If true, does not computation
modify data
while pre-
computing
Demo 04—K-means Clustering
Demonstrate how to use k-means clustering to classify data points
Unsupervised Learning Models: Dimensionality Reduction
It reduces a high-dimensional dataset into a dataset with fewer dimensions. This makes it easier and
faster for the algorithm to analyze the data.
Unsupervised Learning Models: Dimensionality Reduction (contd.)
These are some techniques used for dimensionality reduction:
Number of components to
keep
Class
metrics. accuracy_score
Classification metrics.average_precision_score
Clustering metrics.adjusted_rand_score
metrics.mean_absolute_error
Regression metrics.mean_squared_error
metrics.median_absolute_error
Knowledge Check
KNOWLEDGE
CHECK What is the best way to train a model?
b. Split the known dataset into separate training and testing sets
b. Split the known dataset into separate training and testing sets
Explanation: The best way to train a model is to split the known dataset into training and testing sets. The testing set varies
from 20% to 40%.
Quiz
Assignment 01
Problem Instructions
The given dataset contains ad budgets for different media channels and the corresponding ad sales of
XYZ firm. Evaluate the dataset to:
• Find the features or media channels used by the firm
• Find the sales figures for each channel
• Create a model to predict the sales outcome
• Split as training and testing datasets for the model
• Calculate the Mean Square Error (MSE)
Assignment 01
Instruction
Problem
s
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document from the
“Resources” tab to view the steps for installing Anaconda and the Jupyter notebook.
• Download the “Assignment 01” notebook and upload it on the Jupyter notebook to access it.
• Follow the provided cues to complete the assignment.
Quiz
Assignment 02
Problem Instructions
The given dataset lists the glucose level readings of several pregnant women taken either during a
survey examination or routine medical care. It specifies if the 2 hour post-load plasma glucose was at
least 200 mg/dl. Analyze the dataset to:
1. Find the features of the dataset,
2. Find the response label of the dataset,
3. Create a model to predict the diabetes outcome,
4. Use training and testing datasets to train the model, and
5. Check the accuracy of the model.
Assignment 02
Problem Instructions
a. The prediction will be more accurate, making it the best fit model.
d. The model will not be the best fit for the attributes.
QUIZ
Which of the following is true with a greater value of SSR or SSE? Select all that apply.
1
a. The prediction will be more accurate, making it the best fit model.
d. The model will not be the best fit for the attributes.
Explanation: With higher SSR or SSE, the prediction will be less accurate and the model will not be the best fit for the attributes.
QUIZ
Class sklearn.linear_model.LogisticRegression, random_state _____.
2
a. indicates the seed of the pseudo random number generator used to shuffle data
a. indicates the seed of the pseudo random number generator used to shuffle data
Explanation: The class “sklearn.linear_model.LogisticRegression, random_state” indicates the seed of the pseudo random
number generator used to shuffle data.
QUIZ
What are the requirements of the K-means algorithm? Select all that apply.
3
Explanation: The K-means algorithm requires that the number of clusters be specified and that centroids that minimize inertia
be selected. It requires several iterations to fine tune itself and meet the required criteria to become the best fit model.
QUIZ
In Class sklearn.decomposition.PCA, the transform(X) method , where X is multi-
dimensional ______.
4
Explanation: In Class “sklearn.decomposition.PCA,” the transform(X) method applies the dimensionality reduction on X.
Key Takeaways
Scikit-learn has many built-in functions and algorithms which make it easy for Data
Scientists to build machine learning models.
Machine learning can easily and quickly extract information from large sources
of data.
Supervised and unsupervised machine learning models are two of the most
widely used learning models.
Extract information
Handling ambiguities
Why Natural Language Processing (contd.)
In NLP, full automation can be easily achieved by using modern software libraries, modules, and
packages.
Full Intelligent
automation processing
Let us look at the Natural Language Processing approaches to analyze text data.
Analyze sentence
Extract information
structure
Demo 01- NLP Environmental Setup
Demonstrate the installation of NLP environment
Demo 02: Sentence Analysis
Demonstrate the sentence analysis
The NLP Applications
Let us take a look at the applications that use NLP.
Speech Recognition
Sentiment Analysis
The NLP Applications (contd.)
Let us take a look at the applications that use NLP
Speech Recognition
Sentiment Analysis
The NLP Applications (contd.)
Let us take a look at the applications that use NLP
Speech Recognition
Sentiment Analysis
Quiz
KNOWLEDGE
CHECK
In NLP, tokenization is a way to
c. Find ambiguities
c. Find ambiguities
Explanation: Splitting text data into words, phrases, and idioms is known as tokenization and each individual word is known as
token.
Major NLP Libraries
NLTK
Scikit-learn
NLP libraries
TextBlob
spaCy
The Scikit-Learn Approach
It is a very powerful library with a set of modules to process and analyze natural language data such as
texts and images and extract information using machine learning algorithms.
Analyze the
Contains built-in A way to extract content based
modules to load the information from on particular
dataset’s content data which can be categories and
and categories. text or images. then train them
according to a
specific model.
The SciKit Learn Approach (contd.)
It is a very powerful library with a set of modules to process and analyze natural language data such as
texts and images and extract information using machine learning algorithms.
It’s a powerful
A technique in In this stage we way to search
Scikit-learn train the models parameters
approach to to optimize the affecting the
streamline the overall process. outcome for
NLP process model training
into stages. purposes.
Modules to Load Content and Category
Scikit-learn has many built-in datasets. There are several methods to load these datasets with the help of
a data load object.
Container
folder
Category 1
Category 2
Category 1
NumPy array
SciPy matrix
Category 2
Modules to Load Content and Category (contd.)
Load dataset
Let us see how functions like type, .data, and .target help in analyzing a dataset.
View data
View target
Feature Extraction
Feature extraction is a technique to convert the content into the numerical vectors to perform machine
learning.
Store
Counting
Store as the
value feature
Tokenizing
Number of
occurrences of
Assign a fixed each word
integer id to
each word
Document 1 42 32 119 3
Document 3 0 0 0 55
CountVectorizer Class Signature
Specifies number of
components to keep
Class class
sklearn.feature_extraction.text.CountVectorizer
(input='content', encoding='utf-8', Encoding used to
File name or decode the input
sequence of strings decode_error='strict', strip_accents=None,
Removes accents
lowercase=True, preprocessor=None,
Overrides string
tokenizer tokenizer=None, stop_words=None, Built-in stopwords list
token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), Min Threshold
analyzer='word', max_df=1.0, min_df=1,
Max Threshold
max_features=None, vocabulary=None,
binary=False, dtype=<class 'numpy.int64'>)
Demo 03—Bag of Words
Demonstrate the Bag of Words technique
Text Feature Extraction Considerations
This utility deals with sparse matrix while storing them in memory.
Sparse Sparse data is commonly noticed when it comes to extracting feature
values, especially for large document datasets.
Decoding This utility can decode text files if their encoding is specified.
Model Training
An important task in model training is to identify the right model for the given dataset. The choice of
model completely depends on the type of dataset.
Models predict the outcome of new observations and datasets, and classify
Supervised documents based on the features and response of a given dataset.
Example: Naïve Bayes, SVM, linear regression, K-NN neighbors
Models identify patterns in the data and extract its structure. They are also used
Unsupervised to group documents using clustering algorithms.
Example: K-means
Naïve Bayes Classifier
Advantages: Uses:
• It is efficient as it uses limited CPU • Naïve Bayes is used for sentiment
and memory. analysis, email spam detection,
• It is fast as the model training takes categorization of documents, and
less time. language detection.
• Multinomial Naïve Bayes is used when
multiple occurrences of the words
matter.
Naïve Bayes Classifier
Let us take a look at the signature of the multinomial Naïve Bayes classifier:
Smoothing parameter
(0 for no smoothing) Prior probabilities of the
classes
Grid Search and Multiple Parameters
Document classifiers can have many parameters and a Grid approach helps to search the best parameters
for model training and predicting the outcome accurately.
Category 1
Document classifier
Category 2
Grid Search and Multiple Parameters (contd.)
Document classifiers can have many parameters and a Grid approach helps to search the best parameters
for model training and predicting the outcome accurately.
Parameter
Best parameter
Grid Search and Multiple Parameters (contd.)
In grid search mechanism, the whole dataset can be divided into multiple grids and a search can be run on
entire grids or a combination of grids.
Grid searcher
Best parameter Parameter 1
Parameter 2
Parameter 3
Pipeline
Extracts features
around the word
of interest
Converts a collection
Helps the
of text documents
model predict
into a numerical
accurately
feature vector
Demo 04—Pipeline and Grid Search
Demonstrate the Pipeline and grid search technique
Quiz
Assignment 01
Problem Instructions
Problem Instructions
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document from the
“Resources” tab to view the steps for installing Anaconda and the Jupyter notebook.
• Download the “Assignment 01” notebook and upload it on the Jupyter notebook to access it.
• Follow the provided cues to complete the assignment.
Quiz
Assignment 02
Problem Assignment
Problem Instructions
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document from the
“Resources” tab to view the steps for installing Anaconda and the Jupyter notebook.
• Download the “Assignment 02” notebook and upload it on the Jupyter notebook to access it.
• Follow the provided cues to complete the assignment.
Quiz
QUIZ
What is the tf-idf value in a document?
1
Explanation: td-idf value reflects how important a word is to a document. It is directly proportional to the number of times a
word appears and is offset by frequency of the words in corpus.
QUIZ
In grid search if n_jobs = -1, then which of the following is correct?
2
Explanation: Detects all installed cores on the machine and uses all of them.
QUIZ
Identify the correct example of Topic Modeling from the following options:
3
a. Machine translation
b. Speech recognition
c. News aggregators
d. Sentiment analysis
QUIZ
Identify the correct example of Topic Modeling from the following options:
3
a. Machine translation
b. Speech recognition
c. News aggregators
d. Sentiment analysis
Explanation: ‘Topic model’ is statistical modeling and used to find latent groupings in the documents based upon the words. For
example, news aggregators.
QUIZ
How do we save memory while operating on Bag of Words which typically contain
4 high-dimensional sparse datasets?
d. Decode them
QUIZ
How do we save memory while operating on Bag of Words which typically contain
4 high-dimensional sparse datasets?
d. Decode them
Explanation: In features vector, there will be several values with zeros. The best way to save memory is to store only non zero
parts of the feature vectors.
QUIZ
What is the function of the sub-module feature_extraction.text.CountVectorizer?
5
Explain what data visualization is and its importance in our world today
Understand why Python is considered one of the best data visualization tools
List the types of plots and the steps involved in creating these plots
Data Visualization
You are a Sales Manager in a leading global organization. The organization plans to study the
sales details of each product across all regions and countries. This is to identify the product
which has the highest sales in a particular region and up the production. This research will
enable the organization to increase the manufacturing of that product in that particular
region.
Data Visualization (contd.)
You are a Sales Manager in a leading global organization. The organization plans to study the
sales details of each product across all regions and countries. This is to identify the product
which has the highest sales in a particular region and up the production. This research will
enable the organization to increase the manufacturing of that product in that particular
region.
Data Visualization (contd.)
You are a Sales Manager in a leading global organization. The organization plans to study the
sales details of each product across all regions and countries. This is to identify the product
which has the highest sales in a particular region and up the production. This research will
enable the organization to increase the manufacturing of that product in that particular
region.
Data Visualization
The are
You maina benefits
Sales Manager
of datainvisualization
a leading global
are as
organization.
follows: The organization plans to research
the sales details of each product across all regions and countries. This is to identify the
product which has the highest sale in a particular region. This research will enable the
organization to increase its manufacturing of that product in the particular region.
Data Visualization Considerations
Ensure the dataset is complete and relevant. This enables the Data Scientist to
use the new patterns obtained from the data in the relevant places.
Data Visualization Considerations (contd.)
Use efficient visualization techniques that highlight all the data points.
Data Visualization Factors
There are some basic factors that one needs to be aware of before visualizing the data:
The visual effect includes the usage of appropriate shapes, colors, and sizes to represent
the analyzed data.
Data Visualization Factors (contd.)
There are some basic factors that one needs to be aware of before visualizing the data:
The coordinate system helps organize the data points within the provided coordinates.
Data Visualization Factors (contd.)
There are some basic factors that one needs to be aware of before visualizing the data:
The data types and scale choose the type of data, for example, numeric or categorical.
Data Visualization Factors
There are some basic factors that one needs to be aware of before visualizing the data:
The informative interpretation helps create visuals in an effective and easily interpretable
manner using labels, title, legends, and pointers.
Data Visualization Tool—Python
Many new Python data visualization libraries are introduced recently such as:
matplotlib
vispy pygal
bokeh folium
seaborn networkx
Python Data
Visualization Libraries
Python Libraries—matplotlib
Using Python’s matplotlib, the data visualization of large and complex data becomes easy.
matplotlib
There are several advantages of using matplotlib to visualize data. They are as follows:
First Plot
1.1
Legend
1.0
0.9
0.8 Grid
Numbers
Y -axis 0.7
0.6
0.5
0.4
0.
0.3
2 0 1 3 4 5 6 7
Range
X-axis
Steps to Create a Plot
First Plot
1.1
1.0
0.9
0.8
Numbers
0.7
0.6
0.5
0.4
0.3
0.2
0 1 3 4 5 6 7
Range
Steps to Create Plot – Example (contd.)
a. Plot()
b. Plt.title()
c. Plot.title()
d. Title()
KNOWLEDGE
CHECK
Which of the following methods is used to set the title?
a. Plot()
b. Plt.title()
c. Plot.title()
d. Title()
A leading global organization wants to know how many people visit its website in a particular time. This
analysis helps it control and monitor the website traffic.
2D plot
Users
Time
Plot With (X,Y)
List of users
Time
1600
Number of users 1400
1200
1000
800
600
400
200
0
6 8 10 12 14 16 18
Hrs
Controlling Line Patterns and Colors
Number of users
1200
1000
800
600
400
200
0
6 8 10 12 14 16 18
Hrs
Set Axis, Labels, and Legend Property
Using matplotlib, it is also possible to set the desired axis to interpret the result.
Number of
users
1000
500
0
8 1 12 14 16
0 Hrs
Alpha and Annotation
Annotate() method is used to annotate the graph. It has several attributes which help
annotate the plot.
Annotate() method is used to annotate the graph. It has several attributes which help
annotate the plot.
Multiple Plots
Monday
1500
Number of users
1000
500
0
8 10 12 14 16
Hrs
Multiple Plots
2000
Monday
Tuesday
Wednesday
1500
Number of users
1000
500
0
8 10 12 14 16
Hrs
Subplots
For example,
Subplot(2,2,1) Subplot(2,2,2)
Subplot(2,1,1)
Grid divided
into two
vertically Grid divided
stacked plots Subplot(2,1,2) into four plots
Subplot(2,2,3) Subplot(2,2,4)
Layout
Layout and Spacing adjustments are two important factors to be considered while creating
subplots.
Use the plt.subplots_adjust() method with the parameters hspace and wspace to adjust the
distances between the subplots and move them around on the grid.
hspace
Top
Bottom
wspace
Knowledge Check
KNOWLEDGE
CHECK
Which of the following methods is used to adjust the distances between the subplots?
a. plot.subplots_adjust()
b. plt.subplots_adjust()
c. subplots_adjust()
d. plt.subplots.adjust()
KNOWLEDGE
CHECK
Which of the following methods is used to adjust the distances between the subplots?
a. plot.subplots_adjust()
b. plt.subplots_adjust()
c. subplots_adjust()
d. plt.subplots.adjust()
Histogram
Scatter Plot
Heat Map
Pie Chart
Error Bar
Types of Plots (contd.)
Frequency
Heat Map Advantages of Histogram charts:
Error Bar
Types of Plots (contd.)
Histogram A heat map is a way to visualize two-dimensional data. Using heat maps, you can
gain deeper and faster insights about data than other types of plots.
It has several advantages:
Scatter Plot
• Draws attention to the risk-prone area
• Uses the entire dataset to draw meaningful insights
Heat Map • Is used for cluster analysis and can deal with large
datasets
Pie Chart
Error Bar
Types of Plots (contd.)
Error Bar
Types of Plots (contd.)
Histogram An error bar is used to graphically represent the variability of data. It is used mainly
to identify errors. It builds confidence about the data analysis by revealing the
statistical difference between the two groups of data.
Scatter Plot
It has several advantages:
Heat Map
• Shows the variability in data and indicates the
errors.
• Depicts the precision in the data analysis.
Pie Chart
• Demonstrates how well a function and model are
used in the data analysis.
• Describes the underlying data.
Error Bar
Seaborn
Problem Instructions
Analyze the “auto mpg data” and draw a pair plot using seaborn library for mpg, weight, and origin.
Sources:
(a) Origin: This dataset was taken from the StatLib library maintained at Carnegie Mellon University.
•Number of Instances: 398
• Number of Attributes: 9 including the class attribute
• Attribute Information:
o mpg: continuous
o cylinders: multi-valued discrete
o displacement: continuous
o horsepower: continuous
Assignment 01
Problem Instructions
o weight: continuous
o acceleration: continuous
o model year: multi-valued discrete
o origin: multi-valued discrete
o car name: string (unique for each instance)
Quiz
Assignment 02
Problem Instructions
You have been provided with a dataset that lists Ohio State’s leading causes of death from the year
2012.
Using the two data points:
• Cause of deaths and
• Percentile
Draw a pie chart to visualize the dataset.
Assignment 02
Problem Instructions
a. %matplotlib
b. %matplotlib inline
c. import matplotlib
d. import style
QUIZ
Which of the following libraries needs to be imported to display the plot on Jupyter
notebook?
1
a. %matplotlib
b. %matplotlib inline
c. import matplotlib
d. import style
a. Legend
b. Alpha
c. Animated
d. Annotation
QUIZ
Which of the following keywords is used to decide the transparency of the plot line?
2
a. Legend
b. Alpha
c. Animated
d. Annotation
a. Histogram
b. Heat Map
c. Pie Chart
d. Scatter Plot
QUIZ
Which of the following plots is used to represent data in a two-dimensional manner?
3
a. Histogram
b. Heat Map
c. Pie Chart
d. Scatter Plot
a. plt.xlim(0, 6)
b. plt.ylim(0, 6)
c. plt.xylim(0, 6)
d. plt.axis([0, 6, 0, 6])
QUIZ
Which of the following statements limits both x and y axes to the interval [0, 6]?
4
a. plt.xlim(0, 6)
b. plt.ylim(0, 6)
c. plt.xylim(0, 6)
d. plt.axis([0, 6, 0, 6])
Explanation: plt.axis([0, 6, 0, 6]) statement limits both x and y axes to the interval [0, 6].
Key Takeaways
Data visualization is the technique to present the data in a pictorial or graphical format.
There are three major considerations for data visualization. They are clarity, accuracy, and
efficiency.
The matplotlib is a python 2D plotting library for data visualization and the creation of
interactive graphics/ plots.
A plot is a graphical representation of data which shows the relationship between
two variables or the distribution of data.
Describe basic terminologies such as parser, object, and tree associated with the
BeautifulSoup
Understand various operations such as searching, modifying, and navigating
the tree to yield the required result
What is Web Scraping
Every day, you find yourself in a situation where you need to extract data from the web.
Why Web Scraping (contd.)
Every day, you find yourself in a situation where you need to extract data from the web.
Web Scraping Process—Basic Preparation
There are two basic things to consider before setting up the web scraping process.
Once you have understood the target data and finalized the list of websites, you need to design
the web scraping process.
The steps involved in a typical web scraping process are as follows:
Step 1: A web request is sent to the targeted website to collect the required data.
Web Scraping Process (contd.)
Once you have understood the target data and finalized the list of websites, you need to design
the web scraping process.
The steps involved in a typical web scraping process are as follows:
Step 2: The information is retrieved from the targeted website in HTML or XML format from
web.
Web Scraping Process (contd.)
Once you have understood the target data and finalized the list of websites, you need to design
the web scraping process.
The steps involved in a typical web scraping process are as follows:
Step 3: The retrieved information is parsed to the several parsers based on the data format.
Parsing is a technique to read data and extract information from the available document.
Web Scraping Process (contd.)
Once you have understood the target data and finalized the list of websites, you need to design
the web scraping process.
The steps involved in a typical web scraping process are as follows:
Step 4: The parsed data is stored in the desired format. You can follow the same process to
scrap another targeted web.
Web Scraping Software vs. Web Browser
A web scraping software will interact with websites in the same way as your web browser.
A Web scraper is used to extract the information from web in routine and automated manner.
Displays the data Saves data from the web page to the local
file or database
Web Scraping Considerations
It’s important to read and understand the legal information and terms and conditions
mentioned in the website.
Web Scraping Considerations (contd.)
It’s important to read and understand the legal information and terms and conditions
mentioned in the website.
Legal Constraints
Notice
Copyright
Trademark Material
Patented Information
Web Scraping Tool—BeautifulSoup
BeautifulSoup, is an easy, intuitive, and a robust Python library designed for web scraping.
BeautifulSoup, is an easy, intuitive, and a robust Python library designed for web scraping.
Efficient tool for dissecting documents and extracting information from the web pages
Powerful sets of built-in methods for navigating, searching, and modifying a parse tree
An HTML 5 is a new
HTML standard which
gained popularity with
the mobile devices.
Common Data/ Page Formats on The Web (contd.)
Application Program
Interface, or APIs, has
now become a common
practice to extract
information from the
web.
Common Data/ Page Formats on The Web (contd.)
JavaScript Object
Notation, or JSON, is a
lightweight and popular
format used for
information exchange on
the web.
The Parser
What is a parser?
A Parser is also used to validate the input information before processing it.
Parsing data is one of the most important steps in the web scraping process.
Failing to parse the data would eventually lead to a failure of the entire process.
! !
Parser
Parser
Various Parser
lxml xml Lxml xml is the only xml parser available and it also
depends on C.
HTML
Parser
Tree
a. Tag
b. NavigableString
c. BeautifuSoup
d. Comment
KNOWLEDGE Which of the following object types represents a string or set of characters within a
CHECK tag?
a. Tag
b. NavigableString
c. BeautifuSoup
d. Comment
html
head Body
a li li li
Understanding The Tree
html tag
Body tag
Division or a Section
Cascaded style sheets
Understanding The Tree (contd.)
BeautifulSoup Parent
div:
oraganizationlist
Siblings
li: li:
HRmanager HRmanager
With the help of the search filters technique, you can extract specific information from the parsed
document.
The filters can be treated as search criteria for extracting the information based on the elements
present in the document.
Searching The Tree – Filters (contd.)
There are various kinds of filters used for searching an information from a tree:
List A list filters the string that matches against the search item
in the list.
Searching methods
Methods
,, and Attributes
find_all() find()
Searching the tree with find_all()
The find_all() searches and retrieves all tags’ descendants that matches your filters.
Method
The find() method has a syntax similar to that of the find_all() method; however, there are some
key differences.
Find_all() Scans entire document Returns list with values Returns empty list
Searching the parse tree can also be performed by various other methods such as the following:
Tree Search
find_parents() find_parent()
find_previous_siblings() find_previous_sibling()
find_all_next() find_next()
find_all_previous() find_previous()
find_next_siblings() find_next_sibling()
scans for all the matches scans only for the first match
Demo: 02—Demo: 02—Searching in a Tree with Filters
This demo shows the ways to search in a tree using filters.
Knowledge Check
KNOWLEDGE
CHECK
The method get_text() is used to _________.
With the help of BeautifulSoup, it is easy to navigate the parse tree based on the need.
Navigating Down
Navigating Up
Navigating
Sideways
Navigating Back
and Forth
Navigating options
With the help of BeautifulSoup, it is easy to navigate the parse tree based on the need.
Navigating Down This technique shows you how to extract information from children
tags. Following are the attributes used to navigate down:
Navigating Back
and Forth
Navigating options
With the help of BeautifulSoup, it is easy to navigate the parse tree based on the need.
Navigating
Sideways
Navigating Back
and Forth
Navigating options
With the help of BeautifulSoup, it is easy to navigate the parse tree based on the need.
This technique shows you how to extract information from the same
Navigating Up level in the tree.
The attributes used to navigate sideways are .next_sibling and
Navigating .previous_sibling.
Sideways
Navigating Back
and Forth
Navigating options
With the help of BeautifulSoup, it is easy to navigate the parse tree based on the need.
This technique shows you how to parse the tree back and forth.
Navigating Up Following are the attributes used to navigate back and forth are:
.next_element and .previous_element
Navigating .next_elements and .previous_elements
Sideways
Navigating Back
and Forth
Demo: 03—Navigating a Tree
This demo shows how to navigate the web tree using various techniques.
Knowledge Check
KNOWLEDGE
CHECK
Which of the following attributes is used to navigate up?
a. .next_element
b. .parent
c. .previous_elements
d. .next_sibling
KNOWLEDGE
CHECK
Which of the following attributes is used to navigate up?
a. .next_element
b. .parent
c. .previous_elements
d. .next_sibling
.string()
append()
NavigableString()
.new_tag()
insert()
Insert_before() and insert_after()
clear()
extract()
decompose()
replace with ()
wrap()
unwrap()
Modifying The Tree
With BeautifulSoup, you can also modify the tree and write your changes as a new HTML or
XML document.
.string()
append()
NavigableString()
.new_tag()
insert()
Insert_before() and insert_after()
clear()
extract()
decompose()
replace with ()
wrap()
unwrap()
Demo: 04—Modifying the Tree
This demo shows you ways to modify a web tree to get the desired result
with the help of an example.
Parsing Only Part of the Document
This feature of parsing a part of the document will not work with the html5lib parser.
Demo: 05—Parsing part of the document
This demo shows you how to parse only a part of document with the help of
an example.
Output : Printing and Formatting
Output
Prettify()
Printing
Unicode() or str()
Formatting
Output : Printing and Formatting
Prettify()
Printing
Formatting
Unicode() or str()
Output : Printing and Formatting (contd.)
Prettify()
Output
Unicode() or str()
Printing The unicode()or str() method turns a parse tree into a non-
decorative formatting string.
Formatting
Output : Printing and Formatting (contd.)
The formatters are used to generate different types of output with the desired formatting.
Output
Minimal
Formatting Html and xml formatting
will convert unicode
none
characters into html and
xml entities respectively.
Uppercase and
lowercase
Output : Printing and Formatting (contd.)
The formatters are used to generate different types of output with the desired formatting.
Output
Minimal
Formatting The minimal formatting
none will processes content
with valid html/ xml tags.
Uppercase and
lowercase
Output : Printing and Formatting (contd.)
The formatters are used to generate different types of output with the desired formatting.
Output
Minimal
Formatting None formatting will not
None modify the content or
string on output.
Uppercase and
lowercase
Output : Printing and Formatting (contd.)
The formatters are used to generate different types of output with the desired formatting.
Output
Minimal
Uppercase and lowercase
Formatting formatting will convert
none string values to
uppercase and
lowercase, respectively.
Uppercase and
lowercase
Demo: 06—Formatting and Printing
This demo shows the ways to format, print, and encode the web
document.
Encoding
Problem Instructions
Scrape the Simplilearn website page and perform the following tasks:
• View and print the Simplilearn web page content in a proper format
• View the head and title
• Print all the href links present in the Simplilearn web page
Problem Instructions
Problem Instructions
Scrape the Simplilearn website resource page and perform the following tasks:
• View and print the Simplilearn web page content in a proper format
• View the head and title
• Print all the href links present in the Simplilearn web page
• Search and print the resource headers of the Simplilearn web page
• Search resource topics
• View the article names and navigate through them
Problem Instructions
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document from the
“Resources” tab to view the steps for installing Anaconda and the Jupyter notebook.
• Download the “Assignment 02” notebook and upload it on the Jupyter notebook to access it.
• Follow the provided cues to complete the assignment.
Quiz
QUIZ
Which of the following is the only xml parser?
1
a. html.parser
b. lxml
c. lxml.xml
d. html5lib
QUIZ
Which of the following is the only xml parser?
1
a. html.parser
b. lxml
c. lxml.xml
d. html5lib
a. ASCII
b. Unicode
c. latin-1
d. UTF-8
QUIZ
In which of the following formats id the BeautifulSoup output encoded?
2
a. ASCII
b. Unicode
c. latin-1
d. UTF-8
a. Beautiful Soup
b. Pandas
c. Requests
d. NumPy
QUIZ
Which of the following libraries is used to extract a web page?
3
a. Beautiful Soup
b. Pandas
c. Requests
d. NumPy
a. Tag
b. NextSibling
c. NavigableString
d. Comment
QUIZ
Which of the following is NOT an object in BeautifulSoup?
4
a. Tag
b. Next sibling
c. NavigableString
d. Comment
Objects are used to extract the required information from a tree structure by
searching or navigating through the parsed document.
A tree can be defined as a collection of simple and complex objects.
BeautifulSoup transforms a complex HTML document into a complex tree of Python objects.
This concludes “Web Scraping with BeautifulSoup”
The next lesson is “Python integration with Hadoop, MapReduce, and Spark”
Data Science with Python
Lesson 12—Python Integration with Hadoop MapReduce
and Spark
What You’ll Learn
Real Time
Analytics
Acquire
Wrangle
Explore
Model
Data Science
Visualize
Bokeh
Disparity in Programming Languages
However, Big Data can only be accessed through Hadoop which is completely developed and
implemented in Java. Also, analytics platforms are coded in different programming languages.
Hadoop
Infrastructure
(HDFS)
Python
Big Data
Spark Analytics
Platform
Data Science
Hadoop
Infrastructure
(HDFS)
Python Python
APIs
Big Data
Spark Analytics
Platform
Data Science
Multiple
Programming
Languages
Hadoop
HDFS MapReduce
(Hadoop Distributed File System)
This example illustrates the Hadoop system architecture and the ways to store data in a cluster.
D
a
t
a
S
o Name node
u Large file
r
c Data nodes
(Hadoop cluster)
e
s File blocks
(64MB or 128MB)
Secondary name
node
MapReduce
The second core component of Hadoop is MapReduce, the primary framework of the HDFS architecture.
input
HDFS
sort Output
HDFS
Split 0 map copy
merge
HDFS
reduce part 0
replication
Split 1 map
HDFS
reduce part 1
replication
Split 2 map
MapReduce: The Mapper and Reducer
Let us discuss the MapReduce functions—mapper and reducer—in detail.
Mapper Reducer
• Mappers run locally on the data nodes • All intermediate values for a given
to avoid the network traffic. intermediate key are combined together
• Multiple mappers run in parallel into a list and given to a reducer.
processing a portion of the input data. • This step is known as ‘shuffle and sort’.
• The mapper reads data in the form of • The reducer outputs either zero or
key-value pairs. more final key-value pairs. These
• If the mapper writes generates an output, it is written are written to HDFS.
in the form of key-value pairs.
Hadoop Steaming: Python API for Hadoop
Hadoop Streaming acts like a bridge between your Python code and the Java-based HDFS, and lets you
seamlessly access Hadoop clusters and execute MapReduce tasks.
Hadoop
Streaming
You can now sum the numbers using the reduce function
sum_squared = reduce(sum, squared_nums)
Cloudera provides enterprise-ready Hadoop Big Data platform which supports Python as well.
To set up the Cloudera Hadoop environment, visit the Cloudera link:
http://www.cloudera.com/downloads/quickstart_vms/5-7.html
Cloudera recommends that you use 7-Zip to extract these files. To download and install it, visit the link:
http://www.7-zip.org/
Cloudera QuickStart VM: Prerequisites
• These 64-bit VMs require a 64-bit host OS and a virtualization product that can support a 64-bit guest
OS.
• To use a VMware VM, you must use a player compatible with WorkStation 8.x or higher:
• Player 4.x or higher
• Fusion 4.x or higher
• Older versions of WorkStation can be used to create a new VM using the same virtual disk (VMDK file),
but some features in VMware Tools are not available.
• The amount of RAM required varies by the run-time option you choose
QuickStart VMware Player: Windows, Linux & VMware Fusion: Mac
https://www.vmware.com/products/player/ https://www.vmware.com/products/fusion/
playerpro-evaluation.html fusion-evaluation.html
QuickStart VMware Image
Launch VMware player with Cloudera VM
Launch Terminal
Account:
username: cloudera
password: cloudera
QuickStart VM Terminal
Step 01 Step 02
Unix command :
• pwd to verify present working directory
• ls -lrt to list files and directories
Demo 01—Using Hadoop Streaming for Calculating Word Count
Demonstrate how to create a MapReduce program and use Hadoop
Streaming to determine the word count of a document
Knowledge Check
KNOWLEDGE
CHECK
What is the usual size of the data block on HDFS?
a. 32 MB
b. 64 MB
c. 100 MB
d. 1 GB
KNOWLEDGE
CHECK
What is the usual size of the data block on HDFS
a. 32 MB
b. 64 MB
c. 100 MB
d. 1 GB
Input
Query 1 Result 1
Query 2 Result 2
Input Query 3
Result 3
Apache Spark Uses In-Memory Instead of Disk I/O
HDFS
read
Iteration 1 Iteration 2
Memory (RAM) Memory (RAM)
Input
Query 1 Result 1
Query 2
Result 2
Input Query 3
Result 3
Distributed memory
CPUs
CPUs Memory
MapReduce Spark
Apache Spark Resilient Distributed Systems (RDD)
Some basic concepts about Resilient Distributed Datasets (RDD) are listed here:
PySpark is the Spark Python API which enables data scientists to access Spark programming model
PySpark
Transformation Action
Spark MLlib
Spark GraphX
SQL (machine
Streaming (graph)
Learning)
Spark
http://spark.apache.org/downloads.html
[installed directory]\spark-1.6.1-bin-
hadoop2.4\spark-1.6.1-bin-hadoop2.4
Setup the
pyspark
notebook
specific
variables
Check SparkContext
Demo 02—Using PySpark to Determine Word Count
Demonstrate how to use the Jupyter integrated PySpark API to
determine the word count of a given dataset
Knowledge Check
KNOWLEDGE What happens if the available memory is insufficient while performing RDD
CHECK transformations?
Instruction
Problem
s
Click each tab to know more. Click the Resources tab to download the files for this
assignment.
Assignment 01
Instruction
Problem
s
Special instructions:
• This assignment is done purely on Cloudera’s QuickStart VM. You may need to learn a few basic
UNIX commands to operate the program.
• For any cues, refer the Hadoop Streaming demo provided in the lesson.
Quiz
Assignment 02
Instruction
Problem
s
Use the given dataset to count and display all the airports based in New York using PySpark. Perform
the following steps:
• View all the airports listed in the dataset
• View only the first 10 records
• Filter the data for all airports located in New York
• Clean up the dataset, if required
Assignment 02
Instruction
Problem
s
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document from the
“Resources” tab to view the steps for installing Anaconda and the Jupyter notebook.
• Download the “Assignment 02” notebook and upload it on the Jupyter notebook to access it.
• Follow the provided cues to complete the assignment.
Quiz
QUIZ
What are the core components of Hadoop? Select all that apply.
1
a. MapReduce
b. HDFS
c. Spark
d. RDD
QUIZ
What are the core components of Hadoop? Select all that apply.
1
a. MapReduce
b. HDFS
c. Spark
d. RDD
a. at DataNode
b. at NameNode
c. on client side
d. in memory
QUIZ
MapReduce is a data processing framework which gets executed _____.
2
a. at DataNode
b. at NameNode
c. on client side
d. in memory
Explanation: The MapReduce program is executed at the data node and the output is written to the disk.
QUIZ
Which of the following functions is responsible for consolidating the results produced
3 by each of the Map() functions/tasks?
a. Reducer
b. Mapper
c. Partitioner
a. Reducer
b. Mapper
c. Partitioner
a. Mapper
b. Reducer
c. Combiner
d. Partitioner
QUIZ
What transforms input key-value pairs to a set of intermediate key-value pairs?
4
a. Mapper
b. Reducer
c. Combiner
d. Partitioner
Explanation: Mapper processes input data to intermediate key-value pairs which are in turn processed by reducers.
Key Takeaways
Both Hadoop and Spark provide Python APIs to help Data Scientists use the Big
Data platform.
After learning about Data Science in depth, it is now time to implement the knowledge gained through
this course in real-life scenarios. We will provide you with four scenarios where you need to implement
data science solutions. To perform these tasks, you can use the different Python libraries such as
NumPy, SciPy, Pandas, scikit-learn, matplotlib, BeautifulSoup, and so on.
You will focus on acquiring stock data information for the companies listed.
• Import the financial data using Yahoo data reader for the following companies:
o Yahoo
o Apple
o Amazon
o Microsoft
o Google
• Perform Daily Return Analysis and show the relationship between different stocks
o Plot the percentage change plot for Apple’s stock
o Show a joint plot for Apple and Google
o Use PairPlot to show the correlation between all the stocks
We recommend you to first solve the project and then view the solution to assess your learning.
Project 01: Stock Market Data Analysis
Hope you had a good experience working on the project “Stock Market Data Analysis.”
Go to the next screen to assess your performance.
After learning about Data Science in depth, it is time to implement the knowledge gained through this
course in real-life scenarios. We are providing four real-life scenarios where you can implement data
science solutions. To develop solutions to these problems, you can use various Python libraries like
NumPy, SciPy, Pandas, Scikit-learn, Matplotlib, BeautifulSoup, and so on.
Project details are given below:
Project 02: Titanic Data Set Analysis
On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers
and crew. This tragedy shocked the world and led to better safety regulations for ships. Here, we ask
you to perform the analysis through the exploratory data analysis technique. In particular, we want
you to apply the tools of machine learning to predict which passengers survived the tragedy.
The details of these projects and their scope are listed in the following sections.
Click each tab to know more. Click the Resources tab to download the files for this project.
Project 02: Titanic Data Set Analysis
Click each tab to know more. Click the Resources tab to download the files for this project.
Project 02: Titanic Data Set Analysis
Click each tab to know more. Click the Resources tab to download the files for this project.
Project 02: Titanic Data Set Analysis
We recommend you to first solve the project and then view the solution to assess your learning.
Project 02: Titanic Data Set Analysis
Hope you had a good experience working on the project “Titanic data set analysis.”
Go to the next screen to assess your performance.