Data Science With Python - Lesson 02 - Data Analytics Overview
Data Science With Python - Lesson 02 - Data Analytics Overview
Data by itself is just an information source. But, unless you understand it, you will not be able to use it effectively.
When the transaction details are presented as a line chart, the deposit and withdrawal patterns become apparent.
Overall pattern
Why Data Analytics
When the transaction details are presented as a line chart, the deposit and withdrawal patterns become
apparent. It helps view and analyze general trends and discrepancies.
Discrepancy
Introduction to Data Analytics
Sales Inventory
Collect data from various sources for analysis to answer the question raised in step 1.
Twitter, Facebook,
LinkedIn, and other social
media and information
sites provide streaming
APIs.
Data wrangling is the most important phase of the data analytic process.
This phase includes data cleansing, data manipulation, data aggregation, data split, and reshaping of data.
Data wrangling is the most challenging phase and takes up 70% of the Data Scientist’s time.
Data Exploration: Model Selection
This phase includes data cleansing, data manipulation, data aggregation, data split, and reshaping of data.
Model selection
• Based on the overall data analysis process
• Should be accurate to avoid iterations
• Depends on pattern identification and algorithms
• Depends on hypothesis building and testing
• Leads to building mathematical statistical functions
Exploratory Data Analysis (EDA)
Quantitative:
EDA techniques
The focus is on data; Provides numeric
EDA approach make minimal or no
its structure, outputs for the
studies the data to assumptions. They
outliers, and models inputted data
recommend suitable present and show
suggested by the Graphical:
models that best fit all the underlying
data. Uses statistical
the data. data without any
functions for
data loss.
graphical output
EDA: Quantitative Technique
EDA: Quantitative technique has two goals, measurement of central tendency and spread of data.
Measurement of Spread
Variance Variance is approximately the mean of the squares of the deviations.
Histograms and scatter plots are two popular graphical techniques to depict data.
a univariate dataset.
It shows:
Frequency
20
1 1 2 2
0 5 Per 0
Miles 5
Gallon
EDA: Graphical Technique
Histograms and scatter plots are two popular graphical techniques to depict data.
This step involves reaching a conclusion and making predictions based on the data analysis.
Hypothesis is used to establish the relationship between dependent and independent variables.
Draw two samples from the population and calculate the difference between their means.
μ1 Calculating the
difference
S1 between the two
means is
hypothesis
testing.
μ2
S2
Hypothesis Testing
Alternative Hypothesis
• Proposed model outcome is
accurate and matches the data.
• There is a difference between the
means of S1 and S2.
Null Hypothesis
• Opposite of the alternative
hypothesis.
• There is no difference between
the means of S1 and S2.
Hypothesis Testing Process
Choosing the training and test dataset, and evaluating them with the null and alternative hypothesis.
Usually the training dataset is between 60% to 80% of the big dataset and the test dataset is between
20% to 40% of the big dataset.
Data Visualization
Communication
The last step of data analysis is communication, where the analyzed data is formally presented to stakeholders.
Plotting is a data visualization technique used to represent underlying data through graphics.
Features of plotting:
Data is measured in time blocks, such as, date, month, year, and time (hours, minutes, and
seconds
Time Series
Types of Plot
Data acquisition is a process to collect data from various data sources, such as RDBMS, No SQL databases, web server logs and
also scrape the web through web APIs.
Knowledge
Check What is Exploratory data analysis technique?
Select all that apply.
2
Most EDA techniques are graphical in nature with a few quantitative techniques and also suggest models that best fit the data.
They use almost the entire data with minimum and no assumptions.
Knowledge
Check Which plotting technique is used for continuous data?
Select all that apply.
3
a. Regression plot
b. Line chart
c. Histogram
d. Heat map
Knowledge
Check Which plotting technique is used for continuous data?
Select all that apply.
3
a. Regression plot
b. Line chart
c. Histogram
d. Heat map
a. Pandas
b. Matplotlib
c. Scikit-learn
d. NumPy
Knowledge
Check
Which Python library is the main machine learning library?
4
a. Pandas
b. Matplotlib
c. Scikit-learn
d. NumPy
a. Data acquisition
b. Data visualization
c. Data wrangling
d. Machine learning
Knowledge
Check Which of the following includes data transformation, merging, aggregation, group by operation,
and reshaping?
5
a. Data acquisition
b. Data visualization
c. Data wrangling
d. Machine learning
Data wrangling includes data transformation, merging, aggregation, group by operation, and reshaping.
Knowledge
Check
Which measure of central tendency is used to catch outliers in the data?
6
a. Mean
b. Median
c. Mode
d. Variance
Knowledge
Check
Which measure of central tendency is used to catch outliers in the data?
6
a. Mean
b. Median
c. Mode
d. Variance
Median is the exact middle value and most suitable to catch outliers.
Knowledge
Check
In hypothesis testing, the proposed model is built on:
6
a. Entire dataset
b. Test dataset
c. Small subset
d. Training dataset
Knowledge
Check
In hypothesis testing, the proposed model is built on:
6
a. Entire dataset
b. Test dataset
c. Small subset
d. Training dataset
a. Data wrangling
b. Web scraping
c. Plotting
d. Machine learning
Knowledge
Check
Beautiful soup library is used for _____.
7
a. Data wrangling
b. Web scraping
c. Plotting
d. Machine learning
BeautifulSoup is used for web scraping and mainly used in the data acquisition phase.
Key Takeaways