Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
69 views

Notes Data Science With Python 1

This document discusses data science and analytics using Python. It covers the basic skills of a data scientist including asking questions, understanding data structures, interpreting data, applying statistical methods, visualizing data, and working as a team player. It also discusses the roles and responsibilities of a data scientist such as identifying data sources, preprocessing data, analyzing data to find trends, building predictive models, presenting findings visually, and collaborating with other teams. Finally, it outlines the five steps of the data analysis process: defining the purpose, collecting data, cleaning data, analyzing data, and interpreting the results.

Uploaded by

saurabh sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Notes Data Science With Python 1

This document discusses data science and analytics using Python. It covers the basic skills of a data scientist including asking questions, understanding data structures, interpreting data, applying statistical methods, visualizing data, and working as a team player. It also discusses the roles and responsibilities of a data scientist such as identifying data sources, preprocessing data, analyzing data to find trends, building predictive models, presenting findings visually, and collaborating with other teams. Finally, it outlines the five steps of the data analysis process: defining the purpose, collecting data, cleaning data, analyzing data, and interpreting the results.

Uploaded by

saurabh sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Data Science with Python

Data Scientists collect data and explore, analyze, and visualize it. They apply
mathematical and statistical models to find patterns and solutions in the data.

Basic Skills of a Data Scientist


A Data Scientist should be able to
•Ask the right questions
•Understand data structure
•Interpret and wrangle data
•Apply statistical and mathematical methods
•Visualize data and communicate with stakeholders
•Work as a team player

Roles and Responsibilities of a Data Scientist


• Identify valuable data sources and automate collection processes
• Undertake preprocessing of structured and unstructured data
• Analyze large amounts of information to discover trends and patterns
• Build predictive models and machine-learning algorithms
• Combine models through ensemble modeling
• Present information using data visualization techniques
• Propose solutions and strategies to business challenges
• Collaborate with engineering and product development teams
Data Analytics and Python
Python deals with each stage of data analytics efficiently by applying different
libraries and packages.

Python is a general purpose, open source, programming language that lets you work
quickly and integrate systems more effectively.
Benefits of Python

Steps of the Data Analysis Process

1. Define why you need data analysis.


2. Begin collecting data from sources.
3. Clean through unnecessary data.
4. Begin analyzing the data.
5. Interpret the results and apply them.

Step 1: Define why you need data analysis

Before getting into the nitty-gritty of data analysis, a business will need to define
why they’re seeking one in the first place. This need typically stems from a business
problem or question. Some examples include:

• How can we reduce production costs without sacrificing quality?


• What are some ways to increase sales opportunities with our current
resources?
• Do customers view our brand in a favorable way?
Step 2: Data collection

After a purpose has been defined, it’s time to begin collecting the data that will be
used in the analysis. This step is important because whichever sources of data are
chosen will determine how in-depth the analysis is.
Data collection starts with primary sources, also known as internal sources. This is
typically structured data gathered from CRM software, ERP systems, marketing
automation tools, and others. These sources contain information about customers,
finances, gaps in sales, and more.
Then comes secondary sources, also known as external sources. This is both
structured and unstructured data that can be gathered from many places.

Step 3: Data cleaning

Once data is collected from all the necessary sources, your data team will be tasked
with cleaning and sorting through it. Data cleaning is extremely important during
the data analysis process, simply because not all data is good data.
To generate accurate results, data scientists must identify and purge duplicate data,
anomalous data, and other inconsistencies that could skew the analysis. Although, 60
percent of data scientists say most of their time is spent cleaning data.

Step 4: Data analysis

One of the last steps in the data analysis process is, you guessed it, analyzing and
manipulating the data. This can be done in a variety of ways.
One way is through data mining, which is defined as “knowledge discovery within
databases.” Data mining techniques like clustering analysis, anomaly detection,
association rule mining, and others could unveil hidden patterns in data that weren’t
previously visible.
There’s also business intelligence and data visualization software, both of which are
optimized for decision-makers and business users. These options generate easy-to-
understand reports, dashboards, scorecards, and charts.
Data scientists may also apply predictive analytics, which makes up one of
four types of data analytics used today. Predictive analyses look ahead to the future,
attempting to forecast what is likely to happen next with a business problem or
question.
Step 5: Interpret the results

The final step is interpreting the results from the data analysis. This part is important
because it’s how a business will gain actual value from the previous four steps.
Interpreting the data analysis should validate why you conducted one in the first
place, even if it’s not 100 percent conclusive. For example, “options A and B can be
explored and tested to reduce production costs without sacrificing quality.”
Analysts and business users should look to collaborate during this process. Also,
when interpreting results, consider any challenges or limitations that may have not
been present in the data. This will only bolster the confidence in your next steps

Challenges of Data Analytics and How to Fix Them


1. The amount of data being collected
With today’s data-driven organizations and the introduction of big data, risk
managers and other employees are often overwhelmed with the amount of data that
is collected. An organization may receive information on every incident and
interaction that takes place on a daily basis, leaving analysts with thousands of
interlocking data sets.
There is a need for a data system that automatically collects and organizes
information. Manually performing this process is far too time-consuming and
unnecessary in today’s environment. An automated system will allow employees to
use the time spent processing data to act on it instead.

2. Collecting meaningful and real-time data


With so much data available, it’s difficult to dig down and access the insights that
are needed most. When employees are overwhelmed, they may not fully analyze
data or only focus on the measures that are easiest to collect instead of those that
truly add value. In addition, if an employee has to manually sift through data, it can
be impossible to gain real-time insights on what is currently happening. Outdated
data can have significant negative impacts on decision-making.
A data system that collects, organizes and automatically alerts users of trends will
help solve this issue. Employees can input their goals and easily create a report that
provides the answers to their most important questions. With real-time reports and
alerts, decision-makers can be confident they are basing any choices on complete
and accurate information.

3. Visual representation of data


To be understood and impactful, data often needs to be visually presented in graphs
or charts. While these tools are incredibly useful, it’s difficult to build them
manually. Taking the time to pull information from multiple areas and put it into a
reporting tool is frustrating and time-consuming.
Strong data systems enable report building at the click of a button. Employees and
decision-makers will have access to the real-time information they need in an
appealing and educational format.

4. Data from multiple sources


The next issue is trying to analyze data across multiple, disjointed sources. Different
pieces of data are often housed in different systems. Employees may not always
realize this, leading to incomplete or inaccurate analysis. Manually combining data
is time-consuming and can limit insights to what is easily viewed.
With a comprehensive and centralized system, employees will have access to all
types of information in one location. Not only does this free up time spent accessing
multiple sources, it allows cross-comparisons and ensures data is complete.

5. Inaccessible data
Moving data into one centralized system has little impact if it is not easily accessible
to the people that need it. Decision-makers and risk managers need access to all of
an organization’s data for insights on what is happening at any given moment, even
if they are working off-site. Accessing information should be the easiest part of data
analytics.
An effective database will eliminate any accessibility issues. Authorized employees
will be able to securely view or edit data from anywhere, illustrating organizational
changes and enabling high-speed decision making.

6. Poor quality data


Nothing is more harmful to data analytics than inaccurate data. Without good input,
output will be unreliable. A key cause of inaccurate data is manual errors made
during data entry. This can lead to significant negative consequences if the analysis
is used to influence decisions. Another issue is asymmetrical data: when information
in one system does not reflect the changes made in another system, leaving it
outdated.
A centralized system eliminates these issues. Data can be input automatically with
mandatory or drop-down fields, leaving little room for human error. System
integrations ensure that a change in one area is instantly reflected across the board.
7. Shortage of skills
Some organizations struggle with analysis due to a lack of talent. This is especially
true in those without formal risk departments. Employees may not have the
knowledge or capability to run in-depth data analysis.
This challenge is mitigated in two ways: by addressing analytical competency in the
hiring process and having an analysis system that is easy to use. The first solution
ensures skills are on hand, while the second will simplify the analysis process for
everyone. Everyone can utilize this type of system, regardless of skill level.

Types of Analytics
Data Visualization
Data visualization techniques are used for effective communication of data.

Benefits of data visualization:


•Simplifies quantitative information through visuals
•Shows the relationship between data points and variables
•Identifies patterns
•Establishes trends

Data Types for Plotting


There are various data types used for plotting.
Types of Plot
Different data types can be visualized using various plotting techniques.

Introduction to Statistics
Statistics is the study of the collection, analysis, interpretation, presentation, and
organization of data.
Tools available to analyze data:
•Statistical principles
•Functions
•Algorithms
What you can do using statistical tools:
•Analyze the primary data
•Build a statistical model
•Predict the future outcome

Statistical and Non-Statistical Analysis

Data Structures in Python


What is a Data Structure?

Organizing, managing and storing data is important as it enables easier access and
efficient modifications. Data Structures allows you to organize your data in such a
way that enables you to store collections of data, relate them and perform operations
on them accordingly.

Python has implicit support for Data Structures which enable you to store and access
data. These structures are called List, Dictionary, Tuple and Set.

Python allows its users to create their own Data Structures enabling them to have
full control over their functionality. The most prominent Data Structures are Stack,
Queue, Tree, Linked List and so on which are also available to you in other
programming languages. So now that you know what are the types available to you,
why don’t we move ahead to the Data Structures and implement them using Python.
Lists

Lists are used to store data of different data types in a sequential manner. There are
addresses assigned to every element of the list, which is called as Index. The index
value starts from 0 and goes on until the last element called the positive index. There
is also negative indexing which starts from -1 enabling you to access elements from
the last to first. Let us now understand lists better with the help of an example
program.

Dictionary

Dictionaries are used to store key-value pairs. To understand better, think of a phone
directory where hundreds and thousands of names and their corresponding numbers
have been added. Now the constant values here are Name and the Phone Numbers
which are called as the keys. And the various names and phone numbers are the
values that have been fed to the keys. If you access the values of the keys, you will
obtain all the names and phone numbers. So that is what a key-value pair is. And in
Python, this structure is stored using Dictionaries.

Tuple
Tuples are the same as lists are with the exception that the data once entered into the
tuple cannot be changed no matter what. The only exception is when the data inside
the tuple is mutable, only then the tuple data can be changed. The example program
will help you understand better.

Sets
Sets are a collection of unordered elements that are unique. Meaning that even if the
data is repeated more than one time, it would be entered into the set only once. It
resembles the sets that you have learnt in arithmetic. The operations also are the
same as is with the arithmetic sets. An example program would help you understand
better.
Scipy
How to handle multiple scientific domains? The solution is SciPy.
SciPy is a python library that is useful in solving many mathematical equations and
algorithms. It is designed on the top of Numpy library that gives more extension of
finding scientific mathematical formulae like Matrix Rank, Inverse, polynomial
equations, LU Decomposition, etc. Using its high level functions will significantly
reduce the complexity of the code and helps in better analyzing the data. SciPy is an
interactive Python session used as a data-processing library that is made to compete
with its rivalries such as MATLAB, Octave, R-Lab,etc. It has many user-friendly,
efficient and easy-to-use functions that helps to solve problems like numerical
integration, interpolation, optimization, linear algebra and statistics.
The benefit of using SciPy library in Python while making ML models is that it also
makes a strong programming language available for use in developing less complex
programs and applications
Case-let

Let’s say you work for a social media company that has just done a launch in a new
city. Looking at weekly metrics, you see a slow decrease in the average number of
comments per user from January to March in this city.
The company has been consistently growing new users in the city from January to
March.

What are some reasons why the average number of comments per user would be
decreasing and what metrics would you look into?

Step 1: Ask Clarifying Questions Specific to the Case

Hint: This question is very vague. It’s all hypothetical, so we don’t know very much about users,
what the product is, and how people might be interacting. Be sure you ask questions upfront
about the product.

Answer. Before I jump into an answer, I’d like to ask a few questions:

• Who uses this social network? How do they interact with each other?
• Has there been any performance issues that might be causing the problem?
• What are the goals of this particular launch?
• Has there been any changes to the comment features in recent weeks?

For the sake of this example, let’s say we learn that it’s a social network similar to
Facebook with a young audience, and the goals of the launch are to grow the user base.
Also, there have been no performance issues and the commenting feature hasn’t been
changed since launch.

Step 2: Use the Case Question to Make Assumptions

Hint: Look for clues in the question. For example, this case gives you a metric, “average
number of comments per user.” Consider if the clue might be helpful in your solution. But
be careful, sometimes questions are designed to throw you off track.

Answer. From the question, we can hypothesize a little bit. For example, we know that
user count is increasing linearly. That means two things:

1. The decreasing comments issue isn’t a result of a declining user base.


2. The cause isn’t loss of platform.

We can also model out the data to help us get a better picture of the average number of
comments per user metric:
• January: 10000 users, 30000 comments, 3 comments/user
• February: 20000 users, 50000 comments, 2.5 comments/user
• March: 30000 users, 60000 comments, 2 comments/user ****

One thing to note: Although this is an interesting metric, I’m not sure if it will help us solve
this question. For one, average comments per user doesn’t account for churn. We might
assume that during the three-month period users are churning off the platform. Let’s say
the churn rate is 25% in January, 20% in February and 15% in March.

Step 3: Make a Hypothesis About the Data

Hint: Don’t worry too much about making a correct hypothesis. Instead, interviewers want
to get a sense of your product initiation and that you’re on the right track. Also, be
prepared to measure your hypothesis.

Answer. I would say that average comments per user isn’t a great metric to use, because
it doesn’t reveal insights into what’s really causing this issue.

That’s because it doesn’t account for active users, which are the users who are actually
commenting. A better metric to investigate would be retained users and monthly active
users.

What I suspect is causing the issue is that active users are commenting frequently and
are responsible for the increase in comments month-to-month. New users, on the other
hand, aren’t as engaged and aren’t commenting as often.

Step 4: Provide Metrics and Data Analysis

Hint: Within your solution, include key metrics that you’d like to investigate that will help
you measure success.

Answer. I’d say there are a few ways we could investigate the cause of this problem, but
the one I’d be most interested in would be the engagement of monthly active users.

If the growth in comments is coming from active users, that would help us understand
how we’re doing at retaining users. Plus, it will also show if new users are less engaged
and commenting less frequently.
One way that we could dig into this would be to segment users by their onboarding date,
which would help us to visualize engagement and see how engaged some of our longest-
retained users are.

If engagement of new users is the issue, that will give us some options in terms of
strategies for addressing the problem. For example, we could test new onboarding or
commenting features designed to generate engagement.

Step 5: Propose a Solution for the Case Question

Hint: In the majority of cases, your initial assumptions might be incorrect, or the
interviewer might throw you a curveball. Be prepared to make new hypotheses or discuss
the pitfalls of your analysis.

Answer. If the cause wasn’t due to a lack of engagement among new users, then I’d want
to investigate active users. One potential cause would be active users commenting less.
In that case, we’d know that our earliest users were churning out, and that engagement
among new users was potentially growing.

Again, I think we’d want to focus on user engagement since the onboarding date. That
would help us understand if we were seeing higher levels of churn among active users,
and we could start to identify some solutions there.

You might also like