Notes Data Science With Python 1
Notes Data Science With Python 1
Data Scientists collect data and explore, analyze, and visualize it. They apply
mathematical and statistical models to find patterns and solutions in the data.
Python is a general purpose, open source, programming language that lets you work
quickly and integrate systems more effectively.
Benefits of Python
Before getting into the nitty-gritty of data analysis, a business will need to define
why they’re seeking one in the first place. This need typically stems from a business
problem or question. Some examples include:
After a purpose has been defined, it’s time to begin collecting the data that will be
used in the analysis. This step is important because whichever sources of data are
chosen will determine how in-depth the analysis is.
Data collection starts with primary sources, also known as internal sources. This is
typically structured data gathered from CRM software, ERP systems, marketing
automation tools, and others. These sources contain information about customers,
finances, gaps in sales, and more.
Then comes secondary sources, also known as external sources. This is both
structured and unstructured data that can be gathered from many places.
Once data is collected from all the necessary sources, your data team will be tasked
with cleaning and sorting through it. Data cleaning is extremely important during
the data analysis process, simply because not all data is good data.
To generate accurate results, data scientists must identify and purge duplicate data,
anomalous data, and other inconsistencies that could skew the analysis. Although, 60
percent of data scientists say most of their time is spent cleaning data.
One of the last steps in the data analysis process is, you guessed it, analyzing and
manipulating the data. This can be done in a variety of ways.
One way is through data mining, which is defined as “knowledge discovery within
databases.” Data mining techniques like clustering analysis, anomaly detection,
association rule mining, and others could unveil hidden patterns in data that weren’t
previously visible.
There’s also business intelligence and data visualization software, both of which are
optimized for decision-makers and business users. These options generate easy-to-
understand reports, dashboards, scorecards, and charts.
Data scientists may also apply predictive analytics, which makes up one of
four types of data analytics used today. Predictive analyses look ahead to the future,
attempting to forecast what is likely to happen next with a business problem or
question.
Step 5: Interpret the results
The final step is interpreting the results from the data analysis. This part is important
because it’s how a business will gain actual value from the previous four steps.
Interpreting the data analysis should validate why you conducted one in the first
place, even if it’s not 100 percent conclusive. For example, “options A and B can be
explored and tested to reduce production costs without sacrificing quality.”
Analysts and business users should look to collaborate during this process. Also,
when interpreting results, consider any challenges or limitations that may have not
been present in the data. This will only bolster the confidence in your next steps
5. Inaccessible data
Moving data into one centralized system has little impact if it is not easily accessible
to the people that need it. Decision-makers and risk managers need access to all of
an organization’s data for insights on what is happening at any given moment, even
if they are working off-site. Accessing information should be the easiest part of data
analytics.
An effective database will eliminate any accessibility issues. Authorized employees
will be able to securely view or edit data from anywhere, illustrating organizational
changes and enabling high-speed decision making.
Types of Analytics
Data Visualization
Data visualization techniques are used for effective communication of data.
Introduction to Statistics
Statistics is the study of the collection, analysis, interpretation, presentation, and
organization of data.
Tools available to analyze data:
•Statistical principles
•Functions
•Algorithms
What you can do using statistical tools:
•Analyze the primary data
•Build a statistical model
•Predict the future outcome
Organizing, managing and storing data is important as it enables easier access and
efficient modifications. Data Structures allows you to organize your data in such a
way that enables you to store collections of data, relate them and perform operations
on them accordingly.
Python has implicit support for Data Structures which enable you to store and access
data. These structures are called List, Dictionary, Tuple and Set.
Python allows its users to create their own Data Structures enabling them to have
full control over their functionality. The most prominent Data Structures are Stack,
Queue, Tree, Linked List and so on which are also available to you in other
programming languages. So now that you know what are the types available to you,
why don’t we move ahead to the Data Structures and implement them using Python.
Lists
Lists are used to store data of different data types in a sequential manner. There are
addresses assigned to every element of the list, which is called as Index. The index
value starts from 0 and goes on until the last element called the positive index. There
is also negative indexing which starts from -1 enabling you to access elements from
the last to first. Let us now understand lists better with the help of an example
program.
Dictionary
Dictionaries are used to store key-value pairs. To understand better, think of a phone
directory where hundreds and thousands of names and their corresponding numbers
have been added. Now the constant values here are Name and the Phone Numbers
which are called as the keys. And the various names and phone numbers are the
values that have been fed to the keys. If you access the values of the keys, you will
obtain all the names and phone numbers. So that is what a key-value pair is. And in
Python, this structure is stored using Dictionaries.
Tuple
Tuples are the same as lists are with the exception that the data once entered into the
tuple cannot be changed no matter what. The only exception is when the data inside
the tuple is mutable, only then the tuple data can be changed. The example program
will help you understand better.
Sets
Sets are a collection of unordered elements that are unique. Meaning that even if the
data is repeated more than one time, it would be entered into the set only once. It
resembles the sets that you have learnt in arithmetic. The operations also are the
same as is with the arithmetic sets. An example program would help you understand
better.
Scipy
How to handle multiple scientific domains? The solution is SciPy.
SciPy is a python library that is useful in solving many mathematical equations and
algorithms. It is designed on the top of Numpy library that gives more extension of
finding scientific mathematical formulae like Matrix Rank, Inverse, polynomial
equations, LU Decomposition, etc. Using its high level functions will significantly
reduce the complexity of the code and helps in better analyzing the data. SciPy is an
interactive Python session used as a data-processing library that is made to compete
with its rivalries such as MATLAB, Octave, R-Lab,etc. It has many user-friendly,
efficient and easy-to-use functions that helps to solve problems like numerical
integration, interpolation, optimization, linear algebra and statistics.
The benefit of using SciPy library in Python while making ML models is that it also
makes a strong programming language available for use in developing less complex
programs and applications
Case-let
Let’s say you work for a social media company that has just done a launch in a new
city. Looking at weekly metrics, you see a slow decrease in the average number of
comments per user from January to March in this city.
The company has been consistently growing new users in the city from January to
March.
What are some reasons why the average number of comments per user would be
decreasing and what metrics would you look into?
Hint: This question is very vague. It’s all hypothetical, so we don’t know very much about users,
what the product is, and how people might be interacting. Be sure you ask questions upfront
about the product.
Answer. Before I jump into an answer, I’d like to ask a few questions:
• Who uses this social network? How do they interact with each other?
• Has there been any performance issues that might be causing the problem?
• What are the goals of this particular launch?
• Has there been any changes to the comment features in recent weeks?
For the sake of this example, let’s say we learn that it’s a social network similar to
Facebook with a young audience, and the goals of the launch are to grow the user base.
Also, there have been no performance issues and the commenting feature hasn’t been
changed since launch.
Hint: Look for clues in the question. For example, this case gives you a metric, “average
number of comments per user.” Consider if the clue might be helpful in your solution. But
be careful, sometimes questions are designed to throw you off track.
Answer. From the question, we can hypothesize a little bit. For example, we know that
user count is increasing linearly. That means two things:
We can also model out the data to help us get a better picture of the average number of
comments per user metric:
• January: 10000 users, 30000 comments, 3 comments/user
• February: 20000 users, 50000 comments, 2.5 comments/user
• March: 30000 users, 60000 comments, 2 comments/user ****
One thing to note: Although this is an interesting metric, I’m not sure if it will help us solve
this question. For one, average comments per user doesn’t account for churn. We might
assume that during the three-month period users are churning off the platform. Let’s say
the churn rate is 25% in January, 20% in February and 15% in March.
Hint: Don’t worry too much about making a correct hypothesis. Instead, interviewers want
to get a sense of your product initiation and that you’re on the right track. Also, be
prepared to measure your hypothesis.
Answer. I would say that average comments per user isn’t a great metric to use, because
it doesn’t reveal insights into what’s really causing this issue.
That’s because it doesn’t account for active users, which are the users who are actually
commenting. A better metric to investigate would be retained users and monthly active
users.
What I suspect is causing the issue is that active users are commenting frequently and
are responsible for the increase in comments month-to-month. New users, on the other
hand, aren’t as engaged and aren’t commenting as often.
Hint: Within your solution, include key metrics that you’d like to investigate that will help
you measure success.
Answer. I’d say there are a few ways we could investigate the cause of this problem, but
the one I’d be most interested in would be the engagement of monthly active users.
If the growth in comments is coming from active users, that would help us understand
how we’re doing at retaining users. Plus, it will also show if new users are less engaged
and commenting less frequently.
One way that we could dig into this would be to segment users by their onboarding date,
which would help us to visualize engagement and see how engaged some of our longest-
retained users are.
If engagement of new users is the issue, that will give us some options in terms of
strategies for addressing the problem. For example, we could test new onboarding or
commenting features designed to generate engagement.
Hint: In the majority of cases, your initial assumptions might be incorrect, or the
interviewer might throw you a curveball. Be prepared to make new hypotheses or discuss
the pitfalls of your analysis.
Answer. If the cause wasn’t due to a lack of engagement among new users, then I’d want
to investigate active users. One potential cause would be active users commenting less.
In that case, we’d know that our earliest users were churning out, and that engagement
among new users was potentially growing.
Again, I think we’d want to focus on user engagement since the onboarding date. That
would help us understand if we were seeing higher levels of churn among active users,
and we could start to identify some solutions there.