Data Science Tutorial
Data Science Tutorial
Data science is also known as data-driven science, which makes use of scientific methods,
processes, and systems to extract knowledge or insights from data in various forms, i.e.
either structured or unstructured. Data science uses the most advanced hardware,
programming systems, and algorithms to solve problems that have to do with data. It is
where artificial intelligence is going.
Audience
This tutorial is designed for anyone willing to start their career in data science as this
tutorial is all about what is data science, important tools for data science, courses, top
interview questions asked, the salary of a data scientist, how to make a career in data
science, and what are the future opportunities.
Prerequisites
Before proceeding with this tutorial, you should have a basic knowledge of writing code in
Python programming language, using any Python IDE, and execution of Python programs.
If you are completely new to Python then please refer to our Python tutorial, which is
available at https://www.tutorialspoint.com/Python/index.htm to get a sound
understanding of the language.
Here are some of the technical concepts you should be aware of before starting with Data
Science: Statistics, Machine learning, Data modelling, Databases, and programming in
python.
All the content and graphics published in this e-book are the property of Tutorials Point (I)
Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute, or
republish any contents or a part of the contents of this e-book in any manner without the
written consent of the publisher.
We strive to update the contents of our website and tutorials as timely and as precisely as
possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.
Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our
website or its contents including this tutorial. If you discover any errors on our website or
in this tutorial, please notify us at contact@tutorialspoint.com
2
Table of Contents
About the tutorial ..................................................................................................................... 2
Audience .................................................................................................................................. 2
Prerequisites ............................................................................................................................ 2
Copyright & Disclaimer ............................................................................................................. 2
Table of Contents ..................................................................................................................... 3
1. Data Science — Getting Started .................................................................................. 5
History of data science: ............................................................................................................. 5
Why data science? .................................................................................................................... 6
Need for Data Science:.............................................................................................................. 6
Impact of data science: ............................................................................................................. 7
2. Data Science — What is data?..................................................................................... 8
What is data in data science? .................................................................................................... 8
Types of data: ........................................................................................................................... 8
Types of Data collection: ......................................................................................................... 12
Big data: ................................................................................................................................. 12
How do we use data in data science? ...................................................................................... 13
3. Data science- Lifecycle .............................................................................................. 14
What is Data science lifecycle? ................................................................................................ 14
How many phases are there in the data science life cycle? ....................................................... 14
Who are all involved in Data Science lifecycle? ........................................................................ 21
4. Data Science-Prerequisites ........................................................................................ 22
Technical skills: ....................................................................................................................... 22
Non-Technical skills ................................................................................................................ 23
5. Data Science – Applications....................................................................................... 25
6. Data Science-Machine learning ................................................................................. 28
What is machine learning? ...................................................................................................... 28
Types of machine learning: ..................................................................................................... 28
Data Science Vs Machine Learning .......................................................................................... 31
7. Data Science-Data Analysis ....................................................................................... 33
What is Data Analysis in Data Science? .................................................................................... 33
Why Data analysis is important? ............................................................................................. 33
Data analysis process: ............................................................................................................. 33
Types of data analysis: ............................................................................................................ 35
8. Data Science-Tools in demand ................................................................................... 37
3
Best Data Science Tools: ......................................................................................................... 37
9. Data Science – Careers .............................................................................................. 40
List of jobs related to data science: ......................................................................................... 40
Data science job trends 2022: ................................................................................................. 42
10. Data Science – Scientists ........................................................................................... 44
How to become a data scientist? ............................................................................................. 44
11. Data Scientist – Salary .............................................................................................. 48
Top factors that decide data scientist salaries:......................................................................... 48
Data Scientist Salary in India ................................................................................................... 48
Data Scientist Salary in USA .................................................................................................... 49
12. Data Science-Resources............................................................................................. 51
Top Data Science courses ........................................................................................................ 51
Top Data Science ebooks: ....................................................................................................... 54
13. Data Science - Interview Questions............................................................................ 56
4
Data Science — Getting Started
Data science is the process of extracting and analysing useful information from data to
solve problems that are difficult to solve analytically. For example, when you visit an e-
commerce site and look at a few categories and products before making a purchase, you
are creating data that Analysts can use to figure out how you make purchases.
It involves different disciplines like mathematical and statistical modelling, extracting data
from its source and applying data visualization techniques. It also involves handling big
data technologies to gather both structured and unstructured data.
It helps you find patterns that are hidden in the raw data. The term "Data Science" has
evolved because mathematical statistics, data analysis, and "big data" have changed over
time.
Data science is an interdisciplinary field that lets you learn from both organised and
unorganised data. With data science, you can turn a business problem into a research
project and then apply into a real-world solution.
Peter Naur suggested the phrase "data science" as an alternative name for computer
science in 1974. The International Federation of Classification Societies was the first
conference to highlight data science as a special subject in 1996. Yet, the concept
remained in change. Following the 1985 lecture at the Chinese Academy of Sciences in
Beijing, C. F. Jeff Wu again advocated for the renaming of statistics to data science in
1997. He reasoned that a new name would assist statistics in inaccurate stereotypes and
perceptions, such as being associated with accounting or confined to data description.
Hayashi Chikio proposed data science in 1998 as a new, multidisciplinary concept with
three components: data design, data collecting, and data analysis.
In the 1990s, "knowledge discovery" and "data mining" were popular phrases for the
process of identifying patterns in datasets that were growing in size.
In 2012, engineers Thomas H. Davenport and DJ Patil proclaimed "Data Scientist: The
Hottest Job of the 21st Century," a term that was taken up by major metropolitan
publications such as the New York Times and the Boston Globe. They repeated it a decade
later, adding that "the position is in more demand than ever"
In 2008, DJ Patil and Jeff Hammerbacher were given the professional designation of "data
scientist." Although it was used by the National Science Board in their 2005 study "Long-
Lived Digital Data Collections: Supporting Research and Teaching in the 21st Century," it
referred to any significant role in administering a digital data collection.
An agreement has not yet been reached on the meaning of data science, and some believe
it to be a buzzword. Big data is a similar concept in marketing. Data scientists are
responsible for transforming massive amounts of data into useful information and
developing software and algorithms that assist businesses and organisations in
determining optimum operations.
1) Data is the oil of the modern age. With the proper tools, technologies, and
algorithms, we can leverage data to create a unique competitive edge.
2) Data Science may assist in detecting fraud using sophisticated machine learning
techniques.
3) It helps you avoid severe financial losses.
4) Enables the development of intelligent machines
5) You may use sentiment analysis to determine the brand loyalty of your customers.
This helps you to make better and quicker choices.
6) It enables you to propose the appropriate product to the appropriate consumer in
order to grow your company.
6
3) Demand and average salary of a data scientist:
a) According to India Today, India is the second biggest centre for data science in
the world due to the fast digitalization of companies and services. By 2026,
analysts anticipate that the nation will have more than 11 million employment
opportunities. In fact, recruiting in the data science field has surged by 46% since
2019.
b) Bank of America was one of the first financial institutions to provide mobile
banking to its consumers a decade ago. Recently, the Bank of America introduced
Erica, its first virtual financial assistant. It is regarded the as best financial
invention in the world.
Erica now serves as a client adviser for more than 45 million consumers worldwide.
Erica uses Voice Recognition to receive client feedback, which represents a
technical development in Data Science.
c) The Data Science and Machine Learning curves are steep. Although India sees a
massive influx of data scientists each year, relatively few possess the needed skill
set and specialization. As a consequence, people with specialised data skills are in
great demand.
Healthcare industry has benefitted from the rise of data science. In 2008, Google
employees realised that they could monitor influenza strains in real time. Previous
technologies could only provide weekly updates on instances. Google was able to build one
of the first systems for monitoring the spread of diseases by using data science.
The sports sector has similarly profited from data science. A data scientist in 2019 found
ways to measure and calculate how goal attempts increase a soccer team's odds of
winning. In reality, data science is utilised to easily compute statistics in several sports.
Government agencies also use data science on a daily basis. Governments throughout the
globe employ databases to monitor information regarding social security, taxes, and other
data pertaining to their residents. The government's usage of emerging technologies
continues to develop.
Since the Internet has become the primary medium of human communication, the
popularity of e-commerce has also grown. With data science, online firms may monitor
the whole of the customer experience, including marketing efforts, purchases, and
consumer trends. Ads must be one of the greatest instances of eCommerce firms using
data science. Have you ever looked for anything online or visited an eCommerce product
website, only to be bombarded by advertisements for that product on social networking
sites and blogs?
Ad pixels are integral to the online gathering and analysis of user information. Companies
leverage online consumer behaviour to retarget prospective consumers throughout the
internet. This usage of client information extends beyond eCommerce. Apps such as Tinder
and Facebook use algorithms to assist users locate precisely what they are seeking. The
Internet is a growing treasure trove of data, and the gathering and analysis of this data
will also continue to expand.
7
Data Science — What is data?
Data comes in many shapes and forms, but can generally be thought of as being the result
of some random experiment — an experiment whose outcome cannot be determined in
advance, but whose workings are still subject to analysis. Data from a random experiment
are often stored in a table or spreadsheet. A statistical convention to denote variables is
often called as features or columns and individual items (or units) as rows.
Types of data:
There are mainly two types of data, they are:
Qualitative data:
Qualitative data consists of information that cannot be counted, quantified, or expressed
simply using numbers. It is gathered from text, audio, and pictures and distributed using
data visualization tools, including word clouds, concept maps, graph databases, timelines,
and infographics.
The objective of qualitative data analysis is to answer questions about the activities and
motivations of individuals. Collecting, and analyzing this kind of data may be time-
consuming. A researcher or analyst that works with qualitative data is referred to as a
qualitative researcher or analyst.
Qualitative data can give essential statistics for any sector, user group, or product.
Nominal data
In statistics, nominal data (also known as nominal scale) is used to designate variables
without giving a numerical value. It is the most basic type of measuring scale. In contrast
to ordinal data, nominal data cannot be ordered or quantified.
For example, The name of the person, the colour of the hair, nationality, etc. Let’s assume
a girl named Aby her hair is brown and she is from America.
8
Nominal data may be both qualitative and quantitative. Yet, there is no numerical value
or link associated with the quantitative labels (e.g., identification number). In contrast,
several qualitative data categories can be expressed in nominal form. These might consist
of words, letters, and symbols. Names of individuals, gender, and nationality are some of
the most prevalent instances of nominal data.
Using the grouping approach, nominal data can be analyzed. The variables may be sorted
into groups, and the frequency or percentage can be determined for each category. The
data may also be shown graphically, for example using a pie chart.
Although though nominal data cannot be processed using mathematical operators, they
may still be studied using statistical techniques. Hypothesis testing is one approach to
assess and analyse the data.
With nominal data, nonparametric tests such as the chi-squared test may be used to test
hypotheses. The purpose of the chi-squared test is to evaluate whether there is a
statistically significant discrepancy between the predicted frequency and the actual
frequency of the provided values.
Ordinal data:
Ordinal data is a type of data in statistics where the values are in a natural order. One of
the most important things about ordinal data is that you can't tell what the differences
between the data values are. Most of the time, the width of the data categories doesn't
match the increments of the underlying attribute.
In some cases, the characteristics of interval or ratio data can be found by grouping the
values of the data. For instance, the ranges of income are ordinal data, while the actual
income is ratio data.
Ordinal data can't be changed with mathematical operators like interval or ratio data can.
Because of this, the median is the only way to figure out where the middle of a set of
ordinal data is.
9
This data type is widely found in the fields of finance and economics. Consider an economic
study that examines the GDP levels of various nations. If the report rates the nations
based on their GDP, the rankings are ordinal statistics.
Using visualisation tools to evaluate ordinal data is the easiest method. For example, the
data may be displayed as a table where each row represents a separate category. In
addition, they may be represented graphically using different charts. The bar chart is the
most popular style of graph used to display these types of data.
Ordinal data may also be studied using sophisticated statistical analysis methods like
hypothesis testing. Note that parametric procedures such as the t-test and ANOVA cannot
be used to these data sets. Only nonparametric tests, such as the Mann-Whitney U test or
Wilcoxon Matched-Pairs test, may be used to evaluate the null hypothesis about the data.
1. Data records: Utilizing data that is already existing as the data source is a best
technique to do qualitative research. Similar to visiting a library, you may examine
books and other reference materials to obtain data that can be utilised for research.
2. Interviews: Personal interviews are one of the most common ways to get
deductive data for qualitative research. The interview may be casual and not have
a set plan. It is often like a conversation. The interviewer or researcher gets the
information straight from the interviewee.
3. Focus groups: Focus groups are made up of 6 to 10 people who talk to each other.
The moderator's job is to keep an eye on the conversation and direct it based on
the focus questions.
4. Case Studies: Case studies are in-depth analyses of an individual or group, with
an emphasis on the relationship between developmental characteristics and the
environment.
10
5. Observation: It is a technique where the researcher observes the object and take
down transcript notes to find out innate responses and reactions without
prompting.
Quantitative data:
Quantitative data consists of numerical values, has numerical features, and mathematical
operations can be performed on this type of data such as addition. Quantitative data is
mathematically verifiable and evaluable due to its quantitative character.
Discrete Data:
These are data that can only take on certain values, as opposed to a range. For instance,
data about the blood type or gender of a population is considered discrete data.
Example of discrete quantitative data may be the number of visitors to your website; you
could have 150 visits in one day, but not 150.6 visits. Usually, tally charts, bar charts, and
pie charts are used to represent discrete data.
Since it is simple to summarise and calculate discrete data, it is often utilized in elementary
statistical analysis. Let's examine some other essential characteristics of discrete data:
Continuous Data:
These are data that may take values between a certain range, including the greatest and
lowest possible. The difference between the greatest and least value is known as the data
range. For instance, the height and weight of your school's children. This is considered
continuous data. The tabular representation of continuous data is known as a frequency
distribution. These may be depicted visually using histograms.
Continuous data, on the other hand, can be either numbers or spread out over time and
date. This data type uses advanced statistical analysis methods because there are an
11
infinite number of possible values. The important characteristics about continuous data
are:
1. Continuous data changes over time, and at different points in time, it can have
different values.
2. Random variables, which may or may not be whole numbers, make up continuous
data.
3. Data analysis tools like line graphs, skews, and so on are used to measure
continuous data.
4. One type of continuous data analysis that is often used is regression analysis.
1. Surveys and questionnaires: These types of research are good for getting
detailed feedback from users and customers, especially about how people feel
about a product, service, or experience.
2. Open-source datasets: There are a lot of public datasets that can be found
online and analysed for free. Researchers sometimes look at data that has
already been collected and try to figure out what it means in a way that fits their
own research project.
3. Experiments: A common method is an experiment, which usually has a control
group and an experimental group. The experiment is set up so that it can be
controlled and the conditions can be changed as needed.
4. Sampling: When there are a lot of data points, it may not be possible to survey
each person or data point. In this case, quantitative research is done with the help
of sampling. Sampling is the process of choosing a sample of data that is
representative of the whole. The two types of sampling are Random sampling (also
called probability sampling), and non-random sampling.
1) Primary Data: These are the data that are acquired for the first time for a particular
purpose by an investigator. Primary data are 'pure' in the sense that they have not
been subjected to any statistical manipulations and are authentic. Examples of
primary data include the Census of India.
2) Secondary Data: These are the data that were initially gathered by a certain entity.
This indicates that this kind of data has already been gathered by researchers or
investigators and is accessible in either published or unpublished form. This data is
impure because statistical computations may have previously been performed on it.
For example, Information accessible on the website of the Government of India or
the Department of Finance, or in other archives, books, journals, etc.
Big data:
Big data is defined as data with a larger volume and require overcoming logistical
challenges to deal with them. Big data refers to bigger, more complicated data collections,
particularly from novel data sources. Some data sets are so extensive that conventional
data processing software is incapable of handling them. But, these vast quantities of data
can be use to solve business challenges that were previously unsolvable.
12
Data science is the study of how to analyse huge amount of data and get the information
from them. You can compare big data and data science to crude oil and an oil refinery.
Data science and big data grew out of statistics and traditional ways of managing data,
but they are now seen as separate fields.
People often use the three Vs to describe the characteristics of big data:
13
Data science- Lifecycle
A standard data science lifecycle approach comprises the use of machine learning
algorithms and statistical procedures that result in more accurate prediction models. Data
extraction, preparation, cleaning, modelling, assessment, etc., are some of the most
important data science stages. This technique is known as "Cross Industry Standard
Procedure for Data Mining" in the field of data science.
How many phases are there in the data science life cycle?
There are mainly six phases in data science life cycle:
14
Identifying problem and understanding the business:
The data science lifecycle starts with "why?" just like any other business lifecycle. One of
the most important parts of the data science process is figuring out what the problem is.
This helps to find a clear goal around which all the other steps can be planned out. In
short, it's important to know the business goal as earliest because it will determine what
the end goal of the analysis will be.
This phase should evaluate the trends of business, assess case studies of comparable
analyses, and research the industry’s domain. The group will evaluate the feasibility of the
project given the available employees, equipment, time, and technology. When these
factors been discovered and assessed, a preliminary hypothesis will be formulated to
address the business issues resulting from the existing environment. This phrase should -
1. Specify the issue that why the problem must be resolved immediately and
demands answer.
2. Specify the business project's potential value.
3. Identify dangers, including ethical concerns, associated with the project.
4. Create and convey a flexible, highly integrated project plan.
Data collection:
The next step in the data science lifecycle is data collection, which means getting raw data
from the appropriate and reliable source. The data that is collected can be either organized
or unorganized. The data could be collected from website logs, social media data, online
data repositories, and even data that is streamed from online sources using APIs, web
scraping, or data that could be in Excel or any other source.
The person doing the job should know the difference between the different data sets that
are available and how an organization invests its data. Professionals find it hard to keep
track of where each piece of data comes from and whether it is up to date or not. During
the whole lifecycle of a data science project, it is important to keep track of this information
because it could help test hypotheses or run any other new experiments.
The information may be gathered by surveys or the more prevalent method of automated
data gathering, such as internet cookies which is the primary source of data that is
unanalysed.
We can also use secondary data which is an open-source dataset. There are many available
websites from where we can collect data for example
Kaggle(https://www.kaggle.com/datasets), UCI Machine Learning Repository(
http://archive.ics.uci.edu/ml/index.php ), Google Public
Datasets( https://cloud.google.com/bigquery/public-data/ ). There are some predefined
datasets available in python. Let’s import the Iris dataset from python and use it to define
phases of data science.
15
iris = load_iris()
# Create a dataframe
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df['target'] = iris.target
X = iris.data
Data processing
After collecting high-quality data from reliable sources, next step is to process it. The
purpose of data processing is to ensure if there is any problem with the acquired data so
that it can be resolved before proceeding to the next phase. Without this step, we may
produce mistakes or inaccurate findings.
There may be several difficulties with the obtained data. For instance, the data may have
several missing values in multiple rows or columns. It may include several outliers,
inaccurate numbers, timestamps with varying time zones, etc. The data may potentially
have problems with date ranges. In certain nations, the date is formatted as DD/MM/YYYY,
and in others, it is written as MM/DD/YYYY. During the data collecting process numerous
problems can occur, for instance, if data is gathered from many thermometers and any of
them are defective, the data may need to be discarded or recollected.
At this phase, various concerns with the data must be resolved. Several of these problems
have multiple solutions, for example, if the data includes missing values, we can either
replace them with zero or the column's mean value. However, if the column is missing a
large number of values, it may be preferable to remove the column completely since it has
so little data that it cannot be used in our data science life cycle method to solve the issue.
When the time zones are all mixed up, we cannot utilize the data in those columns and
may have to remove them until we can define the time zones used in the supplied
timestamps. If we know the time zones in which each timestamp was gathered, we may
convert all timestamp data to a certain time zone. In this manner, there are a number of
strategies to address concerns that may exist in the obtained data.
We will access the data and then store it in a dataframe using python.
# Load Data
iris = load_iris()
# Create a dataframe
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df['target'] = iris.target
16
X = iris.data
All data must be in numeric representation for machine learning models. This implies that
if a dataset includes categorical data, it must be converted to numeric values before the
model can be executed. So we will be implementing label encoding.
Label encoding:
species = []
for i in range(len(df['target'])):
if df['target'][i] == 0:
species.append("setosa")
elif df['target'][i] == 1:
species.append('versicolor')
else:
species.append('virginica')
df['species'] = species
labels = np.asarray(df.species)
df.sample(10)
labels = np.asarray(df.species)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(labels)
labels = le.transform(labels)
df_selected1 = df.drop(['sepal length (cm)', 'sepal width (cm)', "species"], ax
is=1)
Data analysis
Data analysis Exploratory Data Analysis (EDA) is a set of visual techniques for analysing
data. With this method, we may get specific details on the statistical summary of the data.
Also, we will be able to deal with duplicate numbers, outliers, and identify trends or
patterns within the collection.
At this phase, we attempt to get a better understanding of the acquired and processed
data. We apply statistical and analytical techniques to make conclusions about the data
and determine the link between several columns in our dataset. Using pictures, graphs,
charts, plots, etc., we may use visualisations to better comprehend and describe the data.
Professionals use data statistical techniques such as the mean and median to better
comprehend the data. Using histograms, spectrum analysis, and population distribution,
they also visualise data and evaluate its distribution patterns. The data will be analysed
based on the problems.
17
Below code is used to check if there are any null values in the dataset:
df.isnull().sum()
Output:
From the above output we can conclude that there are no null values in the dataset as the
sum of all the null values in the column is 0.
We will be using shape parameter to check the shape (rows, columns) of the dataset:
df.shape
Output:
(150, 5)
Now we will use info() to check the columns and their data types:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
4 target 150 non-null int64
dtypes: float64(4), int64(1)
18
memory usage: 6.0 KB
Only one column contains category data, whereas the other columns include non-Null
numeric values.
Now we will use describe() on the data. The describe() method performs fundamental
statistical calculations to a dataset, such as extreme values, the number of data points,
standard deviation, etc. Any missing or NaN values are immediately disregarded. The
describe() method accurately depicts the distribution of data.
df.describe()
Output:
Data visualization:
Target column: Our target column will be the Species column since we will only want
results based on species in the end.
sns.countplot(x='species', data=df, )
plt.show()
Output:
19
There are many other visualization plots in Data science. To know more about them refer
https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_
python_understanding_data_with_visualization.htm
Data modelling:
Data Modelling is one of the most important aspects of data science and is sometimes
referred to as the core of data analysis. The intended output of a model should be derived
from prepared and analysed data. The environment required to execute the data model
will be chosen and constructed, before achieving the specified criteria.
At this phase, we develop datasets for training and testing the model for production-
related tasks. It also involves selecting the correct mode type and determining if the
problem involves classification, regression, or clustering. After analysing the model type,
we must choose the appropriate implementation algorithms. It must be performed with
care, as it is crucial to extract the relevant insights from the provided data.
Here machine learning comes in picture. Machine learning is basically divided into
classification, regression, or clustering models and each model have some algorithms
which is applied on the dataset to get the relevant information. These models are used in
this phase. We will discuss these models in detail in the machine learning chapter.
Model deployment:
We have reached the final stage of the data science lifecycle. The model is finally ready to
be deployed in the desired format and chosen channel after a detailed review process.
Note that the machine learning model has no utility unless it is deployed in the production.
Generally speaking, these models are associated and integrated with products and
applications.
20
Who are all involved in Data Science lifecycle?
Data is being generated, collected, and stored on voluminous servers and data warehouses
from the individual level to the organisational level. But how will you access this massive
data repository? This is where the data scientist comes in, since he or she is a specialist
in extracting insights and patterns from unstructured text and statistics.
Below, we present the many job profiles of the data science team participating in the data
science lifecycle.
21
Data Science-Prerequisites
You need to have several technical and non-technical skills to become a successful Data
Scientist. Some of the skills are essential to have to become a well-versed data scientist
while some for just for making thing things easier for a data scientist. Different job roles
determine the level of skill-specific proficiency you need to possess.
Given below are some skills you will require to become a data scientist.
Technical skills:
1. Python
Data Scientists use Python a lot because it is one of the most popular programming
languages, easy to learn and has extensive libraries that can be used for data manipulation
and data analysis. Since it is a flexible language, it can be used in all stages of Data
Science, such as data mining or running applications. Python has a huge open-source
library with powerful Data Science libraries like Numpy, Pandas, Matplotlib, PyTorch,
Keras, Scikit Learn, Seaborn, etc. These libraries help with different Data Science tasks,
such as reading large datasets, plotting and visualizing data and correlations, training and
fitting machine learning models to your data, evaluating the performance of the model,
etc.
2. SQL
SQL is an additional essential prerequisite before getting started with Data Science. SQL
is relatively simple compared to other programming languages, but is required to become
a Data Scientist. This programming language is used to manage and query relational
database-stored data. We can retrieve, insert, update, and remove data with SQL. To
extract insights from data, it is crucial to be able to create complicated SQL queries that
include joins, group by, having, etc. The join method enables you to query many tables
simultaneously. SQL also enables the execution of analytical operations and the
transformation of database structures.
3. R
R is an advanced language that is used to make complex models of statistics. R also lets
you work with arrays, matrices, and vectors. R is well-known for its graphical libraries,
which let users draw beautiful graphs and make them easy to understand.
With R Shiny, programmers can make web applications using R, which is used to embed
visualizations in web pages and gives users a lot of ways to interact with them. Also, data
extraction is a key part of the science of data. R lets you connect your R code to database
management systems.
R also gives you a number of options for more advanced data analysis, such as building
prediction models, machine learning algorithms, etc. R also has a number of packages for
processing images.
4. Statistics
In data science, advanced machine learning algorithms that stores and translate data
patterns for prediction rely heavily on statistics. Data scientists utilize statistics to collect,
22
assess, analyze, and derive conclusions from data, as well as to apply relevant quantitative
mathematical models and variables. Data scientists work as programmers, researchers,
and executives in business, among other roles, all of these disciplines have a statistical
foundation. The importance of statistics in data science is comparable to that of
programming languages.
5. Hadoop
Data scientists perform operations on enormous amount of data but sometimes the
memory of the system is not able to carry out processing on these huge amount of data.
So how data processing will be performed on such huge amount of data? Here Hadoop
comes in the picture. It is used to rapidly divide and transfer data to numerous servers for
data processing and other actions such as filtering. While Hadoop is based on the concept
of Distributed Computing, several firms require that Data Scientists have a fundamental
understanding of Distributed System principles such as Pig, Hive, MapReduce, etc. Several
firms have begun to use Hadoop-as-a-Service (HaaS), another name for Hadoop in the
cloud, so that Data Scientists do not need to understand Hadoop's inner workings.
6. Spark
Spark is a framework for big data computation like Hadoop and has gained some popularity
in Data Science world. Hadoop reads data from the disk and writes data to the disk while
on the other hand Spark Calculates the computation results in the system memory, making
it comparatively easy and faster than Hadoop. The function of Apache Spark is to facilitate
the speed of the complex algorithms and it is specially designed for the data science. If
the dataset is huge then it distributes data processing which saves a lot of time. The main
reason of using apache spark is because of its speed and the platform it provides to run
data science tasks and processes. It is possible to run Spark on a single machine or a
cluster of machines which makes it convenient to work with.
7. Machine learning
Machine learning is crucial component of Data Science. Machine Learning algorithms are
an effective method for analysing massive volumes of data. It may assist in automating a
variety of Data Science-related operations. Nevertheless, an in-depth understanding of
Machine Learning principles is not required to begin a career in this industry. The majority
of Data Scientists lack skills in Machine Learning. Just a tiny fraction of Data Scientists has
extensive knowledge and expertise in advanced topics such as Recommendation Engines,
Adversarial Learning, Reinforcement Learning, Natural Language Processing, Outlier
Detection, Time Series Analysis, Computer Vision, Survival Analysis, etc. These
competencies will consequently help you stand out in a Data Science profession.
Non-Technical skills
1) Understanding of business domain
More understanding one has for a particular business area or domain, easier it will be for
a data scientist to do the analysis on the data from that particular domain.
2) Understanding of data:
Data science is all about data so it is very important to have an understanding of data that
what is data, how data is stored, knowledge of tables, rows and columns.
23
Critical thinking is the ability to think clearly and logically while figuring out and
understanding how ideas fit together. In data science, you need to be able to think critically
to get useful insights and improve business operations. Critical thinking is probably one of
the most important skills in data science. It makes it easier for them to dig deeper into
information and find the most important things.
4) Product understanding:
Designing models isn't the entire job of a data scientist. Data scientists have to come up
with insights that can be used to improve the quality of products. With a systematic
approach, professionals can accelerate quickly if they understand the whole product. They
can help models get started (bootstrap) and improve feature engineering. This skill also
helps them improve their storytelling by revealing thoughts and insights about products
that they may not have thought of before.
5) Adaptability:
One of the most sought-after soft skills for data scientists in the modern talent acquisition
process is the ability to adapt. Because new technologies are being made and used more
quickly, professionals have to quickly learn how to use them. As a data scientist, you have
to keep up with changing business trends and be able to adapt.
24
Data Science – Applications
Data science involves different disciplines like mathematical and statistical modelling,
extracting data from its source and applying data visualization techniques. It also involves
handling big data technologies to gather both structured and unstructured data. Below,
we will see some applications of data science:
Gaming industry:
By establishing a presence on social media, sports organizations deal with a number of
issues. Zynga, a gaming corporation, has produced social media games like Zynga Poker,
Farmville, Chess with Friends, Speed Guess Something, and Words with Friends. This has
generated many user connections and large data volumes.
Here comes the necessity for data science within the game business in order to use the
data acquired from players across all social networks. Data analysis provides a captivating,
innovative diversion for players to keep ahead of the competition! One of the most
interesting applications of data science is inside the features and procedures of game
creation.
Healthcare:
Data science plays an important role in the field of healthcare. A Data Scientist's
responsibility is to integrate all Data Science methodologies into healthcare software. The
Data Scientist helps in collecting useful insights from the data in order to create prediction
models. The overall responsibilities of a Data Scientist in the field of healthcare are as
follows:
Data Science helps to determine the abnormalities in a human body by performing image
analysis on scanned images, hence assisting physicians in developing an appropriate
treatment plan. These picture examinations include X-ray, sonography, MRI (Magnetic
Resonance Imaging), and CT scan, among others. Doctors are able to give patients with
better care by gaining vital information from the study of these test photos.
2. Predictive analysis:
The condition of a patient is predicted by the predictive analytics model developed using
Data Science. In addition, it facilitates the development of strategies for the patient's
suitable treatment. Predictive analytics is a highly important tool of data science that plays
a significant part in the healthcare business.
25
Image recognition:
Image recognition is a technique of image processing that identifies everything in an
image, including individuals, patterns, logos, items, locations, colors, and forms.
Data science techniques have begun to recognize the human face and match it with all the
images in their database. In addition, mobile phones with cameras are generating infinite
number of digital images and videos. This vast amount of digital data is being utilized by
businesses to provide customers with superior and more convenient services. Generally,
the facial recognition system of AI analyses all facial characteristics and compares them
to its database to find a match.
Recommendation systems
As online shopping becomes more prevalent, the e-commerce platforms are able to
capture users shopping preferences as well as the performance of various products in the
market. This leads to creation of recommendation systems, which create models predicting
the shoppers needs and show the products the shopper is most likely to buy. Companies
like Amazon and Netflix use recommendation system so that they can help their user to
find the correct movie or product they are looking for.
It also helps to decide whether to directly land at the destination or take a halt in between
like a flight can have a direct route.
Finance:
The importance and relevance of data science in the banking sector is comparable to that
of data science in other areas of corporate decision-making. Professionals in data science
for finance give support and assistance to relevant teams within the company, particularly
the investment and financial team, by assisting them in the development of tools and
dashboards to enhance the investment process.
Computer Vision
The advancement in recognizing an image by a computer involves processing large sets
of image data from multiple objects of same category. For example, Face recognition.
These data sets are modelled, and algorithms are created to apply the model to newer
26
images (testing dataset) to get a satisfactory result. Processing of these huge data sets
and creation of models need various tools used in Data science.
Internet search:
Several search engines use data science to understand user behaviour and search
patterns. These search engines use diverse data science approaches to give each user with
the most relevant search results. Search engines such as Google, Yahoo, Bing, etc. are
becoming increasingly competent at replying to searches in seconds as time passes.
Speech recognition:
Google's Voice Assistant, Apple's Siri, and Microsoft's Cortana all utilise large datasets and
are powered by data science and natural language processing (NLP) algorithms. Speech
recognition software improves and gains a deeper understanding of human nature due to
the application of data science as more data is analysed.
Education:
While the world experienced the COVID-19 epidemic, the majority of students were always
carrying their computers. Online Courses, E-Submissions of assignments and
examinations, etc., have been used by the Indian education system. For the majority of
us, doing everything "online" remains challenging. Technology and contemporary times
have undergone a metamorphosis. As a result, Data Science in education is more crucial
than ever as it enters our educational system.
Now, instructors’ and students' everyday interactions are being recorded through a variety
of platforms, and class participation and other factors are being evaluated. As a result, the
rising quantity of online courses has increased the value of Educational data's depth.
27
Data Science-Machine learning
Data science is the science of gaining useful insights from data in order to get the most
crucial and relevant information source. And given a dependable stream of data,
generating predictions using machine learning.
Data science and machine learning are subfields of computer science that focus on
analyzing and making use of large amounts of data to improve the processes by which
products, services, infrastructural systems, and more are developed and introduced to the
market.
The two relate to each other in a similar manner that squares are rectangles, but
rectangles are not squares. Data science is the all-encompassing rectangle, while machine
learning is a square that is its own entity. They are both commonly employed by data
scientists in their job and are increasingly being accepted by practically every business.
Machine learning is a part of artificial intelligence that uses algorithms to find patterns in
data and then predict how those patterns will change in the future. This lets engineers use
statistical analysis to look for patterns in the data.
Facebook, Twitter, Instagram, YouTube, and TikTok collect information about their users,
based on what you've done in the past, it can guess your interests and requirements and
suggest products, services, or articles that fit your needs.
Machine learning is a set of tools and concepts that are used in data science, but they also
show up in other fields. Data scientists often use machine learning in their work to help
them get more information faster or figure out trends.
1) Supervised learning
2) Unsupervised learning
3) Reinforcement learning
28
Supervised learning
Supervised learning is a type of machine learning and artificial intelligence. It is also called
"supervised machine learning." It is defined by the fact that it uses labelled datasets to
train algorithms how to correctly classify data or predict outcomes. As data is put into the
model, its weights are changed until the model fits correctly. This is part of the cross
validation process. Supervised learning helps organisations find large-scale solutions to a
wide range of real-world problems, like classifying spam in a separate folder from your
inbox like in Gmail we have a spam folder.
1) Naive Bayes: Naive Bayes is a classification algoritm that is based on the Bayes
Theorem's principle of class conditional independence. This means that the presence
of one feature doesn't change the likelihood of another feature, and that each
predictor has the same effect on the result/outcome.
2) Linear regression: Linear regression is used to find how a dependent variable is
related to one or more independent variables and to make predictions about what
will happen in the future. Simple linear regression is when there is only one
independent variable and one dependent variable.
3) Logistic regression: When the dependent variables are continuous, linear
regression is used. When the dependent variables are categorical, like "true" or
"false" or "yes" or "no," logistic regression is used. Both linear and logistic regression
seek to figure out the relationships between the data inputs. However, logistic
regression is mostly used to solve binary classification problems, like figuring out if a
particular mail is a spam or not.
4) Support Vector Machines(SVM): A support vector machine is a popular model for
supervised learning developed by Vladimir Vapnik. It can be used to both classify and
predict data. So, it is usually used to solve classification problems by making a
hyperplane where the distance between two groups of data points is the greatest.
This line is called the "decision boundary" because it divides the groups of data points
(for example, oranges and apples) on either side of the plane.
5) K-nearest neighbour: The KNN algorithm, which is also called the "k-nearest
neighbour" algorithm, groups data points based on how close they are to and related
to other data points. This algorithm works on the idea that data points that are
similar can be found close to each other. So, it tries to figure out how far apart the
data points are, using Euclidean distance and then assigns a category based on the
most common or average category. However, as the size of the test dataset grows,
the processing time increases, making it less useful for classification tasks.
6) Random forest: Random forest is another supervised machine learning algorithm
that is flexible and can be used for both classification and regression. The "forest" is
a group of decision trees that are not correlated to each other. These trees are then
combined to reduce variation and make more accurate data predictions.
Unsupervised learning
Unsupervised learning, also called unsupervised machine learning, uses machine learning
algorithms to look at unlabelled datasets and group them together. These programmes
find hidden patterns or groups of data. Its ability to find similarities and differences in
information makes it perfect for exploratory data analysis, cross-selling strategies,
customer segmentation, and image recognition.
29
Common unsupervised learning approaches:
Unsupervised learning models are used for three main tasks: clustering, making
connections, and reducing the number of dimensions. Below, we'll describe learning
methods and common algorithms used:
1) Clustering:
Clustering is a method for data mining that organises unlabelled data based on their
similarities or differences. Clustering techniques are used to organise unclassified,
unprocessed data items into groups according to structures or patterns in the data. There
are many types of clustering algorithms, including exclusive, overlapping, hierarchical, and
probabilistic.
2) Dimensionality reduction:
Although more data typically produces more accurate findings, it may also affect the
effectiveness of machine learning algorithms (e.g., overfitting) and make it difficult to
visualize datasets. Dimensionality reduction is a strategy used when a dataset has an
excessive number of characteristics or dimensions. It decreases the quantity of data inputs
to a manageable level while retaining the integrity of the dataset to the greatest extent
feasible. Dimensionality reduction is often employed in the data pre-processing phase, and
there are a number of approaches, one of them is:
Reinforcement learning:
Reinforcement Learning (RL) is a type of machine learning that allows an agent to learn in
an interactive setting via trial and error utilising feedback from its own actions and
experiences.
In recent years, machine learning and artificial intelligence (AI) have come to dominate
portions of data science, playing a crucial role in data analytics and business intelligence.
Machine learning automates data analysis and makes predictions based on the collection
and analysis of massive volumes of data about certain populations using models and
algorithms. Data science and machine learning are related to each other, but not identical.
Data science is a vast field that incorporates all aspects of deriving insights and information
from data. It involves gathering, cleaning, analysing, and interpreting vast amount of data
to discover patterns, trends, and insights that may guide business choices.
In other words, data science encompasses machine learning as one of its numerous
methodologies. Machine learning is a strong tool for data analysis and prediction, but it is
just a subfield of data science as a whole.
3. Data science includes a wide range of Machine learning, on the other hand,
techniques, such as data cleaning, data primarily focuses on building
integration, data exploration, statistical predictive models using algorithms
analysis, data visualization, and such as regression, classification, and
machine learning. clustering.
31
4. Data science typically requires large and Machine learning, on the other hand,
complex datasets that require requires labelled data that can be used
significant processing and cleaning to to train algorithms and models.
derive insights.
6. Data science techniques can be used for Machine learning algorithms are
a variety of purposes beyond prediction, primarily focused on making
such as clustering, anomaly detection, predictions or decisions based on data
and data visualization
32
Data Science-Data Analysis
To be more precise, Data analysis converts raw data into meaningful insights and valuable
information which helps in making informed decisions in various fields like healthcare,
education, business, etc.
Accurate data: We need accurate data to make informed decisions. Here we need data
analysis it helps businesses acquire relevant and accurate information that they can use
to plan business strategies and make informed decisions related to future plans and realign
the company’s vision and goal.
Problem-solving
33
Typically, the data analysis process involves many iterative rounds. Let's examine each in
more detail.
1) Identify: Determine the business issue you want to address. What issue is the firm
attempting to address? What must be measured, and how will it be measured?
2) Collect: Get the raw data sets necessary to solve the indicated query. Internal
sources, such as client relationship management (CRM) software, or secondary
sources, such as government records or social media application programming
interfaces, may be used to gather data (APIs).
3) Clean: Prepare the data for analysis by cleansing it. This often entails removing
duplicate and anomalous data, resolving inconsistencies, standardizing data structure
and format, and addressing white spaces and other grammatical problems.
4) Analyze the data: You may begin to identify patterns, correlations, outliers, and
variations that tell a narrative by transforming the data using different data analysis
methods and tools. At this phase, you may utilize data mining to identify trends
within databases or data visualization tools to convert data into an easily digestible
graphical format.
5) Interpret: Determine how effectively the findings of your analysis addressed your
initial query by interpreting them. Based on the facts, what suggestions are possible?
What constraints do your conclusions have?
34
Types of data analysis:
Data may be utilized to answer questions and assist decision making in several ways. To
choose the optimal method for analyzing your data, you must have knowledge about the
four types of data analysis widely used in the area might be helpful.
Descriptive analysis:
Descriptive analytics is the process of looking at both current and past data to find patterns
and trends. It's sometimes called the simplest way to look at data because it shows about
trends and relationships without going into more detail.
Descriptive analytics is easy to use and is probably something almost every company does
every day. Simple statistical software like Microsoft Excel or data visualisation tools like
Google Charts and Tableau can help separate data, find trends and relationships between
variables, and show information visually.
Descriptive analytics is a good way to show how things have changed over time. It also
uses trends as a starting point for more analysis to help make decisions.
Some examples of descriptive analysis are financial statement analysis, survey reports.
Diagnostic analysis
Diagnostic analytics is the process of using data to figure out why trends and correlation
between variables happen. It is the next step following identifying trends using descriptive
analytics. You can do diagnostic analysis manually, with an algorithm, or with statistical
software (such as Microsoft Excel).
Before getting into diagnostic analytics, you should know how to test a hypothesis, what
the difference is between correlation and causation, and what diagnostic regression
analysis is.
This type of analysis answers the question, “Why did this happened?”.
Some examples of diagnostic analysis are examining market demand, explaining customer
behavior.
Predictive analysis:
Predictive analytics is the process of using data to try to figure out what will happen in the
future. It uses data from the past to make predictions about possible future situations that
can help make strategic decisions.
The forecasts might be for the near term or future, such as anticipating the failure of a
piece of equipment later that day, or for the far future, such as projecting your company's
cash flows for the next year.
Predictive analysis can be done manually or with the help of algorithms for machine
learning. In either case, data from the past is used to make guesses or predictions about
what will happen in the future.
35
Regression analysis, which may detect the connection between two variables (linear
regression) or three or more variables, is one predictive analytics method (multiple
regression). The connections between variables are expressed in a mathematical equation
that may be used to anticipate the result if one variable changes.
Regression allows us to gain insights into the structure of that relationship and provides
measures of how well the data fit that relationship. Such insights can be extremely useful
for assessing past patterns and formulating predictions. Forecasting can help us to build
data-driven plans and make more informed decisions.
This type of analysis answers the question, “What might happen in the future?”.
Prescriptive analysis
Prescriptive analytics is the process of using data to figure out the best thing to do next.
This type of analysis looks at all the important factors and comes up with suggestions for
what to do next. This makes prescriptive analytics a useful tool for making decisions based
on data.
In prescriptive analytics, machine-learning algorithms are often used to sort through large
amounts of data faster and often more efficiently than a person can. Using "if" and "else"
statements, algorithms sort through data and make suggestions based on a certain set of
requirements. For example, if at least 50% of customers in a dataset said they were "very
unsatisfied" with your customer service team, the algorithm might suggest that your team
needs more training.
It's important to remember that algorithms can make suggestions based on data, but they
can't replace human judgement. Prescriptive analytics is a tool that should be used as
such to help make decisions and come up with strategies. Your judgement is important
and needed to give context and limits to what an algorithm comes up with.
Some examples of prescriptive analysis are: Investment decisions, Sales: Lead scoring.
36
Data Science-Tools in demand
Data science tools are used to dig deeper into raw and complicated data (unstructured or
structured data) and process, extract, and analyse it to find valuable insights by using
different data processing techniques like statistics, computer science, predictive modelling
and analysis, and deep learning.
Data scientists use a wide range of tools at different stages of the data science life cycle
to deal with zettabytes and yottabytes of structured and/or unstructured data every day
and get useful insights from it. The most important thing about these tools is that they
make it possible to do data science tasks without using complicated programming
languages. This is because these tools have algorithms, functions, and graphical user
interfaces that are already set up (GUIs).
1) SQL: Data Science is the comprehensive study of data. To access data and work with
it, data must be extracted from the database for which SQL will be needed. Data
37
Science relies heavily on Relational Database Management. With SQL commands and
queries, a Data Scientist may manage, define, alter, create, and query the database.
Several contemporary sectors have equipped their product data management with
NoSQL technology, yet SQL remains the best option for many business intelligence
tools and in-office processes.
2) DuckDB: DuckDB is a relational database management system based on tables that
also lets you use SQL queries to do analysis. It is free and open source, and it has
many features like faster analytical queries, easier operations, and so on.
DuckDB also works with programming languages like Python, R, Java, etc. that are
used in Data Science. You can use these languages to create, register, and play with
a database.
3) Beautiful Soup: Beautiful Soup is a Python library that can pull or extract
information from HTML or XML files. It is an easy-to-use tool that lets you read the
HTML content of websites to get information from them.
This library can help Data Scientists or Data Engineers set up automatic Web scraping,
which is an important step in fully automated data pipelines.
It is mainly used for web scrapping.
4) Scrapy: Scrapy is an open-source Python web crawling framework that is used to
scrape a lot of web pages. It is a web crawler that can both scrape and crawl the
web. It gives you all the tools you need to get data from websites quickly, process
them in the way you want, and store them in the structure and format you want.
5) Selenium: Selenium is a free, open-source testing tool that is used to test web apps
on different browsers. Selenium can only test web applications, so it can't be used to
test desktop or mobile apps. Appium and HP's QTP are two other tools that can be
used to test software and mobile apps.
6) Python: Data Scientists use Python the most and it is the most popular
programming language. One of the main reasons why Python is so popular in the
field of Data Science is that it is easy to use and has a simple syntax. This makes it
easy for people who don't have a background in engineering to learn and use. Also,
there are a lot of open-source libraries and online guides for putting Data Science
tasks like Machine Learning, Deep Learning, Data Visualization, etc. into action.
Some of the most commonly used libraries of python in data science are:
1) Numpy
2) Pandas
3) Matplotlib
4) SciPy
5) Plotly
7) R: R is the second most-used programming language in Data Science, after Python.
It was first made to solve problems with statistics, but it has since grown into a full
Data Science ecosystem.
Most people use Dpylr and readr, which are libraries, to load data and change and add
to it. ggplot2 allows you use different ways to show the data on a graph.
8) Tableau: Tableau is a visual analytics platform that is changing the way people and
organizations use data to solve problems. It gives people and organizations the tools
they need to extract the most out of their data.
When it comes to communication, tableau is very important. Most of the time, Data
Scientists have to break down the information so that their teams, colleagues,
executives, and customers can understand it better. In these situations, the
information needs to be easy to see and understand.
Tableau helps teams dig deeper into data, find insights that are usually hidden, and
then show that data in a way that is both attractive and easy to understand. Tableau
38
also helps Data Scientists quickly look through the data, adding and removing things
as they go so that the end result is an interactive picture of everything that matters.
9) Tensorflow: TensorFlow is a platform for machine learning that is open-source, free
to use and uses data flow graphs. The nodes of the graph are mathematical
operations, and the edges are the multidimensional data arrays (tensors) that flow
between them. The architecture is so flexible; machine learning algorithms can be
described as a graph of operations that work together. They can be trained and run
on GPUs, CPUs, and TPUs on different platforms, like portable devices, desktops, and
high-end servers, without changing the code. This means that programmers from all
kinds of backgrounds can work together using the same tools, which makes them
much more productive. The Google Brain Team created the system to study machine
learning and deep neural networks (DNNs). However, the system is flexible enough
to be used in a wide range of other fields as well.
10) Scikit-learn: Scikit-learn is a popular open-source Python library for machine
learning that is easy to use. It has a wide range of supervised and unsupervised
learning algorithms, as well as tools for model selection, evaluation, and data
preprocessing. Scikit-learn is used a lot in both academia and business. It is known
for being fast, reliable, and easy to use.
It also has features for reducing the number of dimensions, choosing features,
extracting features, using ensemble techniques, and using datasets that come with the
program. We will look at each of these things in turn.
11) Keras: Google's Keras is a high-level deep learning API for creating neural
networks. It is built in Python and is used to facilitate neural network construction.
Moreover, different backend neural network computations are supported.
Since it offers a Python interface with a high degree of abstraction and numerous
backends for computation, Keras is reasonably simple to understand and use. This
makes Keras slower than other deep learning frameworks, but very user-friendly for
beginners.
12) Jupyter Notebook: Jupyter Notebook is an open-source online application that
allows the creation and sharing of documents with live code, equations,
visualisations, and narrative texts. It is popular among data scientists and
practitioners of machine learning because it offers an interactive environment for
data exploration and analysis.
With Jupyter Notebook, you can write and run Python code (and code written in other
programming languages) right in your web browser. The results are shown in the same
document. This lets you put code, data, and text explanations all in one place, making
it easy to share and reproduce your analysis.
13) Dash: Dash is an important tool for data science because it lets you use Python
to create interactive web apps. It makes it easy and quick to create data visualisation
dashboards and apps without having to know how to code for the web.
14) SPSS: SPSS, which stands for "Statistical Package for the Social Sciences," is an
important tool for data science because it gives both new and experienced users a
full set of statistical and data analysis tools.
39
Data Science – Careers
There are several jobs available that are linked to or overlap with the field of data scientist.
1) Data analyst
2) Data scientist
3) Database administrator
4) Big data engineer
5) Data mining engineer
6) Machine learning engineer
7) Data architect
8) Hadoop engineer
9) Data warehouse architect
Data analyst:
A data analyst analyses data sets to identify solutions to customer-related issues. This
information is also communicated to management and other stakeholders by a data
analyst. These people work in a variety of fields, including business, banking, criminal
justice, science, medical, and government.
A data analyst is someone who has the expertise and abilities to transform raw data into
information and insight that can be utilized to make business choices.
Data scientist:
A Data Scientist is a professional who uses analytical, statistical, and programming abilities
to acquire enormous volumes of data. It is their obligation to utilize data to create solutions
that are personalized to the organization's specific demands.
Companies are increasingly relying on data in their day-to-day operations. A data scientist
examines raw data and pulls meaningful meaning from it. They then utilize this data to
identify trends and provide solutions that a business needs to grow and compete.
Database administrator:
Database administrators are responsible for managing and maintaining business
databases. Database administrators are responsible for enforcing a data management
policy and ensuring that corporate databases are operational and backed up in the case of
memory loss.
40
Big data engineer:
Big data engineers create, test, and maintain solutions for a company that use Big Data.
Their job is to gather a lot of data from many different sources and make sure that people
who use the data later can get to it quickly and easily. Big data engineers basically make
sure that the company's data pipelines are scalable, secure, and able to serve more than
one user.
The amount of data made and used today seems to be endless. The question is how this
information will be saved, analyzed, and shown. A big data engineer works on the methods
and techniques to deal with these problems.
A data mining engineer sets up and runs the systems that are used to store and analyze
data. Overarching tasks include setting up data warehouses, organizing data so it's easy
to find, and installing conduits for data to flow through. A data mining engineer needs to
know where the data comes from, how it will be used, and who will use it. ETL, which
stands for "extract, transform, and load," is the key acronym for a data mining engineer.
Different roles can be given to machine learning. There is often some overlap between the
jobs of a data scientist and an AI (artificial intelligence) engineer, and sometimes the two
jobs are even confused with each other. Machine learning is a subfield of AI that focuses
on looking at data to find connections between what was put in and what was wanted to
come out.
A machine learning developer makes sure that each problem has a solution that fits it
perfectly. Only by carefully processing the data and choosing the best algorithm for the
situation can you get the best results.
Data architect
Data architects build and manage a company's database by finding the best ways to set it
up and structure it. They work with database managers and analysts to make sure that
company data is easy to get to. Tasks include making database solutions, figuring out
what needs to be done, and making design reports.
A data architect is an expert who comes up with the organization's data strategy, which
includes standards for data quality, how data moves around the organisation, and how
data is kept safe. The way this professional in data management sees things is what turns
business needs into technical needs.
As the key link between business and technology, data architects are becoming more and
more in demand.
41
Hadoop engineer
Hadoop Developers are in charge of making and coding Hadoop applications. Hadoop is an
open-source framework for managing and storing applications that work with large
amounts of data and run on cluster systems. Basically, a Hadoop developer makes apps
that help a company manage and keep track of its big data.
A Hadoop Developer is the person in charge of writing the code for Hadoop applications.
This job is like being a Software Developer. The jobs are pretty similar, but the first one
is in the Big Data domain. Let's look at some of the things a Hadoop Developer has to do
to get a better idea of what this job is about.
So, just like a regular architect designs a building or a naval architect designs a ship, data
warehouse architects design and help launch data warehouses, customizing them to meet
the needs of the client.
Glassdoor says that the top job on its site is for a Data Scientist. In the future, nothing
will change about this position. It is also looked at that the job openings in data science
are open for 45 days. This is five days longer than the average job market.
IBM will work with schools and businesses to create a work-study environment for aspiring
data scientists. This will help close the skills gap.
The need for data scientists is growing at a rate that is a power of two. This is because
new jobs and industries have been created. This is made worse by the growing amount of
data and the different kinds of data.
In the future, there will only be more roles for data scientists and more of them. Data
scientist jobs include data engineer, data science manager, and big data architect. Also,
the financial and insurance sectors are becoming some of the biggest employers of data
scientists.
As the number of institutes that train data scientists grows, it is likely that more and more
people will know how to use data.
42
43
Data Science – Scientists
A data scientist is a trained professional who analyzes and makes sense of data. They use
their knowledge of data science to help businesses make better decisions and run better.
Most data scientists have a lot of experience with math, statistics, and computer science.
They use this information to look at big sets of data and find trends or patterns. Data
scientists might also come up with new ways to collect and store data.
There are many ways to become a Data Scientist, but because it's usually a high-level job,
most Data Scientists have degrees in math, statistics, computer science, and other related
fields.
A Data Scientist is a high-level role; prior to attaining this level of expertise, you should
acquire a comprehensive knowledge foundation in a related topic. This might include
mathematics, engineering, statistics, data analysis, programming, or information
technology; some Data Scientists began their careers in banking or baseball scouting.
But regardless of the area you begin in, you should begin with Python, SQL, and Excel.
These abilities will be important for processing and organizing raw data. It is beneficial to
be acquainted with Tableau, a tool you will use often to build visuals.
Learn data science fundamentals such as how to gather and store data, analyze and model
data, and display and present data using every tool in the data science arsenal, such as
Tableau and PowerBI, among others.
You should be able to utilize Python and R to create models that assess behavior and
forecast unknowns, as well as repackage data in user-friendly formats, by the conclusion
of your training.
Several Data Science job listings state advanced degrees as a prerequisite. Sometimes,
this is non-negotiable, but when demand exceeds supply, this increasingly reveals the
truth. That is, proof of the necessary talents often surpasses credentials alone.
44
Hiring managers care most about how well you can show that you know the subject, and
more and more people are realizing that this doesn't have to be done in the traditional
ways.
Machine learning is being used most in data science. This refers to tools that use artificial
intelligence to give systems the ability to learn and get better without being specifically
programmed to do so.
Excel is generally used in this step. Even though the basic idea behind spreadsheets is
simple—making calculations or graphs by correlating the information in their cells—Excel
is still very useful after more than 30 years, and it is almost impossible to do data science
without it.
But making beautiful pictures is just the start. As a Data Scientist, you'll also need to be
able to use these visualizations to show your findings to a live audience. You may have
these communication skills already, but if not, don't worry. Anyone can get better with
practice. If you need to, start small by giving presentations to one friend or even your pet
before moving on to a group.
Step 5: Work on some data science projects that will help develop your
practical data skills
Once you know the basics of the programming languages and digital tools that Data
Scientists use, you can start using them to practice and improve your new skills. Try to
take on projects that require a wide range of skills, like using Excel and SQL to manage
and query databases and Python and R to analyze data using statistical methods, build
models that analyze behavior and give you new insights, and use statistical analysis to
predict things you don't know.
45
As you practice, try to cover different parts of the process. Start with researching a
company or market sector, then define and collect the right data for the task at hand.
Finally, clean and test that data to make it as useful as possible.
Lastly, you can make and use your own algorithms to analyze and model the data. You
can then put the results into easy-to-read visuals or dashboards that users can use to
interact with your data and ask questions about it. You could even try showing your
findings to other people to get better at communicating.
You should also get used to working with different kinds of data, like text, structured data,
images, audio, and even video. Every industry has its own types of data that help leaders
make better, more informed decisions.
As a working Data Scientist, you'll probably be an expert in just one or two, but as a
beginner building your skillset, you'll want to learn the basics of as many types as possible.
Taking on more complicated projects will give you the chance to see how data can be used
in different ways. Once you know how to use descriptive analytics to look for patterns in
data, you'll be better prepared to try more complicated statistical methods like data
mining, predictive modelling, and machine learning to predict future events or even make
suggestions.
In fact, your portfolio might be the most important thing you have when looking for a job.
If you want to be a Data Scientist, you might want to show off your work on GitHub instead
of (or in addition to) your own website. GitHub makes it easy to show your work, process,
and results while also raising your profile in a public network. Don't stop there, though.
Include a compelling story with your data and show the problems you're trying to solve so
the employer can see how good you are. You can show your code in a bigger picture on
GitHub instead of just by itself, which makes your contributions easier to understand.
Don't list all of your work when you're applying for a specific job. Highlight just a few
pieces that are most relevant to the job you're applying for and that best show your range
of skills throughout the whole data science process, from starting with a basic data set to
defining a problem, cleaning up, building a model, and finding a solution.
Your portfolio is your chance to show that you can do more than just crunch numbers and
communicate well.
Choose something that really interests you, ask a question about it, and try to answer that
question with data.
Document your journey and show off your technical skills and creativity by presenting your
findings in a beautiful way and explaining how you got there. Your data should be
46
accompanied by a compelling narrative that shows the problems you've solved,
highlighting your process and the creative steps you've taken, so that an employer can
see your worth.
Joining an online data science network like Kaggle is another great way to show that you're
involved in the community, show off your skills as an aspiring Data Scientist, and continue
to grow both your expertise and your reach.
Find out what a company values and what they're working on, and make sure it fits with
your skills, goals, and what you want to do in the future. And don't just look in Silicon
Valley. Cities like Boston, Chicago, and New York are having trouble finding technical
talent, so there are lots of opportunities.
47
Data Scientist – Salary
As digitalization has spread around the world, data science has become one of the best-
paying jobs in the world. In India, data scientists make between 1.8 Lakhs and 1 Crore a
year, depending on their qualifications, skills, and experience.
48
S.No Job Title Average annual base salary in
India
49
The data in the table above is taken from Indeed.
The United States of America pays the highest data scientist salaries on average, followed
by Australia, Canada, and Germany.
According to Payscale, an entry-level Data Scientist with less than 1-year experience can
expect to earn an average total compensation (includes tips, bonus, and overtime pay) of
₹589,126 based on 498 salaries. An early career Data Scientist with 1-4 years of
experience earns an average total compensation of ₹830,781 based on 2,250 salaries. A
mid-career Data Scientist with 5-9 years of experience earns an average total
compensation of ₹1,477,290 based on 879 salaries. An experienced Data Scientist with
10-19 years of experience earns an average total compensation of ₹1,924,803 based on
218 salaries. In their late career (20 years and higher), employees earn an average total
compensation of ₹1,350,000.
In recent years, improvements in technology have made Data Science more important in
many different fields of work. Data Science is used for more than just collecting and
analyzing data. It is now a multidisciplinary field with many different roles. With high
salaries and guaranteed career growth, more and more people are getting into the field of
data science every day.
50
Data Science-Resources
This article lists the best programs and courses in data science that you can take to
improve your skills and get one of the best data scientist jobs in 2023. You should take
one of these online courses and certifications for data scientists to get started on the right
path to mastering data science.
A variety of factors/aspects were considered when producing the list of top data science
courses for 2023, including:
Curriculum Covered: The list is compiled with the breadth of the syllabus in mind, as
well as how effectively it has been tailored to fit varied levels of experience.
Course Features and Outcomes: We have also discussed the course outcomes and
other aspects, such as Query resolve, hands-on projects, and so on, that will help students
obtain marketable skills.
Sills required: We have addressed the required skills that applicants must have in order
to participate in the course.
Course Fees: Each course is graded based on its features and prices to ensure that you
get the most value for your money.
1) Covers all areas of data science, beginning with the fundamentals of programming
(binary, loops, number systems, etc.) and on through intermediate programming
subjects (arrays, OOPs, sorting, recursion, etc.) and ML Engineering (NLP,
Reinforcement Learning, TensorFlow, Keras, etc.).
2) Lifetime access.
3) 30-Days Money Back Guarantee.
4) After completion certificate.
1) This course will enable you to build a Data Science foundation, whether you have
basic Python skills or not. The code-along and well planned-out exercises will make
you comfortable with the Python syntax right from the outset. At the end of this
51
short course, you’ll be proficient in the fundamentals of Python programming for
Data Science and Data Analysis.
2) In this truly step-by-step course, every new tutorial video is built on what you have
already learned. The aim is to move you one extra step forward at a time, and then,
you are assigned a small task that is solved right at the beginning of the next video.
That is, you start by understanding the theoretical part of a new concept first. Then,
you master this concept by implementing everything practically using Python.
3) Become a Python developer and Data Scientist by enrolling in this course. Even if you
are a novice in Python and data science, you will find this illustrative course
informative, practical, and helpful. And if you aren’t new to Python and data science,
you’ll still find the hands-on projects in this course immensely helpful.
52
This Course of Pandas offers a complete view of this powerful tool for implementing data
analysis, data cleaning, data transformation, different data formats, text manipulation,
regular expressions, data I/O, data statistics, data visualization, time series and more.
This course is a practical course with many examples because the easiest way to learn is
by practicing! then we'll integrate all the knowledge we have learned in a Capstone Project
developing a preliminary analysis, cleaning, filtering, transforming, and visualizing data
using the famous IMDB dataset.
They will have a clear understanding of different types of data with examples which is very
important to understand data analysis
They will understand the relationship and dependency by learning Pearson's correlation
coefficient, scatter diagram, and linear regression analysis between the variables and will
be able to know to make the prediction
Students will understand the different methods of data analysis such as a measure of
central tendency (mean, median, mode), a measure of dispersion (variance, standard
deviation, coefficient of variation), how to calculate quartiles, skewness, and box plot
53
They will have a clear understanding of the shape of data after learning skewness and box
plot, which is an important part of data analysis
Students will have a basic understanding of probability and how to explain and understand
Bayes theorem with the simplest example
By the end of this book, you’ll feel confident using conda and Anaconda Navigator to
manage dependencies and gain a thorough understanding of the end-to-end data science
workflow.
54
The code has been broken down into small chunks (a few lines or a function at a time) to
enable thorough discussion.
As you progress, you will learn how to perform data analysis while exploring the
functionalities of key data science Python packages, including pandas, SciPy, and scikit-
learn. Finally, the book covers ethics and privacy concerns in data science and suggests
resources for improving data science skills, as well as ways to stay up to date on new data
science developments.
By the end of the book, you should be able to comfortably use Python for basic data science
projects and should have the skills to execute the data science process on any data source.
You will begin by looking at data ingestion of data formats such as JSON, CSV, SQL
RDBMSes, HDF5, NoSQL databases, files in image formats, and binary serialized data
structures. Further, the book provides numerous example data sets and data files, which
are available for download and independent exploration.
Moving on from formats, you will impute missing values, detect unreliable data and
statistical anomalies, and generate synthetic features that are necessary for successful
data analysis and visualization goals.
By the end of this book, you will have acquired a firm understanding of the data cleaning
process necessary to perform real-world data science and machine learning tasks.
55
Data Science - Interview Questions
What is data science and how is it different from other data-related fields?
Data science is the domain of study that uses computational and statistical methods to get
knowledge and insights from data. It utilizes techniques from mathematics, statistics,
computer science and domain-specific knowledge to analyse large datasets, find trends
and patterns from the data and make predictions for the future.
Data science is different from other data related fields because it is not only about
collecting and organising data. The data science process consists of analysing, modelling,
visualizing and evaluating the data set. Data science uses tools like machine learning
algorithms, data visualisation tools and statistical models to analyse data, make
predictions and find patterns and trends in the data.
Other data related fields such as machine learning, data engineering and data analytics
are more focused on a particular thing like the goal of a machine leaning engineer is to
design and create algorithms that are capable of learning from the data and making
predictions, the goal of data engineering is to design and manage data pipelines,
infrastructures and databases. Data analysis is all about exploring and analysing data to
find patterns and trends. Whereas data science does modelling, exploring, collecting,
visualizing, predicting, and deploying the model.
Overall, data science is a more comprehensive way to analyse data because it includes the
whole process, from preparing the data to making predictions. Other fields that deal with
data have more specific areas of expertise.
What is the data science process and what are the key steps involved?
A data science process also known as data science lifecycle is a systematic approach to
find a solution for a data problem which shows the steps that are taken to develop, deliver,
and maintain a data science project.
A standard data science lifecycle approach comprises the use of machine learning
algorithms and statistical procedures that result in more accurate prediction models. Data
extraction, preparation, cleaning, modelling, assessment, etc., are some of the most
important data science stages. Key steps involved in data science process are:
The data science lifecycle starts with "why?" just like any other business lifecycle. One of
the most important parts of the data science process is figuring out what the problems
are. This helps to find a clear goal around which all the other steps can be made. In short,
it's important to know the business goal as earliest because it will determine what the end
goal of the analysis will be.
2) Data collection:
The next step in the data science lifecycle is data collection, which means getting raw data
from the appropriate and reliable source. The data that is collected can be either organized
56
or unorganized. The data could be collected from website logs, social media data, online
data repositories, and even data that is streamed from online sources using APIs, web
scraping, or data that could be in Excel or any other source.
3) Data processing
After collecting high-quality data from reliable sources, next step is to process it. The
purpose of data processing is to ensure that any problems with the acquired data have
been resolved before proceeding to the next phase. Without this step, we may produce
mistakes or inaccurate findings.
4) Data analysis
Data analysis Exploratory Data Analysis (EDA) is a set of visual techniques for analysing
data. With this method, we may get specific details on the statistical summary of the data.
Also, we will be able to deal with duplicate numbers, outliers, and identify trends or
patterns within the collection.
5) Data visualization:
Data visualisation is the process of demonstrating information and data on a graph. Data
visualisation tools make it easy to understand trends, outliers, and patterns in data by
using visual elements like charts, graphs, and maps. It's also a great way for employees
or business owners to present data to people who aren't tech-savvy without making them
confused.
6) Data modelling:
Data Modelling is one of the most important aspects of data science and is sometimes
referred to as the core of data analysis. The intended output of a model should be derived
from prepared and analysed data.
At this phase, we develop datasets for training and testing the model for production-
related tasks. It also involves selecting the correct mode type and determining if the
problem involves classification, regression, or clustering. After analysing the model type,
we must choose the appropriate implementation algorithms. It must be performed with
care, as it is crucial to extract the relevant insights from the provided data.
7) Model deployment:
57
Unsupervised learning
Unsupervised learning, also called unsupervised machine learning, uses machine learning
algorithms to look at unlabelled datasets and group them together. These programmes
find hidden patterns or groups of data. Its ability to find similarities and differences in
information makes it perfect for exploratory data analysis, cross-selling strategies,
customer segmentation, and image recognition.
Regularization takes away any extra weights from the chosen features and redistributes
the weights so that they are all the same. This means that regularisation makes it harder
to learn a model that is both flexible and has a lot of moving parts. A model with a lot of
flexibility is one that can fit as many data points as possible.
For cross-validation, we can use the k-fold cross-validation method. In k-fold cross-
validation, we divide the data you start with into k groups (also known as folds). We train
an ML model on all but one (k-1) of the subsets, and then we test the model on the subset
that wasn't used for training. This process is done k times, and each time a different subset
is set aside for evaluation (and not used for training).
A regression algorithm can make a prediction about a discrete value, which is a whole
number.
If the value is in the form of a class label probability, a classification algorithm can predict
this type of data.
58
K-means clustering is a popular example of a clustering approach in which data points
are allocated to K groups based on their distance from each group's centroid. The data
points closest to a certain centroid will be grouped into the same category. A higher K
number indicates smaller groups with more granularity, while a lower K value indicates
bigger groupings with less granularity. Common applications of K-means clustering include
market segmentation, document clustering, picture segmentation, and image
compression.
Can you explain overfitting and underfitting, and how to mitigate them?
Overfitting is a modelling error that arises when a function is overfit to a restricted number
of data points. It is the outcome of a model with an excessive amount of training points
and excessive complexity.
Underfitting is a modelling error that arises when a function does not properly match the
data points. That is the outcome of a simple model with inadequate training points.
There are a number of ways that researchers in machine learning can avoid overfitting.
These include: Cross-validation, Regularization, Pruning, Dropout.
There are a number of ways that researchers in machine learning can avoid underfitting.
These include:
With these methods, you should be able to make your models better and fix any problems
with overfitting or underfitting.
59
60