Unit-I Introduction To Data Science
Unit-I Introduction To Data Science
As a data scientist aspirant, you must be keen to understand how the life cycle
of data science projects works so that it’s easier for you to implement your
individual projects in a similar pattern. Today, we will be basically discussing
the step-by- step implementation process of any data science project in a real-
world scenario.
In simple terms, a data science life cycle is nothing but a repetitive set of steps
that you need to take to complete and deliver a project/product to your client.
Although the data science projects and the teams involved in deploying and
developing the model will be different,
every data science life cycle will be slightly different in every other company.
However, most of the data science projects happen to follow a somewhat
similar process.
Now that we have an idea of who all are involved in a typical business project,
let’s understand what a data science project is and how do we define the life
cycle of the data science project in a real-world scenario like a fake news
identifier.
Why do we need to define the Life Cycle of a data science project?
help us in solving the use case. However, for beginners many questions arise
like:
In what format do we need the data? How to get the data?What do we need to
do with data?
So many questions yet answers might vary from person to person. Hence in
order to address all these concerns right away, we do have a pre-defined flow
that is termed as Data Science Project Life Cycle. The process is fairly simple
wherein the company has to first gather data, perform data cleaning, perform
EDA to extract relevant features, preparing the data by
Performing feature engineering and feature scaling. In the second phase, the
model is built and deployed after a proper evaluation. This entire lifecycle is
not a one man’s job, for this, you need the entire team to work together to get
the work done by achieving the required amount of efficiency for the project
• Data Mining:
Data mining is the process of sorting through large data sets to identify patterns
and relationships that can help solve business problems through data analysis.
Data mining techniques and tools enable enterprises to predict future trends
HYPERLINK
"https://www.techtarget.com/searchbusinessanalytics/feature/Top-5-
predictive-analytics-use-cases-in-enterprises" and make more-informed
business decisions.
After gaining clarity on the problem statement, we need to collect relevant
data to break the problem into small components.
The data science project starts with the identification of various data sources,
Data collection entails obtaining information from both known internal and
external sources that can assist in addressing the business issue.
Normally, the data analyst team is responsible for gathering the data. They need
to figure out proper ways to source data and collect the same to get the desired
results.
There are two ways to source the data:
• Through web scraping with Python
• Extracting Data with the use of third party APIs
• Data Cleaning:
This is a very important stage in the data science lifecycle, this stage includes
data cleaning, data reduction, data transformation, and data integration. This
stage takes lots of time and data scientists spend a significant amount of time
preparing the data.
Data cleaning is handling the missing values in the data and filling out these
missing values with appropriate values and smoothing out the noisy data.
Data reduction is using various strategies to reduce the size of data such that
the output remains the same and the processing time of data reduces.
Data transformation is transforming the data from one type to another type
so that it can beused efficiently for analysis and visualization.
Data integration is resolving any conflicts in the data and handling
redundancies. Data preparation is the most time-consuming process,
accounting for up to 90% of the total project duration, and this is the most
crucial step throughout the entire life cycle.
• Exploratory Data Analysis:
This step includes getting some concept about the answer and elements
affecting it, earlier than constructing the real model. Distribution of data inside
distinctive variables of a character is explored graphically the usage of bar-
graphs, Relations between distinct aspects are captured via graphical
representations like scatter plots and warmth maps. Many data visualization
strategies are considerably used to discover each and every characteristic
individually and by means of combining them with different features.
• Feature Engineering:
Feature engineering is a machine learning technique that leverages data to
create new variables that aren’t in the training set. It can produce new features
for both supervised and unsupervised learning, with the goal of simplifying and
speeding up data transformations while also enhancing model accuracy.
Feature engineering is required when working with machine learning models.
Regardless of the data or architecture, a terrible feature will have a direct
impact on your model.
Throughout most cases of data analysis, data modeling is regarded as the core
process. In this process of data modeling, we take the prepared data as the input
and with this, we try to prepare the desired output.
We first tend to select the appropriate type of model that would be implemented
to acquire results, whether the problem is a regression problem or
classification, or a clustering-based problem. Depending on the type of data
received we happen to choose the appropriate machine learning algorithm that
is best suited for the model. Once this is done, we ought to tune the hyper
parameters of the chosen models to get a favorable outcome.
Finally, we tend to evaluate the model by testing the accuracy and relevance.
In addition to this project, we need to make sure there is a correct balance
between specificity and generalizability, which is the created model must be
unbiased.
• Data visualization:
• Data Processing:
Data processing occurs when data is collected and translated into usable information.
Usually performed by a data scientist or team of data scientists, it is important for
data processing to be done correctly as not to negatively affect the end product, or
data output.
• The second state is data Processing
• This is to prepare the data to understand the business problem & extract
information to solve the problem
• Step for preparing data.
• Selecting data related to the problem.
• Combining the data sets. Data you may integrate the data.
• Clean the data to find the missing values.
• Handle the missing values by removing or imputing them
• Errors are dealt with by being removed.
• Use the box plots for detecting outliers & handling them
EDA (Exploratory Data Analysis):
Before really developing the model, this step entails understanding the
solution & the variables that may affect it. To understand the data & dada
features better. 70% of the data science project life is spend on this step. We
can extract lots of information with the proper EDA.
• Data Modelling:
• This is the most important step of the data science project.
• This phase is about selecting the right model type, depending on whether the issue is
classification regression or clustering, following are the selection of the model family, we
must carefully select & implement algorithms to be used inside that family.
• Model Deployment
• This is the end stage of data science project.
• In this stage. the delivery method that will be used to distribute the model to users or another
system is created.
• for various projects, this step can mean. many different things getting your model results in a
Tableau/power BI dashboard might that is necessary.
• Or as complicated as growing it to millions of users on the cloud
• Data Analyst
• Extracting data from primary and secondary sources using automated tools
• Developing and maintaining databases
• Performing data analysis and making reports with recommendations
• Analyzing data and forecasting trends that impact the organization/project
• Working with other team members to improve data collection and quality processes.
SQL, R, SAS, and Python are some of the sought-after technologies for data
analysis. So, certification in these can easily give a boost to your job
applications. You should also have good problem-solving qualities.
• Data Engineers
Data engineers build and test scalable Big Data ecosystems for the businesses
so that the datascientists can run their algorithms on the data systems that are
stable and highly
optimized. Data engineers also update the existing systems with newer or
upgraded versionsof the current technologies to improve the efficiency of the
databases.
• Database Administrator
• Data Architect
A data architect creates the blueprints for data management so that the
databases can be easily integrated, centralized, and protected with the best
security measures. They also ensure that the data engineers have the best tools
and systems to work with.
• Statistician
A statistician, as the name suggests, has a sound understanding of statistical
theories and dataorganization. Not only do they extract and offer valuable
insights from the data clusters, but they also help create new methodologies
for the engineers to apply.
• Business Analyst
The role of business is slightly different than other data science jobs. While
they dohave a good understanding of how data-oriented technologies work
and how to handle largevolumes of data, they also separate the high-value
data from the low-value data. In other words, they identify how the Big
Data can be linked to actionable business insights for business growth.
• In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
• Detecting Tumor.
• Drug discoveries.
• Medical Image Analysis.
• Virtual Medical Bots.
• Genetics and Genomics.
• Predictive Modeling for Diagnosis etc.
• Image Recognition
Currently, Data Science is also used in Image Recognition. For
Example, when weupload our image with our friend on Facebook, Facebook
gives suggestions Tagging who isin the picture. This is done with the help of
machine learning and Data Science. When an Image is recognized, the data
analysis is done on one’s Facebook friends and after analysis, if the faces
which are present in the picture matched with someone else profile then
Facebook suggests us auto-tagging.
• Targeting Recommendation
Targeting Recommendation is the most important application of Data
Science. Whatever the user searches on the Internet, he/she will see numerous
posts everywhere. This can be explained properly with an example: Suppose I
want a mobile phone, so I just Google search it and after that, I changed my
mind to buy offline. Data Science helps thosecompanies who are paying for
Advertisements for their mobile. So everywhere on the internet in the social
media, in the websites, in the apps everywhere I will see the recommendation
of that mobile phone which I searched for. So this will force me to buy online.
• Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like
with the help of it,it becomes easy to predict flight delays. It also helps to
decide whether to directly land into the destination or take a halt in between
like a flight can have a direct route from Delhi to the U.S.A or it can halt in
between after that reach at the destination.
• Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a
Computer Opponent, data science concepts are used with machine learning
where with the help of past data the Computer will improve its performance.
There are many games like Chess,EA Sports, etc. will use Data Science
concepts.
• In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data
Science. Data Science helps these companies to find the best route for the
Shipment of their Products, the best time suited for delivery, the best mode of
transport to reach the destination, etc.
• Autocomplete
AutoComplete feature is an important part of Data Science where the
user will get the facility to just type a few letters or words, and he will get the
feature of auto-completingthe line. In Google Mail, when we are writing
formal mail to someone so at that time data science concept of Autocomplete
feature is used where he/she is an efficient choice to auto-complete the whole
line. Also in Search Engines in social media, in various apps, AutoComplete
feature is widely used.
• Speech Recognition
Some of the best examples of speech recognition products are Google
Voice, Siri, Cortana etc. Using the speech-recognition feature, even if you
aren’t in a position to type a message, your life wouldn’t stop. Simply speak
out the message and it will be converted to text. However, at times, you would
realize, speech recognition doesn’t perform accurately.
Observation Method
Whether you’re collecting data for business or academic research, the first
step is to identify the type of data you need to collect and what method you’ll
use to do so. In general, there are two data types— primary and secondary
—and you can gather both with a variety of effective collection methods.
Both primary and secondary data-collection methods have their pros, cons,
and particular use cases. Read on for an explanation of your options and a
listof some of the best methods to consider.
You can collect primary data using quantitative or qualitative methods. Let’s
take a closerlook at the two:
While researchers often use the terms “survey” and “questionnaire” inter
changeably, the two mean slightly different things.
• Interviews
• Focus groups
• Observation
• Newspapers
Newspapers often publish data they’ve collected from their own surveys.
Due to the volume of resources you’ll have to sift through, some surveys may
be relevant to your niche but difficult to find on paper. Luckily, most newspapers
are also published online, solooking through their online archives for specific
data may be easier.
• Unpublished sources
These include diaries, letters, reports, records, and figures belonging to private
individuals;these sources aren’t in the public domain. Since authoritative
bodies haven’t vetted or published the data, it can often be unreliable.
Below are some of the benefits of secondary data-collection methods and their
advantages over primary methods.
• Non-Sampling Errors:
The errors related to the collection of data are known as Non-Sampling Errors. The
different types of Non-Sampling Errors are Error of Measurement, Error of Non-
response,Error of Misinterpretation, Error of Calculation or Arithmetical Error, and
Error of Sampling Bias.
• Error of Measurement:
The reason behind the occurrence of Error of Measurement may be
difference in the scaleof measurement and difference in the rounding-off
procedure that is adopted by different investigators.
• Error of Non-response:
These errors arise when the respondents do not offer the information required for the
study.
• Error of Misinterpretation:
These errors arise when the respondents fail to interpret the question given
in the questionnaire.
• Error of Calculation or Arithmetical Error:
These errors occur while adding, subtracting, or multiplying figures of data.
• Error of Sampling Bias:
These errors occur when because of one reason or another, a part of the target population
cannot be included in the sample choice.
Note: If the field of investigation is larger or the size of the population is larger,
then thepossibility of the occurrence of errors related to the collection of data
is high. Besides, a
Data Preprocessing
Data preprocessing is an important step in the data mining process. It refers to
the cleaning,transforming, and integrating of data in order to make it ready for
analysis. The goal of datapreprocessing is to improve the quality of the data
and to make it more suitable for the specific data mining task.
Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved
through techniques such as featureselection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset,
while feature extraction involves transforming the data into a lower-
dimensional space while preserving the important information.
This method works on sorted data in order to smooth it. The whole data
is divided intosegments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values
can be used to complete the task.
• Regression:
Here data can be made smooth by fitting it to a regression function.The
regression usedmay be linear (having one independent variable) or
multiple (having multiple independent variables).
• Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or itwill fall outside the clusters.
• Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for miningprocess. This involves following ways:
• Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
• Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help themining process.
• Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptuallevels.
• Data Reduction:
Data reduction is a technique used in data mining to reduce the size of a dataset
while still preserving the most important information. This can be beneficial
in situations where the dataset is too large to be processed efficiently, or where
the dataset contains a large amountof irrelevant or redundant information.
There are several different data reduction techniques that can be used
indata mining, including:
Discretization
Histogram analysis
Binning
Binning refers to a data smoothing technique that helps to group a huge number
of continuous values into smaller values. For data discretization and the
development of idea hierarchy, this technique can also be used.
Cluster Analysis
Discretizing data by linear regression technique, you can get the best
neighboring interval, and then the large intervals are combined to develop a
larger overlap to form the final 20 overlapping intervals. It is a supervised
procedure.
Whenever we talk about data analysis, the term outliers often come to our mind.
As the name suggests, "outliers" refer to the data points that exist outside of
what is to be expected. The major thing about the outliers is what you do with
them. If you are going to analyze any task to analyze data sets, you will always
have some assumptions based on how this data is generated. If you find some
data points that are likely to contain some form of error, then these are
definitely outliers, and depending on the context, you want to overcome those
errors. The data mining process involves the analysis and prediction of data
that the data holds. In 1969, Grubbs introduced the first definition of outliers.
Types of Outliers
Global outliers are also called point outliers. Global outliers are taken as the
simplest form of outliers. When data points deviate from all the rest of the data
points in a given data set, it is
known as the global outlier. In most cases, all the outlier detection procedures
are targeted to determine the global outliers. The green data point is the global
outlier.
Collective Outliers
In a given set of data, when a group of data points deviates from the rest of the
data set is called collective outliers. Here, the particular set of data objects may
not be outliers, but when you consider the data objects as a whole, they may
behave as outliers. To identify the types of different outliers, you need to go
through background information about the relationship between the behavior
of outliers shown by different data objects. For example, in an Intrusion
Detection System, the DOS package from one system to another is taken as
normal
behavior. Therefore, if this happens with the various computer simultaneously,
it is considered abnormal behavior, and as a whole, they are called collective
outliers. The green data pointsas a whole represent the collective outlier.
Contextual Outliers
Outliers are discarded at many places when data mining is applied. But it is
still used in many applications like fraud detection, medical, etc. It is usually
because the events that occur rarely can store much more significant
information than the events that occur more regularly.
Other applications where outlier detection plays a vital role are given below.
Any unusual response that occurs due to medical treatment can be analyzed
through outlier analysis in data mining.
Machine Learning is one of the booming technologies across the world that
enables computers/machines to turn a huge amount of data into predictions.
However, these predictions highly depend on the quality of the data, and if we
are not using the right data for our model, then it will not generate the expected
result. In machine learning projects, we generally divide the original dataset
into training data and test data. We train our model over a subset of the original
dataset, i.e., the training dataset, and then evaluate whether it can generalize
well to the new or unseen dataset or test set. Therefore, train and test datasets
are the two key concepts of machine learning, where the training dataset is
used to fit the model, and the test dataset is used to evaluate the model.
In this topic, we are going to discuss train and test datasets along with the
difference between both of them. So, let's start with the introduction of the
training dataset and test dataset in Machine Learning.
The training data is the biggest (in -size) subset of the original dataset, which
is used to train or fit the machine learning model. Firstly, the training data is
fed to the ML algorithms, which lets them learn how to make predictions for
the given task.
For example, for training a sentiment analysis model, the training data could be as below:
Splitting the dataset into train and test sets is one of the important parts of data
pre- processing, as by doing so, we can improve the performance of our model
and hence give better predictability.
We can understand it as if we train our model with a training set and then test
it with a completely different test dataset, and then our model will not be able
to understand the correlations between the features.
Once the model is trained enough with the relevant training data, it is tested
with the test data. We can understand the whole process of training and testing
in three steps, which are as follows:
• Feed: Firstly, we need to train the model by feeding it with training input data.
• Define: Now, training data is tagged with the corresponding outputs (in
Supervised Learning), and the model transforms the training data into text vectors or
a number of data features.
• Test: In the last step, we test the model by feeding it with the test data/unseen
dataset. This step ensures that the model is trained efficiently and can generalize well.