Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Science Report

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 35

Industrial TRAINING REPORT

ON
“DATA SCIENCE”
Complete at
Teachnook
Duration
1 April to 31 June 2023
3rd Year (5th Sem)
Submitted By:
Krishna Soni

Enrolment Roll No.: ECB 2021/10/19


Department of artificial intelligence & data science
Engineering College of Bikaner
Bikaner, Rajasthan
DECLARATION

I hereby declare that the Industrial Training Report on Data Science


completed at Teachnook is an authentic record of my own work as
requirement of Industrial Training as a part of the V semester syllabus
during the period from April 2023 to June 2023 submitted at the
Department of Artificial Intelligence and Data Science, Engineering
College Bikaner for the award of the degree of B.Tech. in Artificial
Intelligence and Data Science by Bikaner Technical University, Bikaner.

Krishna
21EEBAD019
Table of content
Certificate
Student Declaration

1. Introduction 1-2
1.1) Data science 1
1.2) Data Science Process 2

2. My Learning 2-3
2.1) Introduction to Data Science 2
2.2) Python for Data Science 3
2.3) Understanding the statistics for Data Science 3
2.4) Predictive Modelling basis of Machine Learning 3

3. Introduction to Data Science 4-6


3.1) Data Science 4
3.2) Example 4
3.3) Computer Vision 4
3.4) Application of Data Science 5
3.5) Reason for choosing Data Science 6

4. Python Introduction 7-9


4.1) History of python 7
4.2) Feature of Python 7
4.3) Python for data science 8
4.4) Why Python 9
5. Statistics 10 - 12
5.1) Descriptive Statistics 10
5.2) Types of Variable 10
5.3) Outliers 10
5.4) Range 11
5.5) Histogram 11
5.6) Inferential statistics 11
5.7) Hypothesis testing 11
5.8) T-Test 11
5.9) Z scored 12
5.10) Chi Squared Test 12

6.Predictive Modelling 13 - 17
6.1) Types 13
6.2) Stages of Predictive Modelling 14
6.3) Problem Definition 14
6.4) Problem Generation 14
6.5) Data Extraction and Collection 14
6.6) Data Exploration and Transportation 14
6.6.1) Variable Treatment 15
6.6.2) Univariate Analysis 15
6.6.3) Bivariate Analysis 15
6.6.4) missing value treatment 15
6.7) Types of Outliers 16
6.7.1) Univariate 16
6.7.2) Bivariate 16

7. Modelling Building 18 - 21
7.1) Algorithm 19
7.2) Algorithm of Machine Learning 19
8. Methodology 22 -28

9. Result 28
10.Refrence 29
INTRODUCTION
OBJECTIVES
To explore, sort and analyse mega data from various sources to take advantage of
them and reach conclusions to optimize business processes and for decision support.
Examples include machine maintenance or (predictive maintenance), in the fields of
marketing and sales with sales forecasting based on weather.

1.1) DATA SCIENCE:


Data Science as a multi-disciplinary subject that uses mathematics, statistics, and
computer science to study and evaluate data. The key objective of Data Science is to
extract valuable information for use in strategic decision making, product
development, trend analysis, and forecasting.
Data Science concepts and processes are mostly derived from data engineering,
statistics, programming, social engineering, data warehousing, machine learning, and
natural language processing. The key techniques in use are data mining, big data
analysis, data extraction and data retrieval.
Data science is the field of study that combines domain expertise, programming skills,
and knowledge of mathematics and statistics to extract meaningful insights from data.
Data science practitioners apply machine learning algorithms to numbers, text,
images, video, audio, and more to produce artificial intelligence (AI) systems to
perform tasks that ordinarily require human intelligence. In turn, these systems
generate insights which analysts and business users can translate into tangible
business value.
1.2) DATA SCIENCE PROCESS:
1. The first step of this process is setting a research goal. The main purpose here
is making sure all the stakeholders understand the what, how, and why of the
project.
2. The second phase is data retrieval. You want to have data available for analysis,
so this step includes finding suitable data and getting access to the data from
the data owner. The result is data in its raw form, which probably needs
polishing and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes
transforming the data from a raw form into data that’s directly usable in your
models. To achieve this, you’ll detect and correct different kinds of errors in the

1
data, combine data from different data sources, and transform it. If you have
successfully completed this step, you can progress to data visualization and
modelling.
4. The fourth step is data exploration. The goal of this step is to gain a deep
understanding of the data. You’ll look for patterns, correlations, and deviations
based on visual and descriptive techniques. The insights you gain from this
phase will enable you to start modelling.
5. Finally, we get to the sexiest part: model building (often referred to as “data
modelling” throughout this book). It is now that you attempt to gain the
insights or make the predictions stated in your project charter. Now is the time
to bring out the heavy guns, but remember research has taught us that often
(but not always) a combination of simple models tends to outperform one
complicated model. If you’ve done this phase right, you’re almost done.
6. The last step of the data science model is presenting your results and
automating the analysis, if needed. One goal of a project is to change a process
and/or make better decisions. You may still need to convince the business that
your findings will indeed change the business process as expected. This is
where you can shine in your influencer role. The importance of this step is
more apparent in projects on a strategic and tactical level. Certain projects
require you to perform the business process over and over again, so automating
the project will save time.

MY LEARNINGS
2.1) INTRODUCTION TO DATA SCIENCE
• Overview & Terminologies in Data
Science
Applications of Data Science
➢ Unfamiliar detection (fraud, disease, etc.)
➢ Automation and decision-making (credit worthiness, etc.)
➢ Classifications (classifying emails as “important” or “junk”)
➢ Forecasting (sales, revenue, etc.)
➢ Pattern detection (weather patterns, financial market patterns, etc.)
➢ Recognition (facial, voice, text, etc.)

2
➢ Recommendations (based on learned preferences, recommendation
engines can refer you to movies, restaurants and books you may like)

2.2) PYTHON FOR DATA SCIENCE


Introduction to Python, Understanding Operators, Variables and Data Types,
Conditional Statements, Looping Constructs, Functions, Data Structure, Lists,
Dictionaries, Understanding Standard Libraries in Python, reading a CSV File in
Python, Data Frames and basic operations with Data Frames, Indexing Data Frame.

2.3) UNDERSTANDING THE STATISTICS FOR DATA SCIENCE


Introduction to Statistics, Measures of Central Tendency, Understanding the spread of
data,
Data Distribution, Introduction to Probability, Probabilities of Discrete and
Continuous Variables, Normal Distribution, Introduction to Inferential Statistics,
Understanding the Confidence Interval and margin of error, Hypothesis Testing,
Various Tests, Correlation.
2.4) PREDICTIVE MODELLINGAND BASICS OF MACHINE LEARNING
Introduction to Predictive Modelling, Types and Stages of Predictive Models,
Hypothesis Generation, Data Extraction and Exploration, Variable Identification,
Univariate Analysis for Continuous Variables and Categorical Variables, Bivariate
Analysis, Treating Missing Values and Outliers, Transforming the Variables, Basics of
Model Building, Linear and Logistic Regression, Decision Trees, K-means
Algorithms in Python.
Summary of Procedure of Analysing Data:
Data science generally has a five-stage life cycle that consists of:
• Capture: data entry, signal reception, data extraction
• Maintain: Data cleansing, data staging, data processing.
• Process: Data mining, clustering/classification, data modelling
• Communicate: Data reporting, data visualization
• Analyse: Predictive analysis, regression

3
Introduction to Data Science
3.1) Data Science
The field of bringing insights from data using scientific techniques is called data science.

3.2) Applications:

Amazon Go – No checkout lines

3.3) Computer Vision - The advancement in recognizing an image by a computer involves


processing large sets of image data from multiple objects of same category. For example, Face
recognition.

Spectrum of Business Analysis

What can happen?


Given data is
collected and used.
Big Data

What is likely to
Complexity happen?
Predictive Analysis

What’s happening
now?
Dashboards

Why did it
happen?
Detective Analysis

What happened?
Reporting

Value added to organization

4
 Reporting / Management Information System

To track what is happening in organization.

 Detective Analysis

Asking questions based on data we are seeing, like. Why something happened?

 Dashboard / Business Intelligence

Utopia of reporting. Every action about business is reflected in front of screen.

 Predictive Modelling

Using past data to predict what is happening at granular level.

 Big Data

Stage where complexity of handling data gets beyond the traditional system.

Can be caused because of volume, variety or velocity of data. Use specific tools to analyse such
scale data.

3.4) Application of Data Science


• Recommendation System
Example-In Amazon recommendations are different for different users according to their past
search.

• Social Media
1. Recommendation Engine
2. Ad placement
3. Sentiment Analysis
• Deciding the right credit limit for credit card customers.
• Suggesting right products from e-commerce companies
1. Recommendation System
2. Past Data Searched
3. Discount Price Optimization
• How google and other search engines know what are the more relevant results for our
search query?
1. Apply ML and Data Science

5
2. Fraud Detection
3. AD placement
3.5) Reason for choosing data science
Data Science has become a revolutionary technology that everyone seems to talk about. Hailed as
the ‘sexiest job of the 21st century’. Data Science is a buzzword with very few people knowing
about the technology in its true sense.
While many people wish to become Data Scientists, it is essential to weigh the pros and cons of
data science
and give out a real picture. In this article, we will discuss these points in detail and provide you
with the necessary insights about Data Science.
Advantages: -
1. It’s in Demand
2. Abundance of Positions
3. A Highly Paid Career
4. Data Science is Versatile
Disadvantages: -
1. Mastering Data Science is near to impossible
2. A large Amount of Domain Knowledge Required
3. Arbitrary Data May Yield Unexpected Results
4. The problem of Data Privacy

6
Python Introduction
PYTHON
Python is an interpreted, high-level, general-purpose programming language. It has efficient high-
level data structures and a simple but effective approach to object-oriented programming. Python’s
elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language
for scripting and rapid application development in many areas on most platforms.

4.1) HISTORY OF PYTHON


Python was developed by Guido van Rossum in the late eighties and early nineties at the National
Research Institute for Mathematics and Computer Science in the Netherlands. Python is derived
from many other languages, including ABC, Modula-3, C, C++, Algol-68, Small Talk, and Unix
shell and other scripting languages. Python is copyrighted. Like Perl, Python source code is now
available under the GNU General Public License (GPL). Python is now maintained by a core
development team at the institute, although Guido van Rossum still holds a vital role in directing its
progress.
4.2) PYTHON FEATURES
Python's features include:
 Easy-to-learn:
Python has few keywords, simple structure, and a clearly defined syntax. This
allows the student to pick up the language quickly.
 Easy-to-read:
Python code is more clearly defined and visible to the eyes.

 Easy-to-maintain:
Python's source code is fairly easy-to-maintain.

 A broad standard library:


Python's bulk of the library is very portable and cross-platform compatible on
UNIX, Windows, and Macintosh.

 Interactive Mode:
Python has support for an interactive mode which allows interactive testing
and debugging of snippets of code.

7
 Portable:
Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.

 Extendable:
You can add low-level modules to the Python interpreter. These modules
enable programmers to add to or customize their tools to be more efficient.

 Databases:
Python provides interfaces to all major commercial databases.

 GUI Programming:
Python supports GUI applications that can be created and ported to many system calls,
libraries and windows systems, such as Windows MFC, Macintosh, and the X
Window system of Unix.

 Scalable:
o Python provides a better structure and support for large programs than shell
scripting. Python has a big list of good features:

o It supports functional and structured programming methods as well as OOP.

o It can be used as a scripting language or can be compiled to byte-code for


building large applications.

o It provides very high-level dynamic data types and supports dynamic type
checking.

o IT supports automatic garbage collection.

o It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

4.3) Python for Data science:


In this python for data science guide, we’ll explore the exciting world of Python and its wide-
ranging applications in data science. We will also explore a variety of data science techniques used
in data science using the Python programming language.

8
We all know that data Science is applied to gather multiple data sets to collect information, project
the insight, and interpret it to make an effective business decision. However, being a data scientist
requires you to learn some of the best and most highly used programming languages, such
as Java, C++, R, Python, etc. Among these, Python has been considered the preferred choice among
data scientists throughout the globe.

4.4) Why Python???


1. Python is an open-source language.
2. Syntax as simple as English.
3. Very large and Collaborative developer community.
4. Extensive Packages.
• UNDERSTANDING OPERATORS:
Theory of operators: - Operators are symbolic representation of Mathematical tasks.
• VARIABLES AND DATATYPES:
Variables are named bounded to objects. Data types in python are int (Integer), Float,
Boolean and strings.
• CONDITIONAL STATEMENTS:
If-else statements (Single condition)
If- elif- else statements (Multiple Condition)
• LOOPING CONSTRUCTS:
For loop
• FUNCTIONS:
Functions are re-usable piece of code. Created for solving specific problem.
Two types: Built-in functions and User- defined functions.
Functions cannot be reused in python.
• DATA STRUCTURES:

Two types of Data structures:

LISTS: A list is an ordered data structure with elements separated by comma and enclosed
within square brackets.

DICTIONARY: A dictionary is an unordered data structure with elements separated by


comma and stored as key: value pair, enclosed with curly braces {}.

9
Statistics:
5.1) Descriptive Statistic
 Mode
It is a number which occurs most frequently in the data series.
It is robust and is not generally affected much by addition of couple of
new values. Code import pandas as pd data=pd.
read_csv(“Mode.csv”) //reads data from csv file
data.head() //print first five lines
mode_data=data['Subject'].mode() //to take mode of
subject column print(mode_data)
 Mean
import pandas as pd
data=pd.read_csv( "mean.csv") //reads data
from csv file
data.head() //print first five lines
mean_data=data[Overallmarks].mean() //to take mode of subject column
print(mean_data)
 Median
Absolute central value of data set.
import pandas as pd
data=pd.read_csv( "data.csv") //reads data
from csv file
data.head() //print first five lines
median_data=data[Overallmarks].median() //to take mode of subject column
print(median_data)
5.2) Types of variables
• Continous – Which takes continuous numeric values. Eg-marks
• Categorial-Which have discrete values. Eg- Gender
• Ordinal – Ordered categorial variables. Eg- Teacher feedback
• Nominal – Unordered categorial variable. Eg- Gender
5.3) Outliers
Any value which will fall outside the range of the data is termed as an outlier. Eg- 9700 instead of
97.

10
Reasons of Outliers
• Typos-During collection. Eg-adding extra zero by mistake.
• Measurement Error-Outliers in data due to measurement operator being faulty.
• Intentional Error-Errors which are induced intentionally. Eg-claiming smaller amount of
alcohol consumed then actual.
• Legit Outlier—These are values which are not actually errors but in data due to legitimate
reasons.
Eg - a CEO’s salary might actually be high as compared to other employees.
Interquartile 5.4) Range (IQR):
Is difference between third and first quartile from last. It is robust to outliers.
5.5) Histograms:
Histograms depict the underlying frequency of a set of discrete or continuous data that are
measured on an interval scale.
import pandas as pd
histogram=pd.read_csv(histogram.c
sv) import matplotlib.pyplot as plt
%matplot inline plt.hist(x=
'Overall Marks',data=histogram)
plt.show()
5.6) Inferential Statistics
Inferential statistics allows to make inferences about the population from the sample data.
5.7) Hypothesis Testing:
Hypothesis testing is a kind of statistical inference that involves asking a question, collecting data,
and then examining what the data tells us about how to proceed. The hypothesis to be tested is
called the null hypothesis and given the symbol Ho. We test the null hypothesis against an
alternative hypothesis, which is given the symbol Ha.

5.8) T Tests:
When we have just a sample not population statistics.
Use sample standard deviation to estimate population standard deviation.
T test is more prone to errors, because we just have samples.

11
5.9) Z Score:
The distance in terms of number of standard deviations, the observed value is away from mean, is
standard score or z score.

+Z – value is above mean.


-Z – value is below mean.
The distribution once converted to z- score is always same as that of shape of original
distribution.

5.10) Chi Squared Test:


To test categorical variables.
Correlation:
Determine the relationship between two variables.
It is denoted by r. The value ranges from -1 to +1. Hence, 0 means
no relation. Syntax import pandas as pd import numpy as np
data=pd.read_csv("data.csv") data.corr()

12
Predictive Modelling
A data model helps organizations capture all the points of information necessary to perform
operations and enact policy based on the data they collect. This can be explained with an example
of a sales transaction which is broken down into related groups of data points, describing the
customer, the seller, the item sold, and the payment mechanism. For instance, if the sales
transactions were recorded without the date on which they occurred, it would be impossible to
enforce certain return policies. Data modelling in data science is also performed to help
organizations ensure that they are collecting all the necessary items of information in the first place
Making use of past data and attributes we predict future using
this data.
e.g.-
Past Horror Movies
Future Unwatched Horror
Movies

Predicting stock price movement:


1. Analysing past stock prices.
2. Analysing similar stocks.
3. Future stock price required.
6.1) Types:
1. Supervised Learning:
Supervised learning is a type algorithm that uses a known dataset (called the training
dataset) to make predictions. The training dataset includes input data and response values.
• Regression-which have continuous possible values. Eg-Marks
• Classification-which have only two values. Eg-Cancer prediction is either 0 or 1.
2. Unsupervised Learning:
Unsupervised learning is the training of machine using information that is neither classified
nor. Here the task of machine is to group unsorted information according to similarities,
patterns and differences without any prior training of data.
• Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behaviour.
• Association: An association rule learning problem is where you want to discover
rules that describe large portions of your data, such as people that buy X also tend
to buy Y.

13
6.2) Stages of Predictive Modelling
1. Problem definition
2. Hypothesis Generation
3. Data Extraction/Collection
4. Data Exploration and Transformation
5. Predictive Modelling
6. Model Development/Implementation

6.3) Problem Definition:


Identify the right problem statement, ideally formulate the problem mathematically.
6.4) Hypothesis Generation:
List down all possible variables, which might influence problem objective. These variables should
be free from personal bias and preferences.
Quality of model is directly proportional to quality of hypothesis.

6.5) Data Extraction/Collection


Collect data from different sources and combine those for exploration and model building.
While looking at data we might come across new hypothesis.

6.6) Data Exploration and Transformation


Data extraction is a process that involves retrieval of data from various sources for further data
processing or data storage.
Steps of Data
Extraction •
Reading the
data e.g.-
From csv file
• Variable identification
• Univariate Analysis
• Bivariate Analysis
• Missing value treatment

14
• Outlier treatment
• Variable Transformation

6.6.1) Variable Treatment:


It is the process of identifying whether variable is
1. Independent or dependent variable
2. Continuous or categorical variable
Why do we perform variable identification?
1. Techniques like supervised learning require identification of dependent variable.
2. Different data processing techniques for categorical and continuous data.
Categorical variable- Stored as object.
Continuous variable-Stored as int or float.
6.6.2) Univariate Analysis:
1. Explore one variable at a time.
2. Summarize the variable.
3. Make sense out of that summary to discover insights, anomalies, etc.
6.6.3) Bivariate Analysis:
• When two variables are studied together for their empirical relationship.
• When you want to see whether the two variables are associated with each other.
• It helps in prediction and detecting anomalies.
6.6.4) Missing Value
Treatment:
Reasons of missing
value:
1. Non-response – Eg-when you collect data on people’s income and many choose not to
answer. 2. Error in data collection. Eg- Faculty data
3. Error in data reading.
Types:
1. MCAR (Missing completely at random): Missing values have no relation to the variable in
which missing value exist and other variables in dataset.
2. MAR (Missing at random): Missing values have no relation to the in which missing value
exist and the variables other than the variables in which missing values exist.

15
3. MNAR (Missing not at random): Missing values have relation to the variable in which
missing value exists Identifying
Syntax: -
1. Describe ()
2. Isnull()
Output will we in True or False
Different methods to deal with missing values
1. Imputation
Continuous-Impute with help of mean, median or regression mode.
Categorical-With mode, classification model.
2. Deletion
Row wise or column wise deletion. But it leads to loss of
data. Outlier Treatment
Reasons of Outliers:
1. Data entry Errors
2. Measurement Errors
3. Processing Errors
4. Change in underlying population
6.7) Types of Outliers:
6.7.1) Univariate
Analysing only one variable for
outlier.
Example:
– In box plot of height and
weight.
Weight will we analysed for outlier
6.7.2) Bivariate
Analysing both variables for outlier.
Eg- In scatter plot graph of height and weight. Both will we analysed.
Identifying Outlier
Graphical Method
• Box Plot

16
• Scatter Plot

Formula Method
Using Box Plot
< Q1 - 1.5 * IQR or > Q3+1.5 * IQR
Where IQR= Q3 – Q1
Q3=Value of 3rd quartile
Q1=Value of 1st
quartile Treating Outlier
1. Deleting observations
2. Transforming and binning values
3. Imputing outliers like missing values
4. Treat them as separate Variable Transformation Is the process by which-
1. We replace a variable with some function of that variable. Eg – Replacing a
variable x with its log.
2. We change the distribution or relationship of a variable with others. Used to

1. Change the scale of a variable
2. Transforming nonlinear relationships into linear relationship
3. Creating symmetric distribution from skewed distribution.
Common methods of Variable Transformation – Logarithm, Square root, Cube root, Binning, etc.

17
Model Building
It is a process to create a mathematical model for estimating / predicting the future based on past
data.
Example-
A retail wants to know the default behaviour of its credit card customers. They want to predict the
probability of default for each customer in next three months.
• Probability of default would lie between 0 and 1.
• Assume every customer has a 10% default rate.
Probability of default for each customer in next 3 months=0.1
It moves the probability towards one of the extremes based on attributes of past information.
A customer with volatile income is more likely (closer to) to default.
A customer with healthy credit history for last years has low chances of default (closer to 0).

Steps in Model Building


1. Algorithm Selection
2. Training Model
3. Prediction / Scoring

Algorithm Selection
Example-

Eg- Predict the customer will buy product or not.

18
7.1) Algorithms
• Logistic Regression
• Decision Tree
• Random Forest

Training Model
It is a process to learn relationship / correlation between independent and dependent variables.
We use dependent variable of train data set to
predict/estimate. Dataset
• Train
Past data (known dependent variable).
Used to train model.
• Test
Future data (unknown
dependent variable) Used to score.
Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by applying model rules.
We apply training learning to test data set for prediction/estimation.

7.2) Algorithm of Machine Learning


 Linear Regression
Linear regression is a statistical approach for modelling relationship between a dependent variable
with a given set of independent variables.
It is assumed that the wo variables are linearly related. Hence, we try to find a linear function. That
predicts the response value(y) as accurately as possible as a function of the feature or independent
variable(x).

19
The equation of regression line is
Y-Values
14 represented as:
12

10

6 The squared error or cost function, J as:


4

0
0 1 2 3 4 5 6 7 8 9

 Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a
binary dependent variable, although many more complex extensions exist.

C = -y (log(y) – (1-y) log(1-y))

20
 K-Means Clustering (Unsupervised learning)

K-means clustering is a type of unsupervised learning, which is used when you have unlabelled
data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in
the data, with the number of groups represented by the variable K. The algorithm works iteratively
to assign each data point to one of K groups based on the features that are provided. Data points
are clustered based on feature similarity.

21
METHODOLOGY
PREDICTING IF CUSTOMER BUYS TERM DEPOSIT
Problem Statement:

Your client is a retail banking institution. Term deposits are a major source of income
for a bank.
A term deposit is a cash investment held at a financial institution. Your money is
invested for an agreed rate of interest over a fixed amount of time, or term. The bank
has various outreach plans to sell term deposits to their customers such as email
marketing, advertisements, telephonic marketing and digital marketing.
Telephonic marketing campaigns still remain one of the most effective ways to reach
out to people. However, they require huge investment as large call centres are hired
to actually execute these campaigns. Hence, it is crucial to identify the customers
most likely to convert beforehand so that they can be specifically targeted via call.
You are provided with the client data such as: age of the client, their job type, their
marital status, etc. Along with the client data, you are also provided with the
information of the call such as the duration of the call, day and month of the call,
etc. Given this information, your task is to predict if the client will subscribe to
term deposit. Data Dictionary: -

22
Prerequisites:
We have the following files:
• train.csv: This dataset will be used to train the model. This file contains all the
client and call details as well as the target variable “subscribed”.
• test.csv: The trained model will be used to predict whether a new set of clients
will subscribe the term deposit or not for this dataset.
• TEST.csv file: -

23
 TRAIN.csv file: -

Problem Description
Provided with following files: train.csv and test.csv.
Use train.csv dataset to train the model. This file contains all the client and call details as well as
the target variable “subscribed”. Then use the trained model to predict whether a new set of clients
will subscribe the term deposit.

24
25
26
27
RESULTS
In this complete 6 weeks training I successfully learnt about DATA SCIENCE. Also,
now I’m able to perform data analysis using python. I also attempted various quizzes
and assignments provided for periodic evaluation during 6 weeks and completed this
training with 82% score in Final Test.

28
REFRENCE
1) WIKIPEDIA.COM: We have used Wikipedia to define certain terms including the history and
basics of python and data visualisation.
2) TEACHNOOK.REPORT: We have used teachnook site for data science in order to predict the
optimum site data.
3) SCRIBB.NET We have used SCRIBB.NET to learn about certain functions about how they work
and how to define in code, we have also used geeks for geeks to learn more about libraries used
in this project in data wrangling, data collection, web scraping, data visualisation, machine
learning and many more.

***

29

You might also like