Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
20 views

Data Science Ppt1 Update

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Data Science Ppt1 Update

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Data Science

AN INTERDISCIPLINARY FIELD, DATA SCIENCE DEALS WITH PROCESSES AND


SYSTEMS, THAT ARE USED TO EXTRACT KNOWLEDGE OR INSIGHTS FROM
LARGE AMOUNTS OF DATA.
DATA EXTRACTED CAN BE EITHER STRUCTURED OR UNSTRUCTURED.
 It is primarily the Science used to uncover hidden patterns from data. Those
hidden patterns or insights could go a long way in achieving ground-breaking
results in several fields and improve the lives of the people
  data science is an interdisciplinary field (it consists of more than one branch of
study) that uses:
 Statistics.
 computer science and
  machine learning algorithms 
to gain insights from both structured and unstructured data. 
Main Components of Data Science

 The main components or process are as follows:


Data Exploration

 It is the most important step as this step consumes the most amount of time.
 Around 70 percent of the time is spent on data exploration.
 The main ingredient for data science is data so when we get data,
 it is seldom that data is in a correct structured form.
Removing Noise in data

 As data scientists , we usually don’t think about how our data is collected. We focus
on analysis, not measurement.
 it can be dangerous if we’re dealing with noisy data.
 A dirty dataset can be a bottleneck that reduces the quality of the entire analysis
pipeline.
 There is a lot of noise present in the data.
 Noise here means a lot of unwanted data that is not required. So what we do in this
step?
 This step involves sampling and transformation of data in which we check the
observations (rows) and features (columns) and remove the noise by using statistical
methods.
 .
 By the relationship we mean whether the features(columns) are dependent on each
other or independent of each other?
 whether there are missing values in the data or not?
So basically the data is transformed for further use.
 Hence this is one of the most time-consuming steps.
 2. Modeling
 This is the second step where we actually use Machine Learning algorithms.
 Here we actually fit the data into the model.
 The selection of a model depends on the type of data we have and the business
requirement.
3. Testing the Model
 The model is tested with test data to check the accuracy and other characteristics
of the model and make the required changes in the model to get the desired result.
 In case we do not get the desired accuracy we can again go to step 2(modeling)
select a different model and then repeat the same step 3 and choose the model
which gives the best result as per the business requirement
4. Deploying Models
 Once we get the desired result by proper testing as per the business requirements,
we finalize the model which gives us the best result as per testing results and
deploy the model in the production environment.
Characteristics of Data Science

 The characteristics are as follows:


1. Business Understanding
 It is the most important characteristic as unless you understand the business you
cannot make a good model even if you have good knowledge of machine learning
algorithms or statistical skills.
 A data Scientist needs to understand the business requirement and develop
analytics according to it.
 So, domain knowledge of the business also becomes important or helpful.
 2. Intuition
 Although the math involved is proven and foundational but a data scientist needs
to pick the right model with the right accuracy. As all models will not give up
exact same results.
 So a data scientist needs to feel when a model is ready for production
deployment.
 They also need the intuition to know at what point the production model is stale
and needs refactoring to respond to changing business environment.
3. Curiosity
 Data Science is not a new field.
 It has been there before also but the progress being made in this field is very fast
and new methods to solve familiar problems are being developed constantly.
 so, as a data scientist curiosity to learn emerging technologies becomes very
important.
Applications
 we have cleared about the applications of data science that it is huge. It’s required
in every field. Here are examples of a few sectors where data science can be used or
being used actively.
 1. Marketing
 There is a huge scope in marketing, for example, Improved Pricing strategy
Companies like Uber, e-commerce companies can use data science-driven pricing
which allows them to increase their profits.
 2. Healthcare
 Using wearable data to prevent and monitor health problems. The data generated
from the body can be used in healthcare to prevent future emergencies.
 3. Banking and Finance

 4. Government Policies
 The Government can use data science to prepare better policies to the needs of
the people and what they want using the data they can get by conducting surveys
and others from other official sources
Advantages and Disadvantages of Data
Science
 Advantages
 Some of the advantages are as follows:
 It helps us to get insights from the historical data with its powerful tools.
 It helps to optimize the business, hire the right persons and generate more revenue
as using data science helps you to make better future decisions for the business.
 Companies can develop and market their products better as they can better select
their target customers.
 Introduction to Data Science also helps consumers search for better goods,
especially in e-commerce sites based on the data-driven recommendation system.
 Disadvantages
 The disadvantages are generally when data science is used for customer profiling
and infringement of customer privacy.
 As their information, such as transactions, purchases, and subscriptions, is visible
their parent companies.
 The information obtained using data science can be used against a certain group,
individual, country or community.
 Data science includes domain knowledge, statistics and coding skills as all these
are combined to give desired results. The vast set of data science involves the
application of machine learning and deep learning as the study of past to predict
future or the study of behavioral tracts requires data that could not be analyzed
without data science.
Structured vs Unstructured Data

 What is Structured Data?


 Structured data usually resides in relational databases . Fields such as a phone
numbers, Social Security numbers, or ZIP codes.
 Even text strings of variable length like names are contained in records, making it
a simple matter to search.
 Data may be human- or machine-generated as long as the data is created within
an RDBMS structure.
 This format is searchable both with human generated queries and via algorithms.
 Common relational database applications with structured data include airline
reservation systems, inventory control, sales transactions.
 Structured Query Language (SQL) enables queries on this type of structured data
within relational databases.
What is Unstructured Data?

 Unstructured data is essentially everything else.


 Unstructured data has internal structure but is not structured via pre-defined data
models or schema.
 It may be textual or non-textual, and human- or machine-generated. It may also
be stored within a non-relational database like NoSQL.
Typical human-generated unstructured
data includes:
 Text files: Word processing, spreadsheets, presentations, email, logs.
 Email: Email has some internal structure thanks to its metadata, and we
sometimes refer to it as semi-structured. However, its message field is
unstructured.
 Social Media: Data from Facebook, Twitter, LinkedIn.
 Website: YouTube, Instagram, photo sharing sites.
 Mobile data: Text messages, locations.
 Communications: Chat, phone recordings.
 Media: MP3, digital photos, audio and video files.
 Business applications: MS Office documents, productivity applications
 Typical machine-generated unstructured data includes:
 Satellite imagery: Weather data, military movements.
 Scientific data: Oil and gas exploration, space exploration, atmospheric data.
 Digital surveillance: Surveillance photos and video.
 Sensor data: Traffic, oceanographic sensors.
Understanding Data Science

Data Science is primarily the Science used to uncover hidden patterns from data. Those hidden patterns can be
used to achieve the most optimize results in several fields and hence improve the lives of the people.
Working with Data Science

 Data Science work would be divided into the following categories.


 Understanding the Problem – It is essential that the problem statement is clear
before you dive into the actual implementation part. The knowledge of what to
find out is crucial to get the right data and to derive the perfect solution.
 Getting the right data – Once the problem is understood, it’s essential to get the
right data to perform the operation.
 As a data scientist, you will often have to use a variety of different methods to
extract data sets.
 The most traditional way of obtaining data is directly from files, which are stored
in CSV (Comma Separated Value) or TSV (Tab Separated Values) format. These
files are flat text files. 
 You might be using publically available data, data available via an API, data
found in a database or in many cases a combination of these methods., data
available via an API, data found in a database or in many cases a combination of
these methods.
 SQL
 If you need to obtain data from a relational database you will need to use SQL.
You can connect a Jupyter Notebook to most common database types using a
library called SQLAlchemy.  
 result = connection.execute("select * from my_table")
Scraping
 Web scraping is used to download data from a website and extract the required
information from those pages.
 There are a number of python libraries that can be used for this but one of the
simplest to use is Beautiful Soup.

Another popular option to gather data is connecting to Web APIs. Websites such
as Facebook and Twitter allow users to connect to their web servers and access
their data.
 API
 API, which stands for application programming interface, is a web-based system
that provides an endpoint for data which you can connect to through some
programming. Typically the data will be returned in JSON or XML format.
 SON stands for JavaScript Object Notation
 JSON is a lightweight format for storing and transporting data
 JSON is often used when data is sent from a server to a web page

 XML stands for eXtensible Markup Language


 XML is often used for distributing data over the Internet
 In machine learning, you may need to obtain data using this method. A simple
example is :
 To get weather data from a publically available API known as 
Dark Sky.
 Exploratory Data Analysis –Exploratory Data Analysis (EDA) is an approach to
analyzing datasets to summarize their main characteristics, often with visual
methods.
  It is not easy to look at a column of numbers or a whole spreadsheet and
determine important characteristics of the data.
 It is defined as a process of cleaning, transforming data to discover useful
information for business decision-making. The purpose of Data Analysis is to
extract useful information from data and taking the decision based upon the data
analysis.
 The steps involve checking for duplicate data, missing values , NULL values, .
 Exploratory data analysis is generally classified in two ways. First, each method is
either non-graphical or graphical. And second, each method is either univariate or
multivariate (usually just bivariate).
 Univariate analysis is the simplest form of data analysis where the data being
analyzed contains only one variable. 
 Univariate is a term commonly used in statistics to describe a type of data which
consists of observations on only a single characteristic or attribute.
 A simple example of univariate data would be the salaries of workers in industry.
 Bivariate analysis is used to find out if there is a relationship
between two different variables.
 Data Visualization – Once the data is cleaned and pre-processed, it’s necessary to
visualize the data to find out the right features or columns to use for our model.

 Model Selection– Model Building is the core activity of a data science project. It
is carried out either Statistical Driven or using Machine Learning Techniques.
Selecting the right model for a particular problem statement is essential as every
model cannot fit in perfectly for every data set.
Model Selection

 This is the stage where we can finally start evaluating our complete data science
system.
 The end of modeling is characterized by model evaluation where you measure:
 Accuracy : How well the model performs i.e. does it describe the data accurately.
 Relevance: Does the model answer the original question that you set out to
answer
 Deployment– Once the model is built, and the business is satisfied with the
findings, the model could be deployed to production and used in the product.
 Once used by humans , you get feedback.
 The more accurately you capture the feedback, the more effective will be the
changes that you make to your model and more accurate will your final results be.
What can you do with Data Science?

Let’s look at some of the usages of Data Science which has made our life easy in
recent times.
 Example 1
 YouTube is the favorite mode of entertainment, knowledge, news in our daily
lives. We prefer to watch videos than going through slides of long articles. But
how did we become so addictive to YouTube? What has made YouTube so unique
and different?
Well, YouTube uses our data to recommend the videos; we would like to see next. It
uses a recommender system algorithm to track our search patterns and based on that;
its intelligence system shows us those videos which are somewhat related to the one
we have seen.
So basically, it saves our time and energy to manually look for videos which might
be helpful to us based on our liking.
 Example 2
 Similar to YouTube, the recommender system is also used in e-commerce
websites like Netflix, Amazon.
 In the case of Netflix, we are shown those TV shows or movies which are
somewhat related to the one we have watched and thus saves our time to look for
more similar videos.
 Additionally, Amazon recommends the products based on our buying pattern, and
it displays those products which other buyers have bought along with that product
or what we could buy based on our shopping habits or patterns.
 Example 3
 One of the major breakthroughs in Data Science is Amazon’s Alexa or Apple’s
Siri. Often we find tedious to surf through our phone for contacts or feel lazy to
set up alarm bells or reminders.
 In this regard, the virtual assistant systems do all the stuff for us only by listening
to our commands. We tell Alexa or Siri about the things we want and the system
convert our natural voice to text using the Natural Language Processing topology
and extract insights from that text to solve our problems.
 In layman terms, this Intelligent Systems uses Speech to Voice terminology to
save time and solve our problems.
 Example 4
 Data Science has eased the life of athletes and people involved in Sports arenas as
well. The enormous amount of data that’s available these days could be used to
analyze a sportsman’s health and mental conditions to prepare accordingly for a
game.
 Also, the data could be used to make strategies and outplay the opponent even
before the match starts.
 Example 5
 Data Science has eased the life in the Healthcare sector as well. The medics and
the researchers could use Deep Learning to analyze a cell and stop a disease from
occurring in the first place.
 They could also prescribe adequate medication for a patient based on the
prediction from the data.
Python Overview

• Introduction
• Data Types, Expression and Variables
• String
• Conditions and Branching
• Loops
• Functions
• List, Tuples, Dictionaries and Sets
What is Python?

 Python is a popular programming language. It was created by Guido van Rossum,


and released in 1991.
 It is used for:
 web development (server-side),
 software development,
 mathematics,
 system scripting.
What can Python do?

 Python can be used on a server to create web applications.

 Python can be used alongside software to create workflows.


 Python can connect to database systems. It can also read and modify files.
 Python can be used to handle big data and perform complex mathematics.
Why Python?

 English language.
 Python has syntax that allows dePython works on different platforms (Windows,
Mac, Linux, Raspberry Pi, etc).
 Python has a simple syntax similar to write programs with fewer lines than some
other programming languages.
 Python runs on an interpreter system, meaning that code can be executed as soon
as it is written. This means that prototyping can be very quick.
 Python can be treated in a procedural way, an object-oriented way or a functional
way.
Good to know
 The most recent major version of Python is Python 3.
 It is possible to write Python in an Integrated Development Environment, such as :
 Thonny,
 Pycharm,
 Netbeans or
 Eclipse
which are particularly useful when managing larger collections of Python files.
Python Syntax compared to other
programming languages
 Python uses new lines to complete a command, as opposed to other programming
languages which often use semicolons or parentheses.
 Python relies on indentation, using whitespace, to define scope; such as the scope
of loops, functions and classes. Other programming languages often use curly-
brackets for this purpose.
Python Indentation

 Indentation refers to the spaces at the beginning of a code line.


Where in other programming languages the indentation in code is for readability only, the
indentation in Python is very important.
 Python uses indentation to indicate a block of code.
Example
if 5 > 2:
  print("Five is greater than two!")
Print nnnn
Indentation Cont..

 Python will give you an error if you skip the indentation:


 Example
 Syntax Error:
 if 5 > 2:
print("Five is greater than two!")
Indentation Cont..

The number of spaces is up to you as a programmer, but it has to be at least one.


Example

if 5 > 2:
 print("Five is greater than two!") 
if 5 > 2:
        print("Five is greater than two!") 
You have to use the same number of spaces in the same block of code, otherwise Python will give you an
error:
Example
Syntax Error:

if 5 > 2:
 print("Five is greater than two!")
        print("Five is greater than two!")
Python Variables

 In Python, variables are created when you assign a value to it:


Example
Variables in Python:
x = 5
x = "Hello, World!"

 Python has no command for declaring a variable.


 Assigning Values to Variables
 Python variables do not need explicit declaration to reserve memory space. The
declaration happens automatically when you assign a value to a variable. The
equal sign (=) is used to assign values to variables.
 The operand to the left of the = operator is the name of the variable and the
operand to the right of the = operator is the value stored in the variable. For
example −
Standard Data Types

 The data stored in memory can be of many types. For example, a person's age is
stored as a numeric value and his or her address is stored as alphanumeric
characters. Python has various standard data types that are used to define the
operations possible on them and the storage method for each of them.
 Python has five standard data types −
 Numbers
 String
 List
 Tuple
 Dictionary
 Python Strings
 Strings in Python are identified as a contiguous set of characters represented in the
quotation marks.
 The plus (+) sign is the string concatenation operator and the asterisk (*) is the repetition
operator. For example

 Str = “Hellow World”


 Print str
 Print str[0]
 Print str[2:5]
 Print str *2
 Print str +”test”
 What is Python If Statement?
 Python if Statement is used for decision-making operations. It contains a body of
code which runs only when the condition given in the if statement is true. If the
condition is false, then the optional else statement runs which contains some code
for the else condition.
 Syntax:
 if expression
 Statement
 else
 Statement
Conditional Statment
What is Loop?

 Loops can execute a block of code number of times until a certain condition is
met. Their usage is fairly common in programming. Unlike other programming
language that have For Loop, while loop, dowhile, etc.
What is For Loop?
 For loop is used to iterate over elements of a sequence. It is often used when you
have a piece of code which you want to repeat "n" number of time.
What is While Loop?
 While Loop is used to repeat a block of code. Instead of running the code block
once, It executes the code block multiple times until a certain condition is met.
 Syntax for while loop:

While expression
statement
Example:
X=0
While(x < 4):
print(x)
x=x+1
 How to use "For Loop"
 In Python, "for loops" are called iterators.
 Just like while loop, "For Loop" is also used to repeat the program.
 But unlike while loop which depends on condition true or false. "For Loop"
depends on the elements it has to iterate.
 Example:
 for x in range(2 ,10):
 print(x)
 For Loop iterates with number declared in the range.
For example,
 For Loop for x in range (2,7)
 When this code is executed, it will print the number between 2 and 7 (2,3,4,5,6).
In this code, number 7 is not considered inside the range.
 For Loops can also be used for a set of other things and not just number
n =10
i=i
sum=0
while i<=n :
sum=sum+I
i++
Print(“The sum is = “ , sum)
Comments

 Python has commenting capability for the purpose of in-code documentation.


 Comments start with a #, and Python will render the rest of the line as a comment:
 Example1
 Comments in Python:
 #This is a comment.
print("Hello, World!")
 Example2
 #print("Hello, World!")
print("Cheers, Mate!")
 n=5
 f=1
 If(n>1)
 f=f*n
 n=n-1

You might also like