Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
23 views

FDS Unit 1 Notes

Uploaded by

dhanshree1278
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

FDS Unit 1 Notes

Uploaded by

dhanshree1278
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

UNIT 1

TOPICS
• Introduction to Data Science
• Role of data scientist
• Types of Data
• Tool boxes for data scientists
• Introduction to R studio
What is Data Science?
• Data Science is a combination of mathematics,
statistics, machine learning, and computer science.
Data Science is collecting, analyzing and interpreting
data to gather insights into the data that can help
decision-makers make informed decisions.
• Data Science is used in almost every industry today
that can predict customer behavior and trends and
identify new opportunities. Businesses can use it to
make informed decisions about product development
and marketing. It is used as a tool to detect fraud and
optimize processes. Governments also use Data
Science to improve efficiency in the delivery of public
services.
Importance of Data Science
• Nowadays, organizations are overwhelmed
with data. Data Science will help in extracting
meaningful insights from that by combining
various methods, technology, and tools. In the
fields of e-commerce, finance, medicine,
human resources, etc, businesses come across
huge amounts of data. Data Science tools and
technologies help them process all of them.
What is the Data Science process?
• Obtaining the data
• The first step is to identify what type of data needs to be
analyzed, and this data needs to be exported to an excel or
a CSV file.
• Scrubbing the data
• It is essential because before you can read the data, you
must ensure it is in a perfectly readable state, without any
mistakes, with no missing or wrong values.
• Exploratory Analysis
• Analyzing the data is done by visualizing the data in various
ways and identifying patterns to spot anything out of the
ordinary. To analyze the data, you must have excellent
attention to detail to identify if anything is out of place.
• Modeling or Machine Learning
• A data engineer or scientist writes down
instructions for the Machine Learning
algorithm to follow based on the Data that has
to be analyzed. The algorithm iteratively uses
these instructions to come up with the correct
output.
• Interpreting the data
• In this step, you uncover your findings and
present them to the organization. The most
critical skill in this would be your ability to
explain your results.
Types of data
• The data is classified into four categories:
• Nominal data.
• Ordinal data.
• Discrete data.
• Continuous data.
Qualitative or Categorical Data
• Qualitative or Categorical Data is data that can’t be
measured or counted in the form of numbers. These
types of data are sorted by category, not by number.
That’s why it is also known as Categorical Data. These
data consist of audio, images, symbols, or text. The
gender of a person, i.e., male, female, or others, is
qualitative data.
• The other examples of qualitative data are :
• What language do you speak
• Favorite holiday destination
• Opinion on something (agree, disagree, or neutral)
• Colors
The Qualitative data are further
classified into two parts :
• Nominal Data
• Nominal Data is used to label variables without any order or
quantitative value. The color of hair can be considered nominal
data, as one color can’t be compared with another color.
• The name “nominal” comes from the Latin name “nomen,” which
means “name.” With the help of nominal data, we can’t do any
numerical tasks or can’t give any order to sort the data.
• Examples of Nominal Data :
• Colour of hair (Blonde, red, Brown, Black, etc.)
• Marital status (Single, Widowed, Married)
• Nationality (Indian, German, American)
• Gender (Male, Female, Others)
• Eye Color (Black, Brown, etc.)
Ordinal Data
• Ordinal data have natural ordering where a number is present in some
kind of order by their position on the scale. These data are used for
observation like customer satisfaction, happiness, etc., but we can’t do
any arithmetical tasks on them.
• Ordinal data is qualitative data for which their values have some kind of
relative position. These kinds of data can be considered “in-between”
qualitative and quantitative data. The ordinal data only shows the
sequences and cannot use for statistical analysis. Compared to nominal
data, ordinal data have some kind of order that is not present in nominal
data.
• Examples of Ordinal Data :
• When companies ask for feedback, experience, or satisfaction on a scale
of 1 to 10
• Letter grades in the exam (A, B, C, D, etc.)
• Ranking of people in a competition (First, Second, Third, etc.)
• Economic Status (High, Medium, and Low)
• Education Level (Higher, Secondary, Primary)
Quantitative Data
• Quantitative data can be expressed in numerical values, making it
countable and including statistical data analysis. These kinds of
data are also known as Numerical data. It answers the questions
like “how much,” “how many,” and “how often.” For example, the
price of a phone, the computer’s ram, the height or weight of a
person, etc., falls under quantitative data.
• Quantitative data can be used for statistical manipulation. These
data can be represented on a wide variety of graphs and charts,
such as bar graphs, histograms, scatter plots, boxplots, pie charts,
line graphs, etc.
• Examples of Quantitative Data :
• Height or weight of a person or object
• Room Temperature
• Scores and Marks (Ex: 59, 80, 60, etc.)
• Time
The Quantitative data are further
classified into two parts :
• Discrete Data
• The term discrete means distinct or separate. The discrete data
contain the values that fall under integers or whole numbers. The
total number of students in a class is an example of discrete data.
These data can’t be broken into decimal or fraction values.
• The discrete data are countable and have finite values; their
subdivision is not possible. These data are represented mainly by a
bar graph, number line, or frequency table.
• Examples of Discrete Data :
• Total numbers of students present in a class
• Cost of a cell phone
• Numbers of employees in a company
• The total number of players who participated in a competition
• Days in a week
Continuous Data
• Continuous Data
• Continuous data are in the form of fractional numbers. It can be the
version of an android phone, the height of a person, the length of an
object, etc. Continuous data represents information that can be divided
into smaller levels. The continuous variable can take any value within a
range.
• The key difference between discrete and continuous data is that discrete
data contains the integer or whole number. Still, continuous data stores
the fractional numbers to record different types of data such as
temperature, height, width, time, speed, etc.
• Examples of Continuous Data :
• Height of a person
• Speed of a vehicle
• “Time-taken” to finish the work
• Wi-Fi Frequency
• Market share price
Data Scientist Roles and
Responsibilities
• Data scientists collaborate closely with business leaders and
other key players to comprehend company objectives and
identify data-driven strategies for achieving those objectives.
A data scientist’s job is to gather a large amount of data,
analyze it, separate out the essential information, and then
utilize tools like SAS, R programming, Python, etc. to extract
insights that may be used to increase the productivity and
efficiency of the business. Depending on an organization’s
needs, data scientists have a wide range of roles and
responsibilities.
The following is a list of some of the data
scientist roles and responsibilities:
1. Collect data and identify data sources
2. Analyze huge amounts of data, both structured and unstructured
3. Create solutions and strategies to business problems
4. Work with team members and leaders to develop data strategy
5. To discover trends and patterns, combine various algorithms and
modules
6. Present data using various data visualization techniques and tools
7. Investigate additional technologies and tools for developing innovative
data strategies
8. Create comprehensive analytical solutions, from data gathering to
display; assist in the construction of data engineering pipelines
9. Supporting the data scientists, BI developers, and analysts team as
needed for their projects Working with the sales and pre-sales team on
cost reduction, effort estimation, and cost optimization.
10. To boost general effectiveness and performance, stay current with the
newest tools, trends, and technologies
11. collaborating together with the product team and partners to provide
data-driven solutions created with original concepts
12. Create analytics solutions for businesses by combining various tools,
applied statistics, and machine learning
13. Lead discussions and assess the feasibility of AI/ML solutions for business
processes and outcomes
14. Architect, implement, and monitor data pipelines, as well as conduct
knowledge sharing sessions with peers to ensure effective data use
Data scientist requirements
• Each industry has its own big data profile for a data scientist to analyze.
Here are some of the more common forms of big data in each industry, as
well as the kinds of analysis a data scientist will likely be required to
perform, according to the Bureau of Labor Statistics(BLS).
• Business:
• Today, data shapes the business strategy for nearly every company
• but businesses need data scientists to make sense of the information.
Data analysis of business data can inform decisions around efficiency,
inventory, production errors, customer loyalty and more.
• E-commerce:
• Now that websites collect more than purchase data, data scientists help
e-commerce businesses improve customer service, find trends and
develop services or products.
• Finance:
• In the finance industry, data on accounts, credit and debit transactions
and similar financial data are vital to a functioning business. But for data
scientists in this field, security and compliance, including fraud detection,
are also major concerns
• Government:
• Big data helps governments form decisions, support constituents and monitor overall
satisfaction. Like the finance sector, security and compliance are a paramount concern for
data scientists.
• Science:
• Scientists have always handled data, but now with technology, they can better collect, share
and analyze data from experiments. Data scientists can help with this process.
• Social networking:
• Social networking data helps inform targeted advertising, improve customer satisfaction,
• establish trends in location data and enhance features and services. Ongoing data analysis of
posts, tweets, blogs and other social media can help businesses constantly improve their
services.
• Healthcare:
• Electronic medical records are now the standard for healthcare facilities, which requires a
dedication to big data, security and compliance. Here, data scientists can help improve
health services and uncover trends that might go unnoticed otherwise.
• Telecommunications:
• All electronics collect data, and all that data needs to be stored, managed, maintained and
analyzed. Data scientists help companies squash bugs, improve products and keep customers
happy by delivering the features they want
7 essential skills for a data scientist
• 1. Programming
• Programming languages, such as Python or R, are
necessary for data scientists to sort, analyze, and manage
large amounts of data (commonly referred to as “big
data”). As a data scientist just starting out, you should
know the basic concepts of data science and begin
familiarizing yourself with how to use Python. Popular
programming languages include:
• Python
• R
• SAS
• SQL
2. Statistics and probability
• In order to write high-quality machine learning models and
algorithms, data scientists need to learn statistics and
probability. For machine learning, it is essential to
use statistical analysis concepts like linear regression. Data
scientists need to be able to collect, interpret, organize, and
present data, and to fully comprehend concepts like mean,
median, mode, variance, and standard deviation. Here are
different types of statistical techniques you should know:
• Probability distributions
• Over and under sampling
• Bayesian and frequent ist statistics
• Dimension reduction
3. Data wrangling and database
management
• Data wrangling is the process of cleaning and organizing complex data sets to make
them easier to access and analyze. Manipulating the data to categorize it by patterns
and trends, and to correct and input data values can be time-consuming but
necessary to make data-driven decisions. This is also related to
understanding database management—you’re expected to extract data from
different sources and transform it into a suitable format for query and analysis, and
then load it into a data warehouse system. Useful tools for data wrangling include:
• Altair
• Talend
• Alteryx
• Trifacta
• Tamr
• And database management tools include:
• MySQL
• MongoDB
• Oracle
4. Machine learning and deep learning
• As a data scientist, you’ll want to immerse yourself in machine learning
and deep learning. Incorporating these techniques helps you improve as a
data scientist because you’ll be able to gather and synthesize data more
efficiently, while also predicting the outcomes of future data sets. For
example, you can forecast how many clients your company will have
based on the previous month’s data using linear regression. Later on, you
can boost your knowledge to include more sophisticated models like
Random Forest. Some machine learning algorithms to know include:
• Linear regression
• Logistic regression
• Naive Bayes
• Decision tree
• Random forest algorithm
• K-nearest neighbor (KNN)
• K means algorithm
5. Data visualization
• Not only do you need to know how to analyze,
organize, and categorize data, but you’ll also want to
build your skills in data visualization. Being able to
create charts and graphs is important to being a data
scientist. With strong visualization skills, you can
present your work to stakeholders so that the data
tells a compelling story of the business insights.
Familiarity with the following tools should prepare you
well:
• Tableau
• Microsoft Excel
• PowerBI
6. Cloud computing
• As a data scientist, you'll most likely need to use cloud
computing tools that help you analyze and visualize data
that are stored in cloud platforms. Some certifications will
specifically focus on cloud services such as:
• Amazon Web Service (AWS)
• Microsoft Azure
• Google Cloud
• These tools provide data professionals access to
cloud-based databases and frameworks that are key for
advancing technology. They are used in many industries
now, so it is important in data science to become familiar
with the concepts behind cloud computing.
7. Interpersonal skills
• You’ll want to develop workplace skills such as communication in
order to form strong working relationships with your team
members and be able to present your findings to stakeholders.
Just as data visualization is important for communicating the data
insights you uncover as a data scientist, so is being able to
collaborate with teams successfully. Here are interpersonal skills
you can build upon:
• Active listening
• Effective communication skills
• Sharing feedback
• Attention to detail
• Leadership
• Empathy
• Public speaking
Difference Between Data Scientist, Data
Analyst, and Data Engineer

Data Scientist Data Analyst Data Engineer

Data Engineers focus on


The main focus of a data optimization techniques and the
analyst is on optimization of construction of data in a
The focus will be on the
scenarios, for example how an conventional manner. The
futuristic display of data.
employee can enhance the purpose of a data engineer is
company’s product growth. continuously advancing data
consumption.

Frequently data engineers


Data scientists present both Data formation and cleaning of
operate at the back end.
supervised and unsupervised raw data, interpreting and
Optimized machine learning
learning of data, say regression visualization of data to perform
algorithms were used for
and classification of data, the analysis and to perform the
keeping data and making data
Neural networks, etc. technical summary of data.
to be prepared most accurately.

Skills required for Data Scientist


Skills required for Data
are Python, R, SQL, Pig, SAS, Skills required for Data Analyst
Engineer are MapReduce, Hive,
Apache Hadoop, Java, Perl, are Python, R, SQL, SAS.
Pig Hadoop, techniques.
Spark.
Most Frequent Used Tools For Data
Science
• 1. Apache Hadoop
• Apache Hadoop is a free, open-source framework by Apache
Software Foundation authorized under the Apache License 2.0
that can manage and store tons and tons of data. It is used for
high-level computations and data processing. By using its parallel
processing nature, we can work with the number of clusters of
nodes. It also facilitates solving highly complex computational
problems and tasks related to data-intensive.
• Hadoop offers standard libraries and functions for the subsystems.
• Effectively scale large data on thousands of Hadoop clusters.
• It speeds up disk-powered performance by up to 10 times per
project.
• Provides the functionalities of modules like Hadoop Common,
Hadoop YARN, Hadoop MapReduce.
• 2. SAS (Statistical Analysis System)
• SAS is a statistical tool developed by SAS Institute. It is
a closed source proprietary software that is used by
large organizations to analyze data. It is one of the
oldest tools developed for Data Science. It is used in
areas like Data Mining, Statistical Analysis, Business
Intelligence Applications, Clinical Trial Analysis,
Econometrics & Time-Series Analysis.
– It is a suite of well-defined tools.
– It has a simple but most effective GUI.
– It provides a Granular analysis of textual content.
– Easy to learn and execute as there is a lot of available
tutorials with appropriate knowledge.
– Can make visually appealing reports with seamless and
dedicated technical support.
• 3. Apache Spark
• Apache Spark is the data science tool developed by Apache
Software Foundation used for analyzing and working on
large-scale data. It is a unified analytics engine for large-scale
data processing. It is specially designed to handle batch
processing and stream processing. It allows you to create a
program to clusters of data for processing them along with
incorporating data parallelism and fault-tolerance. It inherits
some of the features of Hadoop like YARN, MapReduce, and
HDFS.
– It offers data cleansing, transformation, model building & evaluation.
– It has the ability to work in-memory makes it extremely fast for
processing data and writing to disk.
– It provides many APIs that facilitate repeated access to data.
• 4. Data Robot
• DataRobot Founded in 2012, is the leader in
enterprise AI, that aids in developing accurate
predictive models for the real-world problems of
any organization. It facilitates the environment to
automate the end-to-end process of building,
deploying, and maintaining your AI. DataRobot’s
Prediction Explanations help you understand the
reasons behind your machine learning model
results.
– Highly Interpretable.
– It has the ability to making the model’s predictions
easy to explain to anyone.
– It provides the suitability to implement the whole
Data Science process at a large scale.
• 5. Tableau
• Tableau is the most popular data visualization
tool used in the market, is an American interactive
data visualization software company founded in
January 2003, was recently acquired by
Salesforce. It provides the facilities to break down
raw, unformatted data into a processable and
understandable format. It has the ability to
visualize geographical data and for plotting
longitudes and latitudes in maps.
– It offers comprehensive end-to-end analytics.
– It is a fully protected system that reduces security
risks to the maximum state.
– It provides a responsive user interface that fits all
types of devices and screen dimensions.
• 6. BigML
• BigML, founded in 2011, is a Data Science tool
that provides a fully interactable, cloud-based GUI
environment that you can use for processing
Complex Machine Learning Algorithms. The main
goal of using BigML is to make building and
sharing datasets and models easier for everyone.
It provides an environment with just one
framework for reduced dependencies.
– It specializes in predictive modeling.
– It has ability to export models via JSON PML and
PMML makes for a seamless transition from one
platform to another.
– It provides an easy to use web-interface using Rest
APIs.
• 7. TensorFlow
• TensorFlow, developed by Google Brain team, is a free
and open-source software library for dataflow and
differentiable programming across a range of tasks. It
provides an environment for building and training
models, deploying platforms such as computers,
smartphones, and servers, to achieving maximum
potential with finite resources. It is one of the very
useful tools that is used in the fields of Artificial
Intelligence, Deep Learning, & Machine Learning.
– It provides good performance and high computational
abilities.
– Can run on both CPUs and GPUs.
– It provides features like easily trainable and responsive
construct.
• 8. Jupyter
• Jupyter, developed by Project Jupyter on February 2015
open-source software, open-standards, and services for
interactive computing across dozens of programming
languages. It is a web-based application tool running on the
kernel, used for writing live code, visualizations, and
presentations. It is one of the best tools, used by scratch
level programmers & data science aspirants, by which they
can easily learn and adapt the functionalities related to the
Data Science field.
• It provides an environment to perform data cleaning,
statistical computation, visualization and create predictive
machine learning models.
– It has the ability to display plots that are the output of running
code cells.
– It is quite extensible, supports many programming languages,
easily hosted on almost any server.
Introduction to R studio
• R Studio is an integrated development
environment(IDE) for R. IDE is a GUI, where you
can write your quotes, see the results and also see
the variables that are generated during the course
of programming.
• R Studio is available as both Open source and
Commercial software.
• R Studio is also available as both Desktop and
Server versions.
• R Studio is also available for various platforms
such as Windows, Linux, and macOS.
Introduction to R studio for beginners
• Rstudio is an open-source tool that provides
Ide to use R language, and enterprise-ready
professional software for data science teams
to develop share the work with their team.
• After the installation process is over, the R Studio interface looks like:
• The console panel(left panel) is the place where R is waiting for you to tell it what
to do, and see the results that are generated when you type in the commands.
• Environment tab: It shows the variables that are generated during the course of
programming in a workspace that is temporary.
• History tab: In this tab, you’ll see all the commands that are used till now from the
start of usage of R Studio.
• To the right bottom, you have another panel, which
contains multiple tabs, such as files,
plots, packages, help, and viewer.
– The Files tab shows the files and directories that are
available within the default workspace of R.
– The Plots tab shows the plots that are generated during
the course of programming.
– The Packages tab helps you to look at what are the
packages that are already installed in the R Studio and it
also gives a user interface to install new packages.
– The Help tab is the most important one where you can get
help from the R Documentation on the functions that are
in built-in R.
– The final and last tab is that the Viewer tab which can be
used to see the local web content that’s generated using R.
• Set the working directory in R Studio
• R is always pointed at a directory on our computer. We can find out which
directory by running the getwd() function. Note: this function has no arguments.
We can set the working directory manually in two ways:
• The first way is to use the console and using the command
setwd(“directorypath”).
You can use this function setwd() and give the path of the directory which you
want to be the working directory for R studio, in the double codes.
• The second way is to set the working directory from the GUI.
To set the working directory from the GUI you have to click on this 3 dots button.
When you click this, this will open up a file browser, which will help you to choose
your working directory.
• Once you choose your working directory, you need to use this setting
button in the more tab and click it and then you get a popup menu, where
you need to select “Set as working directory”.
• This will select the current directory, which you have chosen using this file
browser as your working directory. Once you set the working directory,
you are ready to program in R Studio.
• Create an RStudio project
• Step 1: Select the FILE option and select create
option.
• Step 2: Then select the New Project option.
• Step 3: Then choose the path and directory
name.
• Finally, project are created in a specific
location:
• Creating your first R script
• Here we are adding two numbers in R studio.
• Navigating directories in R studio
• getwd(): Returns the current working directory.
• setwd(): Set the working directory.
• dir(): Return the list of the directory.
• sessionInfo(): Return the session of the windows.
• date(): Return the current date.
R Programming Language –
Introduction
• R is an open-source programming language that is widely
used as a statistical software and data analysis tool. R
generally comes with the Command-line interface. R is
available across widely used platforms like Windows, Linux,
and macOS. Also, the R programming language is the latest
cutting-edge tool.
• It was designed by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, and is currently
developed by the R Development Core Team. R
programming language is an implementation of the S
programming language. It also combines with lexical
scoping semantics inspired by Scheme. Moreover, the
project conceives in 1992, with an initial version released in
1995 and a stable beta version in 2000.
Why R Programming Language?
• R programming is used as a leading tool for machine
learning, statistics, and data analysis. Objects, functions,
and packages can easily be created by R.
• It’s a platform-independent language. This means it can be
applied to all operating system.
• It’s an open-source free language. That means anyone can
install it in any organization without purchasing a license.
• R programming language is not only a statistic package but
also allows us to integrate with other languages (C, C++).
Thus, you can easily interact with many data sources and
statistical packages.
• The R programming language has a vast community of users
and it’s growing day by day.
• R is currently one of the most requested programming
languages in the Data Science job market that makes it the
hottest trend nowadays.
• Features of R Programming Language
• Statistical Features of R:
• Basic Statistics: The most common basic statistics terms are
the mean, mode, and median. These are all known as
“Measures of Central Tendency.” So using the R language
we can measure central tendency very easily.
• Static graphics: R is rich with facilities for creating and
developing interesting static graphics. R contains
functionality for many plot types including graphic maps,
mosaic plots, biplots, and the list goes on.
• Probability distributions: Probability distributions play a
vital role in statistics and by using R we can easily handle
various types of probability distribution such as Binomial
Distribution, Normal Distribution, Chi-squared Distribution
and many more.
• Data analysis: It provides a large, coherent and integrated
collection of tools for data analysis.
Programming Features of R:
• R Packages: One of the major features of R is it has a
wide availability of libraries. R has
CRAN(Comprehensive R Archive Network), which is a
repository holding more than 10, 0000 packages.
• Distributed Computing: Distributed computing is a
model in which components of a software system are
shared among multiple computers to improve
efficiency and performance. Two new packages ddR
and multidplyr used for distributed programming in R
were released in November 2015.
Advantages of R:
• R is the most comprehensive statistical analysis
package. As new technology and concepts often
appear first in R.
• As R programming language is an open source.
Thus, you can run R anywhere and at any time.
• R programming language is suitable for
GNU/Linux and Windows operating system.
• R programming is cross-platform which runs on
any operating system.
• In R, everyone is welcome to provide new
packages, bug fixes, and code enhancements.
Disadvantages of R:
• In the R programming language, the standard of
some packages is less than perfect.
• Although, R commands give little pressure to
memory management. So R programming
language may consume all available memory.
• In R basically, nobody to complain if something
doesn’t work.
• R programming language is much slower than
other programming languages such as Python and
MATLAB.
Applications of R:
• We use R for Data Science. It gives us a broad variety
of libraries related to statistics. It also provides the
environment for statistical computing and design.
• R is used by many quantitative analysts as its
programming tool. Thus, it helps in data importing and
cleaning.
• R is the most prevalent language. So many data
analysts and research programmers use it. Hence, it is
used as a fundamental tool for finance.
• Tech giants like Google, Facebook, bing, Twitter,
Accenture, Wipro and many more using R nowadays.

You might also like