FDS Unit 1 Notes
FDS Unit 1 Notes
TOPICS
• Introduction to Data Science
• Role of data scientist
• Types of Data
• Tool boxes for data scientists
• Introduction to R studio
What is Data Science?
• Data Science is a combination of mathematics,
statistics, machine learning, and computer science.
Data Science is collecting, analyzing and interpreting
data to gather insights into the data that can help
decision-makers make informed decisions.
• Data Science is used in almost every industry today
that can predict customer behavior and trends and
identify new opportunities. Businesses can use it to
make informed decisions about product development
and marketing. It is used as a tool to detect fraud and
optimize processes. Governments also use Data
Science to improve efficiency in the delivery of public
services.
Importance of Data Science
• Nowadays, organizations are overwhelmed
with data. Data Science will help in extracting
meaningful insights from that by combining
various methods, technology, and tools. In the
fields of e-commerce, finance, medicine,
human resources, etc, businesses come across
huge amounts of data. Data Science tools and
technologies help them process all of them.
What is the Data Science process?
• Obtaining the data
• The first step is to identify what type of data needs to be
analyzed, and this data needs to be exported to an excel or
a CSV file.
• Scrubbing the data
• It is essential because before you can read the data, you
must ensure it is in a perfectly readable state, without any
mistakes, with no missing or wrong values.
• Exploratory Analysis
• Analyzing the data is done by visualizing the data in various
ways and identifying patterns to spot anything out of the
ordinary. To analyze the data, you must have excellent
attention to detail to identify if anything is out of place.
• Modeling or Machine Learning
• A data engineer or scientist writes down
instructions for the Machine Learning
algorithm to follow based on the Data that has
to be analyzed. The algorithm iteratively uses
these instructions to come up with the correct
output.
• Interpreting the data
• In this step, you uncover your findings and
present them to the organization. The most
critical skill in this would be your ability to
explain your results.
Types of data
• The data is classified into four categories:
• Nominal data.
• Ordinal data.
• Discrete data.
• Continuous data.
Qualitative or Categorical Data
• Qualitative or Categorical Data is data that can’t be
measured or counted in the form of numbers. These
types of data are sorted by category, not by number.
That’s why it is also known as Categorical Data. These
data consist of audio, images, symbols, or text. The
gender of a person, i.e., male, female, or others, is
qualitative data.
• The other examples of qualitative data are :
• What language do you speak
• Favorite holiday destination
• Opinion on something (agree, disagree, or neutral)
• Colors
The Qualitative data are further
classified into two parts :
• Nominal Data
• Nominal Data is used to label variables without any order or
quantitative value. The color of hair can be considered nominal
data, as one color can’t be compared with another color.
• The name “nominal” comes from the Latin name “nomen,” which
means “name.” With the help of nominal data, we can’t do any
numerical tasks or can’t give any order to sort the data.
• Examples of Nominal Data :
• Colour of hair (Blonde, red, Brown, Black, etc.)
• Marital status (Single, Widowed, Married)
• Nationality (Indian, German, American)
• Gender (Male, Female, Others)
• Eye Color (Black, Brown, etc.)
Ordinal Data
• Ordinal data have natural ordering where a number is present in some
kind of order by their position on the scale. These data are used for
observation like customer satisfaction, happiness, etc., but we can’t do
any arithmetical tasks on them.
• Ordinal data is qualitative data for which their values have some kind of
relative position. These kinds of data can be considered “in-between”
qualitative and quantitative data. The ordinal data only shows the
sequences and cannot use for statistical analysis. Compared to nominal
data, ordinal data have some kind of order that is not present in nominal
data.
• Examples of Ordinal Data :
• When companies ask for feedback, experience, or satisfaction on a scale
of 1 to 10
• Letter grades in the exam (A, B, C, D, etc.)
• Ranking of people in a competition (First, Second, Third, etc.)
• Economic Status (High, Medium, and Low)
• Education Level (Higher, Secondary, Primary)
Quantitative Data
• Quantitative data can be expressed in numerical values, making it
countable and including statistical data analysis. These kinds of
data are also known as Numerical data. It answers the questions
like “how much,” “how many,” and “how often.” For example, the
price of a phone, the computer’s ram, the height or weight of a
person, etc., falls under quantitative data.
• Quantitative data can be used for statistical manipulation. These
data can be represented on a wide variety of graphs and charts,
such as bar graphs, histograms, scatter plots, boxplots, pie charts,
line graphs, etc.
• Examples of Quantitative Data :
• Height or weight of a person or object
• Room Temperature
• Scores and Marks (Ex: 59, 80, 60, etc.)
• Time
The Quantitative data are further
classified into two parts :
• Discrete Data
• The term discrete means distinct or separate. The discrete data
contain the values that fall under integers or whole numbers. The
total number of students in a class is an example of discrete data.
These data can’t be broken into decimal or fraction values.
• The discrete data are countable and have finite values; their
subdivision is not possible. These data are represented mainly by a
bar graph, number line, or frequency table.
• Examples of Discrete Data :
• Total numbers of students present in a class
• Cost of a cell phone
• Numbers of employees in a company
• The total number of players who participated in a competition
• Days in a week
Continuous Data
• Continuous Data
• Continuous data are in the form of fractional numbers. It can be the
version of an android phone, the height of a person, the length of an
object, etc. Continuous data represents information that can be divided
into smaller levels. The continuous variable can take any value within a
range.
• The key difference between discrete and continuous data is that discrete
data contains the integer or whole number. Still, continuous data stores
the fractional numbers to record different types of data such as
temperature, height, width, time, speed, etc.
• Examples of Continuous Data :
• Height of a person
• Speed of a vehicle
• “Time-taken” to finish the work
• Wi-Fi Frequency
• Market share price
Data Scientist Roles and
Responsibilities
• Data scientists collaborate closely with business leaders and
other key players to comprehend company objectives and
identify data-driven strategies for achieving those objectives.
A data scientist’s job is to gather a large amount of data,
analyze it, separate out the essential information, and then
utilize tools like SAS, R programming, Python, etc. to extract
insights that may be used to increase the productivity and
efficiency of the business. Depending on an organization’s
needs, data scientists have a wide range of roles and
responsibilities.
The following is a list of some of the data
scientist roles and responsibilities:
1. Collect data and identify data sources
2. Analyze huge amounts of data, both structured and unstructured
3. Create solutions and strategies to business problems
4. Work with team members and leaders to develop data strategy
5. To discover trends and patterns, combine various algorithms and
modules
6. Present data using various data visualization techniques and tools
7. Investigate additional technologies and tools for developing innovative
data strategies
8. Create comprehensive analytical solutions, from data gathering to
display; assist in the construction of data engineering pipelines
9. Supporting the data scientists, BI developers, and analysts team as
needed for their projects Working with the sales and pre-sales team on
cost reduction, effort estimation, and cost optimization.
10. To boost general effectiveness and performance, stay current with the
newest tools, trends, and technologies
11. collaborating together with the product team and partners to provide
data-driven solutions created with original concepts
12. Create analytics solutions for businesses by combining various tools,
applied statistics, and machine learning
13. Lead discussions and assess the feasibility of AI/ML solutions for business
processes and outcomes
14. Architect, implement, and monitor data pipelines, as well as conduct
knowledge sharing sessions with peers to ensure effective data use
Data scientist requirements
• Each industry has its own big data profile for a data scientist to analyze.
Here are some of the more common forms of big data in each industry, as
well as the kinds of analysis a data scientist will likely be required to
perform, according to the Bureau of Labor Statistics(BLS).
• Business:
• Today, data shapes the business strategy for nearly every company
• but businesses need data scientists to make sense of the information.
Data analysis of business data can inform decisions around efficiency,
inventory, production errors, customer loyalty and more.
• E-commerce:
• Now that websites collect more than purchase data, data scientists help
e-commerce businesses improve customer service, find trends and
develop services or products.
• Finance:
• In the finance industry, data on accounts, credit and debit transactions
and similar financial data are vital to a functioning business. But for data
scientists in this field, security and compliance, including fraud detection,
are also major concerns
• Government:
• Big data helps governments form decisions, support constituents and monitor overall
satisfaction. Like the finance sector, security and compliance are a paramount concern for
data scientists.
• Science:
• Scientists have always handled data, but now with technology, they can better collect, share
and analyze data from experiments. Data scientists can help with this process.
• Social networking:
• Social networking data helps inform targeted advertising, improve customer satisfaction,
• establish trends in location data and enhance features and services. Ongoing data analysis of
posts, tweets, blogs and other social media can help businesses constantly improve their
services.
• Healthcare:
• Electronic medical records are now the standard for healthcare facilities, which requires a
dedication to big data, security and compliance. Here, data scientists can help improve
health services and uncover trends that might go unnoticed otherwise.
• Telecommunications:
• All electronics collect data, and all that data needs to be stored, managed, maintained and
analyzed. Data scientists help companies squash bugs, improve products and keep customers
happy by delivering the features they want
7 essential skills for a data scientist
• 1. Programming
• Programming languages, such as Python or R, are
necessary for data scientists to sort, analyze, and manage
large amounts of data (commonly referred to as “big
data”). As a data scientist just starting out, you should
know the basic concepts of data science and begin
familiarizing yourself with how to use Python. Popular
programming languages include:
• Python
• R
• SAS
• SQL
2. Statistics and probability
• In order to write high-quality machine learning models and
algorithms, data scientists need to learn statistics and
probability. For machine learning, it is essential to
use statistical analysis concepts like linear regression. Data
scientists need to be able to collect, interpret, organize, and
present data, and to fully comprehend concepts like mean,
median, mode, variance, and standard deviation. Here are
different types of statistical techniques you should know:
• Probability distributions
• Over and under sampling
• Bayesian and frequent ist statistics
• Dimension reduction
3. Data wrangling and database
management
• Data wrangling is the process of cleaning and organizing complex data sets to make
them easier to access and analyze. Manipulating the data to categorize it by patterns
and trends, and to correct and input data values can be time-consuming but
necessary to make data-driven decisions. This is also related to
understanding database management—you’re expected to extract data from
different sources and transform it into a suitable format for query and analysis, and
then load it into a data warehouse system. Useful tools for data wrangling include:
• Altair
• Talend
• Alteryx
• Trifacta
• Tamr
• And database management tools include:
• MySQL
• MongoDB
• Oracle
4. Machine learning and deep learning
• As a data scientist, you’ll want to immerse yourself in machine learning
and deep learning. Incorporating these techniques helps you improve as a
data scientist because you’ll be able to gather and synthesize data more
efficiently, while also predicting the outcomes of future data sets. For
example, you can forecast how many clients your company will have
based on the previous month’s data using linear regression. Later on, you
can boost your knowledge to include more sophisticated models like
Random Forest. Some machine learning algorithms to know include:
• Linear regression
• Logistic regression
• Naive Bayes
• Decision tree
• Random forest algorithm
• K-nearest neighbor (KNN)
• K means algorithm
5. Data visualization
• Not only do you need to know how to analyze,
organize, and categorize data, but you’ll also want to
build your skills in data visualization. Being able to
create charts and graphs is important to being a data
scientist. With strong visualization skills, you can
present your work to stakeholders so that the data
tells a compelling story of the business insights.
Familiarity with the following tools should prepare you
well:
• Tableau
• Microsoft Excel
• PowerBI
6. Cloud computing
• As a data scientist, you'll most likely need to use cloud
computing tools that help you analyze and visualize data
that are stored in cloud platforms. Some certifications will
specifically focus on cloud services such as:
• Amazon Web Service (AWS)
• Microsoft Azure
• Google Cloud
• These tools provide data professionals access to
cloud-based databases and frameworks that are key for
advancing technology. They are used in many industries
now, so it is important in data science to become familiar
with the concepts behind cloud computing.
7. Interpersonal skills
• You’ll want to develop workplace skills such as communication in
order to form strong working relationships with your team
members and be able to present your findings to stakeholders.
Just as data visualization is important for communicating the data
insights you uncover as a data scientist, so is being able to
collaborate with teams successfully. Here are interpersonal skills
you can build upon:
• Active listening
• Effective communication skills
• Sharing feedback
• Attention to detail
• Leadership
• Empathy
• Public speaking
Difference Between Data Scientist, Data
Analyst, and Data Engineer