Data science is a multidisciplinary field that uses statistics, programming, and machine learning to extract knowledge and insights from large amounts of data. It has various applications like email spam detection, medical diagnosis, predicting stock prices, and self-driving cars. The document discusses how the size of data is rapidly increasing and will continue to do so, with an estimated 463 exabytes of new data generated daily by 2025. It also outlines common tasks performed by data scientists like understanding business problems, analyzing and visualizing data, making recommendations, and predicting future values. Theoretical and practical aspects of data science are also covered, along with examples of how it relates to other fields.
2. Introduction: What is Data Science?
• Data science is one of the most promising and in-demand career
paths for skilled professionals.
• The term “data scientist” was coined as recently as 2008.
• It is a blending of three application: Data, Business and Statistics
3. Size of the Data is increasing
• The amount of data in the world was
estimated to be 44 zettabytes at the
dawn of 2020.
• By 2025, the amount of data generated
each day is expected to reach 463
exabytes globally.
• Google, Facebook, Microsoft, and
Amazon store at least 1,200 petabytes of
information.
• The world spends almost $1 million per
minute on commodities on the Internet.
• Electronic Arts process roughly 50
terabytes of data every day.
• By 2025, there would be 75 billion Internet-
of-Things (IoT) devices in the world
• By 2030, nine out of every ten people
aged six and above would be digitally
active.
4. What Does a Data Scientist Do?
• Understand the business problem
• How can I improve the sells of an e-commerce platform?
• Analise the data provided by the company
• The sell data, what and when products customer buy (before festival), How many
time they spend on the app (app time) etc.
• Visualize the data and get an intuition
• Visualize all the data and try to find any pattern.
• Recommend a product/service based on the past data
• Which product to recommend to the user (if he is buying a phone, recommend
him for its accessories/warranty or similar phones)
• Predict future uncertainties/values
• Predict the sell/revenue/stock price of the company
6. Data Science as a multidisciplinary domain
• Data Science is a multidisciplinary domain which consists of many
other domains.
• The following Venn diagram will explain. There are three main
domains that primarily includes Data Science:
• For theory:
• Theoretical Computer Science
• Statistics (why)
• For Application:
• Application Oriented Computer Science
• Different tools: python/R/Java
• Software Development
7. Where data science
comes from?
• Core of Data Science is Statistics
with a wrapping package +
computer science
• The other domain which handles
data is Statistics but without
computer, as it was unavailable
until recently
10. Example-2: Medical
Diagnosis
• Input: Symptoms (Fever, Cough,
nausea, pain)
• Output: Diagnosis (Covid-19, Common
cold, pneumonia)
• Assuming that, there are only these
possible disease.
• An Example of multiclass
classification
• There is some uncertainty such as 20%
sure: Covid19 and 80% sure about a
common cold.
• Probabilistic or soft classification
(soft computing)
11. Example-3: predicting
a stock price
• Input: History of
stock prices
• Output: Price of
the stock at the
nearest future
12. Example-4: Self-driving car
• Input: Road
conditions/traffics signals/
crowd
• Sensors: Camera, IR,
radar etc.
• Output: directions of the
vehicle, speed, acceleration
etc.
13. Essence of Data Science
1. Exploratory analysis: Discover the structure within the data. E.g.:
Experience (in years in a company) and salary are correlated.
1. Unsupervised learning
2. Predictive Analysis: This is sometimes described as “learn from
the past to predict the future”.
1. Supervised learning
21. A word of caution: Don’t Use ML everywhere
Not all problems are machine learning problems. It is important to know when to
(not) use machine learning.
1. If you have a deterministic logic that solves it with 100% accuracy, then obviously that is
cheaper, easier, and more accurate than any ML model one can make.
2. If you have some stable heuristic rules that solve it most of the time, the extra
work/complexity of ML might not be worth it.
3. If your heuristics does not work up to the desired accuracy, and requires constant
updates, then ML can be a good bet.