Module 1
Module 1
Module 1
DATA SCIENCE
Presenter’s Name
Dr. Shital Bhatt
Associate Professor
School of Computational and Data Sciences
www.vidyashilpuniversity.com www.vidyashilpuniversity.com
DATA SCIENCE IN VARIOUS
DOMAINS
Why we are talking about Data Science?
Source: https://bit.ly/31HBHuQ
What is DATA SCIENCE?
- Hector
Garcia-Molina
Source: https://bit.ly/30dekJB
What is DATA SCIENCE?
What is DATA SCIENCE?
Source: https://bit.ly/2z5sYqf
What is DATA SCIENCE?
• Security
APPLICATIONs OF DATA SCIENCE
• Sports
APPLICATIONs OF DATA SCIENCE
• Internet Search
APPLICATIONs OF DATA SCIENCE
• Digital Advertisements
APPLICATIONs OF DATA SCIENCE
• Recommender System
APPLICATIONs OF DATA SCIENCE
• Image Processing
APPLICATIONs OF DATA SCIENCE
• Speech Recognition
APPLICATIONs OF DATA SCIENCE
• Gaming
APPLICATIONs OF DATA SCIENCE
• Delivery Logistics
APPLICATIONs OF DATA SCIENCE
• Health Care
APPLICATIONs OF DATA SCIENCE
• Augmented Reality
APPLICATIONs OF DATA SCIENCE
• Self-Driving Cars
APPLICATIONs OF DATA SCIENCE
• Robots
IMPACT of data science on society
• Saving Energy
IMPACT of data science on society
• Data-Driven Hospitals
IMPACT of data science on society
• A Cleaner Environment
IMPORTANCE of data science
Road to become a Data Scientist
DATA SCIENCE TERMINOLOGY
DATA SCIENCE TERMINOLOGY
Application Programming Interface (API)
Collection of software routines, protocols, and tools which provide a programmer with all the building blocks for developing an application program for a specific platform (environment). An API also provides an interface that allows a program to communicate with other programs, running in the same environment. (BusinessDictionary.com)
Deep Learning
Subset of machine learning that imitates the workings of the human brain in processing data and
improves performance. Typically, a multi-level algorithm that gradually identifies things at higher levels of
abstraction.
Python
A programming language available since 1994 that is popular with people doing data science. Python is
noted for ease of use among beginners, and great power when used by advanced users, especially when
taking advantage of specialized libraries such as those designed for machine learning and graph
generation.
R
An open-source programming language and environment for statistical computing and graph
generation available for Linux, Windows, and Mac.
Semantic
Semantics can address meaning at the levels of words, phrases, sentences, or larger units of
discourse. In machine learning, semantic analysis of a corpus is the task of building structures
that approximate concepts from a large set of documents.
Stochastic optimization
Stochastic optimization methods are optimization methods that generate and use random variables.
For stochastic problems, the random variables appear in the formulation of the optimization problem
itself, which involves random objective functions or random constraints.
Supervised Learning
A type of machine learning algorithm in which a system is taught via examples. For instance, a
supervised learning algorithm can be taught to classify input into specific, known classes. The classic
example is sorting email into spam versus non-spam. (datascienceglossary.org)
Unsupervised Learning
A class of machine learning algorithms designed to identify groupings of data without knowing in
advance what the groups will be. (datascienceglossary.org)
Web scraping
Web scraping is a term for various methods used to collect information from across the Internet.
Acquiring Data
Be cynical about your data
Is the data relevant to your problem?
Where did this data come from?
Who collected it?
Why? What for?
Do they have biases that might show up in the data?
Are there holes in the data (demographic, geographical, political etc)?
Do you have supporting data? Is it *really* from a different source?
Can you use this data (are there privacy or copyright issues with using it)?
Data Sources
Data warehouses and catalogues
open government data
NGO websites
web searches
online documents, images, maps etc
Creating your own data: People
Creating your own data: Sensors
Data analysis
Data analysis is an aspect of data science and data analytics that is all about
analyzing data for different kinds of purposes. The data analysis process involves
inspecting, cleaning, transforming and modeling data to draw useful insights from
it.
Descriptive Analytics
Descriptive analytics is a type of data analysis that focuses on describing and summarizing data
to gain insights into what has happened in the past. It is commonly used to answer questions
such as “What happened?” and “How many?”.
Descriptive analytics can help businesses and organizations understand their data and identify
patterns and trends that can inform decision-making.
Here are some real-life examples of descriptive analytics:
• A retail store might analyze historical sales data to identify popular products and trends. For
example, people tend to buy more candy in February.
• Patient data can be summarized to identify common health issues. For example, most people get
the flu from October to June.
• Student performance data can be analyzed to identify areas for improvement. For example, most
students who fail Calculus are frequently late to class.
To use descriptive analytics effectively, you need to ensure that your data is accurate and of high
quality. It’s also crucial to use clear and concise visualizations to communicate insights
effectively.
Diagnostic Analysis
Diagnostic analytics is a type of data analysis that goes beyond descriptive analytics
to identify the root cause of an issue or problem. It answers questions such as “Why did it happen?” and
“What caused it?”. For example, you can use diagnostic analysis to determine why your January sales
dropped by 50%.
Diagnostic analytics involves exploring and analyzing data to identify relationships and correlations that
can help explain an issue or problem. This can be done using techniques such as regression analysis,
hypothesis testing, and causal analysis.
Real-life examples include:
• You can use diagnostic analysis to identify the root cause of a quality issue in your production process.
• You can also use it to identify the cause behind a customer’s complaint and provide a targeted solution.
• In case of a cyber threat, you can also use it to identify the source of a security breach and prevent future
attacks.
There are many benefits to using diagnostic analytics, such as identifying the underlying causes of issues
and problems and developing targeted solutions. But, like with the previous two data analytics methods,
there are some challenges to consider. For one, acquiring high-quality data and ensuring accurate analysis
and insights can be difficult. Secondly, the analysis techniques can be quite complex and may require
specialized skills and knowledge to be implemented effectively.
Predictive Analytics
Predictive analytics uses statistical and machine learning techniques to analyze historical data and predict future events. It
is commonly used to answer questions such as “What is likely to happen?” and “What if?”.
Predictive analytics is useful as it can help you plan ahead. It can help improve business operations, reduce costs, and
increase revenue. For example, you can predict how sales will likely behave based on seasonality and previous sales
figures. If your predictive analysis tells you that sales will likely decrease in winter, you can use this information to design
an effective marketing campaign for this season.
Here are some practical examples of predictive analytics in action:
• A bank might use predictive analytics to assess credit risk and determine whether to grant a loan to a customer. In open
banking, predictive analytics can help build highly personalized behavioral models specific to each customer and identify
their creditworthiness in new ways. For customers, this may mean better and cheaper access to bank accounts, credit cards,
and mortgages.
• In marketing, predictive analytics can help identify which customers are most likely to respond to a particular offer.
• In healthcare, predictive analytics can be used to identify patients at risk of developing a particular disease.
• In manufacturing, predictive analytics can be used to forecast demand and optimize supply chain management.
However, there are also some challenges to using predictive analytics effectively. One challenge is the availability of high-
quality data essential for accurate predictions. Another challenge is selecting appropriate modeling techniques to analyze
the data and make accurate predictions. Finally, communicating predictive analytics results to decision-makers can be
challenging, as the techniques used can be complex and difficult to understand.
Prescriptive Analytics
Prescriptive analytics is a type of data analysis that goes beyond descriptive and
predictive analytics to provide recommendations for actions you should take. In
other words, this approach involves using optimization techniques to
identify the best course of action, given a set of constraints and objectives.
It is commonly used to answer questions such as “What should we do?” and
“How can we improve?”
To be effective, it requires a deep understanding of the data being analyzed and
the ability to model and simulate different scenarios to identify the best course of
action. As such, this is the most complex approach of the four methods.
Prescriptive analytics can help you solve various problems, including product
mix, workforce planning, marketing mix, capital budgeting, and capacity
management.
The best example of prescriptive analytics in action is using Google maps for
directions during peak hours. The software considers all modes of transport and
traffic conditions to calculate the best route possible. A transportation company
might use prescriptive analytics in this way to optimize delivery routes and
minimize fuel costs. This is important especially when you consider the rising
cost of fuel.
However, like with predictive analytics, there are some challenges to using
prescriptive analytics effectively. The first challenge is the
availability of high-quality data essential for accurate analysis and optimization.
Another challenge is the complexity of the optimization algorithms used, which
can require specialized skills and knowledge to implement effectively.
Introduction to Statistics and
Probability
STATISTICS
• It is the science of collecting, organizing, analyzing and interpreting data.
Inferential Statistics : It is about using sample data from a dataset and making inferences
and conclusions using probability theory.
Descriptive Statistics: It is used to summarize and represent the data in an accurate way
using charts, tables and graphs.
For example, you might stand in a mall and ask a sample of 100 people if they like
shopping at Sears. You could make a bar chart of yes or no answers (that would be
descriptive statistics) or you could use your research (and inferential statistics) to reason
that around 75%-80% of population.
DESCRIPTIVE STATISTICS
The following measures are used to represent the data set :
Measure of
Position
Descriptive
Statistics
Measure of Measure of
Shape Spread
MEASURE OF POSITION
• Also known as measure of Central Tendency.
• A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data.
• There are three measures of central tendencies: Mean, Median and Mode.
Mean: It is a point where mass of distribution of data balances.
Median: It is a point that divides the data into two equal halves while being less susceptible
to outliers compare to mean.
Mode: It refers to the data item that occurs most frequently in a given data set.
Example for ungrouped data:
Question:
MEASURE OF DISPERSION
• It refers to how the data deviates from the position measure i.e. gives an indication of
the amount of variation in the process.
Skewness : It is the horizontal displacement of the normal curve about the mean
position. Skewness for a normal distribution is zero.
Kurtosis: It is the vertical distortion of normal curve without disturbing
symmetry of normal curve. The kurtosis for a standard normal distribution
is three.
PROBABILITY
Probability is a numerical description of how likely an event is to occur or how likely
it is that a proposition is true.
Sample space: The set S of all possible outcomes of an experiment. i.e. the sample space for a
dice roll is {1, 2, 3, 4, 5, 6}
Equally likely outcomes: Two outcomes of a random experiment are said to be equally
likely, if upon performing the experiment a (very) large number of times, the relative
occurrences of the two outcomes turn out to be equal.
Exhaustive Events: A set of events is called exhaustive if all the events together consume
the entire sample space. Eg: A and B are sets of mutually exclusive events,
Where,
S = sample space
Mutually Exclusive Events: If the occurrence of one event excludes the occurrence of
another event i.e. no two events can occur simultaneously.
Addition Theorem
Theorem 1: If A and B are two mutually exclusive events, then
P(A ∪ B) = P(A) + P(B)
Where,
n = Total number of exhaustive cases
n1= Number of cases favorable to A.
n2= Number of cases favorable to B.
Theorem2: If A and B are two events that are not mutually exclusive, then
P(A ∪ B) = P( A ) + P( B ) - P ( A ∩ B )
Where,
P (A ∩ B) = Probability of events favorable to both A and B
Multiplication Theorem
If A and B are two independent events, then the probability that both will
occur is equal to the product of their individual probabilities.
Example:
The probability of appointing a lecturer who is B.Com, MBA, and PhD, with
probabilities 1/20, 1/25 and 1/40 is given by:
A discrete distribution representing the number of events occurring within a fixed interval
of time or space.
Defined by the average rate (λ).
Applicable for rare events, e.g., number of phone calls at a call center in an hour.
Exponential Distribution
A continuous distribution used to model the time between events in a Poisson process.
Defined by the rate parameter (λ).
Memoryless property: the probability of an event occurring in the next interval is independent of
past intervals.
Types of
Data
• The term variable means a quality or quantity which varies
from one member of a sample or population to another.
Nominal Variable: A qualitative variable that categorizes (or
describes, or names) an element of a population.
Qualitative Quantitative
Data Data
Consists of Consists of
attributes, labels, numerical
or nonnumerical measurements or
entries. counts.
Qualitative and Quantitative Data
Example:
The grade point averages of five students are listed in the table. Which data
are qualitative data and which are quantitative data?
Student GPA
Sally 3.22
Bob 3.98
Cindy 2.75
Mark 2.24
Kathy 3.84
Nominal
Lowest to
Levels of
Measurement
Ordinal highest
Interval
Ratio
Nominal Level of Measurement
Data at the nominal level of measurement are
qualitative only.
Nominal
Levels of Calculated using names, labels, or
Measurement qualities. No mathematical
computations can be made at this level.
Levels of
Measurement Ordinal
Arranged in order, but differences
between data entries are not meaningful.
Levels of Interval
Measurement Arranged in order, the differences between data
entries can be calculated.
Ratio
Nominal Yes No No No
Ordinal Yes Yes No No
Interval Yes Yes Yes No
Ratio Yes Yes Yes Yes
What is a Data Format?
•A structured way to represent and store data
•Ensures consistency and interoperability
•Different formats for different data types (text, numbers, images, etc.)
Common Data Formats
•Text formats (TXT, CSV, JSON) - Plain text for basic data
•Document formats (DOC, PDF, ODT) - Structured documents with layout
and formatting
•Spreadsheet formats (XLS, XLSX) - Data organized in rows and columns
•Image formats (JPG, PNG, GIF) - Digital representations of pictures
•Audio formats (MP3, WAV) - Storage of sound recordings
•Video formats (MP4, AVI) - Combination of images and audio for moving
pictures
Working with Categorical Data
Introduction to Categorical Data
Categorical data represents characteristics and can be divided into groups or categories.
Examples: Gender, Marital Status, Yes/No responses.
Types of Categorical Data
Nominal Data: Categories with no inherent order. Examples: Colors, Types of Pets.
Ordinal Data: Categories with a meaningful order. Examples: Movie Ratings, Education
Levels.
Methods of Representing Categorical Data
Various methods are used to visually represent categorical data for better understanding
and analysis.
Bar Charts
A bar chart uses rectangular bars to represent the frequency or count of categories.
Advantages: Easy to understand, compare categories.
Disadvantages: Not suitable for a large number of categories.
- **Explanation:**
- Represents entire sentences or phrases in vector space.
- Captures semantic meaning beyond individual words.
- **Models:**
- **Sentence-BERT:** Adapts BERT for sentence-level tasks.
- **Example:**
- Similar sentences have closer vectors.
- **Pros:**
- Captures context and sentence meaning.
- **Cons:**
- Computationally intensive.
Applications
- **Text Classification:**
- Classifying emails as spam or not spam.
- **Sentiment Analysis:**
- Determining sentiment from social media posts.
- **Machine Translation:**
- Translating text from one language to another.
- **Information Retrieval:**
- Search engines retrieving relevant documents.
Challenges in Text Representation
Challenges
- High dimensionality
- Sparsity
- Semantic understanding
Basic Text Representation: Bag of Words
(BoW)
Definition
- Represents text by counting the occurrence of each word
Advantages
- Simple and easy to implement
Disadvantages
- Ignores word order and context
Basic Text Representation: Term Frequency-
Inverse Document Frequency (TF-IDF)
Definition
- Adjusts word counts by their frequency in the document and across all documents
Formula
- TF-IDF(t,d) = TF(t,d) × IDF(t)
Advantages
- Reduces the impact of commonly occurring words
Disadvantages
- Still ignores word order
Word Embeddings: Word2Vec
Definition
- Uses neural networks to generate dense vector representations of words
Models
- CBOW (Continuous Bag of Words)
- Skip-gram
Advantages
- Captures semantic meaning
Applications
- Similarity comparisons
Word Embeddings: GloVe
Definition
- Combines the benefits of BoW and Word2Vec
Methodology
- Uses global word-word co-occurrence statistics
Advantages
- Efficient in capturing semantic relationships
Word Embeddings: FastText
Definition
- Extends Word2Vec by representing words as n-grams
Advantages
- Handles out-of-vocabulary words
Applications
- Morphologically rich languages
Contextual Embeddings: ELMo
Definition
- Generates word representations using deep bi-directional LSTM
Advantages
- Context-sensitive
Applications
- Improves performance on downstream NLP tasks
Contextual Embeddings: BERT
Definition
- Uses transformers to provide context-aware embeddings
Advantages
- Handles context more effectively
Applications
- Wide range of NLP tasks
Contextual Embeddings: GPT-based
Embeddings
Definition
- Uses transformer architecture to generate embeddings
Advantages
- Generates coherent and contextually relevant text
Applications
- Language generation tasks
Advanced Techniques: Sentence and Document
Embeddings
Sentence Embeddings
- Captures the meaning of entire sentences
- Examples: Universal Sentence Encoder, Sentence-BERT
Document Embeddings
- Aggregates sentence embeddings for documents
Advanced Techniques: Transformer Models
Transformers
- Use attention mechanisms
- Examples: GPT-3, T5
Advantages
- State-of-the-art results in many NLP tasks
Applications of Text Data Representation
Sentiment Analysis
- Determine the sentiment of text (positive, negative, neutral)
Text Classification
- Categorize text into predefined categories
Machine Translation
- Translate text from one language to another
Tools and Libraries
NLTK
- Comprehensive library for NLP
SpaCy
- Industrial-strength NLP
Gensim
- Topic modeling and document similarity
Future Trends
- Increasing use of transformer models
- Improved contextual understanding
Summary
- Effective text representation is crucial for NLP
- Ongoing advancements in embeddings and models
Time Series Data Representation
Introduction to Time Series Data
Definition:
A sequence of data points collected or recorded at successive points in time.
Examples:
- Stock Prices
- Weather Data
- Sales Data
Components of Time Series Data
Trend:
Long-term movement in the data.
Seasonality:
Regular pattern repeating over a known, fixed period.
Cyclic Patterns:
Long-term oscillations with no fixed period.
Irregular Components:
Random noise or anomalies.
Types of Time Series Data
Continuous vs Discrete:
- Continuous: Recorded continuously over time.
- Discrete: Recorded at specific time intervals.
Univariate vs Multivariate:
- Univariate: Single variable recorded over time.
- Multivariate: Multiple variables recorded over time.
Time Series Data Representation Techniques
Line Plots:
Visualize data points connected by lines.
Scatter Plots:
Individual data points displayed without connecting lines.
Seasonal Plots:
Show seasonality in data.
Heatmaps:
Visualize data intensity.
Applications of Time Series Analysis
Forecasting:
Predict future values based on past data.
Anomaly Detection:
Identify unusual patterns or outliers.
Trend Analysis:
Understand long-term movement.
Tools and Libraries for Time Series Analysis
Python Libraries:
- Pandas
- Matplotlib
- Seaborn
- Statsmodels
Representing Information in
Different Modalities
Audio, Image, and Video
Introduction to Different Modalities
File Formats:
- WAV Waveform Audio File
- MP3 MPEG-1 Audio Layer III
- AAC Advanced Audio Codec
Visualization Techniques:
- Waveforms
- Spectrograms
Image Representation
File Formats:
- JPEG
- PNG
- GIF
Visualization Techniques:
- Histograms
- Color Channels
Video Representation
File Formats:
- MP4
- AVI
- MKV
Visualization Techniques:
- Frame Extraction
- Motion Vectors
Applications
Audio:
- Music Streaming
- Speech Recognition
Image:
- Digital Photography
- Medical Imaging
Video:
- Streaming Services
- Video Surveillance
Data Visualization
What is Data Visualization
Data visualization is actually a set of data points and information that are
represented graphically to make it easy and quick for user to understand. Data
visualization is good if it has a clear meaning, purpose, and is very easy to
interpret, without requiring context. Tools of data visualization provide an
accessible way to see and understand trends, outliers, and patterns in data by
using visual effects or elements such as a chart, graphs, and maps.
Data visualization is the process of translating large data sets and metrics into
charts, graphs and other visuals. The resulting visual representation of data makes it
easier to identify and share real-time trends, outliers, and new insights about the
information represented in the data.
Challenges of big data visualization
• Visual noise: Most of the objects in dataset are too relative to each other. Users
cannot divide them as separate objects on the screen.
• Information loss: Reduction of visible data sets can be used, but leads to
information loss
.
• Large image perception: Data visualization methods are not only limited by
aspect ratio and resolution of device, but also by physical perception limits.
• High rate of image change: Users observe data and cannot react to the number of
data change or its intensity on display.
2. Understanding the data: One solution is to have the proper domain expertise in
place.
3. Addressing data quality: It is necessary to ensure the data is clean through the
process of data governance or information management.
4. Displaying meaningful results: One way is to cluster data into a higher-level view
where smaller groups of data are visible and the data can be effectively visualized.
5. Dealing with outliers: Possible solutions are to remove the outliers from the data or
create a separate chart for the outliers.
Plots And Diagrams
Contents
Introduction
Graph
Pie Chart
Bar graph
Dot Plot
Box Plots
Stem Plot
Histogram
Scatter Plot
Time series Plot
28-11-2019
159
Introduction
160
Graph
Graphs are made of 2 main purpose
Extract information from data
Communicate the information to others
Therefore it is one of the statistical methods of analysing data
161
Pie Chart
A pie chart is a type of graph in which a circle is divided into sectors that
each represents a proportion of the whole. Pie charts are a useful way to
visualize information that might be presented in a small table.
Juices
12%
19% Mango
Strawberry
Cherries
Banana
34%
35%
162
Advantages
Good for showing the relative sizes of graph
Several groups can be represented and compared
Good for categorical variable
One piece of pie can move to another location without changing the
meaning of the chart
163
Disadvantages
2% 7% 7% Kiwifruit Lemon
9% Melon Papaya
3%
Pineapple
165
Bar graph
A bar chart is a chart with rectangular bars with
lengths proportional to the values that they represent. The bars can
be plotted vertically or horizontally. Bar charts can be used to show
comparisons among categories.
5
4.5
4
3.5
3
2.5 Different brand
2 Mango Juices
1.5
1
0.5
0
Mango Mango(1) Mango(2) Mango(3)
28-11-2019 Source:Briliant.org 166
Advantages
28-11-2019 167
Africa Geography Blog,26th August 2015
Disadvantages
28-11-2019
168
Africa Geography Blog,26th August 2015
28-11-2019 Source:Briliant.org 169
28-11-2019 Source:Briliant.org 170
Dot Plot
A dot plot is a graphic display using dots and a scale to compare the
frequency within categories or groups.
Dot plots clearly display clusters, scatters and outliers.
It clearly shows that most of the women were in their middle to late twenties and
early thirties, with a scattering between 35 and 60.
Advantages : Disadvantages :
Q1 Q2 Q3
Min Max
25% 25%
25% 25%
28-11-2019 175
A box plot summarizes data using the median, upper and lower
quartiles, and the extreme (least and greatest) values. It allows
you to see important characteristics of the data at a glance.
The data values found inside the box represent the middle half
( 50%) of the data.
The line segment inside the box represents the median
19 24 27 32 60
Mode=median=mean
Normal distribution :
Q3-Q2 = Q2-Q1
Positive skew :
Q3-Q2 > Q2-Q1 Mode<median<mean
Negative skew :
Q3-Q2 < Q2-Q1
Mode>median>mean
Advantages : Disadvantages :
28-11-2019
Advantages : Disadvantages :
Scatterplot
common way to display data on two variables.
consists of two axes, a horizontal axis and a
vertical axis.
pair of observations of two variables is shown
as a point in the graph.
Dots representing data points are scattered
on the diagram.
Critical-To-Quality (CTQ) characteristic and a
factor affecting it, two factors affecting a CTQ
or two related quality characteristics.
dots cluster together in a line across the
diagram shows the strength with which the two(Moore et al, 2013.p. 99)
factors are related.
28-11-2019
183
Graphing two metric variables contd..
28-11-2019 184
Graphing two metric variables contd..
Here is a scatter plot showing the relationship between students who took a History
Test and a Math Test.
Is there a relationship between the scores?
Describe the relationship.
28-11-2019 186
Graphing two metric variables contd..
75% on
History
Test
The line of best fit will help you make a prediction as to what score the
student would get on their math test if they earned a 75% on their history
test. About 77%
What score would he get on the math test?
28-11-2019 188
Graphing two metric variables contd..
Scatterplot
Advantage of scatter plot Disadvantage of the scatter Plot
N0 graphs lines to see exactly where the
point is
Clearly reflects the relation between
two variables. Helps to see the relationship but its difficult
to tell if the relation is positive, Negative or
No numerical information is lost and
if there isn’t one.
simplification of the data is gained.
Easy to create and to interpret.
Show Minimum, Maximum and outliers
of the data set.
28-11-2019 189
Graphing two metric variables contd..
(Box et al ,1997,p.34)
28-11-2019
190
Graphing two metric variables contd..
28-11-2019 192