Module 1

Exploratory Data Analysis: INTRODUCTION TO
DATA SCIENCE
Presenter’s Name
Dr. Shital Bhatt
Associate Professor
School of Computational and Data Sciences
www.vidyashilpuniversity.com www.vidyashilpuniversity.com
DATA SCIENCE IN VARIOUS
DOMAINS
Why we are talking about Data Science?
Source: https://bit.ly/31HBHuQ
What is DATA SCIENCE?
• “Data Science is a new term. But in the same sense as

Columbus discovered NEW Continent 1000 years ago.”
- Hector
Garcia-Molina
Professor in the Departments

of
Computer Science and
• A multi-disciplinary field that

uses scientific methods,
processes, algorithms and
systems to extract knowledge
and insights from structured
and unstructured data.
Source: https://bit.ly/30dekJB
Big Data Science Tasks

• Facebooks
• Amazon
• Google
• Linkedln
• Netflix
• Microsoft
What do people look for in a data scientist?

Data Science Roles

Roles Required in Data Science Project
Source: https://bit.ly/2z5sYqf
How to become a data scientist?

• Data Scientists need to know how to “CODE”

• Other languages, tools, platforms and visualization
Learning Data Science with Python - Libraries

Learning Data Science with Python - Libraries

Learning Data Science with Python - Tools


• Learn to code
Data Scientist need to comfortable with:

Data Scientist need to learning machine learning & software

engineering
Who are the Data Scientist?

Who are the Data Scientist?

APPLICATIONs OF DATA SCIENCE
• Security
• Sports
• Banking and Finance

• Internet Search
• Digital Advertisements
• Recommender System
• Image Processing
• Speech Recognition
• Gaming
• Price Comparison Websites

• Airline Routing Planning

• Fraud and Risk Detection

• Delivery Logistics
• Internet of Things (IoT)

• Health Care
• Augmented Reality
• Self-Driving Cars
• Robots
IMPACT of data science on society
• Saving Energy
• Data-Driven Hospitals
• A Cleaner Environment
IMPORTANCE of data science
Road to become a Data Scientist
DATA SCIENCE TERMINOLOGY
DATA SCIENCE TERMINOLOGY
Application Programming Interface (API)
Collection of software routines, protocols, and tools which provide a programmer with all the building blocks for developing an application program for a specific platform (environment). An API also provides an interface that allows a program to communicate with other programs, running in the same environment. (BusinessDictionary.com)
Artificial Intelligence (AI)

Artificial intelligence is a field of computer science dedicated to solving cognitive problems commonly associated with human intelligence such as learning, problem solving, visual perception and speech and pattern recognition.
Artificial Intelligence System

A technological system that uses a model to make inferences to generate output, including predictions, recommendations or decisions.
Data Science
Data Science is an interdisciplinary field that uses scientific methods and algorithms to extract information
and insights from diverse data types. It combines domain expertise, programming skills and knowledge of
mathematics and statistics to solve analytically complex problems.
Deep Learning
Subset of machine learning that imitates the workings of the human brain in processing data and
improves performance. Typically, a multi-level algorithm that gradually identifies things at higher levels of
abstraction.
Machine Learning (ML)

"Machine learning is the science of getting computers to automatically learn from experience instead of
relying on explicitly programmed rules, and generalize the acquired knowledge to new settings.“ In
essence, Machine Learning automates analytical model building through optimization algorithms and
parameters that can be modified and fine-tuned.
Python
A programming language available since 1994 that is popular with people doing data science. Python is
noted for ease of use among beginners, and great power when used by advanced users, especially when
taking advantage of specialized libraries such as those designed for machine learning and graph
generation.
R
An open-source programming language and environment for statistical computing and graph
generation available for Linux, Windows, and Mac.
Reinforcement Learning (RL)

Reinforcement Learning (RL) is a sub-field of Machine Learning involving a controller (termed
an agent) capable of taking actions in the form of decisions within a system. After each
decision is made by the controller, the system evolves to a new state and the controller
receives a measure of utility. By trial and error, the controller learns from its experience to
optimize an action selection strategy that maximizes the expected cumulative utility within
the system. RL is typically used to solve problems that can be modelled as sequential decision
processes.
Semantic
Semantics can address meaning at the levels of words, phrases, sentences, or larger units of
discourse. In machine learning, semantic analysis of a corpus is the task of building structures
that approximate concepts from a large set of documents.
Stochastic optimization
Stochastic optimization methods are optimization methods that generate and use random variables.
For stochastic problems, the random variables appear in the formulation of the optimization problem
itself, which involves random objective functions or random constraints.
Supervised Learning
A type of machine learning algorithm in which a system is taught via examples. For instance, a
supervised learning algorithm can be taught to classify input into specific, known classes. The classic
example is sorting email into spam versus non-spam. (datascienceglossary.org)
Unsupervised Learning
A class of machine learning algorithms designed to identify groupings of data without knowing in
advance what the groups will be. (datascienceglossary.org)
Web scraping
Web scraping is a term for various methods used to collect information from across the Internet.
Acquiring Data
Be cynical about your data
 Is the data relevant to your problem?
 Where did this data come from?
 Who collected it?
 Why? What for?
 Do they have biases that might show up in the data?
 Are there holes in the data (demographic, geographical, political etc)?
 Do you have supporting data? Is it *really* from a different source?
 Can you use this data (are there privacy or copyright issues with using it)?
Data Sources
 Data warehouses and catalogues
 open government data
 NGO websites
 web searches
 online documents, images, maps etc
Creating your own data: People
Creating your own data: Sensors
Data analysis
 Data analysis is an aspect of data science and data analytics that is all about
analyzing data for different kinds of purposes. The data analysis process involves
inspecting, cleaning, transforming and modeling data to draw useful insights from
it.
Descriptive Analytics
 Descriptive analytics is a type of data analysis that focuses on describing and summarizing data
to gain insights into what has happened in the past. It is commonly used to answer questions
such as “What happened?” and “How many?”.
 Descriptive analytics can help businesses and organizations understand their data and identify
patterns and trends that can inform decision-making.
 Here are some real-life examples of descriptive analytics:
• A retail store might analyze historical sales data to identify popular products and trends. For
example, people tend to buy more candy in February.
• Patient data can be summarized to identify common health issues. For example, most people get
the flu from October to June.
• Student performance data can be analyzed to identify areas for improvement. For example, most
students who fail Calculus are frequently late to class.
 To use descriptive analytics effectively, you need to ensure that your data is accurate and of high
quality. It’s also crucial to use clear and concise visualizations to communicate insights
effectively.
Diagnostic Analysis
 Diagnostic analytics is a type of data analysis that goes beyond descriptive analytics
to identify the root cause of an issue or problem. It answers questions such as “Why did it happen?” and
“What caused it?”. For example, you can use diagnostic analysis to determine why your January sales
dropped by 50%.
 Diagnostic analytics involves exploring and analyzing data to identify relationships and correlations that
can help explain an issue or problem. This can be done using techniques such as regression analysis,
hypothesis testing, and causal analysis.
 Real-life examples include:
• You can use diagnostic analysis to identify the root cause of a quality issue in your production process.
• You can also use it to identify the cause behind a customer’s complaint and provide a targeted solution.
• In case of a cyber threat, you can also use it to identify the source of a security breach and prevent future
attacks.
 There are many benefits to using diagnostic analytics, such as identifying the underlying causes of issues
and problems and developing targeted solutions. But, like with the previous two data analytics methods,
there are some challenges to consider. For one, acquiring high-quality data and ensuring accurate analysis
and insights can be difficult. Secondly, the analysis techniques can be quite complex and may require
specialized skills and knowledge to be implemented effectively.
Predictive Analytics
 Predictive analytics uses statistical and machine learning techniques to analyze historical data and predict future events. It
is commonly used to answer questions such as “What is likely to happen?” and “What if?”.
 Predictive analytics is useful as it can help you plan ahead. It can help improve business operations, reduce costs, and
increase revenue. For example, you can predict how sales will likely behave based on seasonality and previous sales
figures. If your predictive analysis tells you that sales will likely decrease in winter, you can use this information to design
an effective marketing campaign for this season.
 Here are some practical examples of predictive analytics in action:
• A bank might use predictive analytics to assess credit risk and determine whether to grant a loan to a customer. In open
banking, predictive analytics can help build highly personalized behavioral models specific to each customer and identify
their creditworthiness in new ways. For customers, this may mean better and cheaper access to bank accounts, credit cards,
and mortgages.
• In marketing, predictive analytics can help identify which customers are most likely to respond to a particular offer.
• In healthcare, predictive analytics can be used to identify patients at risk of developing a particular disease.
• In manufacturing, predictive analytics can be used to forecast demand and optimize supply chain management.
 However, there are also some challenges to using predictive analytics effectively. One challenge is the availability of high-
quality data essential for accurate predictions. Another challenge is selecting appropriate modeling techniques to analyze
the data and make accurate predictions. Finally, communicating predictive analytics results to decision-makers can be
challenging, as the techniques used can be complex and difficult to understand.
Prescriptive Analytics
 Prescriptive analytics is a type of data analysis that goes beyond descriptive and
predictive analytics to provide recommendations for actions you should take. In
other words, this approach involves using optimization techniques to
identify the best course of action, given a set of constraints and objectives.
 It is commonly used to answer questions such as “What should we do?” and
“How can we improve?”
 To be effective, it requires a deep understanding of the data being analyzed and
the ability to model and simulate different scenarios to identify the best course of
action. As such, this is the most complex approach of the four methods.
 Prescriptive analytics can help you solve various problems, including product
mix, workforce planning, marketing mix, capital budgeting, and capacity
management.
 The best example of prescriptive analytics in action is using Google maps for
directions during peak hours. The software considers all modes of transport and
traffic conditions to calculate the best route possible. A transportation company
might use prescriptive analytics in this way to optimize delivery routes and
minimize fuel costs. This is important especially when you consider the rising
cost of fuel.
 However, like with predictive analytics, there are some challenges to using
prescriptive analytics effectively. The first challenge is the
availability of high-quality data essential for accurate analysis and optimization.
Another challenge is the complexity of the optimization algorithms used, which
can require specialized skills and knowledge to implement effectively.
Introduction to Statistics and
Probability
STATISTICS
• It is the science of collecting, organizing, analyzing and interpreting data.
• There are two types of Statistics:
Inferential Statistics : It is about using sample data from a dataset and making inferences
and conclusions using probability theory.
Descriptive Statistics: It is used to summarize and represent the data in an accurate way
using charts, tables and graphs.
For example, you might stand in a mall and ask a sample of 100 people if they like
shopping at Sears. You could make a bar chart of yes or no answers (that would be
descriptive statistics) or you could use your research (and inferential statistics) to reason
that around 75%-80% of population.
DESCRIPTIVE STATISTICS
The following measures are used to represent the data set :
Measure of
Position
Descriptive
Statistics
Measure of Measure of
Shape Spread
MEASURE OF POSITION
• Also known as measure of Central Tendency.
• A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data.
• There are three measures of central tendencies: Mean, Median and Mode.
Mean: It is a point where mass of distribution of data balances.
Median: It is a point that divides the data into two equal halves while being less susceptible
to outliers compare to mean.
Mode: It refers to the data item that occurs most frequently in a given data set.
Example for ungrouped data:
Question:
MEASURE OF DISPERSION
• It refers to how the data deviates from the position measure i.e. gives an indication of
the amount of variation in the process.
• Dispersion of the data set can be described by:

Range: It is the difference between highest and the lowest values.
Standard Deviation: It is the measurement of average distance between each
quantity and mean i.e. how data is spread out from mean. Higher the standard
deviation, more is the data spread from mean.
MEASURE OF SHAPE
• It is used to characterize the location and variability of data set.
• Two common statistics that measure the shape of the data are:
Skewness and Kurtosis
Skewness : It is the horizontal displacement of the normal curve about the mean
position. Skewness for a normal distribution is zero.
Kurtosis: It is the vertical distortion of normal curve without disturbing
symmetry of normal curve. The kurtosis for a standard normal distribution
is three.
PROBABILITY
Probability is a numerical description of how likely an event is to occur or how likely
it is that a proposition is true.
Some examples are:

Tossing a coin: When a coin is tossed, there are two
possible outcomes: Heads (H) or Tails (T).Thus, probability
of the coin landing H is ½ and the probability of the coin
landing T is ½.
Rolling a die: When a single die is thrown, there are six

possible outcomes: 1, 2, 3, 4, 5, 6.The probability of any
one of them is 1/6.
TERMINOLOGY
Experiment: A process by which an outcome is obtained.
Sample space: The set S of all possible outcomes of an experiment. i.e. the sample space for a
dice roll is {1, 2, 3, 4, 5, 6}
Event: Any subset E of the sample space i.e.

Let,
E1 = An even number is rolled.
E2 = A number less than three is rolled.
Outcome: Result of a single trial.
Equally likely outcomes: Two outcomes of a random experiment are said to be equally
likely, if upon performing the experiment a (very) large number of times, the relative
occurrences of the two outcomes turn out to be equal.
Trial: Performing a random experiment.

EVENTS
Simple Events : If the event E has only single element of a sample
space, it is called as a simple event. Eg: if S = {56 , 78 , 96 , 54 , 89} and
E = {78} then E is a simple event.
Compound Events: Any event consists of more than one element of

the sample space. Eg: if S = {56 ,78 ,96 ,54 ,89}, E1 = {56 ,54 }, E2 =
{78 ,56 ,89 } then, E1 and E2 represent two compound events.
Independent Events and Dependent Events:

If the occurrence of any event is completely unaffected by the occurrence
of any other event, such events are Independent Events.
Probability of two independent event is given by,
The events which are affected by other events are Dependent Events.
Probability of dependent event is given by,
Exhaustive Events: A set of events is called exhaustive if all the events together consume
the entire sample space. Eg: A and B are sets of mutually exclusive events,
Where,
S = sample space
Mutually Exclusive Events: If the occurrence of one event excludes the occurrence of
another event i.e. no two events can occur simultaneously.
Addition Theorem
Theorem 1: If A and B are two mutually exclusive events, then
P(A ∪ B) = P(A) + P(B)
Where,
n = Total number of exhaustive cases
n1= Number of cases favorable to A.
n2= Number of cases favorable to B.
Theorem2: If A and B are two events that are not mutually exclusive, then
P(A ∪ B) = P( A ) + P( B ) - P ( A ∩ B )
Where,
P (A ∩ B) = Probability of events favorable to both A and B
Multiplication Theorem
If A and B are two independent events, then the probability that both will
occur is equal to the product of their individual probabilities.
Example:
The probability of appointing a lecturer who is B.Com, MBA, and PhD, with
probabilities 1/20, 1/25 and 1/40 is given by:
Using multiplicative theorem for independent events,

Conditional Probability
The conditional probability of an event B is the probability that the event will
occur given the knowledge that an event A has already occurred. It is
representated as P( B | A).
P(A | B) = P(A ∩ B) ⁄ P(B)
Where A and B are two dependent events.

BAYES’ THEOREM
It is a mathematical formula for determining conditional probability.
In above formula, the posterior probability is equal to the conditional

probability of event B given A multiplied by the prior probability of A, all
divided by the prior probability of B.
Science itself is a special case of Bayes’ theorem

because we are revising a prior
probability( hypothesis) in the light of observation
or experience that confirms our
hypothesis( experimental evidence) to develop a
posterior probability( conclusion)
Data Distributions in Exploratory Data
Analysis
 Understanding and Visualizing Data
Types of Data Distributions
 Common types of data distributions in EDA include:

 Normal Distribution
 Uniform Distribution
 Binomial Distribution
 Poisson Distribution
 Exponential Distribution
Normal Distribution
 A continuous probability distribution characterized by a bell-shaped curve.

 Symmetrical around the mean.
 Defined by mean (μ) and standard deviation (σ).
 68% of data within 1σ, 95% within 2σ, 99.7% within 3σ.
Uniform Distribution
 A distribution where all outcomes are equally likely.

 Continuous and discrete types.
 For a continuous uniform distribution, every interval of the same length is equally probable.
 For a discrete uniform distribution, each value is equally likely.
Binomial Distribution
 A discrete distribution representing the number of successes in a fixed number of

independent Bernoulli trials.
 Defined by number of trials (n) and probability of success (p).
 Used in scenarios like coin tosses, yes/no surveys.
Poisson Distribution
 A discrete distribution representing the number of events occurring within a fixed interval
of time or space.
 Defined by the average rate (λ).
 Applicable for rare events, e.g., number of phone calls at a call center in an hour.
Exponential Distribution
 A continuous distribution used to model the time between events in a Poisson process.
 Defined by the rate parameter (λ).
 Memoryless property: the probability of an event occurring in the next interval is independent of
past intervals.
Types of
Data
• The term variable means a quality or quantity which varies
from one member of a sample or population to another.
Nominal Variable: A qualitative variable that categorizes (or
describes, or names) an element of a population.
Ordinal Variable: A qualitative variable that incorporates an

ordered position, or ranking.
Discrete Variable: A quantitative variable that can assume a

countable number of values. Intuitively, a discrete variable can
assume values corresponding to isolated points along a line interval.
That is, there is a gap between any two values.
Continuous Variable: A quantitative variable that can assume an

uncountable number of values. Intuitively, a continuous variable can
assume any value along a line interval, including every possible value
between any two values.
Classification by the number of variables
• Univariate - data that describes a single
characteristic of the population
• Bivariate - data that describes two characteristics
of the population
• Multivariate - data that describes more than two
characteristics (beyond the scope of this course
Identify the following
variables:
1. the income of adults in your city
Numerical
2. the color of M&M candies selected at random from

b
a ag Categorical
3. the number of speeding tickets each student in

AP Statistics has Numerical
received
4. the area code of an individual
Categorical
5. the birth weights of female babies born at a

hospital
large over the course of a Numerical
year
Exercises
• Identify the type of data (nominal, ordinal, interval and ratio) represented
by each of the following. Confirm your answers by giving your own
examples.
1. Blood group
2. Temperature (Celsius)
3. Ethnic group
4. Job satisfaction index (1-5)
5. Number of heart attacks
6. Calendar year
7. Serum uric acid (mg/100ml)
8. Number of accidents in 3 - year period
9. Number of cases of each reportable disease reported by a health worker
10.The average weight gain of 6 1-year old dogs (with a special diet
supplement) was 950grams last month.
Types of Data
Data sets can consist of two types of data: qualitative
data and quantitative data.
Data
Qualitative Quantitative
Data Data
Consists of Consists of
attributes, labels, numerical
or nonnumerical measurements or
entries. counts.
Qualitative and Quantitative Data
 Example:
 The grade point averages of five students are listed in the table. Which data
are qualitative data and which are quantitative data?
Student GPA
Sally 3.22
Bob 3.98
Cindy 2.75
Mark 2.24
Kathy 3.84
Qualitative data Quantitative data

Levels of Measurement
The level of measurement determines which statistical
calculations are meaningful. The four levels of
measurement are: nominal, ordinal, interval, and ratio.
Nominal
Lowest to
Levels of
Measurement
Ordinal highest
Interval
Ratio
Nominal Level of Measurement
Data at the nominal level of measurement are
qualitative only.
Nominal
Levels of Calculated using names, labels, or
Measurement qualities. No mathematical
computations can be made at this level.
Colors in Names of students in Textbooks you are

the US flag your class using this semester
Ordinal Level of Measurement
Data at the ordinal level of measurement are qualitative
or quantitative.
Levels of
Measurement Ordinal
Arranged in order, but differences
between data entries are not meaningful.
Class standings: Numbers on the back Top 50 songs played

freshman, sophomore, of each player’s shirt on the radio
junior, senior
Interval Level of Measurement
Data at the interval level of measurement are
quantitative. A zero entry simply represents a position on
a scale; the entry is not an inherent zero.
Levels of Interval
Measurement Arranged in order, the differences between data
entries can be calculated.
Temperatures Years on a timeline Atlanta Braves

World Series
victories
Ratio Level of Measurement
Data at the ratio level of measurement are similar to the
interval level, but a zero entry is meaningful.
A ratio of two data values can be formed so one
Levels of
data value can be expressed as a ratio.
Measurement
Ratio
Ages Grade point averages Weights

Summary of Levels of Measurement
Arrange Determine if one

Level of Put data in Subtract data
data in data value is a
measurement categories values
order multiple of another
Nominal Yes No No No
Ordinal Yes Yes No No
Interval Yes Yes Yes No
Ratio Yes Yes Yes Yes
What is a Data Format?
•A structured way to represent and store data
•Ensures consistency and interoperability
•Different formats for different data types (text, numbers, images, etc.)
Common Data Formats
•Text formats (TXT, CSV, JSON) - Plain text for basic data
•Document formats (DOC, PDF, ODT) - Structured documents with layout
and formatting
•Spreadsheet formats (XLS, XLSX) - Data organized in rows and columns
•Image formats (JPG, PNG, GIF) - Digital representations of pictures
•Audio formats (MP3, WAV) - Storage of sound recordings
•Video formats (MP4, AVI) - Combination of images and audio for moving
pictures
Working with Categorical Data
Introduction to Categorical Data
 Categorical data represents characteristics and can be divided into groups or categories.
 Examples: Gender, Marital Status, Yes/No responses.
Types of Categorical Data
 Nominal Data: Categories with no inherent order. Examples: Colors, Types of Pets.
 Ordinal Data: Categories with a meaningful order. Examples: Movie Ratings, Education
Levels.
Methods of Representing Categorical Data
 Various methods are used to visually represent categorical data for better understanding
and analysis.
Bar Charts
 A bar chart uses rectangular bars to represent the frequency or count of categories.
 Advantages: Easy to understand, compare categories.
 Disadvantages: Not suitable for a large number of categories.
A bar plot is a common way to display a single categorical variable. A

bar plot where proportions instead of frequencies are shown is called a
relative frequency bar plot.
Pie Charts
 A pie chart is a circular chart divided into slices to illustrate numerical proportions.
 Advantages: Shows part-to-whole relationships.
 Disadvantages: Hard to compare slices, not effective for many categories.
Frequency Tables
 A table that lists categories and their corresponding frequencies.
 Advantages: Simple and precise.
 Disadvantages: Not visual, harder to interpret for large datasets.
Mosaic Plots
 A mosaic plot is a graphical representation of contingency tables.
 Advantages: Shows relationships between two or more categorical variables.
 Disadvantages: Can be complex to interpret.
Advantages and Disadvantages of Different
Methods
 Bar Charts: + Easy to understand, - Not for large categories.
 Pie Charts: + Good for proportions, - Hard to compare slices.
 Frequency Tables: + Precise, - Not visual.
 Mosaic Plots: + Shows relationships, - Complex.
Practical Examples
 Example 1: Bar chart showing favorite colors among students.

 Example 2: Pie chart showing market share of smartphone brands.
 Example 3: Frequency table of survey responses.
 Example 4: Mosaic plot of education level and employment status.
Representing Text Data
Introduction
 - The rise of digital text data.

 - Importance in data analysis and machine learning.
 - Common applications: NLP, search engines, sentiment analysis.
Types of Text Data
 - **Unstructured Text:** Raw text without specific formatting.

 - Examples: Articles, social media posts, literature.
 - **Semi-structured Text:** Text with some structure but not in a database.
 - Examples: Emails, HTML pages, JSON files.
 - **Structured Text:** Text data organized in a predefined manner.
 - Examples: Spreadsheets, databases.
Text Representation Methods
Bag-of-Words (BoW)
- Converts text into vectors of word counts.

 - Treats text as a collection of individual words.
 - **Example:**
 - Document: 'I love machine learning.'
 - BoW: [1, 1, 1, 1] (assuming the vocabulary is ['I', 'love', 'machine', 'learning'])
 - **Pros:**
 - Simple and easy to implement.
 - **Cons:**
 - Ignores word order and context.
Term Frequency-Inverse Document Frequency
(TF-IDF)
- Weighs terms based on their frequency in a document and their rarity across all documents.
 - **Formulas:**
 - TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document)
 - IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
 - **Example:**
 - Term: 'machine' in a document, TF = 3/100, IDF = log(1000/10)
 - **Pros:**
 - Highlights important words.
 - **Cons:**
 - Still ignores word order.
Word Embeddings
- Represents words in continuous vector space.

 - Captures semantic meaning.
 - **Models:**
 - **Word2Vec:** Predicts context from words or vice versa.
 - **GloVe:** Factorizes word co-occurrence matrix.
 - **Example:**
 - 'king' - 'man' + 'woman' ≈ 'queen'
 - **Pros:**
 - Captures semantic relationships.
 - **Cons:**
 - Requires large corpora for training.
Sentence Embeddings
 - **Explanation:**
 - Represents entire sentences or phrases in vector space.
 - Captures semantic meaning beyond individual words.
 - **Models:**
 - **Sentence-BERT:** Adapts BERT for sentence-level tasks.
 - **Example:**
 - Similar sentences have closer vectors.
 - **Pros:**
 - Captures context and sentence meaning.
 - **Cons:**
 - Computationally intensive.
Applications
 - **Text Classification:**
 - Classifying emails as spam or not spam.
 - **Sentiment Analysis:**
 - Determining sentiment from social media posts.
 - **Machine Translation:**
 - Translating text from one language to another.
 - **Information Retrieval:**
 - Search engines retrieving relevant documents.
Challenges in Text Representation
 - **Ambiguity and Polysemy:**

 - Words with multiple meanings.
 - **High Dimensionality:**
 - Large vocabulary leads to high-dimensional vectors.
 - **Contextual Meaning:**
 - Capturing the context in which words are used.
 - **Data Sparsity:**
 - Many words appear infrequently.
Representing Text Data
Introduction to Text Data Representation
 Importance of Text Data

 - Widely used in Natural Language Processing (NLP)
 - Applications include search engines, chatbots, etc.
 Challenges
 - High dimensionality
 - Sparsity
 - Semantic understanding
Basic Text Representation: Bag of Words
(BoW)
 Definition
 - Represents text by counting the occurrence of each word
 Advantages
 - Simple and easy to implement
 Disadvantages
 - Ignores word order and context
Basic Text Representation: Term Frequency-
Inverse Document Frequency (TF-IDF)
 Definition
 - Adjusts word counts by their frequency in the document and across all documents
 Formula
 - TF-IDF(t,d) = TF(t,d) × IDF(t)
 Advantages
 - Reduces the impact of commonly occurring words
 Disadvantages
 - Still ignores word order
Word Embeddings: Word2Vec
 Definition
 - Uses neural networks to generate dense vector representations of words
 Models
 - CBOW (Continuous Bag of Words)
 - Skip-gram
 Advantages
 - Captures semantic meaning
 Applications
 - Similarity comparisons
Word Embeddings: GloVe
 Definition
 - Combines the benefits of BoW and Word2Vec
 Methodology
 - Uses global word-word co-occurrence statistics
 Advantages
 - Efficient in capturing semantic relationships
Word Embeddings: FastText
 Definition
 - Extends Word2Vec by representing words as n-grams
 Advantages
 - Handles out-of-vocabulary words
 Applications
 - Morphologically rich languages
Contextual Embeddings: ELMo
 Definition
 - Generates word representations using deep bi-directional LSTM
 Advantages
 - Context-sensitive
 Applications
 - Improves performance on downstream NLP tasks
Contextual Embeddings: BERT
 Definition
 - Uses transformers to provide context-aware embeddings
 Advantages
 - Handles context more effectively
 Applications
 - Wide range of NLP tasks
Contextual Embeddings: GPT-based
Embeddings
 Definition
 - Uses transformer architecture to generate embeddings
 Advantages
 - Generates coherent and contextually relevant text
 Applications
 - Language generation tasks
Advanced Techniques: Sentence and Document
Embeddings
 Sentence Embeddings
 - Captures the meaning of entire sentences
 - Examples: Universal Sentence Encoder, Sentence-BERT
 Document Embeddings
 - Aggregates sentence embeddings for documents
Advanced Techniques: Transformer Models
 Transformers
 - Use attention mechanisms
 - Examples: GPT-3, T5
 Advantages
 - State-of-the-art results in many NLP tasks
Applications of Text Data Representation
 Sentiment Analysis
 - Determine the sentiment of text (positive, negative, neutral)
 Text Classification
 - Categorize text into predefined categories
 Named Entity Recognition (NER)

 - Identify entities (names, dates, etc.) in text
 Machine Translation
 - Translate text from one language to another
Tools and Libraries
 NLTK
 - Comprehensive library for NLP
 SpaCy
 - Industrial-strength NLP
 Gensim
 - Topic modeling and document similarity
 Transformers (Hugging Face)

 - Pre-trained transformer models
Conclusion
 Future Trends
 - Increasing use of transformer models
 - Improved contextual understanding
 Summary
 - Effective text representation is crucial for NLP
 - Ongoing advancements in embeddings and models
Time Series Data Representation
Introduction to Time Series Data
 Definition:
 A sequence of data points collected or recorded at successive points in time.
 Examples:
 - Stock Prices
 - Weather Data
 - Sales Data
Components of Time Series Data
 Trend:
 Long-term movement in the data.
 Seasonality:
 Regular pattern repeating over a known, fixed period.
 Cyclic Patterns:
 Long-term oscillations with no fixed period.
 Irregular Components:
 Random noise or anomalies.
Types of Time Series Data
 Continuous vs Discrete:
 - Continuous: Recorded continuously over time.
 - Discrete: Recorded at specific time intervals.
 Univariate vs Multivariate:
 - Univariate: Single variable recorded over time.
 - Multivariate: Multiple variables recorded over time.
Time Series Data Representation Techniques
 Line Plots:
 Visualize data points connected by lines.
 Scatter Plots:
 Individual data points displayed without connecting lines.
 Seasonal Plots:
 Show seasonality in data.
 Heatmaps:
 Visualize data intensity.
Applications of Time Series Analysis
 Forecasting:
 Predict future values based on past data.
 Anomaly Detection:
 Identify unusual patterns or outliers.
 Trend Analysis:
 Understand long-term movement.
Tools and Libraries for Time Series Analysis
 Python Libraries:
 - Pandas
 - Matplotlib
 - Seaborn
 - Statsmodels
Representing Information in
Different Modalities
Audio, Image, and Video
Introduction to Different Modalities
 In multimedia, information can be represented in various modalities including audio,

image, and video.
 Each modality has its unique characteristics and uses, making them suitable for different
types of data representation and analysis.
Audio Representation
 Basics of Audio Representation:

 Audio data is typically represented as waveforms, which show how the sound amplitude varies with time.
 File Formats:
 - WAV Waveform Audio File
 - MP3 MPEG-1 Audio Layer III
 - AAC Advanced Audio Codec
 Visualization Techniques:
 - Waveforms
 - Spectrograms
Image Representation
 Basics of Image Representation:

 Images are represented as a grid of pixels, each having a specific color value.
 File Formats:
 - JPEG
 - PNG
 - GIF
 - Histograms
 - Color Channels
Video Representation
 Basics of Video Representation:

 Videos are sequences of images (frames) displayed at a specific frame rate to create the illusion of motion.
 File Formats:
 - MP4
 - AVI
 - MKV
 - Frame Extraction
 - Motion Vectors
Applications
 Audio:
 - Music Streaming
 - Speech Recognition
 Image:
 - Digital Photography
 - Medical Imaging
 Video:
 - Streaming Services
 - Video Surveillance
Data Visualization
What is Data Visualization
 Data visualization is actually a set of data points and information that are
represented graphically to make it easy and quick for user to understand. Data
visualization is good if it has a clear meaning, purpose, and is very easy to
interpret, without requiring context. Tools of data visualization provide an
accessible way to see and understand trends, outliers, and patterns in data by
using visual effects or elements such as a chart, graphs, and maps.
 Data visualization is the process of translating large data sets and metrics into
charts, graphs and other visuals. The resulting visual representation of data makes it
easier to identify and share real-time trends, outliers, and new insights about the
information represented in the data.
Challenges of big data visualization
• Visual noise: Most of the objects in dataset are too relative to each other. Users
cannot divide them as separate objects on the screen.
• Information loss: Reduction of visible data sets can be used, but leads to
information loss
.
• Large image perception: Data visualization methods are not only limited by
aspect ratio and resolution of device, but also by physical perception limits.
• High rate of image change: Users observe data and cannot react to the number of
data change or its intensity on display.
• High performance requirements: It can be hardly noticed in static visualization

because of lower visualization speed requirements--high performance
requirement.
Solution to this challenges
1. Meeting the need for speed: One possible solution is hardware. Increased memory
and powerful parallel processing can be used. Another method is putting data in-
memory but using a grid computing approach, where many machines are used.
2. Understanding the data: One solution is to have the proper domain expertise in
place.
3. Addressing data quality: It is necessary to ensure the data is clean through the
process of data governance or information management.
4. Displaying meaningful results: One way is to cluster data into a higher-level view
where smaller groups of data are visible and the data can be effectively visualized.
5. Dealing with outliers: Possible solutions are to remove the outliers from the data or
create a separate chart for the outliers.
Plots And Diagrams
Contents
 Introduction
 Graph
 Pie Chart
 Bar graph
 Dot Plot
 Box Plots
 Stem Plot
 Histogram
 Scatter Plot
 Time series Plot
28-11-2019
159
Introduction
 There are many ways of collecting data.

 Data analysis consists of one or more or 3 activities
 Graph of data
 Table of data
 Compute something from data
 Simplification makes it easier to understand and to extract information in the
new form.
 Data simplification cannot cover the original information.
 There is always a loss of some kind of information when we analyse the
data.
160
Graph
 Graphs are made of 2 main purpose
 Extract information from data
 Communicate the information to others
 Therefore it is one of the statistical methods of analysing data
161
Pie Chart
 A pie chart is a type of graph in which a circle is divided into sectors that
each represents a proportion of the whole. Pie charts are a useful way to
visualize information that might be presented in a small table.
Juices
12%
19% Mango
Strawberry
Cherries
Banana
34%
35%
162
Advantages
 Good for showing the relative sizes of graph
 Several groups can be represented and compared
 Good for categorical variable
 One piece of pie can move to another location without changing the
meaning of the chart
163
Disadvantages
 Not compatible for showing many observation

 Are not useful for representing large number of groups.
3% 2%
3% 1% Juices
8%
3% Mango Strawberry
8% Cherries Banana
4%
Blueberries Carrot
7%
10% Guava Tomato
3%
Apple Avacado
6% Mixed Fruit Grape
9%
5% Orange Beetroot
2% 7% 7% Kiwifruit Lemon
9% Melon Papaya
3%
Pineapple
165
Bar graph
 A bar chart is a chart with rectangular bars with
lengths proportional to the values that they represent. The bars can
be plotted vertically or horizontally. Bar charts can be used to show
comparisons among categories.
5
4.5
4
3.5
3
2.5 Different brand
2 Mango Juices
1.5
1
0.5
0
Mango Mango(1) Mango(2) Mango(3)
28-11-2019 Source:Briliant.org 166
Advantages
 Show each data category in a frequency distribution

 Display relative numbers/proportions of multiple categories
 Summarize a large amount of data in a visual, easily interpretable form
 Make trends easier to highlight than tables do
 Estimates can be made quickly and accurately
 Accessible to a wide audience
28-11-2019 167
Africa Geography Blog,26th August 2015
Disadvantages
 Can be easily manipulated to give false impressions
28-11-2019
168
Africa Geography Blog,26th August 2015
Dot Plot
 A dot plot is a graphic display using dots and a scale to compare the
frequency within categories or groups.
 Dot plots clearly display clusters, scatters and outliers.
28-11-2019 Source : Cpalms.org 171

Example :
28-11-2019 Iversen and Gergen,1997,p.80. 172

Dot Plot
It clearly shows that most of the women were in their middle to late twenties and
early thirties, with a scattering between 35 and 60.

Dot Plots :
Advantages : Disadvantages :
 Distribution of the variables  Not suitable for large data set.

shown clearly.
 Original value has shown without

losing any values

Box plot
 The box plot is a standardized way of displaying the distribution of data

based on the minimum, first quartile, median, third quartile, and maximum
of the data set.
Q1 Q2 Q3
Min Max
25% 25%
25% 25%
28-11-2019 175
 A box plot summarizes data using the median, upper and lower
quartiles, and the extreme (least and greatest) values. It allows
you to see important characteristics of the data at a glance.
 The data values found inside the box represent the middle half
( 50%) of the data.
 The line segment inside the box represents the median
19 24 27 32 60

Skewness in boxplot
Mode=median=mean
 Normal distribution :
Q3-Q2 = Q2-Q1
 Positive skew :
Q3-Q2 > Q2-Q1 Mode<median<mean
 Negative skew :
Q3-Q2 < Q2-Q1
Mode>median>mean

Box Plots :
 Shows 5 number summary  Individual data is lost

 Makes comparison easy  Can be confusing to read
 Identify outliers
 Handles extremely large data set

179 Stem Plot
 Has a vertical stem with branches grow out on both sides

 Branches at left – first digit
 Branches at right – remaining digits
28-11-2019
Iversen and Gergen,1997,p.80.

Histogram
 A histogram is a type of graph that shows the frequency distribution of

data within equal intervals (thus, there are no spaces between the bars).
 Most commonly used graph
Iversen and Gergen,1997,p.80.

28-11-2019 180
28-11-2019 181
Histogram:
 Suitable for large data set  Individual data is lost

 Distribution of the variables
shown

Graphing two metric variables contd..
Scatterplot
 common way to display data on two variables.
 consists of two axes, a horizontal axis and a
vertical axis.
 pair of observations of two variables is shown
as a point in the graph.
 Dots representing data points are scattered
on the diagram.
 Critical-To-Quality (CTQ) characteristic and a
factor affecting it, two factors affecting a CTQ
or two related quality characteristics.
 dots cluster together in a line across the
diagram shows the strength with which the two(Moore et al, 2013.p. 99)
factors are related.
28-11-2019
183
WHAT DATA IS NECESSARY TO MAKE A SCATTER PLOT?

 Multiple (two) sets of numerical data
 An independent variable
 A dependent variable
 In some cases, an independent variable and dependent variable can be
interchanged
When Would You Use a Scatter Plot?
 to investigate a relationship between two quantities
 identify the trend in the data set. (negative correlation, positive correlation,
or no correlation)
(Arteaga et al, 2012.p. 266)
28-11-2019 184
28-11-2019 (Arteaga et al, 2012.p. 268)185

History Vs. Math
 Here is a scatter plot showing the relationship between students who took a History
Test and a Math Test.
 Is there a relationship between the scores?
 Describe the relationship.
28-11-2019 186
History Vs. Math
 Since there is a positive correlation with the data,

predict what a student who earned a 75% on their
history test earned on their math test.
 What can I draw that will help me make that
prediction?
28-11-2019 187
75% on
History
Test
 The line of best fit will help you make a prediction as to what score the
student would get on their math test if they earned a 75% on their history
test. About 77%
 What score would he get on the math test?
28-11-2019 188
Scatterplot
 Advantage of scatter plot  Disadvantage of the scatter Plot
 N0 graphs lines to see exactly where the
 point is
Clearly reflects the relation between
two variables.  Helps to see the relationship but its difficult
 to tell if the relation is positive, Negative or
No numerical information is lost and
if there isn’t one.
simplification of the data is gained.
 Easy to create and to interpret.
 Show Minimum, Maximum and outliers
of the data set.
(Iversen and Gergen,1997,p.94)
28-11-2019 189
Time series plot

Introduction
 A time series or time sequence is a data set in which the observations are
recorded in the order in which they occur.
 A time series plot is a graph in which the vertical axis denotes the
observed value of the variable (say x) and the horizontal axis denotes the
time (which could be minutes, days, years, etc.).
 When measurements are plotted as a time series, we often see
• trends,
• series, or
• other broad features of the data
(Box et al ,1997,p.34)
28-11-2019
190

28-11-2019
191
28-11-2019 192

Module 1

Uploaded by

Copyright:

Available Formats

Module 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 1

Uploaded by

Copyright:

Available Formats

Exploratory Data Analysis: INTRODUCTION TO

• “Data Science is a new term. But in the same sense as

Professor in the Departments

• A multi-disciplinary field that

Big Data Science Tasks

What do people look for in a data scientist?

Data Science Roles

Roles Required in Data Science Project

How to become a data scientist?

How to become a data scientist?

Learning Data Science with Python - Libraries

Learning Data Science with Python - Libraries

Learning Data Science with Python - Tools

How to become a data scientist?

Data Scientist need to comfortable with:

Data Scientist need to learning machine learning & software

Who are the Data Scientist?

Who are the Data Scientist?

• Banking and Finance

• Price Comparison Websites

• Airline Routing Planning

• Fraud and Risk Detection

• Internet of Things (IoT)

Artificial Intelligence (AI)

Artificial Intelligence System

Machine Learning (ML)

Reinforcement Learning (RL)

• There are two types of Statistics:

• Dispersion of the data set can be described by:

Some examples are:

Rolling a die: When a single die is thrown, there are six

Event: Any subset E of the sample space i.e.

Outcome: Result of a single trial.

Trial: Performing a random experiment.

Compound Events: Any event consists of more than one element of

Independent Events and Dependent Events:

Using multiplicative theorem for independent events,

P(A | B) = P(A ∩ B) ⁄ P(B)

Where A and B are two dependent events.

In above formula, the posterior probability is equal to the conditional

Science itself is a special case of Bayes’ theorem

 Common types of data distributions in EDA include:

 A continuous probability distribution characterized by a bell-shaped curve.

 A distribution where all outcomes are equally likely.

 A discrete distribution representing the number of successes in a fixed number of

Ordinal Variable: A qualitative variable that incorporates an

Discrete Variable: A quantitative variable that can assume a

Continuous Variable: A quantitative variable that can assume an

2. the color of M&M candies selected at random from

3. the number of speeding tickets each student in

5. the birth weights of female babies born at a

Qualitative data Quantitative data

Colors in Names of students in Textbooks you are

Class standings: Numbers on the back Top 50 songs played

Temperatures Years on a timeline Atlanta Braves

Ages Grade point averages Weights

Arrange Determine if one

A bar plot is a common way to display a single categorical variable. A

 Example 1: Bar chart showing favorite colors among students.

 - Unstructured Text: Raw text without specific formatting.

 - Ambiguity and Polysemy: