Emerging Tech Notes - Module1
Emerging Tech Notes - Module1
1
Introduction to Data Science, AI & Machine Learning
Learning Objectives
Define and differentiate between machine learning and generative AI, understand their key
concepts, and explore their applications across various fields.
Data comes from various sources, such as customer transactions, sales records, website logs,
and social media interactions.
Types of Data
2
Introduction to Data Science, AI & Machine Learning
Product Recommendations
This segment has an iPhone, indicating a preference for high-end technology. The inclusion
of beer suggests a lifestyle that might involve socialising or relaxation after work. Cornflakes
3
Introduction to Data Science, AI & Machine Learning
are often associated with a quick, convenient breakfast, which is common for busy
individuals. Overall, this customer likely enjoys tech, convenience, and leisure activities.
Like P2, the diapers indicate that this customer likely has a child. The iPhone suggests they
are tech-savvy, and biscuits could be a snack for either the child or the parent. This segment
appears to be focused more on family-oriented products, possibly indicating a parent or
caregiver who is attentive to both the child's needs and their own
These guesses are based on the product combinations and typical consumer behaviour
patterns associated with those products.
The example provided involves analyzing customer purchase data to infer customer profiles
and understand purchasing behaviours.
1. Helps retailers understand product affinities, optimize product placements, design
cross-selling strategies, and improve inventory management.
2. Classifying customers into different segments based on their purchase behavior and
inferring their profiles (e.g., young adult, parent).
3. Enables targeted marketing, personalised recommendations, and improved customer
satisfaction by addressing specific needs and preferences of each segment
4. Predicting what other products a customer might be interested in based on current
purchasing patterns.
5. Using insights from data analysis to inform marketing campaigns, product placement,
and inventory decision
The example provided illustrates a simplified scenario of how data science is applied in
real-world contexts to understand and predict customer behaviour.
4
Introduction to Data Science, AI & Machine Learning
Data Science is a multidisciplinary field that focuses on finding actionable insights from large
sets of structured and unstructured data
A data scientist is a professional who uses scientific methods, algorithms, and systems to
extract insights and knowledge from structured and unstructured data. They apply a
combination of statistical analysis, machine learning, data mining, and programming skills to
interpret complex data, identify patterns, and make data-driven decision
Use Case 1: Is it A or B - Will the applicant be able to repay the loan or not ?
5
Introduction to Data Science, AI & Machine Learning
The objective is to predict whether a given applicant belongs to the "repay" class or the
"default" class based on various input features (such as income, credit score, existing debt,
etc.).
The model might output a probability score that represents the likelihood of the applicant
repaying the loan. For example, a score of 0.85 might indicate an 85% chance of repayment.
This type of problem is fundamental in credit risk assessment, where data science models
help financial institutions make informed lending decisions.
Use Case 2: Is it weird ( something weird means anomaly) : I am getting so many spam
emails in my inbox ?
The sudden increase in spam emails represents a deviation from the norm. In data science,
this could be detected using anomaly detection techniques, such as statistical methods (e.g.,
z-score), machine learning algorithms (e.g., Isolation Forest, Autoencoders), or rule-based
systems.
By applying anomaly detection techniques, the root cause can be identified and addressed,
improving the efficiency of spam filters and enhancing email security.
6
Introduction to Data Science, AI & Machine Learning
7
Introduction to Data Science, AI & Machine Learning
Learning Objectives
Learn about various data collection methods, understand data quality issues and
preprocessing techniques, and explore data cleaning and transformation.
Raw Data
Raw Data refers to unprocessed information that is collected and stored in its original format.
It is the most fundamental form of data, captured directly from various sources, such as
sensors, devices, or databases.
Raw Data is typically characterized by its lack of structure, organization, or meaningful
interpretation.
It may include text files, log files, images, audio recordings, or numeric data.
Raw Data is acquired from different sources and stored as-is without any transformations or
modifications. It can be collected manually or automatically through various methods, such
as data extraction tools, IoT devices, or data streaming technologies.
Once the Raw Data is collected, it can be stored in databases, data warehouses, or data
lakes, where it awaits further processing and analysis
By preserving data in its original format, Raw Data ensures data integrity and enables
retrospective analysis.
Related Terms
Data Lake: A data lake is a centralized repository that stores Raw Data in its native format,
facilitating data exploration, analysis, and processing.
Data Pipeline: A data pipeline refers to the set of processes and tools used to extract,
transform, and load (ETL) Raw Data into a destination system for further processing.
Data warehouses :Store cleaned and processed data in a centralized system. Data
warehouses use hierarchical dimensions and tables to store data. Data warehouses can be
used to source analytic or operational reporting, and for business intelligence (BI) use cases.
8
Introduction to Data Science, AI & Machine Learning
machine learning models. Recently added to Azure, it's the latest big data tool for the
Microsoft cloud
As a data scientist, you'll require data to address the problems you're working on.
Occasionally, your organization might already have the data you need. However, if the
necessary data isn't being collected, you'll need to collaborate with a data engineering team to
develop a system that begins gathering the required data.
Storage systems
The choice of storage system for raw data depends on various factors, including the volume
of data, access patterns, scalability requirements, and budget.
1. Cloud storage solutions like Amazon S3 or Google Cloud Storage are popular for
their scalability and integration with analytics tools,
2. while distributed file systems like HDFS are preferred for big data applications.
3. In contrast, databases and data lakes offer structured environments for specific use
cases
9
Introduction to Data Science, AI & Machine Learning
Their work ensures that data is collected, stored, and made accessible in a reliable and
efficient manner for use by data scientists, analysts, and other stakeholders.
Data Cleaning
Data Cleaning is identifying, diagnosing, and correcting the inconsistencies present in the
data in line with problems in hand and business processes.
Dirty data is data that is incomplete, incorrect, incorrectly formatted, erroneous, or irrelevant
to the problem we are trying to solve
10
Introduction to Data Science, AI & Machine Learning
How do we know the data is dirty, or more precisely how do we recognize the dirt, and what
are the main types of data dirt or defects?
Case 1
Each and every box is weighed and dimensions measured at the time of completing the
pickup process in an e-commerce domain by the pickup executive.
If it is a manual process and obviously it is in most cases, there are high chances of getting
incorrect weights and dimensions captured in the system, and this error can be translated to
the whole system and impact other operations like load planning, vehicle optimization, and
even wrong invoicing.
Can you guess what kind of error you are going to expect in the weight and dimensions
datasets in consignment data? Let me help you with some:
11
Introduction to Data Science, AI & Machine Learning
Take a pause and think of some more possibilities. if you are thinking about which syntax or
code can help to detect such anomalies, not to worry for now
Some of them are fixable and some are non-fixable, Non fixable can be corrected or analyzed
by data engineering teams sometimes or by bringing some changes in the flow or mechanism
of collecting the data for improving the data quality.
The standard set of errors has a pre-defined set of techniques that you can learn and build
your understanding to keep your eyes open on any dirt present.
a) If standard errors are repeating or occurring frequently — that can also be fixed at
source by data engineering teams or by doing semi-automation using python scripts
b) Combining data from multiple sources/web sources/API can be cleaned by codifying
their identified anomalies or string patterns, type, and formats.
Let’s understand the technique involved by taking the scenario of duplicate records, A
duplicate record is when you have two (or more) rows with the same information.
In the call center framework, automatic calls are made and assigned to agents, if for some
reason it gets triggered twice and gets recorded in the database tables, there is a high chance
duplicate records are created.
Data Preprocessing
Data preprocessing is a crucial step in the data science pipeline, involving the transformation
of raw data into a format suitable for analysis and modelling.
The primary goal is to enhance data quality and ensure that it is clean, consistent, and ready
for subsequent analysis or machine learning tasks.
1. Handling Missing Values: Identifying and addressing missing data through
imputation, deletion, or filling with default values.
2. Removing Duplicates: Detecting and eliminating duplicate records to avoid
redundancy.
3. Correcting Errors: Fixing inconsistencies and errors in the data, such as incorrect
values or typos.
12
Introduction to Data Science, AI & Machine Learning
13
Introduction to Data Science, AI & Machine Learning
Learning Objectives
Perform exploratory data analysis (EDA), interpret data patterns, and gain insights through
visualization and statistical methods.
You’ve just decided to take a spontaneous trip. Excitement kicks in, and you sit down to book
a flight. But instead of rushing into the first deal that pops up, you become a savvy traveller.
You open multiple tabs—comparing airlines, ticket prices, and perks. One airline offers free
WiFi, another has complimentary meals, and yet another has glowing reviews from happy
travellers.
You start making mental notes, weighing your options. Should you go with the cheaper flight
or the one with better service?
This decision-making journey is exactly what Exploratory Data Analysis (EDA) is all
about—taking raw information, exploring it from different angles, and finding the best
insights before making a choice.
Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their
main characteristics and uncover patterns, relationships, and anomalies. It is an essential step
in the data analysis process that helps to understand the data before applying more complex
statistical or machine learning techniques.
You're the data analyst for a retail store, and you want to understand the sales patterns from
the past month to make informed decisions about product restocking.
14
Introduction to Data Science, AI & Machine Learning
How do we Analyse ?
1) You begin by understanding basic stats about the data:
a) Total Revenue: By summing the revenue column, you find the store earned
$18,100 in 10 days.
b) Top-Selling Product: The product "Shoes" has been sold 310 units vs.
"Jackets" at 165 units.
2) Visualisations: You can create a simple bar chart showing the units sold per
product:
a) Chart: Shoes vs. Jackets sales comparison. Shoes: 310 units and Jackets:
165 units .This tells you that shoes are selling almost double compared to
jackets.
3) Identifying Patterns :Regional Insights:
a) The South and East regions are showing strong sales for both products.
b) West and North regions lag slightly, especially in jacket sales.
Through this simple EDA, you've identified which products are selling better, which regions
need more focus, and how revenue trends are moving. This will help you decide stock levels
and marketing strategies for each region.
Objective of EDA
1. Communicating information about the data set: how summary tables and graphs
can be used for communicating information about the data.
a. Tables can be used to present both detailed and summary level information
about a data set.
b. Graphs visually communicate information about variables in data sets and the
relationship between them
15
Introduction to Data Science, AI & Machine Learning
In data science and machine learning, models and algorithms are fundamental components
for solving problems and making predictions based on data. Here’s a brief explanation of
each concept and their interplay.
Models : A model is a mathematical or computational representation of a real-world process
or system. In machine learning, it refers to a trained algorithm that can make predictions or
decisions based on input data.
Different models can be used to approach a problem from various angles. This is often done
to find the best-performing model for a specific task.
Types of Models:
a. Regression Models: Predict continuous outcomes (e.g., Linear Regression).
b. Classification Models: Predict categorical outcomes (e.g., Logistic Regression,
Decision Trees).
c. Clustering Models: Group similar data points together (e.g., K-Means Clustering).
d. Ensemble Models: Combine multiple models to improve performance (e.g., Random
Forest, Gradient Boosting Machines).
Key Algorithms:
1. Linear Regression: Used for predicting a continuous outcome based on linear
relationships between features.
2. Decision Trees: Used for classification and regression by splitting data based on
feature values
3. Support Vector Machines (SVM): Used for classification by finding the optimal
hyperplane that separates different classes.\
4. Neural Networks: Used for complex tasks like image recognition and natural
language processing by mimicking the structure of the human brain.
16
Introduction to Data Science, AI & Machine Learning
17
Introduction to Data Science, AI & Machine Learning
Learning Objectives
Learn effective data visualization techniques, understand the importance of storytelling with
data, and explore various data visualization tools e.g Tableau, Power BI or Alteryx.
Communicating Insights
Communication and visualization are essential for translating complex data analysis into
clear, actionable insights. Effective communication ensures that findings are presented in a
way that is understandable and relevant to stakeholders, while visualization helps to visually
represent data, making it easier to identify patterns and make informed decisions. Both steps
are integral to making data science outcomes accessible and useful for decision-making.
18
Introduction to Data Science, AI & Machine Learning
3. Avoid clutter
4. Seeking audience attention
5. Designer approach
6. Organizing storyline
19
Introduction to Data Science, AI & Machine Learning
Learning Objectives
Understand the characteristics of big data,explore relevant technologies, and delve into the
challenges and opportunities it brings.
Facebook is generating approximately 500 terabytes of data per day, about 10 terabytes of
sensor data are generated every 30 minutes by airlines, the NSE Stock Exchange is
generating approximately 1 terabyte of data per day, are few examples of BigData.
Introduction to BigData
Large , diverse set of information that can grow at ever increasing speed.can not be loaded in
a single machine due to its size.
Increasing a memory can be one of the alternatives for its not the ideal option due to required
data processing and computational requirements.
Big data is defined by 5V's, which refers to the volume, Variety, value, velocity, and veracity.
Let's discuss each term individually.
20
Introduction to Data Science, AI & Machine Learning
Although storing raw data is not difficult, converting unstructured data into a structured
format and making them accessible for business uses is practically complex.
Smart sensors, smart metering, and RFID tags make it necessary to deal with huge data influx
in almost real-time.
Big data can be of various formats of data either in structured as well as unstructured form,
and comes from various different sources. The main sources of big data can be of the
following types:
a. Social Media
b. Cloud Platforms
c. IoT, Smart Sensors, RFID
d. Web Pages
e. Financial Transactions
f. Healthcare and Medical Data
g. Satellite
Big Data can be categorized as structured, unstructured, and semi-structured data.It is also
helpful in areas as diverse as stock marketing analysis, medicine & healthcare, agriculture,
gambling, environmental protection, etc.
The scope of big data is very vast as it will not be just limited to handling voluminous data;
instead, it will be used for optimizing the data stored in a structured format for enabling easy
analysis.
BigData Infrastructure
Hadoop is an open source framework based on Java that manages the storage and processing
of large amounts of data for applications. Hadoop uses distributed storage and parallel
processing to handle big data and analytics jobs, breaking workloads down into smaller
workloads that can be run at the same time
21
Introduction to Data Science, AI & Machine Learning
nodes operate on data that resides in their local storage. This removes network
latency, providing high-throughput access to application data.
2. Yet Another Resource Negotiator: YARN is a resource-management platform
responsible for managing compute resources in clusters and using them to schedule
users’ applications. It performs scheduling and resource allocation across the Hadoop
system.
3. Map Reduce (for processing): In the MapReduce model, subsets of larger datasets
and instructions for processing the subsets are dispatched to multiple different nodes,
where each subset is processed by a node in parallel with other processing jobs. After
processing the results, individual subsets are combined into a smaller, more
manageable dataset
4. Hadoop Common: Hadoop Common includes the libraries and utilities used and
shared by other Hadoop modules.
Hadoop tools
Hadoop has a large ecosystem of open source tools that can augment and extend the
capabilities of the core module.
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem continues
to grow and includes many tools and applications to help collect, store, process, analyze, and
manage big data.
Some of the main software tools used with Hadoop include:
a. Apache Hive: A data warehouse that allows programmers to work with data in HDFS
using a query language called HiveQL, which is similar to SQL
b. Apache HBase: An open source non-relational distributed database often paired with
Hadoop
c. Apache Pig: A tool used as an abstraction layer over MapReduce to analyze large sets
of data and enables functions like filter, sort, load, and join
d. Apache Impala: Open source, massively parallel processing SQL query engine often
used with Hadoop
22
Introduction to Data Science, AI & Machine Learning
23
Introduction to Data Science, AI & Machine Learning
Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and
presenting data. It helps in making informed decisions based on data. Statistics is broadly
divided into descriptive and inferential statistics.
Methods for making statements about data with confidence : Inferential statistics
cover ways of making confident statements about populations using sample data
a) Confidence intervals : allows us to make statements concerning the likely
range that a population parameter (such as the mean) lies within
b) Hypothesis tests : A hypothesis test determines whether the data collected
supports a specific claim
c) Chi-square : procedure to understand whether a relationship exists between
pairs of categorical variables
d) analysis of variance : determines whether a relationship exists between three
or more group means
3) Comparative statistics allows us to understand relationships between variables.
If you want to know the average height of students at your university, you could measure the
height of a sample of students (e.g., 100 students) and use inferential statistics to estimate the
average height of all students (the population).
Frequency Distributions
The concept of frequency distribution stands as a fundamental tool for data analysis, offering
a window into the underlying patterns and trends hidden within raw data.
24
Introduction to Data Science, AI & Machine Learning
Frequency distribution is a methodical arrangement of data that reveals how often each value
in a dataset occurs
Leveraging Python to construct and analyze frequency distributions adds a layer of efficiency
and flexibility to the statistical analysis process. With libraries such as Pandas for data
manipulation, Matplotlib and Seaborn for data visualization, Python transforms the way data
scientists and statisticians approach frequency distribution, making it easier to manage data,
perform calculations, and generate insightful visualizations.
Histograms
A histogram is a graphical representation of the distribution of numerical data. It consists of
bars where the height of each bar represents the frequency of data points within a specific
range or bin.
Pandas is exceptionally well-suited for creating frequency distributions, especially with its
DataFrame structure that makes data manipulation intuitive.
Loading Data
First, load your dataset into a Pandas DataFrame, For example, let’s use the Iris dataset
available through Seaborn.
25
Introduction to Data Science, AI & Machine Learning
For discrete data, use the `value_counts()` method to generate a frequency distribution. For
continuous data, you can categorize the data into bins using the `cut()` function and then apply
`value_counts()`.
A population includes all of the elements from a set of data while a sample consists of one or
more observations from the population.
A measurable characteristic of a population, such as a mean or standard deviation, is called a
parameter; but a measurable characteristic of a sample is called a statistic.
Sampling is done because one usually cannot gather data from the entire population. Data
may be needed urgently, and including everyone in the population in your data collection may
take too long.
More than 500 million people voted in India in the 2024 general elections. If any agency had
to conduct an exit poll survey, by no means they can do it by reaching out to all the voters.
26
Introduction to Data Science, AI & Machine Learning
27
Introduction to Data Science, AI & Machine Learning
Learning Objectives
Define machine learning and generative AI, grasp their key concepts and distinctions, and
explore their applications across different fields
Machine Learning (ML) is a subset of artificial intelligence (AI) that involves the
development of algorithms and statistical models that enable computers to perform tasks
without explicit instructions. Instead of being programmed with specific rules to follow, these
models learn from data and identify patterns, make decisions, or predictions based on new
data.
It enables systems to improve over time and make intelligent decisions based on data
Generative AI uses machine learning models trained on large datasets to generate new
content like text,images , videos, music and code
1. Apps like Google Maps use machine learning to provide real-time traffic updates,
suggest optimal routes, and estimate travel times based on historical and current
traffic data.
28
Introduction to Data Science, AI & Machine Learning
2. Platforms like Netflix and Spotify use machine learning to analyze your viewing or
listening history and recommend movies, TV shows, or music tailored to your
preferences.
3. E-Commerce: Websites like Amazon and Flipkart use algorithms to suggest products
based on your browsing history and previous purchases.
4. Virtual Assistants: Siri, Google Assistant, and Alexa utilize natural language
processing (NLP) and machine learning to understand and respond to your voice
commands, answer questions, and perform tasks like setting reminders or controlling
smart home devices.
5. Spam Detection: Email services use machine learning to identify and filter out spam
or phishing emails, based on patterns and characteristics learned from previous
emails.
6. Social Media: Platforms like Facebook, Instagram, and Twitter use machine
learning to analyze your interactions and display posts, ads, and stories that match
your interests
7. Smartwatches and fitness trackers use machine learning to analyze your activity data,
monitor health metrics, and provide insights into your fitness and well-being.
8. Chatbots: Customer service bots on websites and apps use machine learning to
understand and respond to customer inquiries, providing instant support and
information
9. Photo Organization: Apps like Google Photos use machine learning to categorize and
search for photos based on content, such as people, places, or objects.
29
Introduction to Data Science, AI & Machine Learning
10. Self-Driving Cars: Companies like Tesla use machine learning to enable autonomous
vehicles to navigate roads, detect obstacles, and make driving decisions based on
sensor data.
11. Duolingo : Duolingo is a popular language learning app that leverages machine
learning to enhance its language education services
12. Security Systems: Smartphones and security cameras use deep learning for facial
recognition to unlock devices or identify individuals in surveillance footage.
13. Autocorrect and Predictive Text: Smartphones and email applications use NLP to
suggest words or correct spelling and grammar as you type, enhancing typing
efficiency
Machine Learning is like teaching a computer to learn from examples and make decisions
or prediction based on that learning
For example, if you have pictures of cats and dogs , a machine learning model can learn to
distinguish between them by analyzing features like fur texture, ear , shape etc
30
Introduction to Data Science, AI & Machine Learning
Once the model has learned from many examples, you can give it new, unseen data ( like a
new picture of a cat it has never seen). The model uses what it learned to make predictions or
decisions, like predicting whether a given image is of cat or dog.
The generalised flow can be understood as depicted in the following diagram:
Say a Multinational Bank wants to improve its loan approval process by using machine
learning.
………………..
Types of Machine Learning
Machine Learning can be supervised, semi-supervised, unsupervised and reinforcement
a. A Loss Function
b. An Optimization Criterion based on the loss function ( a cost function for example)
31
Introduction to Data Science, AI & Machine Learning
Machine learning (ML) is the process of creating systems that can learn from data and make
predictions or decisions. ML is a branch of artificial intelligence (AI) that has many
applications in various domains, such as computer vision, natural language processing,
recommender systems, and more.
To develop ML models, data scientists and engineers need tools that can simplify the
complex algorithms and computations involved in ML. These tools are called ML
frameworks, and they provide high-level interfaces, libraries, and pre-trained models that
can help with data processing, model building, training, evaluation, and deployment.
Most of these are Python machine learning frameworks, primarily because Python is the most
popular machine learning programming language
1. Azure ML Studio : Azure ML Studio allows Microsoft Azure users to create and
train models, then turn them into APIs that can be consumed by other services. Users
get up to 10GB of storage per account for model data, although you can also connect
your own Azure storage to the service for larger models. A wide range of algorithms
are available, courtesy of both Microsoft and third parties.
https://www.datacamp.com/tutorial/azure-machine-learning-guide
32
Introduction to Data Science, AI & Machine Learning
https://www.linkedin.com/pulse/popular-ml-frameworks-train-your-models-what-choose-vish
nuvaradhan-v-q98oc
What is MLOps?
MLOps stands for Machine Learning Operations.
Components of MLOps
33
Introduction to Data Science, AI & Machine Learning
MLOps Cycle
An MLOps Engineer automates, deploys, monitors, and maintains machine learning models
in production, ensuring scalability, performance, security, and compliance while collaborating
with cross-functional teams to streamline the ML lifecycle and optimize infrastructure and
costs.
In contrast, a Machine Learning Engineer primarily develops, trains, and fine-tunes models,
focusing on the core algorithms, data preprocessing, and feature engineering. While MLOps
Engineers handle the operationalization and lifecycle management of models, ML Engineers
concentrate on model development and experimentation.
34
Introduction to Data Science, AI & Machine Learning
35
Introduction to Data Science, AI & Machine Learning
Learning Objectives
Understand supervised learning algorithms, including regression and classification problems,
and explore evaluation metrics for these models.
Training a machine learning task for every input with a corresponding target, is called
supervised learning.
In supervised learning, the dataset is the collection of labeled examples, feature and
Target Variables
The goal of a supervised learning algorithm is to use the dataset to produce a model that takes
a feature vector x as input and outputs information that allows deducing the label for this
feature vector.
There are some very practical applications of supervised learning algorithms in real life,
includes
36
Introduction to Data Science, AI & Machine Learning
Regression Problems
For a used car price prediction model, labelled data refers to a dataset where each of the
records includes both the features (attributes) of the used cars and the corresponding target
variable, which is the price.
Features (Attributes): These are the input variables that the model uses to make predictions.
Typical features for used car price prediction include from Name to Seats
The price at which the car is being sold, usually represented as a continuous numerical value.
This is called the target variable here.This is the output that the model is trying to predict for
unseen features.
37
Introduction to Data Science, AI & Machine Learning
Classification Problem
The prediction task is a classification when the target variable is discrete. There are four main
classification tasks in Machine learning: binary, multi-class, multi-label, and imbalanced
classifications.
1. Binary Classification
In a binary classification task, the goal is to classify the input data into two mutually
exclusive categories. The training data in such a situation is labeled in a binary
format: true and false; positive and negative; O and 1; spam and not spam, etc.
2. Multi-Class Classification
The multi-class classification, on the other hand, has at least two mutually exclusive
class labels, where the goal is to predict to which class a given input example belongs
to.
38
Introduction to Data Science, AI & Machine Learning
3. Multi-label classification
Multi-label classification is a classification problem where each instance can be
assigned to one or more classes. For example, in text classification, an article can be
about 'Technology,' 'Health,' and 'Travel' simultaneously
4. Imbalanced Classification
For the imbalanced classification, the number of examples is unevenly distributed in
each class, meaning that we can have more of one class than the others in the training
data.
Let’s consider the following 3-class classification scenario where the training data
contains: 60% of trucks, 25% of planes, and 15% of boats
5.
Learning Objectives
Understand unsupervised learning algorithms, explore clustering and dimensionality
reduction techniques, and examine their applications.dels.
39
Introduction to Data Science, AI & Machine Learning
Again, x is a feature vector, and the goal of an unsupervised learning algorithm is to create a
model that takes a feature vector x as input and either transforms it into another vector or into
a value that can be used to solve a practical problem.
For example,
In clustering, the model returns the id of the cluster for each feature vector in the dataset.
Customer Segmentation:
In dimensionality reduction, the output of the model is a feature vector that has fewer features
than the input x
In outlier detection, the output is a real number that indicates how x is different from a typical
example in the dataset.
40
Introduction to Data Science, AI & Machine Learning
Class of Customers for doing target and promotion it doesn’t tell the customer A belongs
from Class 1 ( no label given )
Let us suppose it has created three clusters C1, C2, and C3 basis features variables, C1 people
buy low-priced products whereas C2 people are buying expensive products that result in
higher less along with low sales quantity.
Reinforcement Learning
PackMan Game
41
Introduction to Data Science, AI & Machine Learning
42
Introduction to Data Science, AI & Machine Learning
Learning Objectives
Understand the concept of deep learning, explore neural networks and their architectures, and
examine various deep learning applications.
Deep Learning , a subset of machine learning, embraces the mechanism of neural networks
for complex problem solving.
Deep Learning becomes particularly relevant for problems involving large and complex
datasets, where traditional machine learning algorithms fall short.
Problems involving audio, images etc where features of the data needs to be learned
automatically.
Deep Learning is an advanced machine learning technique that requires more computation
and large amounts of training data to be able to scale and generalise well on complex data
representations.The concept of deep learning takes inspiration from the human brain.It uses a
vast number of layered algorithms, hence termed as “deep” to simulate the intricate structure
of the human brain.
Self-driving vehicles
Autonomous vehicles use deep learning to learn how to operate and handle different
situations while driving, and it allows vehicles to detect traffic lights, recognize signs, and
avoid pedestrians.
43
Introduction to Data Science, AI & Machine Learning
Algorithms such as LaneNet are quite popular in the field of research to extract lane
lines.Algorithms such as YOLO or SSD are very popular in this field
Neural Networks
It is often referred to as “artificial brains”; they are a vital part of machine learning and AI
technology. Inspired by the biological neural network that makes up human brains, these
powerful computational models leverage interconnected layers of artificial neurons or
‘nodes’.
Nodes (artificial neurons) receive input data , process it with a set of predefined rules and
pass the result to the next layer in the neural network.
A neural network is typically organised into three essential layers, each layer contains
multiple nodes which process the incoming data before passing it on.Also each of the
connections between nodes carries a numerical weight that adjusts as the network learns
during the training stage, affecting the importance of the input value.
1. Input Layer : Input layer is to receive as input, raw data attributes .The nodes of the
input layer are passive, meaning they do not change the data and sent to hidden layers
44
Introduction to Data Science, AI & Machine Learning
2. Hidden Layers(s) : These layers do most of the computations required.They take the
data from the input layer, process it , and pass it to the next layer.
3. Output Layers: it generates the final outputs
Learning Objectives
45
Introduction to Data Science, AI & Machine Learning
Understand the principles of large language models (LLMs), explore their architectures, and
examine their applications in natural language processing and AI.
Generative AI
Generative AI is a type of AI that can create new content such as text,images, voice and
codes from scratch based on natural language inputs or “prompts” . it uses machine learning
models powered by advanced algorithms and neural networks to learn from large datasets of
the existing contents.
Once a model is trained, it can generate new content similar to the content it was trained on.
What is LLMs
Lange Language Models are foundational machine learning models that use Deep Learning
algorithms to process and understand natural language.
These Models are trained on a massive amount of text data to learn patterns and identify the
entity relationships in the language.
Language models are computational systems designed to understand, generate, and
manipulate human language. At their core, language models learn the intricate patterns ,
semantics, and contextual relationships within language through extensive training over vast
values of text data.
These models are equipped with the capacity to predict the next word in a sentence based on
the preceding words ( with coherent and relevance).
Source : https://www.coursera.org/learn/generative-ai-with-llms
46
Introduction to Data Science, AI & Machine Learning
There were some traditional ways of predicting the next words given some context
implementations in a language model before the advent of transformers.
47
Introduction to Data Science, AI & Machine Learning
f. Speech Recognition
g. Speech Identifications
h. Spelling Corrector
48
Introduction to Data Science, AI & Machine Learning
Learning Objectives
Understand the principles of prompt engineering, explore techniques for crafting effective
prompts, and examine their impact on model performance and output
Prompt engineering is a relatively new discipline for developing and optimizing prompts to
efficiently use language models (LMs) for a wide variety of applications and research topics.
Prompt engineering skills help to better understand the capabilities and limitations of large
language models (LLMs).
Prompt engineering is not just about designing and developing prompts. It encompasses a
wide range of skills and techniques that are useful for interacting and developing with LLMs.
It's an important skill to interface, build with, and understand capabilities of LLMs.
LLM Settings
When designing and testing prompts, you typically interact with the LLM via an API.
You can configure a few parameters to get different results for your prompts. Tweaking these
settings are important to improve reliability and desirability of responses and it takes a bit of
experimentation to figure out the proper settings for your use cases. Below are the common
settings you will come across when using different LLM providers
a. Temperature
b. Top P
c. Max Length
d. Stop Sequences
e. Frequency Penalty
f. Presence Penalty
49
Introduction to Data Science, AI & Machine Learning
a. Temperature
In short, the lower the temperature, the more deterministic the results in the sense that
the highest probable next token is always picked. Increasing temperature could lead to
more randomness, which encourages more diverse or creative outputs.
You are essentially increasing the weights of the other possible tokens. In terms of
application, you might want to use a lower temperature value for tasks like fact-based
QA to encourage more factual and concise responses.
For poem generation or other creative tasks, it might be beneficial to increase the
temperature value
b. Top P
A sampling technique with temperature, called nucleus sampling, where you can
control how deterministic the model is. If you are looking for exact and factual
answers keep this low. If you are looking for more diverse responses, increase to a
higher value. If you use Top P, it means that only the tokens comprising the top_p
probability mass are considered for responses, so a low top_p value selects the most
confident responses.
This means that a high top_p value will enable the model to look at more possible
words, including less likely ones, leading to more diverse outputs.
Prompting an LLM
50
Introduction to Data Science, AI & Machine Learning
A prompt can contain information like the instruction or question you are passing to the
model and include other details such as context, inputs, or examples. You can use these
elements to instruct the model more effectively to improve the quality of results.
Text Summarization
Example:
Explain antibiotics
A:
Antibiotics are a type of medication used to treat bacterial infections. They work by either
killing the bacteria or preventing them from reproducing, allowing the body’s immune system
to fight off the infection. Antibiotics are usually taken orally in the form of pills, capsules, or
liquid solutions, or sometimes administered intravenously. They are not effective against viral
infections, and using them inappropriately can lead to antibiotic resistance.
Text Classification
Code Generation:
One application where LLMs are quite effective is code generation
Example:
Prompt:
"""Table departments, columns = [DepartmentId, DepartmentName]
Table students, columns = [DepartmentId, StudentId, StudentName]
Create a MySQL query for all students in the Computer Science Department
51
Introduction to Data Science, AI & Machine Learning
Output:
SELECT StudentId, StudentName
FROM students
WHERE DepartmentId IN (SELECT DepartmentId FROM departments WHERE
DepartmentName = 'Computer Science');
"""
Elements of a Prompt
A prompt contains any of the following elements:
Example
Classify the text into neutral, negative, or positive
Text: I think the food was okay.
Sentiment:
In the prompt example above, the instruction corresponds to the classification task,
"Classify the text into neutral, negative, or positive". The input data corresponds to
the "I think the food was okay.' part, and the output indicator used is "Sentiment:
Standard Tips for Designing Prompts
You can start with simple prompts and keep adding more elements and context as you aim for
better results. Iterating your prompt along the way is vital for this reason
a. Instruction
You can design effective prompts for various simple tasks by using commands to
instruct the model on what you want to achieve, such as "Write", "Classify",
"Summarize", "Translate", "Order", etc.
### Instruction ###
Translate the text below to Spanish:
Text: "hello!"
b. Specificity
Be very specific about the instruction and task you want the model to perform.
When designing prompts, you should also keep in mind the length of the prompt as
there are limitations regarding how long the prompt can be.
Example:
Extract the name of places in the following text.
Desired format:
Place: <comma_separated_list_of_places>
Input: "Although these developments are encouraging to researchers, much is still a
mystery. “We often have a black box between the brain and the effect we see in the
52
Introduction to Data Science, AI & Machine Learning
53
Introduction to Data Science, AI & Machine Learning
[1]https://www.linkedin.com/pulse/popular-ml-frameworks-train-your-models-what-choose-v
ishnuvaradhan-v-q98oc
[2] https://www.datacamp.com/tutorial/azure-machine-learning-guide
[3] https://www.linkedin.com/jobs/view/3992431030/
[4] https://www.thinkautonomous.ai/blog/deep-learning-in-self-driving-cars/
[5] ML Role Spectrum - Karthik Singhal,Medium, Meta
Attention Is All You Need
The Illustrated Transformer Transformer (Google AI blog post)
54