Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
11 views

Emerging Tech Notes - Module1

Uploaded by

Ayush Agrawal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Emerging Tech Notes - Module1

Uploaded by

Ayush Agrawal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Emerging Trends in Technology

Introduction to Data Science,


AI & Machine Learning

Kunal Kishore Kanak Jaiswal


Introduction to Data Science, AI & Machine Learning

Module 1 : Data Science & Analytics

1
Introduction to Data Science, AI & Machine Learning

1.1. Introduction to Data Science

Learning Objectives
Define and differentiate between machine learning and generative AI, understand their key
concepts, and explore their applications across various fields.

Data is everywhere nowadays. In general, data is a collection of facts, information, and


statistics and this can be in various forms such as numbers, text, sound, images, or any other
format.
Data can be generated by:
a) Humans
b) Machines
c) Human-Machine combines.

Data comes from various sources, such as customer transactions, sales records, website logs,
and social media interactions.

Why is data important ?


a) Data enables better decision-making.
b) Data identifies the causes of underperformance, aiding in problem-solving.
c) Data allows for performance evaluation.
d) Data facilitates process improvement.
e) Data provides insights into consumers and market trends.

Types of Data

2
Introduction to Data Science, AI & Machine Learning

Product Recommendations

Guess the type of Customers

Customer Segment Guesses

P1 (iPhone, Beer, Cornflakes):


Guess: Young Adult or Single Professional

This segment has an iPhone, indicating a preference for high-end technology. The inclusion
of beer suggests a lifestyle that might involve socialising or relaxation after work. Cornflakes

3
Introduction to Data Science, AI & Machine Learning

are often associated with a quick, convenient breakfast, which is common for busy
individuals. Overall, this customer likely enjoys tech, convenience, and leisure activities.

P2 (iPhone, Diaper, Beer):


Guess: Young Parent
Explanation: The presence of an iPhone again indicates an affinity for technology. Diapers
suggest that this customer has a baby or young child. The beer might indicate a need for
relaxation or socialising, perhaps after dealing with the challenges of parenthood. This
segment likely represents a young parent balancing family life with some personal relaxation.

P3 (iPhone, Diaper, Biscuits):


Guess: Parent or Family-Oriented Individual

Like P2, the diapers indicate that this customer likely has a child. The iPhone suggests they
are tech-savvy, and biscuits could be a snack for either the child or the parent. This segment
appears to be focused more on family-oriented products, possibly indicating a parent or
caregiver who is attentive to both the child's needs and their own

These guesses are based on the product combinations and typical consumer behaviour
patterns associated with those products.

Let’s see its association with Data Science

The example provided involves analyzing customer purchase data to infer customer profiles
and understand purchasing behaviours.
1. Helps retailers understand product affinities, optimize product placements, design
cross-selling strategies, and improve inventory management.
2. Classifying customers into different segments based on their purchase behavior and
inferring their profiles (e.g., young adult, parent).
3. Enables targeted marketing, personalised recommendations, and improved customer
satisfaction by addressing specific needs and preferences of each segment
4. Predicting what other products a customer might be interested in based on current
purchasing patterns.
5. Using insights from data analysis to inform marketing campaigns, product placement,
and inventory decision

The example provided illustrates a simplified scenario of how data science is applied in
real-world contexts to understand and predict customer behaviour.

Why This Is Data Science

Lets Understand first what is Data Science

4
Introduction to Data Science, AI & Machine Learning

Data Science is a multidisciplinary field that focuses on finding actionable insights from large
sets of structured and unstructured data

Data Science experts integrate computer science, predictive analytics,


statistics and Machine Learning to mine very large data sets, with the
goal of discovering relevant insights that can help the organisation
move forward, and identifying specific future events.

Who is Data Scientist

A data scientist is a professional who uses scientific methods, algorithms, and systems to
extract insights and knowledge from structured and unstructured data. They apply a
combination of statistical analysis, machine learning, data mining, and programming skills to
interpret complex data, identify patterns, and make data-driven decision

Skills of a Data Scientist:

1. Programming: Proficiency in languages like Python, R, and SQL for data


manipulation and analysis.
2. Statistics and Mathematics: A strong foundation in statistical methods,
probability, and linear algebra.
3. Machine Learning: Knowledge of algorithms, model development, and
evaluation techniques.
4. Data Visualization: Ability to create visualisations using tools like Matplotlib,
Seaborn, Tableau, or Power BI to communicate insights.
5. Domain Knowledge: Understanding the specific industry or business context to
apply data science effectively.
6. Critical Thinking: Strong analytical skills to identify patterns, solve complex
problems, and make data-driven decisions.

Problems that data scientists solve ? ( Type of Problems)

Use Case 1: Is it A or B - Will the applicant be able to repay the loan or not ?

The problem of determining whether an applicant will


be able to repay a loan is a binary classification
problem
● Repay the loan (Yes)
● Default on the loan (No)

5
Introduction to Data Science, AI & Machine Learning

The objective is to predict whether a given applicant belongs to the "repay" class or the
"default" class based on various input features (such as income, credit score, existing debt,
etc.).
The model might output a probability score that represents the likelihood of the applicant
repaying the loan. For example, a score of 0.85 might indicate an 85% chance of repayment.

Implications of the Decision:


Approval: If the prediction is "Yes," the loan may be approved, potentially with terms that
reflect the applicant's risk profile (e.g., interest rate, loan amount).

This type of problem is fundamental in credit risk assessment, where data science models
help financial institutions make informed lending decisions.

Use Case 2: Is it weird ( something weird means anomaly) : I am getting so many spam
emails in my inbox ?

Normally, your inbox receives a certain number of


emails per day, with only a small percentage being
spam. This is the expected or "normal" behaviour.

The sudden increase in spam emails represents a deviation from the norm. In data science,
this could be detected using anomaly detection techniques, such as statistical methods (e.g.,
z-score), machine learning algorithms (e.g., Isolation Forest, Autoencoders), or rule-based
systems.

By applying anomaly detection techniques, the root cause can be identified and addressed,
improving the efficiency of spam filters and enhancing email security.

Industry specific Use cases / Problems

6
Introduction to Data Science, AI & Machine Learning

Data Science Process

7
Introduction to Data Science, AI & Machine Learning

1.2. Data Collection and Preprocessing

Learning Objectives
Learn about various data collection methods, understand data quality issues and
preprocessing techniques, and explore data cleaning and transformation.

Raw Data

Raw Data refers to unprocessed information that is collected and stored in its original format.
It is the most fundamental form of data, captured directly from various sources, such as
sensors, devices, or databases.
Raw Data is typically characterized by its lack of structure, organization, or meaningful
interpretation.
It may include text files, log files, images, audio recordings, or numeric data.

Collecting Raw Data

Raw Data is acquired from different sources and stored as-is without any transformations or
modifications. It can be collected manually or automatically through various methods, such
as data extraction tools, IoT devices, or data streaming technologies.

Once the Raw Data is collected, it can be stored in databases, data warehouses, or data
lakes, where it awaits further processing and analysis
By preserving data in its original format, Raw Data ensures data integrity and enables
retrospective analysis.

Related Terms

Data Lake: A data lake is a centralized repository that stores Raw Data in its native format,
facilitating data exploration, analysis, and processing.
Data Pipeline: A data pipeline refers to the set of processes and tools used to extract,
transform, and load (ETL) Raw Data into a destination system for further processing.

Data Preprocessing: Data preprocessing involves transforming Raw Data into a


standardized, clean format by applying techniques such as cleaning, filtering, and
normalization

Data warehouses :Store cleaned and processed data in a centralized system. Data
warehouses use hierarchical dimensions and tables to store data. Data warehouses can be
used to source analytic or operational reporting, and for business intelligence (BI) use cases.

Databricks: Databricks is an industry-leading, cloud-based data engineering tool used for


processing and transforming massive quantities of data and exploring the data through

8
Introduction to Data Science, AI & Machine Learning

machine learning models. Recently added to Azure, it's the latest big data tool for the
Microsoft cloud

As a data scientist, you'll require data to address the problems you're working on.
Occasionally, your organization might already have the data you need. However, if the
necessary data isn't being collected, you'll need to collaborate with a data engineering team to
develop a system that begins gathering the required data.

Storage systems

The choice of storage system for raw data depends on various factors, including the volume
of data, access patterns, scalability requirements, and budget.
1. Cloud storage solutions like Amazon S3 or Google Cloud Storage are popular for
their scalability and integration with analytics tools,
2. while distributed file systems like HDFS are preferred for big data applications.
3. In contrast, databases and data lakes offer structured environments for specific use
cases

9
Introduction to Data Science, AI & Machine Learning

Who is a Data Engineer?

Their work ensures that data is collected, stored, and made accessible in a reliable and
efficient manner for use by data scientists, analysts, and other stakeholders.

Key responsibility includes :


1. Designing Data Systems: Creating and implementing data architectures that
support data storage, processing, and retrieval.
2. Database Design: Designing relational and NoSQL databases to handle structured
and unstructured data.
3. ETL Processes: Building and managing ETL (Extract, Transform, Load) pipelines
to move data from various sources into data warehouses or lakes.
4. Data Integration: Combining data from different sources to create unified datasets.
5. Building Data Warehouses: Designing and maintaining data warehouses that
consolidate data from multiple sources for analysis and reporting.
6. Handling Large-Scale Data: Using technologies like Hadoop, Spark, and Kafka to
process and analyze big data.

Data Cleaning

Data Cleaning is identifying, diagnosing, and correcting the inconsistencies present in the
data in line with problems in hand and business processes.

Dirty data is data that is incomplete, incorrect, incorrectly formatted, erroneous, or irrelevant
to the problem we are trying to solve

10
Introduction to Data Science, AI & Machine Learning

How do we know the data is dirty, or more precisely how do we recognize the dirt, and what
are the main types of data dirt or defects?

Type of defects in data


1) Duplicate Records -Any record that shows up more than once
2) Missing Data -Any data that is missing important fields ( zero or null or extra space)
3) Data Type/Format Error -Data which has inconsistent or incorrect data type/formats
-Date or String, numbers as string, Separator Issues
4) Incorrect Data -Any field that is prone to human intervention /error in the process,
open-ended fields with no blocker on predefined pattern/type
5) Not updated Fields If any of the fields have not been updated for some reason
(cron-related or unnotified changes in the table structure or field names).
6) Outliers/Inconsistency Data having extraordinary values or outliers — not following
any pattern
7) Typo Errors. Misspells while collecting/capturing the record in tech.

Case 1

Each and every box is weighed and dimensions measured at the time of completing the
pickup process in an e-commerce domain by the pickup executive.

If it is a manual process and obviously it is in most cases, there are high chances of getting
incorrect weights and dimensions captured in the system, and this error can be translated to
the whole system and impact other operations like load planning, vehicle optimization, and
even wrong invoicing.

Can you guess what kind of error you are going to expect in the weight and dimensions
datasets in consignment data? Let me help you with some:

1) Weight can be entered wrongly — 100 kg instead of 10 Kg — Incorrect data


2) 10.8 Inch length can be 108.0 inch — Wrong placing of decimals
3) 100 cm can be found as 100 Inch or any other units

11
Introduction to Data Science, AI & Machine Learning

Take a pause and think of some more possibilities. if you are thinking about which syntax or
code can help to detect such anomalies, not to worry for now

Some of them are fixable and some are non-fixable, Non fixable can be corrected or analyzed
by data engineering teams sometimes or by bringing some changes in the flow or mechanism
of collecting the data for improving the data quality.

The standard set of errors has a pre-defined set of techniques that you can learn and build
your understanding to keep your eyes open on any dirt present.
a) If standard errors are repeating or occurring frequently — that can also be fixed at
source by data engineering teams or by doing semi-automation using python scripts
b) Combining data from multiple sources/web sources/API can be cleaned by codifying
their identified anomalies or string patterns, type, and formats.

Case 2 : Data Cleaning with SQL

Let’s understand the technique involved by taking the scenario of duplicate records, A
duplicate record is when you have two (or more) rows with the same information.

In the call center framework, automatic calls are made and assigned to agents, if for some
reason it gets triggered twice and gets recorded in the database tables, there is a high chance
duplicate records are created.

Data Preprocessing

Data preprocessing is a crucial step in the data science pipeline, involving the transformation
of raw data into a format suitable for analysis and modelling.

The primary goal is to enhance data quality and ensure that it is clean, consistent, and ready
for subsequent analysis or machine learning tasks.
1. Handling Missing Values: Identifying and addressing missing data through
imputation, deletion, or filling with default values.
2. Removing Duplicates: Detecting and eliminating duplicate records to avoid
redundancy.
3. Correcting Errors: Fixing inconsistencies and errors in the data, such as incorrect
values or typos.

12
Introduction to Data Science, AI & Machine Learning

4. Normalization/Scaling: Adjusting numerical values to a common scale or range,


which is essential for many machine learning algorithms.
5. Encoding Categorical Variables: Converting categorical data into numerical format
using techniques like one-hot encoding or label encoding.

6. Feature Engineering: Creating new features or modifying existing ones to improve


the model’s performance.
7. Merging Datasets: Combining data from different sources or tables to create a
unified dataset.
8. Joining Data: Using keys to integrate related datasets, ensuring consistency and
completeness.
9. Dimensionality Reduction: Reducing the number of features or variables through
methods like Principal Component Analysis (PCA) to simplify the dataset and
improve model efficiency.
10. Aggregation: Summarizing data by grouping and calculating statistics to reduce data
volume while retaining essential information.
11. Split into Training and Test Sets: Dividing the dataset into training and test subsets
to evaluate the performance of machine learning models and prevent overfitting.

13
Introduction to Data Science, AI & Machine Learning

1.3. Exploratory Data Analysis

Learning Objectives
Perform exploratory data analysis (EDA), interpret data patterns, and gain insights through
visualization and statistical methods.

You’ve just decided to take a spontaneous trip. Excitement kicks in, and you sit down to book
a flight. But instead of rushing into the first deal that pops up, you become a savvy traveller.
You open multiple tabs—comparing airlines, ticket prices, and perks. One airline offers free
WiFi, another has complimentary meals, and yet another has glowing reviews from happy
travellers.

You start making mental notes, weighing your options. Should you go with the cheaper flight
or the one with better service?

This decision-making journey is exactly what Exploratory Data Analysis (EDA) is all
about—taking raw information, exploring it from different angles, and finding the best
insights before making a choice.

Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their
main characteristics and uncover patterns, relationships, and anomalies. It is an essential step
in the data analysis process that helps to understand the data before applying more complex
statistical or machine learning techniques.

Case Study: Exploring Sales Data for a Retail Store

You're the data analyst for a retail store, and you want to understand the sales patterns from
the past month to make informed decisions about product restocking.

Here’s a small sample of sales data for the last 10 days:

14
Introduction to Data Science, AI & Machine Learning

How do we Analyse ?
1) You begin by understanding basic stats about the data:
a) Total Revenue: By summing the revenue column, you find the store earned
$18,100 in 10 days.
b) Top-Selling Product: The product "Shoes" has been sold 310 units vs.
"Jackets" at 165 units.
2) Visualisations: You can create a simple bar chart showing the units sold per
product:
a) Chart: Shoes vs. Jackets sales comparison. Shoes: 310 units and Jackets:
165 units .This tells you that shoes are selling almost double compared to
jackets.
3) Identifying Patterns :Regional Insights:
a) The South and East regions are showing strong sales for both products.
b) West and North regions lag slightly, especially in jacket sales.

Through this simple EDA, you've identified which products are selling better, which regions
need more focus, and how revenue trends are moving. This will help you decide stock levels
and marketing strategies for each region.

Objective of EDA

1. Communicating information about the data set: how summary tables and graphs
can be used for communicating information about the data.
a. Tables can be used to present both detailed and summary level information
about a data set.
b. Graphs visually communicate information about variables in data sets and the
relationship between them

2. Summarizing the data and relationships : Statistical approaches to summarizing


the data and relationships within the data as well as making statements about the
data with confidence.
a. Summarizing the data: Statistics not only provide us with methods for
summarizing sample data sets, they also allow us to make confident
statements about the dataset (entire populations).
b. Characterizing the data: Prior to building a predictive model or looking for
hidden trends in the data, it is important to characterize the variables and the
relationships between them and statistics gives us many tools to accomplish
this.
c. Making statements about ‘‘hidden’’ facts - Once a group of observations
within the data has been defined as interesting through the use of data mining
techniques, statistics give us the ability to make confident statements about
these groups.

15
Introduction to Data Science, AI & Machine Learning

3. Answering analytical questions


a. Series of methods for grouping data / organizing data→ for answering
analytical questions (including clustering, associative rules, and decision
trees etc.)
b. building predictive models Process and methods to be used in building
models . series of methods including simple regression, k-nearest neighbours,
classification and regression trees, and neural networks

Models & Algorithms

In data science and machine learning, models and algorithms are fundamental components
for solving problems and making predictions based on data. Here’s a brief explanation of
each concept and their interplay.
Models : A model is a mathematical or computational representation of a real-world process
or system. In machine learning, it refers to a trained algorithm that can make predictions or
decisions based on input data.
Different models can be used to approach a problem from various angles. This is often done
to find the best-performing model for a specific task.

Types of Models:
a. Regression Models: Predict continuous outcomes (e.g., Linear Regression).
b. Classification Models: Predict categorical outcomes (e.g., Logistic Regression,
Decision Trees).
c. Clustering Models: Group similar data points together (e.g., K-Means Clustering).
d. Ensemble Models: Combine multiple models to improve performance (e.g., Random
Forest, Gradient Boosting Machines).

Algorithm : An algorithm is a step-by-step procedure or formula used to train a model. It


defines how the model learns from data and updates its parameters to make accurate
predictions

Key Algorithms:
1. Linear Regression: Used for predicting a continuous outcome based on linear
relationships between features.
2. Decision Trees: Used for classification and regression by splitting data based on
feature values
3. Support Vector Machines (SVM): Used for classification by finding the optimal
hyperplane that separates different classes.\
4. Neural Networks: Used for complex tasks like image recognition and natural
language processing by mimicking the structure of the human brain.

16
Introduction to Data Science, AI & Machine Learning

Performance & Optimizations


Once multiple models are created, they need to be evaluated and optimized to ensure the best
performance. This involves
a. Performance Metrics: Assessing models using metrics like accuracy, precision,
recall, F1-score, or mean squared error (MSE) to determine how well they perform.
b. Cross-Validation: Splitting data into training and testing subsets multiple times to
ensure that the model generalizes well to unseen data.
c. Hyperparameter Tuning: Adjusting the settings of the algorithm (e.g., learning rate,
number of trees) to improve model performance.
d. Feature Selection: Choosing the most relevant features to include in the model to
enhance accuracy and reduce overfitting.
Optimization Techniques:
Grid Search: Systematically testing a range of hyperparameter values to find the best
combination.
Random Search: Testing a random subset of hyperparameter values to find optimal settings
more efficiently.

17
Introduction to Data Science, AI & Machine Learning

1.4. Data Visualization & Storytelling

Learning Objectives
Learn effective data visualization techniques, understand the importance of storytelling with
data, and explore various data visualization tools e.g Tableau, Power BI or Alteryx.

Communicating Insights

Communication and visualization are essential for translating complex data analysis into
clear, actionable insights. Effective communication ensures that findings are presented in a
way that is understandable and relevant to stakeholders, while visualization helps to visually
represent data, making it easier to identify patterns and make informed decisions. Both steps
are integral to making data science outcomes accessible and useful for decision-making.

a. Reports: Structured documents that summarize analysis, methodologies, results, and


recommendations. These can be formal reports or executive summaries.
b. Presentations: Visual and verbal summaries of the analysis, often delivered to
stakeholders, decision-makers, or team members. Tools like PowerPoint or Google
Slides are commonly used.
c. Storytelling: Crafting a narrative around the data to make the findings more relatable
and understandable. This involves framing data insights in a way that resonates with
the audience's needs and interests.
d. Meetings and Discussions: Engaging in conversations with stakeholders to explain
findings, answer questions, and discuss implications and next steps.

Components of Storytelling & Data Visualizations


1. Context
2. Selecting and effective visuals

18
Introduction to Data Science, AI & Machine Learning

3. Avoid clutter
4. Seeking audience attention
5. Designer approach
6. Organizing storyline

19
Introduction to Data Science, AI & Machine Learning

1.5. Introduction to BigData

Learning Objectives
Understand the characteristics of big data,explore relevant technologies, and delve into the
challenges and opportunities it brings.

Facebook is generating approximately 500 terabytes of data per day, about 10 terabytes of
sensor data are generated every 30 minutes by airlines, the NSE Stock Exchange is
generating approximately 1 terabyte of data per day, are few examples of BigData.

Introduction to BigData
Large , diverse set of information that can grow at ever increasing speed.can not be loaded in
a single machine due to its size.
Increasing a memory can be one of the alternatives for its not the ideal option due to required
data processing and computational requirements.

5V's in Big Data

Big data is defined by 5V's, which refers to the volume, Variety, value, velocity, and veracity.
Let's discuss each term individually.

Data is coming from various sources such as


social media sites, e-commerce platforms,
new sites, financial transactions.
It can be audio, video, text, emails,
transactions, and many more.

20
Introduction to Data Science, AI & Machine Learning

Although storing raw data is not difficult, converting unstructured data into a structured
format and making them accessible for business uses is practically complex.
Smart sensors, smart metering, and RFID tags make it necessary to deal with huge data influx
in almost real-time.

Sources of data in Big Data

Big data can be of various formats of data either in structured as well as unstructured form,
and comes from various different sources. The main sources of big data can be of the
following types:
a. Social Media
b. Cloud Platforms
c. IoT, Smart Sensors, RFID
d. Web Pages
e. Financial Transactions
f. Healthcare and Medical Data
g. Satellite

Big Data can be categorized as structured, unstructured, and semi-structured data.It is also
helpful in areas as diverse as stock marketing analysis, medicine & healthcare, agriculture,
gambling, environmental protection, etc.
The scope of big data is very vast as it will not be just limited to handling voluminous data;
instead, it will be used for optimizing the data stored in a structured format for enabling easy
analysis.

BigData Infrastructure

Hadoop is an open source framework based on Java that manages the storage and processing
of large amounts of data for applications. Hadoop uses distributed storage and parallel
processing to handle big data and analytics jobs, breaking workloads down into smaller
workloads that can be run at the same time

Four modules comprise the primary Hadoop framework and work


collectively to form the Hadoop ecosystem:
● Hadoop Distributed File System (HDFS)
● Yet Another Resource Negotiator (YARN)
● MapReduce
● Hadoop Common

1. HDFS (Hadoop distributed file systems) (for Storing)


-HDFS is a distributed file system in which individual Hadoop

21
Introduction to Data Science, AI & Machine Learning

nodes operate on data that resides in their local storage. This removes network
latency, providing high-throughput access to application data.
2. Yet Another Resource Negotiator: YARN is a resource-management platform
responsible for managing compute resources in clusters and using them to schedule
users’ applications. It performs scheduling and resource allocation across the Hadoop
system.
3. Map Reduce (for processing): In the MapReduce model, subsets of larger datasets
and instructions for processing the subsets are dispatched to multiple different nodes,
where each subset is processed by a node in parallel with other processing jobs. After
processing the results, individual subsets are combined into a smaller, more
manageable dataset
4. Hadoop Common: Hadoop Common includes the libraries and utilities used and
shared by other Hadoop modules.

Hadoop tools

Hadoop has a large ecosystem of open source tools that can augment and extend the
capabilities of the core module.
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem continues
to grow and includes many tools and applications to help collect, store, process, analyze, and
manage big data.
Some of the main software tools used with Hadoop include:

a. Apache Hive: A data warehouse that allows programmers to work with data in HDFS
using a query language called HiveQL, which is similar to SQL

b. Apache HBase: An open source non-relational distributed database often paired with
Hadoop
c. Apache Pig: A tool used as an abstraction layer over MapReduce to analyze large sets
of data and enables functions like filter, sort, load, and join
d. Apache Impala: Open source, massively parallel processing SQL query engine often
used with Hadoop

22
Introduction to Data Science, AI & Machine Learning

e. Apache Sqoop: A command-line interface application for efficiently transferring bulk


data between relational databases and Hadoop
f. Apache ZooKeeper: An open source server that enables reliable distributed
coordination in Hadoop; a service for, "maintaining configuration information,
naming, providing distributed synchronization, and providing group services"
g. Apache Oozie: A workflow scheduler for Hadoop jobs

BigData Storage Architecture

23
Introduction to Data Science, AI & Machine Learning

1.6. Descriptive Statistics


Learning Objectives
Learn about basic statistical concepts and Understand the different types of descriptive
statistics Explore how descriptive statistics are used in data analysis

Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and
presenting data. It helps in making informed decisions based on data. Statistics is broadly
divided into descriptive and inferential statistics.

1) Descriptive Statistics: Descriptive statistics that summarize various attributes of a


variable such as the average value or the range of values, mean, mode median , trends
and distributions.
a) A survey of students’ grades shows an average score (mean) of 75%.
Descriptive statistics help summarize this information without drawing
conclusions about the entire population..
2) Inferential Statistics: Involves making predictions or inferences about a population
based on a sample. It includes methods like hypothesis testing, confidence intervals,
and regression analysis.
a) After surveying 100 students, you infer that the average score of all students at
the university is likely 75%

Methods for making statements about data with confidence : Inferential statistics
cover ways of making confident statements about populations using sample data
a) Confidence intervals : allows us to make statements concerning the likely
range that a population parameter (such as the mean) lies within
b) Hypothesis tests : A hypothesis test determines whether the data collected
supports a specific claim
c) Chi-square : procedure to understand whether a relationship exists between
pairs of categorical variables
d) analysis of variance : determines whether a relationship exists between three
or more group means
3) Comparative statistics allows us to understand relationships between variables.

If you want to know the average height of students at your university, you could measure the
height of a sample of students (e.g., 100 students) and use inferential statistics to estimate the
average height of all students (the population).

Frequency Distributions

The concept of frequency distribution stands as a fundamental tool for data analysis, offering
a window into the underlying patterns and trends hidden within raw data.

24
Introduction to Data Science, AI & Machine Learning

Frequency distribution is a methodical arrangement of data that reveals how often each value
in a dataset occurs

Leveraging Python to construct and analyze frequency distributions adds a layer of efficiency
and flexibility to the statistical analysis process. With libraries such as Pandas for data
manipulation, Matplotlib and Seaborn for data visualization, Python transforms the way data
scientists and statisticians approach frequency distribution, making it easier to manage data,
perform calculations, and generate insightful visualizations.

Histograms
A histogram is a graphical representation of the distribution of numerical data. It consists of
bars where the height of each bar represents the frequency of data points within a specific
range or bin.

Analyzing Frequency Distribution with Pandas

Pandas is exceptionally well-suited for creating frequency distributions, especially with its
DataFrame structure that makes data manipulation intuitive.

Loading Data
First, load your dataset into a Pandas DataFrame, For example, let’s use the Iris dataset
available through Seaborn.

Creating Frequency Distributions

25
Introduction to Data Science, AI & Machine Learning

For discrete data, use the `value_counts()` method to generate a frequency distribution. For
continuous data, you can categorize the data into bins using the `cut()` function and then apply
`value_counts()`.

Sample & Population

A population includes all of the elements from a set of data while a sample consists of one or
more observations from the population.
A measurable characteristic of a population, such as a mean or standard deviation, is called a
parameter; but a measurable characteristic of a sample is called a statistic.

Why do we need samples?

Sampling is done because one usually cannot gather data from the entire population. Data
may be needed urgently, and including everyone in the population in your data collection may
take too long.

More than 500 million people voted in India in the 2024 general elections. If any agency had
to conduct an exit poll survey, by no means they can do it by reaching out to all the voters.

26
Introduction to Data Science, AI & Machine Learning

Module 2 : AI and Machine Learning

27
Introduction to Data Science, AI & Machine Learning

2.1. Introduction to AI & Machine Learning

Learning Objectives
Define machine learning and generative AI, grasp their key concepts and distinctions, and
explore their applications across different fields

Machine Learning (ML) is a subset of artificial intelligence (AI) that involves the
development of algorithms and statistical models that enable computers to perform tasks
without explicit instructions. Instead of being programmed with specific rules to follow, these
models learn from data and identify patterns, make decisions, or predictions based on new
data.
It enables systems to improve over time and make intelligent decisions based on data

AI , Machine Learning and Deep Learning

Deep learning, is a subset of machine learning, embraces the mechanism of neural


networks for more complex problem-solving. It takes inspiration from the human brain, it
uses vast numbers of layered algorithms, hence the term “Deep” to simulate the intricate
structure of the human brain

Generative AI uses machine learning models trained on large datasets to generate new
content like text,images , videos, music and code

Everyday Examples of Machine Learning in Our Lives

1. Apps like Google Maps use machine learning to provide real-time traffic updates,
suggest optimal routes, and estimate travel times based on historical and current
traffic data.

28
Introduction to Data Science, AI & Machine Learning

2. Platforms like Netflix and Spotify use machine learning to analyze your viewing or
listening history and recommend movies, TV shows, or music tailored to your
preferences.
3. E-Commerce: Websites like Amazon and Flipkart use algorithms to suggest products
based on your browsing history and previous purchases.
4. Virtual Assistants: Siri, Google Assistant, and Alexa utilize natural language
processing (NLP) and machine learning to understand and respond to your voice
commands, answer questions, and perform tasks like setting reminders or controlling
smart home devices.
5. Spam Detection: Email services use machine learning to identify and filter out spam
or phishing emails, based on patterns and characteristics learned from previous
emails.
6. Social Media: Platforms like Facebook, Instagram, and Twitter use machine
learning to analyze your interactions and display posts, ads, and stories that match
your interests
7. Smartwatches and fitness trackers use machine learning to analyze your activity data,
monitor health metrics, and provide insights into your fitness and well-being.
8. Chatbots: Customer service bots on websites and apps use machine learning to
understand and respond to customer inquiries, providing instant support and
information
9. Photo Organization: Apps like Google Photos use machine learning to categorize and
search for photos based on content, such as people, places, or objects.

29
Introduction to Data Science, AI & Machine Learning

10. Self-Driving Cars: Companies like Tesla use machine learning to enable autonomous
vehicles to navigate roads, detect obstacles, and make driving decisions based on
sensor data.
11. Duolingo : Duolingo is a popular language learning app that leverages machine
learning to enhance its language education services
12. Security Systems: Smartphones and security cameras use deep learning for facial
recognition to unlock devices or identify individuals in surveillance footage.
13. Autocorrect and Predictive Text: Smartphones and email applications use NLP to
suggest words or correct spelling and grammar as you type, enhancing typing
efficiency

What the heck is Machine Learning Models?


A machine learning model is like a computer program that learns from data.instead of being
explicitly programmed to do a task , it learns by finding the patterns in the data.

Machine Learning is like teaching a computer to learn from examples and make decisions
or prediction based on that learning

For example, if you have pictures of cats and dogs , a machine learning model can learn to
distinguish between them by analyzing features like fur texture, ear , shape etc

30
Introduction to Data Science, AI & Machine Learning

Once the model has learned from many examples, you can give it new, unseen data ( like a
new picture of a cat it has never seen). The model uses what it learned to make predictions or
decisions, like predicting whether a given image is of cat or dog.
The generalised flow can be understood as depicted in the following diagram:

How Machine Learning Models Work

Say a Multinational Bank wants to improve its loan approval process by using machine
learning.

………………..
Types of Machine Learning
Machine Learning can be supervised, semi-supervised, unsupervised and reinforcement

Building Blocks of Machine Learning Algorithms

a. A Loss Function
b. An Optimization Criterion based on the loss function ( a cost function for example)

31
Introduction to Data Science, AI & Machine Learning

c. An Optimization routine leveraging

Machine Learning Frameworks

Machine learning (ML) is the process of creating systems that can learn from data and make
predictions or decisions. ML is a branch of artificial intelligence (AI) that has many
applications in various domains, such as computer vision, natural language processing,
recommender systems, and more.

To develop ML models, data scientists and engineers need tools that can simplify the
complex algorithms and computations involved in ML. These tools are called ML
frameworks, and they provide high-level interfaces, libraries, and pre-trained models that
can help with data processing, model building, training, evaluation, and deployment.

Most of these are Python machine learning frameworks, primarily because Python is the most
popular machine learning programming language

1. Azure ML Studio : Azure ML Studio allows Microsoft Azure users to create and
train models, then turn them into APIs that can be consumed by other services. Users
get up to 10GB of storage per account for model data, although you can also connect
your own Azure storage to the service for larger models. A wide range of algorithms
are available, courtesy of both Microsoft and third parties.

https://www.datacamp.com/tutorial/azure-machine-learning-guide

32
Introduction to Data Science, AI & Machine Learning

2. Scikit-learn: A Python library that supports both unsupervised and supervised


learning. If you’re new to machine learning, Scikit-learn is a great choice. It’s
effective for predictive data analysis and feature engineering.
3. PyTorch: A customizable option that uses building classes. If you’re a Python
developer searching for a framework with a shorter learning curve, Pytorch, when
stacked against other frameworks, might be the one for you. Additionally, it’s an
open-source, deep-learning framework.
4. TensorFlow: A popular end-to-end machine learning platform that offers feature
engineering and model serving. Finally, if you need a framework with robust
scalability that works across a wide range of data sets, TensorFlow is a good choice.

https://www.linkedin.com/pulse/popular-ml-frameworks-train-your-models-what-choose-vish
nuvaradhan-v-q98oc

What is the difference between library and framework?

A library performs specific, well-defined operations. Whereas a framework is a skeleton


where the application defines the "meat" of the operation by filling out the skeleton. The
skeleton still has code to link up the parts but the most important work is done by the
application.

What is MLOps?
MLOps stands for Machine Learning Operations.

MLOps is a core function of Machine Learning engineering,


focused on streamlining the process of taking machine learning
models to production, and then maintaining and monitoring
them.
MLOps is a collaborative function, often comprising data
scientists, devops engineers, and IT.

Components of MLOps

33
Introduction to Data Science, AI & Machine Learning

MLOps Cycle

Role of MLOps and ML Engineer

An MLOps Engineer automates, deploys, monitors, and maintains machine learning models
in production, ensuring scalability, performance, security, and compliance while collaborating
with cross-functional teams to streamline the ML lifecycle and optimize infrastructure and
costs.
In contrast, a Machine Learning Engineer primarily develops, trains, and fine-tunes models,
focusing on the core algorithms, data preprocessing, and feature engineering. While MLOps
Engineers handle the operationalization and lifecycle management of models, ML Engineers
concentrate on model development and experimentation.

34
Introduction to Data Science, AI & Machine Learning

35
Introduction to Data Science, AI & Machine Learning

2.2. Supervised Learning

Learning Objectives
Understand supervised learning algorithms, including regression and classification problems,
and explore evaluation metrics for these models.

Training a machine learning task for every input with a corresponding target, is called
supervised learning.
In supervised learning, the dataset is the collection of labeled examples, feature and
Target Variables

A feature vector is a vector in which


each dimension j = 1, . . . ,D contains
a value that describes the example.
The value is called a feature and is
denoted as x(j).

The goal of a supervised learning algorithm is to use the dataset to produce a model that takes
a feature vector x as input and outputs information that allows deducing the label for this
feature vector.

There are some very practical applications of supervised learning algorithms in real life,
includes

Input Output Application

36
Introduction to Data Science, AI & Machine Learning

Emails Spam (0/1) Spam Detection

Audio Text Transcripts Speech Recognition

English French MachineTranslations

Image , Radar Info Position of the objects Self Driving Cars

Specifications(Km, Fuel Type …) Price Prediction Price

Regression Problems

For a used car price prediction model, labelled data refers to a dataset where each of the
records includes both the features (attributes) of the used cars and the corresponding target
variable, which is the price.
Features (Attributes): These are the input variables that the model uses to make predictions.
Typical features for used car price prediction include from Name to Seats

The price at which the car is being sold, usually represented as a continuous numerical value.
This is called the target variable here.This is the output that the model is trying to predict for
unseen features.

Factors affecting the price of the car


Predicting the price of a used car involves the following components:
a. Independent variables i.e. Predictor or Feature Variables
b. Dependent variables i.e. Target Variable or Outcome

Predicting Values of continuous dependent variables using independent explanatory variables


is called Regression.
Form of regression that models linear relationship between dependent and independent
variable

37
Introduction to Data Science, AI & Machine Learning

Classification Problem

A classification problem in machine learning is a predictive modeling task that involves


predicting a class label for a specific example of input data.

Here are some examples of classification problems:


1. Spam filtering: Classifying an email as spam or not spam
2. Handwriting recognition: Identifying a handwritten character as one of the
recognized characters
3. Image classification: Classifying images into predefined classes, such as cats and
dogs, or oranges, apples, and pears

The prediction task is a classification when the target variable is discrete. There are four main
classification tasks in Machine learning: binary, multi-class, multi-label, and imbalanced
classifications.

1. Binary Classification
In a binary classification task, the goal is to classify the input data into two mutually
exclusive categories. The training data in such a situation is labeled in a binary
format: true and false; positive and negative; O and 1; spam and not spam, etc.
2. Multi-Class Classification
The multi-class classification, on the other hand, has at least two mutually exclusive
class labels, where the goal is to predict to which class a given input example belongs
to.

38
Introduction to Data Science, AI & Machine Learning

3. Multi-label classification
Multi-label classification is a classification problem where each instance can be
assigned to one or more classes. For example, in text classification, an article can be
about 'Technology,' 'Health,' and 'Travel' simultaneously

4. Imbalanced Classification
For the imbalanced classification, the number of examples is unevenly distributed in
each class, meaning that we can have more of one class than the others in the training
data.
Let’s consider the following 3-class classification scenario where the training data
contains: 60% of trucks, 25% of planes, and 15% of boats

The imbalanced classification problem could occur in the following scenario:

a. Fraudulent transaction detections in financial industries


b. Rare disease diagnosis
c. Customer churn analysis

5.

2.3. Unsupervised Learning

Learning Objectives
Understand unsupervised learning algorithms, explore clustering and dimensionality
reduction techniques, and examine their applications.dels.

In unsupervised learning, the dataset is a collection of unlabeled examples.

39
Introduction to Data Science, AI & Machine Learning

Again, x is a feature vector, and the goal of an unsupervised learning algorithm is to create a
model that takes a feature vector x as input and either transforms it into another vector or into
a value that can be used to solve a practical problem.

Some of the following Algorithms for Unsupervised Machine Learning


1. Clustering (k-means, hierarchical, DBSCAN),
2. Dimensionality Reduction,
3. Latent Semantic Analysis (LSA)

For example,
In clustering, the model returns the id of the cluster for each feature vector in the dataset.

Use cases of Clustering in Business

Customer Segmentation:

In dimensionality reduction, the output of the model is a feature vector that has fewer features
than the input x
In outlier detection, the output is a real number that indicates how x is different from a typical
example in the dataset.

40
Introduction to Data Science, AI & Machine Learning

What is the difference between supervised and unsupervised machine learning?


Supervised Model has a label, it predicts basis feature variable (independent variable)
Unsupervised model doesn’t have labels

Class of Customers for doing target and promotion it doesn’t tell the customer A belongs
from Class 1 ( no label given )
Let us suppose it has created three clusters C1, C2, and C3 basis features variables, C1 people
buy low-priced products whereas C2 people are buying expensive products that result in
higher less along with low sales quantity.

In Regression we predict continuous values, whereas in classification we predict discrete


values / binary values.
Predicting the price of a used car – basis some independent explanatory variables – mileage,
km driven, model, etc.
Price is a continuous variable and can be anything
Classification: Yes No : Approval of Loan

Reinforcement Learning

Reinforcement learning is a subfield of machine learning where the machine “lives” in an


environment and is capable of perceiving the state of that environment as a vector of features.

PackMan Game

41
Introduction to Data Science, AI & Machine Learning

42
Introduction to Data Science, AI & Machine Learning

2.4. Deep Learning

Learning Objectives
Understand the concept of deep learning, explore neural networks and their architectures, and
examine various deep learning applications.

Deep Learning , a subset of machine learning, embraces the mechanism of neural networks
for complex problem solving.

Deep Learning becomes particularly relevant for problems involving large and complex
datasets, where traditional machine learning algorithms fall short.
Problems involving audio, images etc where features of the data needs to be learned
automatically.

Deep Learning is an advanced machine learning technique that requires more computation
and large amounts of training data to be able to scale and generalise well on complex data
representations.The concept of deep learning takes inspiration from the human brain.It uses a
vast number of layered algorithms, hence termed as “deep” to simulate the intricate structure
of the human brain.

Natural language processing


Natural language processing is an important part of deep learning applications that rely on
interpreting text and speech. Customer service chatbots, language translators, and
sentiment analysis are all examples of applications benefitting from natural language
processing.

Self-driving vehicles
Autonomous vehicles use deep learning to learn how to operate and handle different
situations while driving, and it allows vehicles to detect traffic lights, recognize signs, and
avoid pedestrians.

43
Introduction to Data Science, AI & Machine Learning

Algorithms such as LaneNet are quite popular in the field of research to extract lane
lines.Algorithms such as YOLO or SSD are very popular in this field

Neural Networks
It is often referred to as “artificial brains”; they are a vital part of machine learning and AI
technology. Inspired by the biological neural network that makes up human brains, these
powerful computational models leverage interconnected layers of artificial neurons or
‘nodes’.

Nodes (artificial neurons) receive input data , process it with a set of predefined rules and
pass the result to the next layer in the neural network.

A neural network is typically organised into three essential layers, each layer contains
multiple nodes which process the incoming data before passing it on.Also each of the
connections between nodes carries a numerical weight that adjusts as the network learns
during the training stage, affecting the importance of the input value.
1. Input Layer : Input layer is to receive as input, raw data attributes .The nodes of the
input layer are passive, meaning they do not change the data and sent to hidden layers

44
Introduction to Data Science, AI & Machine Learning

2. Hidden Layers(s) : These layers do most of the computations required.They take the
data from the input layer, process it , and pass it to the next layer.
3. Output Layers: it generates the final outputs

Neural networks adjust weights between nodes through backpropagation to optimize


performance over time.
The Spectrum of ML and Data Science Roles

a. Data Analysis / Modeling :Data analysis, feature engineering, model development


and training, statistical analysis, experiment design.
b. ML Services and Infrastructure:Training and Inference services, scalability, model
deployment, API integration.
c. Area of Specialization
i. Generalist:Work on a variety of problem spaces, employ a broad range of ML
techniques, and adapt to different requirements of the team.
ii. Specialist:Deep expertise in the chosen domain (such as Natural Language
Processing (NLP), Computer Vision (CV), or industry-specific areas like
self-driving cars and robotics), advanced knowledge of domain-specific tools.

2.5. LLMs( Large Language Models)

Learning Objectives

45
Introduction to Data Science, AI & Machine Learning

Understand the principles of large language models (LLMs), explore their architectures, and
examine their applications in natural language processing and AI.

Generative AI
Generative AI is a type of AI that can create new content such as text,images, voice and
codes from scratch based on natural language inputs or “prompts” . it uses machine learning
models powered by advanced algorithms and neural networks to learn from large datasets of
the existing contents.

Once a model is trained, it can generate new content similar to the content it was trained on.

What is LLMs
Lange Language Models are foundational machine learning models that use Deep Learning
algorithms to process and understand natural language.
These Models are trained on a massive amount of text data to learn patterns and identify the
entity relationships in the language.
Language models are computational systems designed to understand, generate, and
manipulate human language. At their core, language models learn the intricate patterns ,
semantics, and contextual relationships within language through extensive training over vast
values of text data.

These models are equipped with the capacity to predict the next word in a sentence based on
the preceding words ( with coherent and relevance).

Source : https://www.coursera.org/learn/generative-ai-with-llms

GPT-4 BLOOM FLAN UL2 Claude


GPT-3 LaMDA GATO ChatGLM
LLaMA MT-NLG Pathways Language Model FALCON
LLaMA 2 Stanford Alpaca (PaLM) Mistral 7B
LLaMA 3

46
Introduction to Data Science, AI & Machine Learning

Text Generation before LLMs

There were some traditional ways of predicting the next words given some context
implementations in a language model before the advent of transformers.

N-gram,Markov Models, RNN & LSTM ( Long Short Term Memory)

In 2017, everything changed with the introduction of the transformer architecture, as


described in the influential paper titled “Attention is All You Need” by Google and the
University of Toronto. The transformer revolutionized generative AI by enabling efficient
scaling on multi-core GPUs, parallel processing of input data, and harnessing larger training
datasets. Its key breakthrough was the ability to learn and utilize attention mechanisms,
allowing the model to focus on the meaning of the words being processed.

What LLMs can be used for ?


It is a language Model which is responsible for performing tasks such as text to text
generation, text to image and image to text generations and code generations
a. Text Classifications
b. Text Generation
c. Text Summarization
d. Conversational AI - Chat BOT
e. Question Answering

47
Introduction to Data Science, AI & Machine Learning

f. Speech Recognition
g. Speech Identifications
h. Spelling Corrector

48
Introduction to Data Science, AI & Machine Learning

2.6 Prompt Engineering

Learning Objectives
Understand the principles of prompt engineering, explore techniques for crafting effective
prompts, and examine their impact on model performance and output

Prompt engineering is a relatively new discipline for developing and optimizing prompts to
efficiently use language models (LMs) for a wide variety of applications and research topics.
Prompt engineering skills help to better understand the capabilities and limitations of large
language models (LLMs).
Prompt engineering is not just about designing and developing prompts. It encompasses a
wide range of skills and techniques that are useful for interacting and developing with LLMs.
It's an important skill to interface, build with, and understand capabilities of LLMs.

LLM Settings

When designing and testing prompts, you typically interact with the LLM via an API.

You can configure a few parameters to get different results for your prompts. Tweaking these
settings are important to improve reliability and desirability of responses and it takes a bit of
experimentation to figure out the proper settings for your use cases. Below are the common
settings you will come across when using different LLM providers
a. Temperature
b. Top P
c. Max Length
d. Stop Sequences
e. Frequency Penalty
f. Presence Penalty

49
Introduction to Data Science, AI & Machine Learning

a. Temperature
In short, the lower the temperature, the more deterministic the results in the sense that
the highest probable next token is always picked. Increasing temperature could lead to
more randomness, which encourages more diverse or creative outputs.
You are essentially increasing the weights of the other possible tokens. In terms of
application, you might want to use a lower temperature value for tasks like fact-based
QA to encourage more factual and concise responses.
For poem generation or other creative tasks, it might be beneficial to increase the
temperature value
b. Top P
A sampling technique with temperature, called nucleus sampling, where you can
control how deterministic the model is. If you are looking for exact and factual
answers keep this low. If you are looking for more diverse responses, increase to a
higher value. If you use Top P, it means that only the tokens comprising the top_p
probability mass are considered for responses, so a low top_p value selects the most
confident responses.
This means that a high top_p value will enable the model to look at more possible
words, including less likely ones, leading to more diverse outputs.

The general recommendation is to alter temperature or Top P but not both.


c. Max Length - You can manage the number of tokens the model generates by
adjusting the max length. Specifying a max length helps you prevent long or
irrelevant responses and control costs.
d. Stop Sequences - A stop sequence is a string that stops the model from generating
tokens. Specifying stop sequences is another way to control the length and structure of
the model's response. For example, you can tell the model to generate lists that have
no more than 10 items by adding "11" as a stop sequence.
e. Frequency Penalty - The frequency penalty applies a penalty on the next token
proportional to how many times that token already appeared in the response and
prompt. The higher the frequency penalty, the less likely a word will appear again.
This setting reduces the repetition of words in the model's response by giving tokens
that appear more a higher penalty.
f. Presence Penalty - The presence penalty also applies a penalty on repeated tokens
but, unlike the frequency penalty, the penalty is the same for all repeated tokens. A
token that appears twice and a token that appears 10 times are penalized the same.
This setting prevents the model from repeating phrases too often in its response. If
you want the model to generate diverse or creative text, you might want to use a
higher presence penalty. Or, if you need the model to stay focused, try using a lower
presence penalty
Similar to temperature and top_p, the general recommendation is to alter the
frequency or presence penalty but not both.

Prompting an LLM

50
Introduction to Data Science, AI & Machine Learning

A prompt can contain information like the instruction or question you are passing to the
model and include other details such as context, inputs, or examples. You can use these
elements to instruct the model more effectively to improve the quality of results.

Few examples of Prompts for


a. Text Summarization
b. Information Extraction
c. Question Answering
d. Text Classification
e. Conversation
f. Code Generation
g. Reasoning

Text Summarization
Example:
Explain antibiotics
A:
Antibiotics are a type of medication used to treat bacterial infections. They work by either
killing the bacteria or preventing them from reproducing, allowing the body’s immune system
to fight off the infection. Antibiotics are usually taken orally in the form of pills, capsules, or
liquid solutions, or sometimes administered intravenously. They are not effective against viral
infections, and using them inappropriately can lead to antibiotic resistance.

Explain the above in one sentence:


Output:
Antibiotics are medications used to treat bacterial infections by either killing the bacteria or
stopping them from reproducing, but they are not effective against viruses and overuse can
lead to antibiotic resistance.

Text Classification

Classify the text into neutral, negative or positive.


Text: I think the food was okay.
Sentiment:
Output: Neutral

Code Generation:
One application where LLMs are quite effective is code generation

Example:
Prompt:
"""Table departments, columns = [DepartmentId, DepartmentName]
Table students, columns = [DepartmentId, StudentId, StudentName]
Create a MySQL query for all students in the Computer Science Department

51
Introduction to Data Science, AI & Machine Learning

Output:
SELECT StudentId, StudentName
FROM students
WHERE DepartmentId IN (SELECT DepartmentId FROM departments WHERE
DepartmentName = 'Computer Science');
"""
Elements of a Prompt
A prompt contains any of the following elements:

a. Instruction - a specific task or instruction you want the model to perform


b. Context - external information or additional context that can steer the model to better
responses
c. Input Data - the input or question that we are interested to find a response for
d. Output Indicator - the type or format of the output.

Example
Classify the text into neutral, negative, or positive
Text: I think the food was okay.
Sentiment:

In the prompt example above, the instruction corresponds to the classification task,
"Classify the text into neutral, negative, or positive". The input data corresponds to
the "I think the food was okay.' part, and the output indicator used is "Sentiment:
Standard Tips for Designing Prompts
You can start with simple prompts and keep adding more elements and context as you aim for
better results. Iterating your prompt along the way is vital for this reason
a. Instruction
You can design effective prompts for various simple tasks by using commands to
instruct the model on what you want to achieve, such as "Write", "Classify",
"Summarize", "Translate", "Order", etc.
### Instruction ###
Translate the text below to Spanish:
Text: "hello!"

b. Specificity
Be very specific about the instruction and task you want the model to perform.
When designing prompts, you should also keep in mind the length of the prompt as
there are limitations regarding how long the prompt can be.
Example:
Extract the name of places in the following text.
Desired format:
Place: <comma_separated_list_of_places>
Input: "Although these developments are encouraging to researchers, much is still a
mystery. “We often have a black box between the brain and the effect we see in the

52
Introduction to Data Science, AI & Machine Learning

periphery,” says Henrique Veiga-Fernandes, a neuroimmunologist at the


Champalimaud Centre for the Unknown in Lisbon. “If we want to use it in the
therapeutic context, we actually need to understand the mechanism."
c. Impreciseness (lack of exactness)
The more direct, the more effective the message gets across.
Prompt
Explain the concept of prompt engineering. Keep the explanation short, only a few
sentences, and don't be too descriptive.

53
Introduction to Data Science, AI & Machine Learning

References & Suggested Readings

[1]https://www.linkedin.com/pulse/popular-ml-frameworks-train-your-models-what-choose-v
ishnuvaradhan-v-q98oc
[2] https://www.datacamp.com/tutorial/azure-machine-learning-guide
[3] https://www.linkedin.com/jobs/view/3992431030/
[4] https://www.thinkautonomous.ai/blog/deep-learning-in-self-driving-cars/
[5] ML Role Spectrum - Karthik Singhal,Medium, Meta
Attention Is All You Need
The Illustrated Transformer Transformer (Google AI blog post)

54

You might also like