Building Machine Learning-Program-Notes
Building Machine Learning-Program-Notes
Problem framing
1. Most books, courses, and tutorials focus on algorithms, but engineers spend most of
their time working with data. Building models is only a small portion of creating
working systems.
2. Pablo Picasso once said “Computers are useless. They can only give you
answers.” Asking the right questions will prevent you from solving the wrong
problem or trying to solve the right problem using the wrong solution.
3. Before starting a project, determine what problem you are solving. This
question is essential to understanding the project's scope. Focus on a
high-level breakdown of the critical path.
4. Before starting a project, determine what problems you are ignoring. The
problems you avoid are what differentiate a mediocre solution from an
exceptional one.
5. Before starting a project, determine who your customers are and why they
care. Avoid the telephone game. Cut the middleman. Identify who cares and
why this is important to them. Understanding the problem's importance
provides motivation and context for the entire project. It helps prioritize
resources and justifies the investment in developing a machine-learning
solution. Without clarity on the problem's significance, it's challenging to rally
support and commitment.
6. Before starting a project, determine what existing solutions look like. Those
who cannot remember the past are condemned to repeat it. Assessing existing
solutions allows for learning from past experiences and avoiding reinventing
the wheel. It helps identify best practices, pitfalls, and potential opportunities
for improvement.
7. Before starting a project, determine how you can measure success. You can’t
solve a problem without a benchmark to assess progress and guide future
improvements.
9. Lawyers don't take cases they can't win. Be like a lawyer. Don't start working
on a project until you can build a successful solution.
11. The haystack principle is a powerful mental model when deciding how to solve
a problem. Instead of trying to find the needle in a haystack, make the
haystack as small as possible.
12. Inversion is a mental model in which you turn a problem upside down to think
about it differently. It's a great strategy for identifying new ways to tackle a
problem.
3. Simple ideas are better than clever ideas. Machine learning is clever and
powerful, but is not free. A better alternative is to start using simple heuristics.
For example, use a regular expression instead of a Large Language Model, or
build a simple function with conditional logic instead of a neural network.
4. In most situations, if you don’t know how to build a good application using a
manual approach, machine learning will be a waste of time.
2. There are two ways to gather data. You can go and collect it, or you can let the
data come to you by building a product that your users use. The latter is the
best approach to getting access to real data.
4. The three key elements of a good dataset are quality, diversity, and quantity.
5. More data is not always better—and it’s often worse. You need to maintain
data, deal with privacy implications, and pay for processing and storage costs.
Better data is better than more data.
8. A few common causes of selection bias are time, the location where we collect
the data, demographics bias, response bias, and availability bias.
10. An example of selection bias related to location occurs when training data is
collected from one location, but the model is deployed in a different place. This
may lead to poor predictions due to different regional characteristics.
12. An example of selection bias related to response bias occurs when survey
response rates are low or when the wording of questions and available answer
choices introduce bias, affecting the reliability of the data.
14. An example of selection bias related to long tail bias occurs when the
mechanisms for collecting data prevent rare events from appearing in the
dataset, leading to a model that may not accurately predict these events when
they do occur.
15. Collect as much metadata that helps explain your data as possible. This
information will become critical to evaluate your model and counter the effects
of selection bias.
Labeling strategies
1. Models need ground truth labels to learn.
2. Human annotations consist of subject matter experts providing labels for each sample
in a dataset. This produces high-quality labels, but creating manual labels is expensive,
and doesn’t scale.
3. Many people feel like labeling is beneath them and see it as grunt work, but we can’t
build a good model without a robust strategy to label data.
4. Some problems have built-in, natural labels. In these problems, we can automatically
evaluate the model’s predictions without needing manual labels.
5. Some examples of problems with natural labels are predicting the arrival of planes, the
price of a stock, or whether it will snow in the future.
6. The lack of labels is one of the biggest bottlenecks when building machine learning
systems.
7. Having the wrong labels is worse than having no labels at all. You can't out-train bad
labels.
8. Weak supervision can help us scale labeling. We can use it to generate labels
automatically by using high-level and often noisier sources of supervision.
9. Weak supervision is fast and inexpensive compared to manually creating labels. It can
also adapt to changes in the data or requirements. If you have human annotations,
10. Weak supervision has the drawback of producing noisy or inaccurate labels.
11. Active learning is an efficient strategy to label data. It iteratively trains a model and
uses it to find which data points are the most valuable to label next.
13. Uncertainty sampling and diversity sampling are the two most common
strategies to combine with Active Learning.
14. Uncertainty sampling identifies data points near a decision boundary. These samples
have a larger chance of being misclassified by the model.
15. Diversity sampling identifies any unlabeled samples that are unusual,
underrepresented, or unknown to the model in its current state.
16. Uncertainty and diversity work best together. Combine both strategies when
building a selection function to decide which data points to label next.
Feature engineering
1. Better data is crucial for building better models. The single biggest impact on
your model’s performance will come from very good features1.
3. Vectorization is the process of converting data from its original format into
tensors or vectors of numbers. Label encoding, one-hot encoding, and target
encoding are examples of vectorization.
6. One-hot encoding replaces the original feature with a binary column for each
categorical value.
8. Target encoding replaces each categorical value with the mean of the target
column for that category.
9. Target encoding can inadvertently introduce future information into the model
training process, causing overfitting, especially in cases where the number of
data points for certain categories is small.
1
Konrad Banachewicz and Luca Massaron. “The Kaggle Book. Data analysis and
machine learning for competitive data science.” Interview with Bojan Tunguz, a
Kaggle Quadruple Grandmaster. Book link.
11. Normalization and standardization are techniques that turn numerical data
into a consistent range. Models work best when their features have small
values with similar ranges.
12. Use normalization when you need a bounded range and the data is not
Gaussian (normally distributed). Normalization is particularly helpful in
algorithms that compute distances between data points and require scaling to
a specific range, like neural networks.
13. Normalization is sensitive to outliers. Outliers can skew the minimum and
maximum values, distorting the range for other data points.
14. Use standardization when the feature’s data follows a Gaussian distribution or
when the algorithm assumes normally distributed data, such as linear or
logistic regression.
15. Standardization does not bind values to a specific range, which might be
necessary for some algorithms. It also retains the effect of outliers since it
merely re-centers and rescales the data.
16. Most models don’t work with incomplete data. Handling missing values is a
critical step for building reliable, unbiased, and accurate models.
17. You can handle missing values by removing the feature containing the missing
values, removing the individual samples, or replacing the missing values with
the most frequent feature value, its mean, or median.
18. Be careful when removing missing values. The distribution and frequency of
missing values may uncover key insights about your data or process.
19. Feature engineering is the process of transforming and improving the original
data into more useful information to train a model. The quality of the features
in your training data will have the single biggest impact on the quality of your
model.
21. Deep learning didn’t kill feature engineering. Feature engineering can enhance
a model by providing additional information that raw data doesn’t convey.
2. A model that can’t beat a rudimentary baseline hasn’t learned any significant
patterns from the data.
6. We can also measure people’s performance solving the same task or the
performance of an existing process and use these values as the baseline.
7. Scope creep will eventually turn any rule-based system into a nightmare. To
replace a complex set of heuristics, the best solution is a simple model.
10. Simplicity is not only about the architecture of the model but about how easy
the model is to use.
11. When choosing a model, consider its general capabilities. Consider what your
model can and can’t do. Understand the patterns it can capture, its sensibility
to noise and outliers, and its ability to generalize.
12. When choosing a model, consider hardware, time, and costs. Edge devices have
limited resources and require fast inference times. Large models cost more.
13. When choosing a model, consider how the model scales to more data. Your
model may work today, but the volume and complexity of data will increase
over time.
15. When choosing a model, consider how familiar is your team with it. Prioritize
using the models you know best. Lack of expertise is a huge roadblock.
16. Algorithms come and go. The more experiments you run, the better your
chances of choosing the best model for your problem.
17. Ignore state-of-the-art models as much as possible. Most of these models are
the result of a popularity contest, but they struggle with real-world data.
5. Model-centric changes aim to build the best model for a given dataset.
Data-centric changes aim to produce the best dataset to feed a given model.
2. Data parallelism and model parallelism (with pipeline execution) are the two
distributed training techniques used in production applications.
3. Data parallelism replicates the same model across multiple nodes and trains
each replica on a different subset of the data.
4. Model parallelism splits the model across different nodes and trains each
portion using all of the data. Model parallelism only works when we can split a
model into independent components that we can run in parallel.
6. Data parallelism is easier to implement and it works with every type of model.
It has the downside that each processor needs to hold the entire model in
memory, limiting the size of the model that can be trained.
Evaluation strategies
1. Evaluation is critical for building working models. A good evaluation strategy
is the difference between a model that works and a waste of electricity.
2. You may have a model that performs well with your small dataset, but without
a solid evaluation strategy, it may not work with real production data.
5. The easiest way to evaluate a model is to test it on a holdout set, a small sample
of the data the model has never seen before. Use this holdout set to infer the
performance of the model on future, unseen data.
6. A holdout set may lead to a biased model assessment. It’s also an inefficient
way to take advantage of the data, especially when working with small
datasets.
8. Cross-validation splits the available data into k folds, and for k iterations,
evaluates the model using one of the k partitions and trains with the rest of the
data.
9. After using cross-validation, you can train a new model using the entire
dataset. This is a great strategy that works best when training is cheap and
data is valuable but scarce.
10. After using cross-validation, you can use every model generated during each
iteration and average their results. This is a good strategy when training a new
3. Data leakages are one of the most common issues when building machine
learning models.
4. A leaky validation strategy happens when information from the training data
“leaks” into the validation data when building a predictive model.
7. A plane's life span isn't measured in years but rather in pressurization cycles.
Every time a plane takes flight, it is pressurized, which puts stress on the
fuselage and the wings. Conversely, we can measure the quality of a test set in
“tuning cycles.”Every time we use the test set in any way, we put pressure on it
and overfit our model.
2
François Chollet. Feb 13, 2024. https://twitter.com/fchollet/status/1757573833677308132.
6. Identify and split your data into relevant subsets. Evaluate your model's
performance on each.
8. Look for disproportionately important samples. They are often critical for the
business, but global metrics hide how models perform on them. For example,
some of the customers of the model might be big spenders compared with the
rest. These power users are disproportionately important.
10. A simple strategy is to look at the predictions the model gets wrong, and apply
a clustering algorithm to find any common characteristics among them. You
can focus on these results to improve the data and the model results.
3. Undersampling reduces the size of the majority class to balance the dataset.
This is done by randomly removing instances from the majority class until the
class distribution is more balanced.
9. You can handle imbalances by using the appropriate metric for your problem,
more robust algorithms, changing the class weights, cost-sensitive learning, or
threshold moving.
10. When using threshold moving, don’t use fixed thresholds. Instead, a more
robust strategy is to learn the appropriate thresholds by evaluating using the
holdout validation data.
3
Ruben van den Goorbergh, Maarten van Smeden, Dirk Timmerman, Ben Van Calster.
“The harm of class imbalance corrections for risk prediction models: illustration
and simulation using logistic regression.” Journal of the American Medical
Informatics Association, Volume 29, Issue 9, September 2022, Pages 1525–1534,
https://doi.org/10.1093/jamia/ocac093.
4
Yotam Elor, Hadar Averbuch-Elor. “To SMOTE, or not to SMOTE?” arXiv:2201.08528v3.
12. Some common data augmentation techniques in computer vision are rotation,
flipping, cropping, and scaling.
13. MixUp5 is a technique that helps deep neural networks avoid memorization and
sensitivity to adversarial examples. It’s useful to augment a dataset of images
by blending existing samples and turning discrete labels into continuous
labels.
14. A clever technique for augmenting text data is to use back translations, which
automatically translate text back and forth between different languages to
generate alternatives.
5
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz. “mixup: Beyond
Empirical Risk Minimization.” arXiv:1710.09412v2.
Model versioning
1. Building models is not only about writing code. We need a different process to
store, version, share, maintain, and serve model predictions at scale.
3. The model registry serves as a central versioning hub. From here, we can
export a specific version of the model and deploy it to start serving
predictions.
Serving strategies
1. We can deploy and serve the predictions of a model by wrapping it with a
RESTful API, using TensorFlow Serving, TorchServe, or Nvidia Triton.
2. We can host a model on the cloud for scalability. We can also host a model
on-premises when working with sensitive data that can’t leave local
infrastructure.
3. We all want to serve good predictions fast and cheaply. Quality, speed, and
costs are the key tradeoffs to consider when serving model predictions.
4. The unattainable priority triangle consists of quality, cost, and speed. You can
only achieve two of these at the same time. For example, you can have a cheap
and fast model with low quality, a good and cheap model that’s slow, or a good
and fast model that’s expensive.
9. Dynamic serving it’s inefficient when processing large volumes of data in terms
of latency and costs.
10. Static serving pre-computes predictions ahead of time. It has higher storage
costs and low latency. It works best for problems with a small potential
number of input values (low cardinality). An Example of a low-cardinality
problem is the potential price of a luxury bag given its brand, model, and
condition.
11. Static serving is more efficient for processing a lot of data and reduces latency,
but it requires knowing which predictions to generate in advance. It makes the
system less responsive to changes.
12. A hybrid system that uses static and dynamic serving is often the best
approach. Pre-compute and serve low-cardinality predictions from a cache,
and serve other predictions in real time.
13. Deploying models on edge devices requires creativity. We can run large,
complex models by breaking a problem down and using two-phase
predictions.
18. Evaluate and monitor each pipeline model individually. A pipeline amplifies
errors and hides each model's individual strengths and weaknesses.
19. A technique for evaluating the individual potential of every model in a pipeline
is to compute the performance of the end-to-end pipeline and iteratively
replace each individual model with perfect predictions before reevaluating the
performance. We can then focus on the model with the potential to improve
the pipeline the most.
Human-in-the-loop
1. Implementing human-in-the-loop systems is one of the most powerful ways to
design applications that augment human judgment for critical decisions.
Model compression
9. Model compression reduces a model's size without significantly sacrificing its
accuracy. Smaller models take less space and run faster. The three most
common model compression techniques used in practice are Pruning, Model
quantization, and Knowledge distillation.
10. Pruning is a model compression technique that identifies the parameters that
have no or negligible influence on the final predictions and sets them to zero.
11. Pruning increases the sparsity of the model making it faster. An increased
scarcity makes models more computationally efficient and reduces the space
we need to store them.
12. Pruning increases the risk of a model to overfit. Finding which parameters to
prune and how much pruning is enough is a difficult and expensive process.
13. Quantization reduces the precision of the model parameters by using fewer
bits to represent them. This reduces the model size and increases its speed.
14. Quantized models take less space and run faster, but they can represent a
smaller range of values, leading to more rounding errors and worse
performance.
15. Post-training quantization is the most popular and easiest way to compress
models. Quantization-aware training is better for accuracy preservation.
17. Knowledge distillation can significantly reduce the size and computational
complexity of a model.
2. Models degrade over time. Every model has to deal with edge cases, positive
feedback loops, and data distribution shifts.
5. Edge cases aren’t the same as outliers. An outlier might be an edge case, but
not all outliers are edge cases. An edge case refers to an example where the
model performs significantly worse. An outlier refers to an example that’s
significantly different from the rest of the data, but the model might handle it
well.
6. Edge cases are inevitable. They can occur because of damaged sensors,
malicious input, bad data collection, corrupt data, or rare events.
11. Remove hidden feedback loops whenever possible. Add variability using
exploration/exploitation or randomization, or add positional features to the
data.
Distribution shifts
1. You can’t out-train bad data. If your training and production data come from
different distributions, your model will suck.
3. To run adversarial validation, join your train and test set and replace the target
column with a new binary feature. Set the value of this feature to 0 for every
sample from your train set and 1 for every sample from the test set. Train a
simple binary classification model on this new dataset. If you can tell train
from test apart, your data comes from different distributions.
4. Data distribution shifts happen when production data diverges from the
training data. They are the biggest challenge any production model faces and
degrade its performance over time.
5. Concept drift occurs whenever the relationship between model inputs and
outputs changes because the model’s underlying assumptions aren’t the same.
6. An example of concept drift is how credit card fraud has changed over time as
providers figure out ways to prevent illegitimate transactions. In 2009, we had
magnetic bands that were easy to hack. Most of the fraud happened physically
at the point of sale. In 2024, we have security chips, so most of the fraud
happens online.
7. Data drift occurs whenever the input data used by the model to make
predictions changes as compared to the data used to train the model.
8. An example of data drift is when a model that you trained on data from young
users starts processing data coming from older people.
9. Data drift can happen because of bugs, changes in the input data schema,
changes in the distribution of a feature, or the meaning of the data changes
over time.
Monitoring strategies
1. Unit testing of individual components and end-to-end system testing are
important but not sufficient. The best way to identify model degradation is to
continuously monitor your model’s performance over time.
4. Any changes in the input data used by a model are useful signals for
understanding the system's health. Monitoring should validate the schema of
input data.
5. Track that input features are within acceptable ranges, their statistics, and
categorical values belong to a predefined set and follow the correct format. For
example, compute the minimum, maximum, mean, and median values of a
feature and ensure they are within an acceptable range, or ensure that the
value of one feature is correct in comparison to the value of another feature.
6. Operational metrics convey the health of a system. It doesn’t matter how good
a model is if you can’t keep the system up and running serving its predictions.
7. Some operational metrics you can monitor are latency, throughput, hardware
utilization, number of requests, and how many of them were successful. For
example, you can monitor the percentage of CPU and GPU usage and the
memory utilization of a deployed model.
9. Assuming your model hasn’t changed, any changes in the distribution of the
model predictions generally indicate a distribution shift in the inputs.
10. Ground truth data is essential for model monitoring. The system should store
and label model requests to determine how model performance changes over
time.
3. The more data you use to train a model, the more likely the distribution of
production data will match the distribution of the training data.
Continual improvements
1. Real-world applications don’t work with static data.Production machine
learning systems must be capable of adapting and learning from new data.
2. The machine learning cycle starts with data collection, labeling, training a
model, and evaluating it. If the model is good enough, we deploy it and
monitor it. Whenever the model's performance dips, we collect more data and
repeat the cycle.
3. Most people retrain models whenever they feel like it.Continual learning is the
process of automatically updating models in a production system.
4. Never ask how frequently you should retrain a model.Instead, start retraining
as frequently as you can.Make it work, make it right, make it fast.
13. By comparing the performance of different models trained with data of varying
recency, we can estimate the rate at which a model’s performance falls off over
time and how often it might be necessary to retrain it.
15. Data is only valuable if it increases diversity and improves the representation
of real-world scenarios.Trim anything that doesn’t lead to better predictions.
17. Labels are fundamental for continual learning.Labeling will likely become the
bottleneck that will limit the retraining frequency of your models.
Retraining strategies
1. The most popular option to retrain a model is doing it from scratch using all of
the data you have available.This strategy is also known as stateless training.
5. Stateful training is faster and cheaper because we only need to use new data to
retrain the model.Its main disadvantage is the potential for the model to suffer
from catastrophic forgetting.
Testing in production
1. Offline evaluation is not enough to test a model.Continual learning requires a
different testing strategy to ensure continuous improvements.
2. Offline evaluation cannot determine how a model will react to production data
it has never seen before.We need to test the model in production.
6
Alex Egg. “Online Learning for Recommendations at Grubhub.” arXiv:2107.07106.
5. Canary releases are great for ensuring the operational stability and reliability
of the entire system.We can control who is exposed to the candidate model.
6. A canary release will also expose some users to a potentially inferior model.
The feedback we collect from these users may not represent the larger user
base.