Module 2 Own Notes
Module 2 Own Notes
Module 2 Own Notes
Task:
Data:
Model Type:
Performance Measure:
Company and Task: You are working at Machine Learning Housing Corporation. Your first
task is to build a model to predict housing prices in California.
Data Source: The model will be trained using California census data.
Data Details: This data includes various metrics for each district (block group), such as
population, median income, and most importantly, the median housing price you're trying to
predict.
Data Unit: The data refers to block groups, the smallest unit with census data (typically 600-
3,000 people). These will be called "districts" for convenience.
Model Goal: The model should learn from the data and use that knowledge to predict the
median housing price in any district, considering the other available metrics (population,
income, etc.).
The passage highlights that building a model is just a means to an end. You
need to understand the company's business objective for using the model.
This objective will influence various aspects of your project, such as:
o Problem Framing: How you define the problem (supervised learning,
regression, etc.) depends on the desired outcome.
o Algorithm Selection: The best algorithms for your model depend on the
problem type.
o Performance Measure: You'll choose a metric (like RMSE) that reflects how
well the model meets the business goal.
o Development Effort: The amount of time spent refining the model depends
on its impact on the business.
The passage also briefly introduces data pipelines, which are common in
Machine Learning.
o They involve a sequence of components that process and transform data.
o These components can operate independently and communicate through
data storage.
o This modular design makes the system scalable and easier to manage.
By understanding the business objective and current state, you can now
define the problem precisely and design a Machine Learning model that
effectively addresses the company's needs.
This passage explains data pipelines, which are a fundamental concept in
machine learning systems that handle large amounts of data. Here's a
breakdown of the key points:
Machine learning often deals with vast datasets that require cleaning,
manipulation, and transformation before they can be used to train models.
Data pipelines automate these processes, making them more efficient and
reliable.
Monitoring: It's crucial to monitor the pipelines to ensure data isn't getting
stuck or corrupted somewhere. Unidentified issues can lead to stale or
inaccurate data, impacting the overall system's performance.
Examining the current approach can provide valuable insights for building
your machine learning model.
It can offer a baseline for performance comparison and suggest ways to
improve upon the existing solution.
The California census data appears to be a suitable source for training the
model.
This data includes both the median housing prices (desired output) and other
relevant features (population, income, etc.) for many districts.
With this understanding of the current solution and available data, you can
now define the machine learning problem precisely.
The passage asks you to consider the type of learning (supervised,
unsupervised, etc.) and the specific task (classification, regression, etc.)
based on the information provided.
This passage dives into defining the machine learning problem you'll be
tackling based on the information gathered so far. Here's a breakdown:
Before designing the system (your machine learning model), you need to
clearly define the problem it's trying to solve. This involves specifying several
aspects:
Regression Specifics:
o
The data size (California census data) is likely manageable to handle all at
once.
There's no need for real-time updates on housing prices. The model can be
trained using the entire dataset and then used for predictions.
The passage asks you to analyze the information and identify these aspects
of the machine learning problem yourself before revealing the answers.
Understanding these aspects helps choose the right algorithms, performance
measures, and development approach for building your model.
This passage covers two key aspects of building your machine learning
model: selecting a performance measure and checking assumptions.
You'll need a metric to evaluate how well your model performs in predicting
housing prices.
A common choice for regression problems is the Root Mean Squared Error
(RMSE).
The formula for RMSE is provided (Equation 2-1), which calculates the
average squared difference between predicted and actual housing prices.
This also gives higher weightage to larger errors.
The passage acknowledges that while RMSE is common, other options might
be better suited in specific situations.
For example, if there are many outliers (extreme data points) in your data,
Mean Absolute Error (MAE) might be a better choice (Equation 2-2). MAE
considers the absolute difference between predictions and actual values,
making it less sensitive to outliers.
Both RMSE and MAE can be viewed as ways to measure the "distance"
between two sets of values: predicted prices and actual prices.
These distances are calculated using different mathematical norms:
o RMSE uses the Euclidean norm (ℓ2 norm), which is the familiar distance
formula based on squares.
o MAE uses the ℓ1 norm (Manhattan norm), which considers the sum of
absolute differences, similar to navigating a city grid.
Generally, higher norm values (like RMSE) are more affected by outliers
compared to lower norms (like MAE). So, if outliers are a concern, MAE might
be preferable.
The passage also briefly mentions other norms (ℓ0, ℓ∞) but focuses on ℓ1 and
ℓ2 for simplicity.
Checking Assumptions
It's crucial to verify the assumptions made throughout the process to avoid
building a model that doesn't meet the actual needs.
The passage provides a good example: You initially assumed the downstream
system uses the predicted prices directly. But what if it categorizes them
(cheap, medium, expensive) instead?
In that case, predicting the exact price wouldn't be important. You'd actually
need a classification model, not a regression model like you're building.
Fortunately, you confirm with the downstream team that they do require actual
prices.
Key Takeaways
Choosing the right performance measure depends on factors like the type of
problem (regression) and the presence of outliers.
Verifying assumptions early helps ensure your model is designed to address
the true business objective.