Machine Learning Process Lifecycle: Talat@amii - Ca Luke@amii - Ca Shazan@amii - Ca Sankalp@amii - Ca
Machine Learning Process Lifecycle: Talat@amii - Ca Luke@amii - Ca Shazan@amii - Ca Sankalp@amii - Ca
Machine Learning Process Lifecycle: Talat@amii - Ca Luke@amii - Ca Shazan@amii - Ca Sankalp@amii - Ca
Machine Learning has been gaining a lot of popularity due to its usefulness and its
ability to learn from the data. Unlike regular software development, where the
development lifecycle of a software has been well studied, the machine learning
solution development is a fairly new and less understood process. Developing a
machine learning solution is usually exploratory, and closely tied with the problem that it
addresses and the associated data.
What is MLPL ?
MLPL stands for ‘Machine Learning Process Lifecycle’ and is a framework that captures
the iterative process of developing a Machine Learning solution for a specific problem.
Defining and understanding the business domain and the data related to that problem
plays a key role in coming up with a good ML solution. As previously mentioned, ML
solution development itself is very much an exploratory and experimental process
where different learning algorithms and methods are tried before arriving at a satisfiable
solution. And almost always, back and forth passes between different stages of this
whole process from understanding the problem to coming up with an ML solution is
required to meet the business expectations. MLPL tries to capture this dynamic
workflow between different stages and the sequence in which these stages are carried
out.
At the end of this exploration, if one is able to answer the above questions, then the
exploration process is in the right direction.
Risk Mitigation: The MLPL standardize the stages of an ML project and defines
standard modules for each of those stages, thereby minimizing the risk of missing out
on important ML practices. A good to have check-list is always handy to identify if some
of the modules have been implemented or at least not-missed.
Tracking: This probably might be one of the most important motivations to introduce
MLPL into your ML exploration workflow. MLPL allows you to track the different stages
and the modules inside each of the stages. This being an exploration task, there are a
lot of throwaways in different stages that will never be used in the final ML solution but
have been invested in. These throwaways are required to be tracked to document the
resources that has been spent on them and to know the lessons learnt for any future
iteration.
MLPL - AN OVERVIEW
By the end of this stage, we would have identified and defined our goals and other
aspects that will help us understand the problem better and dive deeper into
subsequent stages of MLPL. Any form of tools, worksheets that can be helpful can be
used at this stage.
● Acquisition: Acquiring the data is an important task. After all, the whole concept
of using machine learning is because of data. At this point, we should be in a
situation to have identified the data sources and gathered the data. Once the
data sources have been identified, we would want to combine the data sources
that help answer the questions defined in the objectives and then consolidate the
raw data. In some cases, combining the data sources might not be trivial and
requires in-depth domain knowledge and expertise to find out how to align the
data and combine them into one data source.
● Cleaning: In the real world, the data is usually corrupt due to various reasons.
Inaccurate readings from sensors, inconsistencies across various readings and
invalid data are some of the data issues that you might find in data. A thorough
analysis on how to fix these values with the help of a domain and data expert
should be carried out.
● Processing: Data that is read is still not consumable by the machine learning
process. Data pre-processing might involve various techniques such as
normalization, standardization or scaling of the data.
● Pipeline: Set up a process to score new data or refresh the data regularly as part
of an ongoing learning process
● Exploratory Data Analysis: Engage in exploratory data analysis to gain
understanding about the data. Understanding the data is very important and
could lead to better design and selection of machine learning process. Also, it
gives an in-depth understanding of what could be useful information for further
steps.
MLME
This is the stage where we confirm if the machine learning problem is addressing the
business problem. Having a conversation with the client is very important to understand
if the business problem is addressed. From the delivery perspective, an ML solution is
to be delivered to the client. The solution could be in one or all three of the forms below.
1. Prototype: Source code of the prototype is provided along with readme and
dependency files on how to use the prototype. The prototype need not
necessarily be a production level code but should be clean enough with
comments and relatively stable so that the engineering teams can take it and use
it to build a product around it.
2. Documentation: A good documentation always accompanies a prototype. Some
of the technical details should be listed and explained.
3. Project Report: This is a complete list of methodologies used and decisions
taken along the lifetime of the project and the reason behind those decisions.
This gives a high level idea of what was achieved in the project.
What we expect is shown above but what typically happens is shown in the below
picture. The picture shows how frequently that we would have to apply the brakes on
what we do and go back to one of the previous stages or modules in the same stage.
This switch to a different stage/module is what we call as a lifecycle switch. A lifecycle
switch forces you to revisit some of the modules that you have already done, because
remember, it's all about the original business problem. If any changes happen midway,
there is a high chance that the other components already visited before might change.
One example shown below is that, while doing the EDA, it was identified that the data
for only one season was available while trying to develop a universal model for the
whole year. In such a case, we would have to acquire more data but there is a
possibility that the addition of more has changed a lot of properties of the data. How we
compute the missing values, the assumptions we made earlier for data processing and
cleaning would also change. Therefore, it is required that we move sequentially without
skipping some of the modules.
Reasons for a Life-cycle Switch
Final words