MC4301 - ML Unit 1 (Introduction)
MC4301 - ML Unit 1 (Introduction)
MC4301 - ML Unit 1 (Introduction)
Human Learning
Learning is the process of acquiring new understanding, knowledge, behaviors, skills,
values, attitudes, and preferences.
Learning consists of complex information processing, problem-solving,
decision-making in uncertainty and the urge to transfer knowledge and skills
into new, unknown settings.
The process of learning is continuous which starts right from the time of birth of an
individual and continues till the death. We all are engaged in the learning endeavours
in order to develop our adaptive capabilities as per the requirements of the changing
environment.
John B Watson is one amongst the first thinkers who has proven that behavioural
changes occur as a result of learning. Watson is believed to be the founder of
Behavioural school of thought, which gained its prominence or acceptability around
the first half of the 20th century.
Crow and Crow defined learning as the process of acquisition of knowledge, habits
and attitudes.
1
According to E.A, Peel, Learning can be described as a change in the individual
which takes place as a result of the environmental change.
Types
Types of Learning
2
6. Attitude Learning: Attitude shapes our behaviour to a very great
extent, as our positive or negative behaviour is based on our attitudinal
predisposition.
The Behavioural School of Thought which was founded by John B Watson which was
highlighted in his seminal work, “Psychology as the Behaviorist View It”, stressed
on the fact that Psychology is an objective science, hence mere emphasis on the
mental processes should not be considered as such processes cannot be objectively
measured or observed.
Watson tried to prove his theory with the help of his famous Little Albert
Experiment, by way of which he conditioned a small kid to be scared of a white rat.
The behavioural psychology described three types of learning: Classical
Conditioning, Observational Learning and Operant Conditioning.
Classical Conditioning theory has been explained with the help of Pavlov’s
Classic Experiment, in which the food was used as the natural stimulus which
was paired with the previously neutral stimuli that’s a bell in this case. By
establishing an association between the natural stimulus (food) and the neutral
stimuli (sound of the bell), the desired response can be elicited. This theory
will be discussed in detail in the next few articles.
Machine Learning
3
intelligence systems are used to perform complex tasks in a way that is similar to how
humans solve problems.
Machine learning is used in internet search engines, email filters to sort out spam,
websites to make personalised recommendations, banking software to detect unusual
transactions, and lots of apps on our phones such as voice recognition.
Types
As with any method, there are different ways to train machine learning algorithms,
each with their own advantages and disadvantages. To understand the pros and cons
of each type of machine learning, we must first look at what kind of data they ingest.
In ML, there are two kinds of data — labeled data and unlabeled data.
Labeled data has both the input and output parameters in a completely
machine-readable pattern, but requires a lot of human labor to label the data, to begin
with. Unlabeled data only has one or none of the parameters in a machine-readable
form. This negates the need for human labor but requires more complex solutions.
There are also some types of machine learning algorithms that are used in very
specific use-cases, but three main methods are used today.
4
Supervised Learning
Supervised learning is one of the most basic types of machine learning. In this type,
the machine learning algorithm is trained on labeled data. Even though the data needs
to be labeled accurately for this method to work, supervised learning is extremely
powerful when used in the right circumstances.
The algorithm then finds relationships between the parameters given, essentially
establishing a cause and effect relationship between the variables in the dataset. At the
end of the training, the algorithm has an idea of how the data works and the
relationship between the input and the output.
This solution is then deployed for use with the final dataset, which it learns from in
the same way as the training dataset. This means that supervised machine learning
algorithms will continue to improve even after being deployed, discovering new
patterns and relationships as it trains itself on new data.
Unsupervised Learning
Unsupervised machine learning holds the advantage of being able to work with
unlabeled data. This means that human labor is not required to make the dataset
machine-readable, allowing much larger datasets to be worked on by the program.
In supervised learning, the labels allow the algorithm to find the exact nature of the
relationship between any two data points. However, unsupervised learning does not
have labels to work off of, resulting in the creation of hidden structures. Relationships
between data points are perceived by the algorithm in an abstract manner, with no
input required from human beings.
Reinforcement Learning
Reinforcement learning directly takes inspiration from how human beings learn from
data in their lives. It features an algorithm that improves upon itself and learns from
new situations using a trial-and-error method. Favorable outputs are encouraged or
‘reinforced’, and non-favorable outputs are discouraged or ‘punished’.
5
Based on the psychological concept of conditioning, reinforcement learning works by
putting the algorithm in a work environment with an interpreter and a reward system.
In every iteration of the algorithm, the output result is given to the interpreter, which
decides whether the outcome is favorable or not.
In case of the program finding the correct solution, the interpreter reinforces the
solution by providing a reward to the algorithm. If the outcome is not favorable, the
algorithm is forced to reiterate until it finds a better result. In most cases, the reward
system is directly tied to the effectiveness of the result.
In typical reinforcement learning use-cases, such as finding the shortest route between
two points on a map, the solution is not an absolute value. Instead, it takes on a score
of effectiveness, expressed in a percentage value. The higher this percentage value is,
the more reward is given to the algorithm. Thus, the program is trained to give the
best possible solution for the best possible reward.
We are always amazed at how machine learning has made such an impact on our lives.
There is no doubt that ML will completely change the face of various industries, as
well as job profiles. While it offers a promising future, there are some inherent
problems at the heart of ML and AI advancements that put these technologies at a
disadvantage. While it can solve a plethora of challenges, there are a few tasks which
ML fails to answer.
1. Reasoning Power
One area where ML has not mastered successfully is reasoning power, a distinctly
human trait. Algorithms available today are mainly oriented towards specific
use-cases and are narrowed down when it comes to applicability. They cannot think as
to why a particular method is happening that way or ‘introspect’ their own outcomes.
In other words, ML algorithms lack the ability to reason beyond their intended
application.
2. Contextual Limitation
If we consider the area of natural language processing (NLP), text and speech
information are the means to understand languages by NLP algorithms. They may
learn letters, words, sentences or even the syntax, but where they fall back is the
context of the language. Algorithms do not understand the context of the language
used. A classic example for this would be the “Chinese room” argument given by
6
philosopher John Searle, which says that computer programs or algorithms grasp the
idea merely by ‘symbols’ rather than the context given.
So, ML does not have an overall idea of the situation. It is limited by mnemonic
interpretations rather than thinking to see what is actually going on.
3. Scalability
In addition, growing data has to be dealt the right way if shared on an ML platform
which again needs examination through knowledge and intuition apparently lacked by
current ML.
ML usually need considerable amounts (in fact, massive) of data in stages such as
training, cross-validation etc. Sometimes, data includes private as well as general
information. This is where it gets complicated. Most tech companies have privatised
data and these data are the ones which are actually useful for ML applications. But,
there comes the risk of the wrong usage of data, especially in critical areas such as
medical research, health insurance etc.,
Even though data are anonymised at times, it has the possibility of being vulnerable.
Hence this is the reason regulatory rules are imposed heavily when it comes to using
private data.
This sub-field of ML is actually responsible for today’s AI growth. What was once
just a theory has appeared to be the most powerful aspect of ML. Deep Learning (DL)
now powers applications such as voice recognition, image recognition and so on
through artificial neural networks.
But, the internal working of DL is still unknown and yet to be solved. Advanced DL
algorithms still baffle researchers in terms of its working and efficiency. Millions of
neurons that form the neural networks in DL increase abstraction at every level, which
cannot be comprehended at all. This is why deep learning is dubbed a ‘black box’
since its internal agenda is unknown.
7
Applications
Social media platforms use machine learning algorithms and approaches to create
some attractive and excellent features. For instance, Facebook notices and records
your activities, chats, likes, and comments, and the time you spend on specific kinds
of posts. Machine learning learns from your own experience and makes friends and
page suggestions for your profile.
2. Product Recommendations
3. Image Recognition
4. Sentiment Analysis
Machine learning algorithms are used to develop behavior models for endangered
cetaceans and other marine species, helping scientists regulate and monitor their
populations.
8
7. Regulating Healthcare Efficiency and Medical Services
9. Banking Domain
Banks are now using the latest advanced technology machine learning has to offer to
help prevent fraud and protect accounts from hackers. The algorithms determine what
factors to consider to create a filter to keep harm at bay. Various sites that are
unauthentic will be automatically filtered out and restricted from initiating
transactions.
Your model will need to be taught what you want it to learn. Feeding relevant back
data will help the machine draw patterns and act accordingly. It is imperative to
provide relevant data and feed files to help the machine learn what is expected. In this
case, with machine learning, the results you strive for depend on the contents of the
files that are being recorded.
Languages/Tools
9
1. Python Programming Language
With over 8.2 million developers across the world using Python for coding, Python
ranks first in the latest annual ranking of popular programming languages by IEEE
Spectrum with a score of 100. Stack overflow programming language trends clearly
show that it’s the only language on rising for the last five years.
10
Python is the preferred programming language of choice for machine learning for
some of the giants in the IT world including Google, Instagram, Facebook, Dropbox,
Netflix, Walt Disney, YouTube, Uber, Amazon, and Reddit. Python is an indisputable
leader and by far the best language for machine learning today and here’s why:
Python’s in-built libraries and packages provide base-level code so machine learning
engineers don’t have to start writing from scratch. Machine learning requires
continuous data processing and Python has in-built libraries and packages for almost
every task. This helps machine learning engineers reduce development time and
improve productivity when working with complex machine learning applications. The
best part of these libraries and packages is that there is zero learning curve, once you
know the basics of Python programming, you can start using these libraries.
Code Readability
The math behind machine learning is usually complicated and unobvious. Thus, code
readability is extremely important to successfully implement complicated machine
learning algorithms and versatile workflows. Python’s simple syntax and the
importance it puts on code readability makes it easy for machine learning engineers to
focus on what to write instead of thinking about how to write. Code readability makes
it easier for machine learning practitioners to easily exchange ideas, algorithms, and
tools with their peers. Python is not only popular within machine learning engineers,
but it is also one of the most popular programming languages among data scientists.
Flexibility
The multiparadigm and flexible nature of Python makes it easy for machine learning
engineers to approach a problem in the simplest way possible. It supports the
procedural, functional, object-oriented, and imperative style of programming allowing
machine learning experts to work comfortably on what approach fits best. The
flexibility Python offers help machine learning engineers choose the programming
style based on the type of problem – sometimes it would be beneficial to capture the
state in an object while other times the problem might require passing around
functions as parameters. Python provides flexibility in choosing either of the
approaches and minimises the likelihood of errors. Not only in terms of programming
11
styles but Python has a lot to offer in terms of flexibility when it comes to
implementing changes as machine learning practitioners need not recompile the
source code to see the changes.
2. R Programming Langauge
With more than 2 million R users, 12000 packages in the CRAN open-source
repository, close to 206 R Meetup groups, over 4000 R programming questions asked
every month, and 40K+ members on LinkedIn’s R group – R is an incredible
programming language for machine learning written by a statistician for statisticians.
R language can also be used by non-programmer including data miners, data analysts,
and statisticians.
12
learning practitioners can mix tools – choose the best tool for each task and
also enjoy the benefits of other tools along with R.
Java has plenty of third party libraries for machine learning. JavaML is an
in-built machine learning library that provides a collection of machine learning
algorithms implemented in Java. Also, you can use Arbiter Java library for
hyperparameter tuning which is an integral part of making ML algorithms run
effectively or you can use Deeplearning4J library which supports popular
machine learning algorithms like K-Nearest Neighbor and Neuroph and lets
you create neural networks or can also use Neuroph for neural networks.
Scalability is an important feature that every machine learning engineer must
consider before beginning a project. Java makes application scaling easier for
machine learning engineers, making it a great choice for the development of
large and complex machine learning applications from scratch.
Java Virtual Machine is one of the best platforms for machine learning as
engineers can write the same code on multiple platforms. JVM also helps
machine learning engineers create custom tools at a rapid pace and has various
IDE’s that help improve overall productivity. Java works best for
speed-critical machine learning projects as it is fast executing.
4. Julia
13
handcrafted profiling techniques or optimisation techniques solving all the
performance problems.
Julia’s code is universally executable. So, once written a machine learning
application it can be compiled in Julia natively from other languages like
Python or R in a wrapper like PyCall or RCall.
Scalability, as discussed, is crucial for machine learning engineers and Julia
makes it easier to be deployed quickly at large clusters. With powerful tools
like TensorFlow, MLBase.jl, Flux.jl, SciKitlearn.jl, and many others that
utilise the scalability provided by Julia, it is an apt choice for machine learning
applications.
Offer support for editors like Emacs and VIM and also IDE’s like Visual
studio and Juno.
5. LISP
Founded in 1958 by John McCarthy, LISP (List Processing) is the second oldest
programming language still in use and is mainly developed for AI-centric applications.
LISP is a dynamically typed programming language that has influenced the creation
of many machine learning programming languages like Python, Julia, and Java. LISP
works on Read-Eval-Print-Loop (REPL) and has the capability to code, compile, and
run code in 30+ programming languages.
LISP is considered as the most efficient and flexible machine learning language for
solving specifics as it adapts to the solution a programmer is coding for. This is what
makes LISP different from other machine learning languages. Today, it is particularly
used for inductive logic problems and machine learning. The first AI chatbot ELIZA
was developed using LISP and even today machine learning practitioners can use it to
create chatbots for eCommerce. LISP definitely deserves a mention on the list of best
language for machine learning because even today developers rely on LISP for
artificial intelligence projects that are heavy on machine learning as LISP offers –
Despite being flexible for machine learning, LISP lacks the support of well-known
machine learning libraries. LISP is neither a beginner-friendly machine learning
language (difficult to learn) and nor does have a large user community like that of
Python or R.
The best language for machine learning depends on the area in which it is going to be
applied, the scope of the machine learning project, which programming languages are
used in your industry/company, and several other factors. Experimentation, testing,
and experience help a machine learning practitioner decide on an optimal choice of
programming language for any given machine learning problem. Of course, the best
14
thing would be to learn at least two programming languages for machine learning as
this will help you put your machine learning resume at the top of the stack. Once you
are proficient in one machine learning language, learning another one is easy.
Azure Machine Learning is a cloud platform that allows developers to build, train,
and deploy AI models. Microsoft is constantly making updates and improvements to
its machine learning tools and has recently announced changes to Azure Machine
Learning, retiring the Azure Machine Learning Workbench.
2. IBM Watson
No, IBM’s Watson Machine Learning isn’t something out of Sherlock Holmes.
Watson Machine Learning is an IBM cloud service that uses data to put machine
learning and deep learning models into production. This machine learning tool allows
users to perform training and scoring, two fundamental machine learning operations.
Keep in mind, IBM Watson is best suited for building machine learning applications
through API connections.
3. Google TensorFlow
5. OpenNN
OpenNN, short for Open Neural Networks Library, is a software library that
implements neural networks. Written in C++ programming language, OpenNN offers
you the perk of downloading its entire library for free from GitHub or SourceForge.
15
Issues
Although machine learning is being used in every industry and helps organizations
make more informed and data-driven choices that are more effective than classical
methodologies, it still has so many problems that cannot be ignored. Here are some
common issues in Machine Learning that professionals face to inculcate ML skills
and create an application from scratch.
The major issue that comes while using machine learning algorithms is the lack of
quality as well as quantity of data. Although data plays a vital role in the processing
of machine learning algorithms, many data scientists claim that inadequate data, noisy
data, and unclean data are extremely exhausting the machine learning algorithms. For
example, a simple task requires thousands of sample data, and an advanced task such
as speech or image recognition needs millions of sample data examples. Further, data
quality is also important for the algorithms to work ideally, but the absence of data
quality is also found in Machine Learning applications. Data quality can be affected
by some factors as follows:
As we have discussed above, data plays a significant role in machine learning, and it
must be of good quality as well. Noisy data, incomplete data, inaccurate data, and
unclean data lead to less accuracy in classification and low-quality results. Hence,
data quality can also be considered as a major common problem while processing
machine learning algorithms.
To make sure our training model is generalized well or not, we have to ensure that
sample training data must be representative of new cases that we need to generalize.
The training data must cover all cases that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well
for generalized cases and provides accurate decisions. If there is less training data,
then there will be a sampling noise in the model, called the non-representative training
set. It won't be accurate in predictions. To overcome this, it will be biased against one
class or a group.
16
Hence, we should use representative data in training to protect against being biased
and make accurate predictions without any drift.
Overfitting:
Overfitting is one of the most common issues faced by Machine Learning engineers
and data scientists. Whenever a machine learning model is trained with a huge amount
of data, it starts capturing noise and inaccurate data into the training data set. It
negatively affects the performance of the model. Let's understand with a simple
example where we have a few training data sets such as 1000 mangoes, 1000 apples,
1000 bananas, and 5000 papayas. Then there is a considerable probability of
identification of an apple as papaya because we have a massive amount of biased data
in the training data set; hence prediction got negatively affected. The main reason
behind overfitting is using non-linear methods used in machine learning algorithms as
they build non-realistic data models. We can overcome overfitting by using linear and
parametric algorithms in the machine learning models.
Underfitting:
Underfitting occurs when our model is too simple to understand the base structure of
the data, just like an undersized pant. This generally happens when we have limited
data into the data set, and we try to build a linear model with non-linear data. In such
scenarios, the complexity of the model destroys, and rules of the machine learning
model become too easy to be applied on this data set, and the model starts doing
wrong predictions as well.
17
5. Monitoring and maintenance
As we know that generalized output data is mandatory for any machine learning
model; hence, regular monitoring and maintenance become compulsory for the same.
Different results for different actions require data change; hence editing of codes as
well as resources for monitoring them also become necessary.
A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example
where at a specific time customer is looking for some gadgets, but now customer
requirement changed over time but still machine learning model showing same
recommendations to the customer while customer expectation has been changed. This
incident is called a Data Drift. It generally occurs when new data is introduced or
interpretation of data changes. However, we can overcome this by regularly updating
and monitoring data according to the expectations.
8. Customer Segmentation
The machine learning process is very complex, which is also another major issue
faced by machine learning engineers and data scientists. However, Machine Learning
and Artificial Intelligence are very new technologies but are still in an experimental
phase and continuously being changing over time. There is the majority of hits and
trial experiments; hence the probability of error is higher than expected. Further, it
also includes analyzing the data, removing data bias, training data, applying complex
mathematical calculations, etc., making the procedure more complicated and quite
tedious.
Data Biasing is also found a big challenge in Machine Learning. These errors exist
when certain elements of the dataset are heavily weighted or need more importance
than others. Biased data leads to inaccurate results, skewed outcomes, and other
18
analytical errors. However, we can resolve this error by determining where data is
actually biased in the dataset. Further, take necessary steps to reduce it.
This issue is also very commonly seen in machine learning models. However,
machine learning models are highly efficient in producing accurate results but are
time-consuming. Slow programming, excessive requirements' and overloaded data
take more time to provide accurate results than expected. This needs continuous
maintenance and monitoring of the model for delivering accurate results.
Although machine learning models are intended to give the best possible outcome, if
we feed garbage data as input, then the result will also be garbage. Hence, we should
use relevant features in our training sample. A machine learning model is said to be
good if training data has a good set of features or less to no irrelevant features.
Getting the data right is the first step in any AI or machine learning project -- and it's
often more time-consuming and complex than crafting the machine learning
algorithms themselves. Advanced planning to help streamline and improve data
preparation in machine learning can save considerable work down the road. It can also
lead to more accurate and adaptable algorithms.
"Data preparation is the action of gathering the data you need, massaging it into a
format that's computer-readable and understandable, and asking hard questions of it to
check it for completeness and bias," said Eli Finkelshteyn, founder and CEO of
Constructor.io, which makes an AI-driven search engine for product websites.
19
It's tempting to focus only on the data itself, but it's a good idea to first consider the
problem you're trying to solve. That can help simplify considerations about what kind
of data to gather, how to ensure it fits the intended purpose and how to transform it
into the appropriate format for a specific type of algorithm.
Good data preparation can lead to more accurate and efficient algorithms, while
making it easier to pivot to new analytics problems, adapt when model accuracy drifts
and save data scientists and business users considerable time and effort down the line.
"Being a great data scientist is like being a great chef," surmised Donncha Carroll, a
partner at consultancy Axiom Consulting Partners. "To create an exceptional meal,
you must build a detailed understanding of each ingredient and think through how
they'll complement one another to produce a balanced and memorable dish. For a data
scientist, this process of discovery creates the knowledge needed to understand more
complex relationships, what matters and what doesn't, and how to tailor the data
preparation approach necessary to lay the groundwork for a great ML model."
Managers need to appreciate the ways in which data shapes machine learning
application development differently compared to customary application development.
"Unlike traditional rule-based programming, machine learning consists of two parts
that make up the final executable algorithm -- the ML algorithm itself and the data to
learn from," explained Felix Wick, corporate vice president of data science at supply
chain management platform provider Blue Yonder. "But raw data are often not ready
to be used in ML models. So, data preparation is at the heart of ML."
Data preparation consists of several steps, which consume more time than other
aspects of machine learning application development. A 2021 study by data science
platform vendor Anaconda found that data scientists spend an average of 22% of their
time on data preparation, which is more than the average time spent on other tasks
like deploying models, model training and creating data visualizations.
1. Problem formulation
Data preparation for building machine learning models is a lot more than just cleaning
and structuring data. In many cases, it's helpful to begin by stepping back from the
data to think about the underlying problem you're trying to solve. "To build a
successful ML model," Carroll advised, "you must develop a detailed understanding
of the problem to inform what you do and how you do it."
Start by spending time with the people that operate within the domain and have a
good understanding of the problem space, synthesizing what you learn through
conversations with them and using your experience to create a set of hypotheses that
describes the factors and forces involved. This simple step is often skipped or
underinvested in, Carroll noted, even though it can make a significant difference in
20
deciding what data to capture. It can also provide useful guidance on how the data
should be transformed and prepared for the machine learning model.
An Axiom legal client, for example, wanted to know how different elements of
service delivery impact account retention and growth. Carroll's team collaborated with
the attorneys to develop a hypothesis that accounts served by legal professionals
experienced in their industry tend to be happier and continue as clients longer. To
provide that information as an input to a machine learning model, they looked back
over the course of each professional's career and used billing data to determine how
much time they spent serving clients in that industry.
"Ultimately," Carroll added, "it became one of the most important predictors of client
retention and something we would never have calculated without spending the time
upfront to understand what matters and how it matters."
Once a data science team has formulated the machine learning problem to be solved,
it needs to inventory potential data sources within the enterprise and from external
third parties. The data collection process must consider not only what the data is
purported to represent, but also why it was collected and what it might mean,
particularly when used in a different context. It's also essential to consider factors that
may have biased the data.
"To reduce and mitigate bias in machine learning models," said Sophia Yang, a senior
data scientist at Anaconda, "data scientists need to ask themselves where and how the
data was collected to determine if there were significant biases that might have been
captured." To train a machine learning model that predicts customer behavior, for
example, look at the data and ensure the data set was collected from diverse people,
geographical areas and perspectives.
"The most important step often missed in data preparation for machine learning is
asking critical questions of data that otherwise looks technically correct,"
Finkelshteyn said. In addition to investigating bias, he recommended determining if
there's reason to believe that important missing data may lead to a partial picture of
the analysis being done. In some cases, analytics teams use data that works
technically but produces inaccurate or incomplete results, and people who use the
resulting models build on these faulty learnings without knowing something is wrong.
3. Data exploration
Data scientists need to fully understand the data they're working with early in the
process to cultivate insights into its meaning and applicability. "A common mistake is
to launch into model building without taking the time to really understand the data
you've wrangled," Carroll said.
Data exploration means reviewing such things as the type and distribution of data
contained within each variable, the relationships between variables and how they vary
relative to the outcome you're predicting or interested in achieving.
21
This step can highlight problems like collinearity -- variables that move together -- or
situations where standardization of data sets and other data transformations are
necessary. It can also surface opportunities to improve model performance, like
reducing the dimensionality of a data set.
Data visualizations can also help improve this process. "This might seem like an
added step that isn't needed," Yang conjectured, "but our brains are great at spotting
patterns along with data that doesn't match the pattern." Data scientists can easily see
trends and explore the data correctly by creating suitable visualizations before
drawing conclusions. Popular data visualization tools include Tableau, Microsoft
Power BI, D3.js and Python libraries such as Matplotlib, Bokeh and the HoloViz
stack.
Various data cleansing and validation techniques can help analytics teams identify
and rectify inconsistencies, outliers, anomalies, missing data and other issues. Missing
data values, for example, can often be addressed with imputation tools that fill empty
fields with statistically relevant substitutes.
But Blue Yonder's Wick cautioned that semantic meaning is an often overlooked
aspect of missing data. In many cases, creating a dedicated category for capturing the
significance of missing values can help. In others, teams may consider explicitly
setting missing values as neutral to minimize their impact on machine learning
models.
A wide range of commercial and open source tools can be used to cleanse and
validate data for machine learning and ensure good quality data. Open source
technologies such as Great Expectations and Pandera, for example, are designed to
validate the data frames commonly used to organize analytics data into
two-dimensional tables. Tools that validate code and data processing workflows are
also available. One of them is pytest, which, Yang said, data scientists can use to
apply a software development unit-test mindset and manually write tests of their
workflows.
5. Data structuring
Once data science teams are satisfied with their data, they need to consider the
machine learning algorithms being used. Most algorithms, for example, work better
when data is broken into categories, such as age ranges, rather than left as raw
numbers.
Two often-missed data preprocessing tricks, Wick said, are data binning and
smoothing continuous features. These data regularization methods can reduce a
machine learning model's variance by preventing it from being misled by minor
statistical fluctuations in a data set.
Binning data into different groups can be done either in an equidistant manner, with
the same "width" for each bin, or equi-statistical method, with approximately the
same number of samples in each bin. It can also serve as a prerequisite for local
22
optimization of the data in each bin to help produce low-bias machine learning
models.
Smoothing continuous features can help in "denoising" raw data. It can also be used
to impose causal assumptions about the data-generating process by representing
relationships in ordered data sets as monotonic functions that preserve the order
among data elements.
Other actions that data scientists often take in structuring data for machine learning
include the following:
The last stage in data preparation before developing a machine learning model is
feature engineering and feature selection.
Wick said feature engineering, which involves adding or creating new variables to
improve a model's output, is the main craft of data scientists and comes in various
forms. Examples include extracting the days of the week or other variables from a
data set, decomposing variables into separate features, aggregating variables and
transforming features based on probability distributions.
Data scientists also must address feature selection -- choosing relevant features to
analyze and eliminating nonrelevant ones. Many features may look promising but lead
to problems like extended model training and overfitting, which limits a model's
ability to accurately analyze new data. Methods such as lasso regression and
automatic relevance determination can help with feature selection.
Facial recognition
Targeted advertising
Voice recognition
SPAM filters
Machine translation
Detecting credit card fraud
Virtual Personal Assistants
Self-driving cars
… and lots more.
23
To fully understand the opportunities and consequences of the machine learning filled
future, everyone needs to be able to …
Here is an example of a robot with a machine learning brain. It reacts just to the tone
of voice – it doesn’t understand the words. It learnt very much like a dog does. It was
‘rewarded’ when it reacted in an appropriate way and was ‘punished’ when it reacted
in an inappropriate way. Eventually it learnt to behave like this.
There are several ways to try to make a machine do tasks ‘intelligently’. For example:
There are several ways to try to make a machine do tasks ‘intelligently’. For example:
24
Genetic algorithms (copying the way evolution improves species to fit their
environment)
Bayesian Networks (building in existing expert knowledge)
Types of data
Machine learning data analysis uses algorithms to continuously improve itself over
time, but quality data is necessary for these models to operate efficiently.
A single row of data is called an instance. Datasets are a collection of instances that
all share a common attribute. Machine learning models will generally contain a few
different datasets, each used to fulfill various roles in the system.
For machine learning models to understand how to perform various actions, training
datasets must first be fed into the machine learning algorithm, followed by validation
25
datasets (or testing datasets) to ensure that the model is interpreting this data
accurately.
Once you feed these training and validation sets into the system, subsequent datasets
can then be used to sculpt your machine learning model going forward. The more data
you provide to the ML system, the faster that model can learn and improve.
Data can come in many forms, but machine learning models rely on four primary data
types. These include numerical data, categorical data, time series data, and text data.
Numerical data
Numerical data
Numerical data, or quantitative data, is any form of measurable data such as your
height, weight, or the cost of your phone bill. You can determine if a set of data is
numerical by attempting to average out the numbers or sort them in ascending or
descending order. Exact or whole numbers (ie. 26 students in a class) are considered
discrete numbers, while those which fall into a given range (ie. 3.6 percent interest
rate) are considered continuous numbers. While learning this type of data, keep in
mind that numerical data is not tied to any specific point in time, they are simply raw
numbers.
Categorical data
Categorical data is sorted by defining characteristics. This can include gender, social
class, ethnicity, hometown, the industry you work in, or a variety of other labels.
While learning this data type, keep in mind that it is non-numerical, meaning you are
unable to add them together, average them out, or sort them in any chronological
order. Categorical data is great for grouping individuals or ideas that share similar
attributes, helping your machine learning model streamline its data analysis.
26
Time series data
Time series data consists of data points that are indexed at specific points in time.
More often than not, this data is collected at consistent intervals. Learning and
utilizing time series data makes it easy to compare data from week to week, month to
month, year to year, or according to any other time-based metric you desire. The
distinct difference between time series data and numerical data is that time series data
has established starting and ending points, while numerical data is simply a collection
of numbers that aren’t rooted in particular time periods.
Text data
Text data is simply words, sentences, or paragraphs that can provide some level of
insight to your machine learning models. Since these words can be difficult for
models to interpret on their own, they are most often grouped together or analyzed
using various methods such as word frequency, text classification, or sentiment
analysis.
There is an abundance of places you can find machine learning data, but we have
compiled five of the most popular ML dataset resources to help get you started:
27
Exploring structure of data
The data structure used for machine learning is quite similar to other software
development fields where it is often used. Machine Learning is a subset of artificial
intelligence that includes various complex algorithms to solve mathematical problems
to a great extent. Data structure helps to build and understand these complex
problems. Understanding the data structure also helps you to build ML models and
algorithms in a much more efficient way than other ML professionals.
The data structure is defined as the basic building block of computer programming
that helps us to organize, manage and store data for efficient search and retrieval.
In other words, the data structure is the collection of data type 'values' which are
stored and organized in such a way that it allows for efficient access and modification.
The data structure is the ordered sequence of data, and it tells the compiler how a
programmer is using the data such as Integer, String, Boolean, etc.
There are two different types of data structures: Linear and Non-linear data structures.
The linear data structure is a special type of data structure that helps to organize and
manage data in a specific order where the elements are attached adjacently.
28
Array:
An array is one of the most basic and common data structures used in Machine
Learning. It is also used in linear algebra to solve complex mathematical problems.
You will use arrays constantly in machine learning, whether it's:
An array contains index numbers to represent an element starting from 0. The lowest
index is arr[0] and corresponds to the first element.
Let's take an example of a Python array used in machine learning. Although the
Python array is quite different from than array in other programming languages, the
Python list is more popular as it includes the flexibility of data types and their length.
If anyone is using Python in ML algorithms, then it's better to kick your journey from
array initially.
Method Description
Append() It is used to add an element at the end of the list.
Clear() It is used to remove/clear all elements in the list.
Copy() It returns a copy of the list.
Count() It returns the count or total available element with an integer value.
Extend() It is used to add the element of a list to the end of the current list.
Index() It returns the index of the first element with the specified value.
Insert() It is used to add an element at a specific position using an index number.
It is used to remove an element from a specified position using an index
Pop()
number.
Remove() Used to remove the elements with specified values.
Reverse() Used to show list in reverse order
Sort() Used to sort the list in an array.
Stacks:
Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last
Out). It is used for binary classification in deep learning. Although stacks are easy to
learn and implement in ML models but having a good grasp can help in many
computer science aspects such as parsing grammar, etc.
Stacks enable the undo and redo buttons on your computer as they function similar to
a stack of blog content. There is no sense in adding a blog at the bottom of the stack.
29
However, we can only check the most recent one that has been added. Addition and
removal occur at the top of the stack.
Linked List:
A linked list is the type of collection having several separately allocated nodes. Or in
other words, a list is the type of collection of data elements that consist of a value and
pointer that point to the next node in the list.
In a linked list, insertion and deletion are constant time operations and are very
efficient, but accessing a value is slow and often requires scanning. So, a linked list is
very significant for a dynamic array where the shifting of elements is required.
Although insertion of an element can be done at the head, middle or tail position, it is
relatively cost consuming. However, linked lists are easy to splice together and split
apart. Also, the list can be converted to a fixed-length array for fast access.
Queue:
A Queue is defined as the "FIFO" (first in, first out). It is useful to predict a queuing
scenario in real-time programs, such as people waiting in line to withdraw cash in the
bank. Hence, the queue is significant in a program where multiple lists of codes need
to be processed.
The queue data structure can be used to record the split time of a car in F1 racing.
As the name suggests, in Non-linear data structures, elements are not arranged in any
sequence. All the elements are arranged and linked with each other in a hierarchal
manner, where one element can be linked with one or more elements.
1) Trees
Binary Tree:
The concept of a binary tree is very much similar to a linked list, but the only
difference of nodes and their pointers. In a linked list, each node contains a data value
with a pointer that points to the next node in the list, whereas; in a binary tree, each
node has two pointers to subsequent nodes instead of just one.
Binary trees are sorted, so insertion and deletion operations can be easily done with
O(log N) time complexity. Similar to the linked list, a binary tree can also be
converted to an array on the basis of tree sorting.
30
In a binary tree, there are some child and parent nodes shown in the above image.
Where the value of the left child node is always less than the value of the parent node
while the value of the right-side child nodes is always more than the parent node.
Hence, in a binary tree structure, data sorting is done automatically, which makes
insertion and deletion efficient.
2) Graphs
A graph data structure is also very much useful in machine learning for link
prediction. Graphs are directed or undirected concepts with nodes and ordered or
unordered pairs. Hence, you must have good exposure to the graph data structure for
machine learning and deep learning.
3) Maps
Maps are the popular data structure in the programming world, which are mostly
useful for minimizing the run-time algorithms and fast searching the data. It stores
data in the form of (key, value) pair, where the key must be unique; however, the
value can be duplicated. Each key corresponds to or maps a value; hence it is named a
Map.
In different programming languages, core libraries have built-in maps or, rather,
HashMaps with different names for each implementation.
In Java: Maps
In Python: Dictionaries
C++: hash_map, unordered_map, etc.
31
Python Dictionaries are very useful in machine learning and data science as various
functions and algorithms return the dictionary as an output. Dictionaries are also
much used for implementing sparse matrices, which is very common in Machine
Learning.
Heap is a hierarchically ordered data structure. Heap data structure is also very much
similar to a tree, but it consists of vertical ordering instead of horizontal ordering.
Ordering in a heap DS is applied along the hierarchy but not across it, where the value
of the parent node is always more than that of child nodes either on the left or right
side.
Here, the insertion and deletion operations are performed on the basis of promotion. It
means, firstly, the element is inserted at the highest available position. After that, it
gets compared with its parent and promoted until it reaches the correct ranking
position. Most of the heaps data structures can be stored in an array along with the
relationships between the elements.
This is one of the most important types of data structure used in linear algebra to solve
1-D, 2-D, 3-D as well as 4-D arrays for matrix arithmetic. Further, it requires good
exposure to Python libraries such as Python NumPy for programming in deep
learning.
For a Machine learning professional, apart from knowledge of machine learning skills,
it is required to have mastery of data structure and algorithms.
32
When we use machine learning for solving a problem, we need to evaluate the model
performance, i.e., which model is fastest and requires the smallest amount of space
and resources with accuracy. Moreover, if a model is built using algorithms,
comparing and contrasting two algorithms to determine the best for the job is crucial
to the machine learning professional. For such cases, skills in data structures become
important for ML professionals.
With the knowledge of data structure and algorithms with ML, we can answer the
following questions easily:
Data quality
Good quality data becomes imperative and a basic building block of an ML pipeline.
The ML model can only be as good as its training data.
The machine learns the statistical associations from the historical data and is as good
as the data it is trained on. Hence, good quality data becomes imperative and a basic
building block of an ML pipeline. The ML model can only be as good as its training
data.
The machine learning algorithms need training data in a single view i.e. a flat
structure. As most organizations maintain multiple sources of data, the data
preparation by combining multiple data sources to bring all necessary attributes in a
single flat file is a time and resource (domain expertise) expensive process.
The data gets exposed to multiple sources of error at this step and requires strict peer
review to ensure that the domain-established logic has been communicated,
understood, programmed, and implemented well.
Since data warehouses integrate data from multiple sources, quality issues related to
data acquisition, cleaning, transformations, linking, and integration become critical.
A very popular notion among most the data scientists is that the data preparation,
cleaning, and transformation take up the majority of the model building time – and it
is an absolute truth. Hence, it is advised not to rush through the data to feed into the
model and perform extensive data quality checks. Though the number and type of
checks one can perform on the data can be very subjective, we will discuss some of
the key factors to be checked in the data while preparing data quality score and
assessing the goodness of data:
33
Techniques to maintain data quality:
o All labelers are not the same: Data is gathered from multiple sources.
Multiple vendors have different approaches to collecting and labeling
data with a different understanding of the end-use of the data. Within
the same vendor for data labeling, there are myriad ways data
inconsistency can crop up as the supervisor gets requirements and
shares the guidelines to different team members, all of whom can label
based on their understanding.
A quality check on the vendor side, validation of adherence to
the shared guidelines at the consumer side will help bring
homogenous labeling.
34
o Distinct Record: Identifying the group of attributes that uniquely
identify a single record is very important and needs validation from a
domain expert. Removing duplicates on this group leaves you with
distinct records necessary for model training. This group acts as a key
to performing multiple aggregate and transformations operations on the
dataset like calculating rolling mean, backfilling null values, missing
value imputation (details on this in next point), etc.
o What to do with the missing data? Systematic missingness of data
leads to the origin of a biased dataset and calls for deeper investigation.
Also, removing the observations from the data with more null/missing
values can lead to the elimination of data representing certain groups
of people (e.g. gender, or race). Hence, misrepresented data will
produce biased results and is not only flawed at the model output level
but is also against the fairness principles of ethical and responsible use
of AI. Another way you may find the missing attributes is “at
random”. Blindly removing a certain important attribute due to a high
missingness quotient can harm the model by reducing its predictive
power.
Data Remediation
Data remediation is the process of cleansing, organizing and migrating data so that it’s
properly protected and best serves its intended purpose. There is a misconception that
data remediation simply means deleting business data that is no longer needed. It’s
important to remember that the key word “remediation” derives from the word
“remedy,” which is to correct a mistake. Since the core initiative is to correct data, the
data remediation process typically involves replacing, modifying, cleansing or
deleting any “dirty” data.
35
Data remediation terminology
As you explore the data remediation process, you will come across unique
terminology. These are common terms related to data remediation that you should get
acquainted with.
Data Migration – The process of moving data between two or more systems,
data formats or servers.
Data Discovery – A manual or automated process of searching for patterns in
data sets to identify structured and unstructured data in an organization’s
systems.
ROT – An acronym that stands for redundant, obsolete and trivial data.
According to the Association for Intelligent Information Management, ROT
data accounts for nearly 80 percent of the unstructured data that is beyond its
recommended retention period and no longer useful to an organization.
Dark Data – Any information that businesses collect, process and store, but do
not use for other purposes. Some examples include customer call records, raw
survey data or email correspondences. Often, the storing and securing of this
type of data incurs more expense and sometimes even greater risk than it does
value.
Dirty Data – Data that damages the integrity of the organization’s complete
dataset. This can include data that is unnecessarily duplicated, outdated,
incomplete or inaccurate.
Data Overload – This is when an organization has acquired too much data,
including low-quality or dark data. Data overload makes the tasks of
identifying, classifying and remediating data laborious.
Data Cleansing – Transforming data in its native state to a predefined
standardized format.
Data Governance – Management of the availability, usability, integrity and
security of the data stored within an organization.
Data remediation is an involved process. After all, it’s more than simply purging your
organization’s systems of dirty data. It requires knowledgeable assessment on how to
most effectively resolve unclean data.
Assessment
Before you take any action on your company’s data, you need to have a complete
understanding of the data you possess. How valuable is this data to the company? Is
this data sensitive? Does this data actually require specialized storage, or is it trivial
information? Identifying the quantity and type of data you’re dealing with, even if it’s
just a ballpark estimate to start, will help your team get a general sense of how much
time and resources need to be dedicated for successful data remediation.
36
Not all data is created equally, which means that not all pieces of data require the
same level of protection or storage features. For instance, it isn’t cost-efficient for a
company to store all data, ranging from information that is publicly facing to sensitive
data, all in the same high-security vault. This is why organizing and creating segments
based on the information’s purpose is critical during the data remediation process.
Once your data is segmented, you can move onto indexing and classification. These
steps build off of the data segments you have created and helps you determine action
steps. In this step, organizations will focus on segments containing non-ROT data and
classify the level of sensitivity of this remaining data.
Migrating
If an organization’s end goal is to consolidate their data into a new, cleansed storage
environment, then migration is an essential step in the data remediation process. A
common scenario is an organization who needs to find a new secure location for
storing data because their legacy system has reached its end of life. Some
organizations may also prefer moving their data to cloud-based platforms, like
SharePoint or Office 365, so that information is more accessible for their internal
teams.
Data cleansing
The final task for your organization’s data may not always involve migration. There
may be other actions better suited for the data depending on what segmentation group
37
it falls under and its classification. A few vital actions that a team may proceed with
include shredding, redacting, quarantining, ACL removal and script execution to
clean up data.
Data remediation is a big effort, but it comes with big benefits for businesses as well.
These are the top benefits that most organizations realize after data remediation.
Reduced data storage costs — Although data remediation isn’t solely about
deletion of data, it is a common remediation action and less data means less
storage required. Additionally, many organizations realize that they have
lumped trivial information in the same high-security storage platform for
sensitive information, instead of only paying for the storage space that’s
actually necessary.
Protection for unstructured sensitive data — Once sensitive data is
discovered and classified, remediation is where you determine and execute the
actions that mitigate risk. This could look like finding a secure area to store
sensitive data or deleting what is necessary from a compliance perspective.
Reduced sensitive data footprint — By removing sensitive data that is
beyond its recommended retention period and is necessary for compliance,
you’ve reduced your organization’s sensitive data footprint and decreased risk
of potential data breaches or leaks of highly sensitive data.
Adherence to compliance laws and regulations — Hanging on to data that
is beyond its recommended retention period can create greater risks. By
cleaning up data, your organization reduces data exposure which supports
compliance initiatives.
Increased staff productivity — Data that your team uses should be available,
usable and trustworthy. By streamlining your organization’s network with data
remediation, information should be easier to find and usable for its intended
purpose.
Minimized cyberattack risks — By continuously engaging in data
remediation, your organization is proactively minimizing data loss risks and
potential financial or reputational damage of successful cyberattacks.
Improved overall data security — Data remediation and data governance
work hand in hand. In order to properly remediate data, your organization will
need to establish data governance policies, which is significant for the overall
management and protection of your organization’s data.
Business changes
If a company has changed software or systems they use, or even moved to a new
office or data center location, that is a case to buckle down on data remediation
38
immediately. Sometimes companies switch to new softwares or systems because they
need to phase out their legacy system that has reached its end of life. Change of any
kind is rarely ever 100 percent smooth, and data could become corrupted or exposed
during the shuffle of changing environments — whether it be digital or physical.
Newly enacted laws or regulations, either on a state or federal level, could be another
major driver for data remediation. Data privacy and protection laws are continuously
being updated and improved upon, like the more recent California Consumer
Protection Act of 2018 (CCPA). Sometimes new policies may be enacted by the
leadership team at your organization as well.
Human error
Drivers for data remediation aren’t always necessarily as grand as a new business
acquisition or legal regulation. Sometimes, instances as simple as human error can be
a catalyst for data remediation. For instance, let’s say that your organization discovers
one of its employees has unintentionally downloaded sensitive corporate data on their
personal mobile phone. Or, perhaps a couple of employees accidentally opened up a
malicious spam email. Actions as innocent as these examples could put the integrity
of your organization’s data at risk and is cause for immediately taking action with
data remediation.
More examples of scenarios that may trigger the need to remediate data include:
Lack of information
39
A common reason that organizations ignore data remediation is a lack of information
about what, where, how and why data is stored in the company. An organization may
not even realize the expanse of data they have collected or where it’s even stored.
Awareness is a common issue, and since such a large percentage of sensitive data falls
under the unstructured category, locating and awareness of all of this data is difficult.
It’s recommended that organizations, especially those who belong to industries that
interact with high volumes of sensitive data (like the medical, financial or education
industries), regularly perform sensitive data discovery and data classification to
prepare for data remediation. All of these steps are essential to a healthy data lifecycle
and depend on one another to keep a company’s data security in good standing.
Another factor that may prevent an organization from getting started with data
remediation is a fear of deleting data. The permanency of the action can be
intimidating, and some businesses may be concerned that they may need the data at
hand at some point in the future. However, hanging on to unnecessary data, or leaving
dirty data unmodified or uncleansed, can pose greater risk to an organization —
especially when it comes to compliance laws and regulations.
Lastly, some organizations may not have established clear data ownership. If there
aren’t clear roles and responsibilities for each member of your organization’s security
team, then important tasks like data remediation can easily slip through the cracks.
It’s essential to determine each person’s key responsibilities when it comes to
maintaining data security, and to make those duties transparent across the
organization so that everyone knows who to turn to for specific security questions,
and to keep the team accountable.
Whether you’ve put data remediation on the back-burner or are realizing for the first
time the benefits of steady data remediation, here are several steps your team should
take to prepare for data remediation.
1. Data remediation teams – First, create data remediation teams. In doing this,
your organization will need to establish data ownership roles and
responsibilities, so everyone on your security team knows how they are
contributing and who to go for with questions or concerns.
2. Data governance policies – From there, you will need to establish company
policies that enforce data governance. An effective data governance plan will
ensure that the company’s data is trustworthy and does not get misused.
Typically, data governance is a process largely based on the company’s
internal data standards and policies that control data usage in order to maintain
the availability, usability, integrity and security of data.
3. Prioritize data remediation areas – Once you have your organization’s
policies and data remediation team assembled, you should begin prioritizing
which areas may require more immediate data remediation. If any of the
drivers we mentioned above have occurred, such as your organization
40
switching to a new platform or an urgent need to eliminate PII, those are great
starting points for prioritizing the order of business areas that need data
remediation.
4. Budget for data-related issues – After compiling a prioritized list, it’s time to
budget for any data-related issues that may occur during the remediation
process. This includes estimating the hours of labor for the process and
factoring in costs for any special tools that may be needed for remediation.
5. Discuss data remediation expectations – Either after or alongside the
budgeting process, your team should sit down and discuss general
expectations of the data remediation process. Are there any types of sensitive
data your team expects to find? Are there any recent overarching data security
issues or changes that could have an impact or effect on the remediation
process? During the discussion, important details may be brought to light for
the team that only one person was aware of and help the team reach success.
6. Track progress and ROI – All company’s want to understand their ROI on
big projects and initiatives, and this applies to data security measures too.
Your organization’s IT data security lead should create a progress reporting
mechanism that can inform company stakeholders on the data remediation
progress, including key performance indicators like amount of issues resolved
or how resolved issues translate into money and risk saved.
Data pre-processing
Companies can use data from nearly endless sources – internal information, customer
service interactions, and all over the internet – to help inform their choices and
improve their business.
But you can’t simply take raw data and run it through machine learning and analytics
programs right away. You first need to preprocess your data, so it can be successfully
“read” or understood by machines.
Data preprocessing is a step in the data mining and data analysis process that takes
raw data and transforms it into a format that can be understood and analyzed by
computers and machine learning.
Raw, real-world data in the form of text, images, video, etc., is messy. Not only may
it contain errors and inconsistencies, but it is often incomplete, and doesn’t have a
regular, uniform design.
Machines like to process nice and tidy information – they read data as 1s and 0s. So
calculating structured data, like whole numbers and percentages is easy. However,
unstructured data, in the form of text and images must first be cleaned and formatted
before analysis.
41
Data Preprocessing Importance
When using data sets to train machine learning models, you’ll often hear the phrase
“garbage in, garbage out” This means that if you use bad or “dirty” data to train
your model, you’ll end up with a bad, improperly trained model that won’t actually be
relevant to your analysis.
Good, preprocessed data is even more important than the most powerful algorithms,
to the point that machine learning models trained with bad data could actually be
harmful to the analysis you’re trying to do – giving you “garbage” results.
Depending on your data gathering techniques and sources, you may end up with data
that’s out of range or includes an incorrect feature, like household income below zero
or an image from a set of “zoo animals” that is actually a tree. Your set could have
missing values or fields. Or text data, for example, will often have misspelled words
and irrelevant symbols, URLs, etc.
When you properly preprocess and clean your data, you’ll set yourself up for much
more accurate downstream processes. We often hear about the importance of
“data-driven decision making,” but if these decisions are driven by bad data, they’re
simply bad decisions.
Data sets can be explained with or communicated as the “features” that make them up.
This can be by size, location, age, time, color, etc. Features appear as columns in
datasets and are also known as attributes, variables, fields, and characteristics.
It’s important to understand what “features” are when preprocessing your data
because you’ll need to choose which ones to focus on depending on what your
business goals are. Later, we’ll explain how you can improve the quality of your
dataset’s features and the insights you gain with processes like feature selection
42
First, let’s go over the two different types of features that are used to describe data:
categorical and numerical:
The diagram below shows how features are used to train machine learning text
analysis models. Text is run through a feature extractor (to pull out or highlight words
or phrases) and these pieces of text are classified or tagged by their features. Once the
model is properly trained, text can be run through it, and it will make predictions on
the features of the text or “tag” the text itself.
Let’s take a look at the established steps you’ll need to go through to make sure your
data is successfully preprocessed.
43
1. Data quality assessment
Take a good look at your data and get an idea of its overall quality, relevance to your
project, and consistency. There are a number of data anomalies and inherent problems
to look out for in almost any data set, for example:
Mismatched data types: When you collect data from many different sources,
it may come to you in different formats. While the ultimate goal of this entire
process is to reformat your data for machines, you still need to begin with
similarly formatted data. For example, if part of your analysis involves family
income from multiple countries, you’ll have to convert each income amount
into a single currency.
Mixed data values: Perhaps different sources use different descriptors for
features – for example, man or male. These value descriptors should all be
made uniform.
Data outliers: Outliers can have a huge impact on data analysis results. For
example if you're averaging test scores for a class, and one student didn’t
respond to any of the questions, their 0% could greatly skew the results.
Missing data: Take a look for missing data fields, blank spaces in text, or
unanswered survey questions. This could be due to human error or incomplete
data. To take care of missing data, you’ll have to perform data cleaning.
2. Data cleaning
Data cleaning is the process of adding missing data and correcting, repairing, or
removing incorrect or irrelevant data from a data set. Dating cleaning is the most
important step of preprocessing because it will ensure that your data is ready to go for
your downstream needs.
Data cleaning will correct all of the inconsistent data you uncovered in your data
quality assessment. Depending on the kind of data you’re working with, there are a
number of possible cleaners you’ll need to run your data through.
Missing data
There are a number of ways to correct for missing data, but the two most common
are:
Noisy data
Data cleaning also includes fixing “noisy” data. This is data that includes unnecessary
data points, irrelevant data, and data that’s more difficult to group together.
44
Binning: Binning sorts data of a wide data set into smaller groups of more
similar data. It’s often used when analyzing demographics. Income, for
example, could be grouped: $35,000-$50,000, $50,000-$75,000, etc.
Regression: Regression is used to decide which variables will actually apply
to your analysis. Regression analysis is used to smooth large amounts of data.
This will help you get a handle on your data, so you’re not overburdened with
unnecessary data.
Clustering: Clustering algorithms are used to properly group data, so that it
can be analyzed with like data. They’re generally used in unsupervised
learning, when not a lot is known about the relationships within your data.
If you’re working with text data, for example, some things you should consider when
cleaning your data are:
Remove URLs, symbols, emojis, etc., that aren’t relevant to your analysis
Translate all text into the language you’ll be working in
Remove HTML tags
Remove boilerplate email text
Remove unnecessary blank text between words
Remove duplicate data
After data cleaning, you may realize you have insufficient data for the task at hand. At
this point you can also perform data wrangling or data enrichment to add new data
sets and run them through quality assessment and cleaning again before adding them
to your original data.
3. Data transformation
With data cleaning, we’ve already begun to modify our data, but data transformation
will begin the process of turning the data into the proper format(s) you’ll need for
analysis and other downstream processes.
1. Aggregation
2. Normalization
3. Feature selection
4. Discreditization
5. Concept hierarchy generation
45
that the more features you choose to use, the longer the training process and,
sometimes, the less accurate your results, because some feature characteristics
may overlap or be less present in the data.
4. Data reduction
The more data you’re working with, the harder it will be to analyze, even after
cleaning and transforming it. Depending on your task at hand, you may actually have
more data than you need. Especially when working with text analysis, much of
regular human speech is superfluous or irrelevant to the needs of the researcher. Data
reduction not only makes the analysis easier and more accurate, but cuts down on data
storage.
It will also help identify the most important features to the process at hand.
46
Numerosity reduction: This will help with data storage and transmission.
You can use a regression model, for example, to use only the data and
variables that are relevant to your analysis.
Dimensionality reduction: This, again, reduces the amount of data used to
help facilitate analysis and downstream processes. Algorithms like K-nearest
neighbors use pattern recognition to combine similar data and make it more
manageable.
Take a look at the table below to see how preprocessing works. In this example, we
have three variables: name, age, and company. In the first example we can tell that #2
and #3 have been assigned the incorrect companies.
We can use data cleaning to simply remove these rows, as we know the data was
improperly entered or is otherwise corrupted.
Or, we can perform data transformation, in this case, manually, in order to fix the
problem:
Once the issue is fixed, we can perform data reduction, in this case by descending age,
to choose which age range we want to focus on:
47