Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

MC4301 - ML Unit 1 (Introduction)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

C.

Abdul Hakeem College of Engineering & Technology


Department of Master of Computer Applications
MC4301 - Machine Learning
Unit - 1
Introduction

Human Learning
Learning is the process of acquiring new understanding, knowledge, behaviors, skills,
values, attitudes, and preferences.
Learning consists of complex information processing, problem-solving,
decision-making in uncertainty and the urge to transfer knowledge and skills
into new, unknown settings.

The process of learning is continuous which starts right from the time of birth of an
individual and continues till the death. We all are engaged in the learning endeavours
in order to develop our adaptive capabilities as per the requirements of the changing
environment.

For a learning to occur, two things are important:

1. The presence of a stimulus in the environment and


2. The innate dispositions like emotional and instinctual
dispositions.

A person keeps on learning across all the stages of life, by constructing or


reconstructing experiences under the influence of emotional and instinctual
dispositions.

Psychologists in general define Learning as relatively permanent behavioural


modifications which take place as a result of experience. This definition of learning
stresses on three important elements of learning:

 Learning involves a behavioural change which can be better or worse.


 This behavioural change should take place as a result of practice and
experience. Changes resulting from maturity or growth cannot be considered
as learning
 This behavioural change must be relatively permanent and last for a relatively
long time enough.

John B Watson is one amongst the first thinkers who has proven that behavioural
changes occur as a result of learning. Watson is believed to be the founder of
Behavioural school of thought, which gained its prominence or acceptability around
the first half of the 20th century.

Gales defined Learning as the behavioural modification which occurs as a result of


experience as well as training.

Crow and Crow defined learning as the process of acquisition of knowledge, habits
and attitudes.

1
According to E.A, Peel, Learning can be described as a change in the individual
which takes place as a result of the environmental change.

H.J. Klausmeir described Learning as a process which leads to some behavioural


change as a result of some experience, training, observation, activity, etc.

The key characteristics of the learning process are:

1. When described in the simplest possible manner, learning is described as an


experience acquisition process.
2. In the complex form, learning can be described as process of acquisition,
retention and modification of experience.
3. It re-establishes the relationship between a stimulus and response.
4. It is a method of problem solving and is concerned about making adjustments
with the environment.
5. It involves all those gamut of activities which may have a relatively permanent
effect on the individual.
6. The process of learning is concerned about experience acquisition, retention of
experiences, and experience development in a step by step manner, synthesis
of both old and new experiences for creating a new pattern.
7. Learning is concerned about cognitive, conative and affective aspects.
Knowledge acquisition process is cognitive, any change in the emotions is
affective and conative is acquisition of new habits or skills.

Types

Types of Learning

1. Motor Learning: Our day to day activities like walking,


running, driving, etc, must be learnt for ensuring a good life.
These activities to a great extent involve muscular
coordination.
2. Verbal Learning: It is related with the language which we use
to communicate and various other forms of verbal
communication such as symbols, words, languages, sounds,
figures and signs.
3. Concept Learning: This form of learning is associated with
higher order cognitive processes like intelligence, thinking,
reasoning, etc, which we learn right from our childhood.
Concept learning involves the processes of abstraction and
generalization, which is very useful for identifying or
recognizing things.
4. Discrimination Learning: Learning which distinguishes
between various stimuli with its appropriate and different
responses is regarded as discrimination stimuli.
5. Learning of Principles: Learning which is based on principles
helps in managing the work most effectively. Principles based
learning explains the relationship between various concepts.

2
6. Attitude Learning: Attitude shapes our behaviour to a very great
extent, as our positive or negative behaviour is based on our attitudinal
predisposition.

3 Types of Behavioural Learning

The Behavioural School of Thought which was founded by John B Watson which was
highlighted in his seminal work, “Psychology as the Behaviorist View It”, stressed
on the fact that Psychology is an objective science, hence mere emphasis on the
mental processes should not be considered as such processes cannot be objectively
measured or observed.

Watson tried to prove his theory with the help of his famous Little Albert
Experiment, by way of which he conditioned a small kid to be scared of a white rat.
The behavioural psychology described three types of learning: Classical
Conditioning, Observational Learning and Operant Conditioning.

1. Classical Conditioning: In case of Classical Conditioning, the process of


learning is described as a Stimulus-Response connection or association.

Classical Conditioning theory has been explained with the help of Pavlov’s
Classic Experiment, in which the food was used as the natural stimulus which
was paired with the previously neutral stimuli that’s a bell in this case. By
establishing an association between the natural stimulus (food) and the neutral
stimuli (sound of the bell), the desired response can be elicited. This theory
will be discussed in detail in the next few articles.

2. Operant Conditioning: Propounded by scholars like Edward Thorndike


firstly and later by B.F. Skinner, this theory stresses on the fact that the
consequences of actions shape the behaviour.

The theory explains that the intensity of a response is either increased or


decreased as a result of punishment or reinforcement. Skinner explained how
with the help of reinforcement one can strengthen behaviour and with
punishment reduce or curb behaviour. It was also analyzed that the
behavioural change strongly depends on the schedules of reinforcement with
focus on timing and rate of reinforcement.

3. Observational Learning: The Observational Learning process was


propounded by Albert Bandura in his Social Learning Theory, which focused
on learning by imitation or observing people’s behaviour. For observational
learning to take place effectively, four important elements will be essential:
Motivation, Attention, Memory and Motor Skills.

Machine Learning

Machine learning is a subfield of artificial intelligence, which is broadly defined as


the capability of a machine to imitate intelligent human behavior. Artificial

3
intelligence systems are used to perform complex tasks in a way that is similar to how
humans solve problems.
Machine learning is used in internet search engines, email filters to sort out spam,
websites to make personalised recommendations, banking software to detect unusual
transactions, and lots of apps on our phones such as voice recognition.

Types

As with any method, there are different ways to train machine learning algorithms,
each with their own advantages and disadvantages. To understand the pros and cons
of each type of machine learning, we must first look at what kind of data they ingest.
In ML, there are two kinds of data — labeled data and unlabeled data.

Labeled data has both the input and output parameters in a completely
machine-readable pattern, but requires a lot of human labor to label the data, to begin
with. Unlabeled data only has one or none of the parameters in a machine-readable
form. This negates the need for human labor but requires more complex solutions.

There are also some types of machine learning algorithms that are used in very
specific use-cases, but three main methods are used today.

4
Supervised Learning
Supervised learning is one of the most basic types of machine learning. In this type,
the machine learning algorithm is trained on labeled data. Even though the data needs
to be labeled accurately for this method to work, supervised learning is extremely
powerful when used in the right circumstances.

In supervised learning, the ML algorithm is given a small training dataset to work


with. This training dataset is a smaller part of the bigger dataset and serves to give the
algorithm a basic idea of the problem, solution, and data points to be dealt with. The
training dataset is also very similar to the final dataset in its characteristics and
provides the algorithm with the labeled parameters required for the problem.

The algorithm then finds relationships between the parameters given, essentially
establishing a cause and effect relationship between the variables in the dataset. At the
end of the training, the algorithm has an idea of how the data works and the
relationship between the input and the output.

This solution is then deployed for use with the final dataset, which it learns from in
the same way as the training dataset. This means that supervised machine learning
algorithms will continue to improve even after being deployed, discovering new
patterns and relationships as it trains itself on new data.

Unsupervised Learning
Unsupervised machine learning holds the advantage of being able to work with
unlabeled data. This means that human labor is not required to make the dataset
machine-readable, allowing much larger datasets to be worked on by the program.

In supervised learning, the labels allow the algorithm to find the exact nature of the
relationship between any two data points. However, unsupervised learning does not
have labels to work off of, resulting in the creation of hidden structures. Relationships
between data points are perceived by the algorithm in an abstract manner, with no
input required from human beings.

The creation of these hidden structures is what makes unsupervised learning


algorithms versatile. Instead of a defined and set problem statement, unsupervised
learning algorithms can adapt to the data by dynamically changing hidden structures.
This offers more post-deployment development than supervised learning algorithms.

Reinforcement Learning

Reinforcement learning directly takes inspiration from how human beings learn from
data in their lives. It features an algorithm that improves upon itself and learns from
new situations using a trial-and-error method. Favorable outputs are encouraged or
‘reinforced’, and non-favorable outputs are discouraged or ‘punished’.

5
Based on the psychological concept of conditioning, reinforcement learning works by
putting the algorithm in a work environment with an interpreter and a reward system.
In every iteration of the algorithm, the output result is given to the interpreter, which
decides whether the outcome is favorable or not.

In case of the program finding the correct solution, the interpreter reinforces the
solution by providing a reward to the algorithm. If the outcome is not favorable, the
algorithm is forced to reiterate until it finds a better result. In most cases, the reward
system is directly tied to the effectiveness of the result.

In typical reinforcement learning use-cases, such as finding the shortest route between
two points on a map, the solution is not an absolute value. Instead, it takes on a score
of effectiveness, expressed in a percentage value. The higher this percentage value is,
the more reward is given to the algorithm. Thus, the program is trained to give the
best possible solution for the best possible reward.

Problems not to be solved

We are always amazed at how machine learning has made such an impact on our lives.
There is no doubt that ML will completely change the face of various industries, as
well as job profiles. While it offers a promising future, there are some inherent
problems at the heart of ML and AI advancements that put these technologies at a
disadvantage. While it can solve a plethora of challenges, there are a few tasks which
ML fails to answer.

1. Reasoning Power

One area where ML has not mastered successfully is reasoning power, a distinctly
human trait. Algorithms available today are mainly oriented towards specific
use-cases and are narrowed down when it comes to applicability. They cannot think as
to why a particular method is happening that way or ‘introspect’ their own outcomes.

For instance, if an image recognition algorithm identifies apples and oranges in a


given scenario, it cannot say if the apple (or orange) has gone bad or not, or why is
that fruit an apple or orange. Mathematically, all of this learning process can be
explained by us, but from an algorithmic perspective, the innate property cannot be
told by the algorithms or even us.

In other words, ML algorithms lack the ability to reason beyond their intended
application.

2. Contextual Limitation

If we consider the area of natural language processing (NLP), text and speech
information are the means to understand languages by NLP algorithms. They may
learn letters, words, sentences or even the syntax, but where they fall back is the
context of the language. Algorithms do not understand the context of the language
used. A classic example for this would be the “Chinese room” argument given by

6
philosopher John Searle, which says that computer programs or algorithms grasp the
idea merely by ‘symbols’ rather than the context given.

So, ML does not have an overall idea of the situation. It is limited by mnemonic
interpretations rather than thinking to see what is actually going on.

3. Scalability

Although we see ML implementations being deployed on a significant basis, it all


depends on data as well as its scalability. Data is growing at an enormous rate and has
many forms which largely affects the scalability of an ML project. Algorithms cannot
do much about this unless they are updated constantly for new changes to handle data.
This is where ML regularly requires human intervention in terms of scalability and
remains unsolved mostly.

In addition, growing data has to be dealt the right way if shared on an ML platform
which again needs examination through knowledge and intuition apparently lacked by
current ML.

4. Regulatory Restriction For Data In ML

ML usually need considerable amounts (in fact, massive) of data in stages such as
training, cross-validation etc. Sometimes, data includes private as well as general
information. This is where it gets complicated. Most tech companies have privatised
data and these data are the ones which are actually useful for ML applications. But,
there comes the risk of the wrong usage of data, especially in critical areas such as
medical research, health insurance etc.,

Even though data are anonymised at times, it has the possibility of being vulnerable.
Hence this is the reason regulatory rules are imposed heavily when it comes to using
private data.

5. Internal Working Of Deep Learning

This sub-field of ML is actually responsible for today’s AI growth. What was once
just a theory has appeared to be the most powerful aspect of ML. Deep Learning (DL)
now powers applications such as voice recognition, image recognition and so on
through artificial neural networks.

But, the internal working of DL is still unknown and yet to be solved. Advanced DL
algorithms still baffle researchers in terms of its working and efficiency. Millions of
neurons that form the neural networks in DL increase abstraction at every level, which
cannot be comprehended at all. This is why deep learning is dubbed a ‘black box’
since its internal agenda is unknown.

7
Applications

Popular Machine Learning Applications and Examples

1. Social Media Features

Social media platforms use machine learning algorithms and approaches to create
some attractive and excellent features. For instance, Facebook notices and records
your activities, chats, likes, and comments, and the time you spend on specific kinds
of posts. Machine learning learns from your own experience and makes friends and
page suggestions for your profile.

2. Product Recommendations

Product recommendation is one of the most popular and known applications of


machine learning. Product recommendation is one of the stark features of almost
every e-commerce website today, which is an advanced application of machine
learning techniques. Using machine learning and AI, websites track your behavior
based on your previous purchases, searching patterns, and cart history, and then make
product recommendations.

3. Image Recognition

Image recognition, which is an approach for cataloging and detecting a feature or an


object in the digital image, is one of the most significant and notable machine learning
and AI techniques. This technique is being adopted for further analysis, such as
pattern recognition, face detection, and face recognition.

4. Sentiment Analysis

Sentiment analysis is one of the most necessary applications of machine learning.


Sentiment analysis is a real-time machine learning application that determines the
emotion or opinion of the speaker or the writer. For instance, if someone has written a
review or email (or any form of a document), a sentiment analyzer will instantly find
out the actual thought and tone of the text. This sentiment analysis application can be
used to analyze a review based website, decision-making applications, etc.

5. Automating Employee Access Control

Organizations are actively implementing machine learning algorithms to determine


the level of access employees would need in various areas, depending on their job
profiles. This is one of the coolest applications of machine learning.

6. Marine Wildlife Preservation

Machine learning algorithms are used to develop behavior models for endangered
cetaceans and other marine species, helping scientists regulate and monitor their
populations.

8
7. Regulating Healthcare Efficiency and Medical Services

Significant healthcare sectors are actively looking at using machine learning


algorithms to manage better. They predict the waiting times of patients in the
emergency waiting rooms across various departments of hospitals. The models use
vital factors that help define the algorithm, details of staff at various times of day,
records of patients, and complete logs of department chats and the layout of
emergency rooms. Machine learning algorithms also come to play when detecting a
disease, therapy planning, and prediction of the disease situation. This is one of the
most necessary machine learning applications.

8. Predict Potential Heart Failure

An algorithm designed to scan a doctor’s free-form e-notes and identify patterns in a


patient’s cardiovascular history is making waves in medicine. Instead of a physician
digging through multiple health records to arrive at a sound diagnosis, redundancy is
now reduced with computers making an analysis based on available information.

9. Banking Domain

Banks are now using the latest advanced technology machine learning has to offer to
help prevent fraud and protect accounts from hackers. The algorithms determine what
factors to consider to create a filter to keep harm at bay. Various sites that are
unauthentic will be automatically filtered out and restricted from initiating
transactions.

10. Language Translation

One of the most common machine learning applications is language translation.


Machine learning plays a significant role in the translation of one language to another.
We are amazed at how websites can translate from one language to another
effortlessly and give contextual meaning as well. The technology behind the
translation tool is called ‘machine translation.’ It has enabled people to interact with
others from all around the world; without it, life would not be as easy as it is now. It
has provided confidence to travelers and business associates to safely venture into
foreign lands with the conviction that language will no longer be a barrier.

Your model will need to be taught what you want it to learn. Feeding relevant back
data will help the machine draw patterns and act accordingly. It is imperative to
provide relevant data and feed files to help the machine learn what is expected. In this
case, with machine learning, the results you strive for depend on the contents of the
files that are being recorded.

Languages/Tools

Regardless of the individual preferences for a particular programming language, we


have profiled five best programming languages for machine learning :

9
1. Python Programming Language

With over 8.2 million developers across the world using Python for coding, Python
ranks first in the latest annual ranking of popular programming languages by IEEE
Spectrum with a score of 100. Stack overflow programming language trends clearly
show that it’s the only language on rising for the last five years.

The increasing adoption of machine learning worldwide is a major factor contributing


to its growing popularity. There are 69% of machine learning engineers and Python
has become the favourite choice for data analytics, data science, machine learning,
and AI – all thanks to its vast library ecosystem that let’s machine learning
practitioners access, handle, transform, and process data with ease. Python wins the
heart of machine learning engineers for its platform independence, less complexity,
and better readability. Below is an interesting poem “The Zen of Python” written by
Tim Peters which beautifully describes why Python is gaining popularity as the best
language for machine learning :

10
Python is the preferred programming language of choice for machine learning for
some of the giants in the IT world including Google, Instagram, Facebook, Dropbox,
Netflix, Walt Disney, YouTube, Uber, Amazon, and Reddit. Python is an indisputable
leader and by far the best language for machine learning today and here’s why:

 Extensive Collection of Libraries and Packages

Python’s in-built libraries and packages provide base-level code so machine learning
engineers don’t have to start writing from scratch. Machine learning requires
continuous data processing and Python has in-built libraries and packages for almost
every task. This helps machine learning engineers reduce development time and
improve productivity when working with complex machine learning applications. The
best part of these libraries and packages is that there is zero learning curve, once you
know the basics of Python programming, you can start using these libraries.

1. Working with textual data – use NLTK, SciKit, and NumPy


2. Working with images – use Sci-Kit image and OpenCV
3. Working with audio – use Librosa
4. Implementing deep learning – use TensorFlow, Keras, PyTorch
5. Implementing basic machine learning algorithms – use Sci-Kit- learn.
6. Want to do scientific computing – use Sci-Py
7. Want to visualise the data clearly – use Matplotlib, Sci-Kit, and Seaborn.

 Code Readability

The joy of coding in Python should be in seeing short, concise,


readable classes that express a lot of action in a small amount of clear
code — not in reams of trivial code that bores the reader to death –
Guido van Rossum

The math behind machine learning is usually complicated and unobvious. Thus, code
readability is extremely important to successfully implement complicated machine
learning algorithms and versatile workflows. Python’s simple syntax and the
importance it puts on code readability makes it easy for machine learning engineers to
focus on what to write instead of thinking about how to write. Code readability makes
it easier for machine learning practitioners to easily exchange ideas, algorithms, and
tools with their peers. Python is not only popular within machine learning engineers,
but it is also one of the most popular programming languages among data scientists.

 Flexibility

The multiparadigm and flexible nature of Python makes it easy for machine learning
engineers to approach a problem in the simplest way possible. It supports the
procedural, functional, object-oriented, and imperative style of programming allowing
machine learning experts to work comfortably on what approach fits best. The
flexibility Python offers help machine learning engineers choose the programming
style based on the type of problem – sometimes it would be beneficial to capture the
state in an object while other times the problem might require passing around
functions as parameters. Python provides flexibility in choosing either of the
approaches and minimises the likelihood of errors. Not only in terms of programming

11
styles but Python has a lot to offer in terms of flexibility when it comes to
implementing changes as machine learning practitioners need not recompile the
source code to see the changes.

2. R Programming Langauge

With more than 2 million R users, 12000 packages in the CRAN open-source
repository, close to 206 R Meetup groups, over 4000 R programming questions asked
every month, and 40K+ members on LinkedIn’s R group – R is an incredible
programming language for machine learning written by a statistician for statisticians.
R language can also be used by non-programmer including data miners, data analysts,
and statisticians.

A critical part of a machine learning engineer’s day-to-day job roles is understanding


statistical principles so they can apply these principles to big data. R programming
language is a fantastic choice when it comes to crunching large numbers and is the
preferred choice for machine learning applications that use a lot of statistical data.
With user-friendly IDE’s like RStudio and various tools to draw graphs and manage
libraries – R is a must-have programming language in a machine learning engineer’s
toolkit. Here’s what makes R one of the most effective machine learning languages
for cracking business problems –

 Machine learning engineers need to train algorithms and bring in automation


to make accurate predictions. R language provides a variety of tools to train
and evaluate machine learning algorithms for predicting future events making
machine learning easy and approachable. R has an exhaustive list of packages
for machine learning –

1. MICE for dealing with missing values.


2. CARET for working with classification and regression problems.
3. PARTY and rpart for creating data partitions.
4. randomFOREST for creating decision trees.
5. dplyr and tidyr for data manipulation.
6. ggplot2 for creating beautiful visualisations.
7. Rmarkdown and Shiny for communicating insights through reports.

 R is an open-source programming language making it a highly cost-effective


choice for machine learning projects of any size.
 R supports the natural implementation of matrix arithmetic and other data
structures like vectors which Python does now. For a similar implementation
in Python programming language, machine learning engineers have to use the
NumPy package which is a clumsier implementation when compared to R.
 R is considered a powerful choice for machine learning because of the breadth
of machine learning techniques it provides. Be it data visualisation, data
sampling, data analysis, model evaluation, supervised/unsupervised machine
learning – R has a diverse array of techniques to offer.
 The style of programming in the R language is quite easy.
 R is highly flexible and also offers cross-platform compatibility. R does not
impose restrictions while performing every task in its language, machine

12
learning practitioners can mix tools – choose the best tool for each task and
also enjoy the benefits of other tools along with R.

3. Java and JavaScript

Though Python and R continue to be the favourites of machine learning enthusiasts,


Java is gaining popularity among machine learning engineers who hail from a Java
development background as they don’t need to learn a new programming language
like Python or R to implement machine learning. Many organisations already have
huge Java codebases, and most of the open-source tools for big data processing like
Hadoop, Spark are written in Java. Using Java for machine learning projects makes it
easier for machine learning engineers to integrate with existing code repositories.
Features like the ease of use, package services, better user interaction, easy debugging,
and graphical representation of data make it a machine learning language of choice –

 Java has plenty of third party libraries for machine learning. JavaML is an
in-built machine learning library that provides a collection of machine learning
algorithms implemented in Java. Also, you can use Arbiter Java library for
hyperparameter tuning which is an integral part of making ML algorithms run
effectively or you can use Deeplearning4J library which supports popular
machine learning algorithms like K-Nearest Neighbor and Neuroph and lets
you create neural networks or can also use Neuroph for neural networks.
 Scalability is an important feature that every machine learning engineer must
consider before beginning a project. Java makes application scaling easier for
machine learning engineers, making it a great choice for the development of
large and complex machine learning applications from scratch.
 Java Virtual Machine is one of the best platforms for machine learning as
engineers can write the same code on multiple platforms. JVM also helps
machine learning engineers create custom tools at a rapid pace and has various
IDE’s that help improve overall productivity. Java works best for
speed-critical machine learning projects as it is fast executing.

4. Julia

Julia is a high-performance, general-purpose dynamic programming language


emerging as a potential competitor for Python and R with many predominant features
exclusively for machine learning. Having said that it is a general-purpose
programming language and can be used for the development of all kinds of
applications, it works best for high-performance numerical analysis and
computational science. With support for all types of hardware including TPU’s and
GPU’s on every cloud, Julia is powering machine learning applications at big
corporations like Apple, Disney, Oracle, and NASA.

Why use Julia for machine learning?

 Julia is particularly designed for implementing basic mathematics and


scientific queries that underlies most machine learning algorithms.
 Julia code is compiled at Just-in-Time or at run time using the LLVM
framework. This gives machine learning engineers great speed without any

13
handcrafted profiling techniques or optimisation techniques solving all the
performance problems.
 Julia’s code is universally executable. So, once written a machine learning
application it can be compiled in Julia natively from other languages like
Python or R in a wrapper like PyCall or RCall.
 Scalability, as discussed, is crucial for machine learning engineers and Julia
makes it easier to be deployed quickly at large clusters. With powerful tools
like TensorFlow, MLBase.jl, Flux.jl, SciKitlearn.jl, and many others that
utilise the scalability provided by Julia, it is an apt choice for machine learning
applications.
 Offer support for editors like Emacs and VIM and also IDE’s like Visual
studio and Juno.

5. LISP

Founded in 1958 by John McCarthy, LISP (List Processing) is the second oldest
programming language still in use and is mainly developed for AI-centric applications.
LISP is a dynamically typed programming language that has influenced the creation
of many machine learning programming languages like Python, Julia, and Java. LISP
works on Read-Eval-Print-Loop (REPL) and has the capability to code, compile, and
run code in 30+ programming languages.

Lisp is a language for doing what you’ve been told is impossible –


Kent Pitman

LISP is considered as the most efficient and flexible machine learning language for
solving specifics as it adapts to the solution a programmer is coding for. This is what
makes LISP different from other machine learning languages. Today, it is particularly
used for inductive logic problems and machine learning. The first AI chatbot ELIZA
was developed using LISP and even today machine learning practitioners can use it to
create chatbots for eCommerce. LISP definitely deserves a mention on the list of best
language for machine learning because even today developers rely on LISP for
artificial intelligence projects that are heavy on machine learning as LISP offers –

 Rapid prototyping capabilities


 Dynamic object creation
 Automatic garbage collection
 Flexibility
 Support for symbolic expressions

Despite being flexible for machine learning, LISP lacks the support of well-known
machine learning libraries. LISP is neither a beginner-friendly machine learning
language (difficult to learn) and nor does have a large user community like that of
Python or R.

The best language for machine learning depends on the area in which it is going to be
applied, the scope of the machine learning project, which programming languages are
used in your industry/company, and several other factors. Experimentation, testing,
and experience help a machine learning practitioner decide on an optimal choice of
programming language for any given machine learning problem. Of course, the best

14
thing would be to learn at least two programming languages for machine learning as
this will help you put your machine learning resume at the top of the stack. Once you
are proficient in one machine learning language, learning another one is easy.

Machine Learning Tools

1. Microsoft Azure Machine Learning

Azure Machine Learning is a cloud platform that allows developers to build, train,
and deploy AI models. Microsoft is constantly making updates and improvements to
its machine learning tools and has recently announced changes to Azure Machine
Learning, retiring the Azure Machine Learning Workbench.

2. IBM Watson

No, IBM’s Watson Machine Learning isn’t something out of Sherlock Holmes.
Watson Machine Learning is an IBM cloud service that uses data to put machine
learning and deep learning models into production. This machine learning tool allows
users to perform training and scoring, two fundamental machine learning operations.
Keep in mind, IBM Watson is best suited for building machine learning applications
through API connections.

3. Google TensorFlow

TensorFlow, which is used for research and production at Google, is an open-source


software library for dataflow programming. The bottom line, TensorFlow is a
machine learning framework. This machine learning tool is relatively new to the
market and is evolving quickly. TensorFlow's easy visualization of neural networks is
likely the most attractive feature to developers.

4. Amazon Machine Learning

It should come as no surprise that Amazon offers an impressive number of machine


learning tools. According to the AWS website, Amazon Machine Learning is a
managed service for building Machine Learning models and generating predictions.
Amazon Machine Learning includes an automatic data transformation tool,
simplifying the machine learning tool even further for the user. In addition, Amazon
also offers other machine learning tools such as Amazon SageMaker, which is a
fully-managed platform that makes it easy for developers and data scientists to utilize
machine learning models.

5. OpenNN

OpenNN, short for Open Neural Networks Library, is a software library that
implements neural networks. Written in C++ programming language, OpenNN offers
you the perk of downloading its entire library for free from GitHub or SourceForge.

15
Issues

Although machine learning is being used in every industry and helps organizations
make more informed and data-driven choices that are more effective than classical
methodologies, it still has so many problems that cannot be ignored. Here are some
common issues in Machine Learning that professionals face to inculcate ML skills
and create an application from scratch.

1. Inadequate Training Data

The major issue that comes while using machine learning algorithms is the lack of
quality as well as quantity of data. Although data plays a vital role in the processing
of machine learning algorithms, many data scientists claim that inadequate data, noisy
data, and unclean data are extremely exhausting the machine learning algorithms. For
example, a simple task requires thousands of sample data, and an advanced task such
as speech or image recognition needs millions of sample data examples. Further, data
quality is also important for the algorithms to work ideally, but the absence of data
quality is also found in Machine Learning applications. Data quality can be affected
by some factors as follows:

 Noisy Data- It is responsible for an inaccurate prediction that affects the


decision as well as accuracy in classification tasks.
 Incorrect data- It is also responsible for faulty programming and results
obtained in machine learning models. Hence, incorrect data may affect the
accuracy of the results also.
 Generalizing of output data- Sometimes, it is also found that generalizing
output data becomes complex, which results in comparatively poor future
actions.

2. Poor quality of data

As we have discussed above, data plays a significant role in machine learning, and it
must be of good quality as well. Noisy data, incomplete data, inaccurate data, and
unclean data lead to less accuracy in classification and low-quality results. Hence,
data quality can also be considered as a major common problem while processing
machine learning algorithms.

3. Non-representative training data

To make sure our training model is generalized well or not, we have to ensure that
sample training data must be representative of new cases that we need to generalize.
The training data must cover all cases that are already occurred as well as occurring.

Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well
for generalized cases and provides accurate decisions. If there is less training data,
then there will be a sampling noise in the model, called the non-representative training
set. It won't be accurate in predictions. To overcome this, it will be biased against one
class or a group.

16
Hence, we should use representative data in training to protect against being biased
and make accurate predictions without any drift.

4. Overfitting and Underfitting

Overfitting:

Overfitting is one of the most common issues faced by Machine Learning engineers
and data scientists. Whenever a machine learning model is trained with a huge amount
of data, it starts capturing noise and inaccurate data into the training data set. It
negatively affects the performance of the model. Let's understand with a simple
example where we have a few training data sets such as 1000 mangoes, 1000 apples,
1000 bananas, and 5000 papayas. Then there is a considerable probability of
identification of an apple as papaya because we have a massive amount of biased data
in the training data set; hence prediction got negatively affected. The main reason
behind overfitting is using non-linear methods used in machine learning algorithms as
they build non-realistic data models. We can overcome overfitting by using linear and
parametric algorithms in the machine learning models.

Methods to reduce overfitting:

 Increase training data in a dataset.


 Reduce model complexity by simplifying the model by selecting one with
fewer parameters
 Ridge Regularization and Lasso Regularization
 Early stopping during the training phase
 Reduce the noise
 Reduce the number of attributes in training data.
 Constraining the model.

Underfitting:

Underfitting is just the opposite of overfitting. Whenever a machine learning model is


trained with fewer amounts of data, and as a result, it provides incomplete and
inaccurate data and destroys the accuracy of the machine learning model.

Underfitting occurs when our model is too simple to understand the base structure of
the data, just like an undersized pant. This generally happens when we have limited
data into the data set, and we try to build a linear model with non-linear data. In such
scenarios, the complexity of the model destroys, and rules of the machine learning
model become too easy to be applied on this data set, and the model starts doing
wrong predictions as well.

Methods to reduce Underfitting:

 Increase model complexity


 Remove noise from the data
 Trained on increased and better features
 Reduce the constraints
 Increase the number of epochs to get better results.

17
5. Monitoring and maintenance

As we know that generalized output data is mandatory for any machine learning
model; hence, regular monitoring and maintenance become compulsory for the same.
Different results for different actions require data change; hence editing of codes as
well as resources for monitoring them also become necessary.

6. Getting bad recommendations

A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example
where at a specific time customer is looking for some gadgets, but now customer
requirement changed over time but still machine learning model showing same
recommendations to the customer while customer expectation has been changed. This
incident is called a Data Drift. It generally occurs when new data is introduced or
interpretation of data changes. However, we can overcome this by regularly updating
and monitoring data according to the expectations.

7. Lack of skilled resources

Although Machine Learning and Artificial Intelligence are continuously growing in


the market, still these industries are fresher in comparison to others. The absence of
skilled resources in the form of manpower is also an issue. Hence, we need manpower
having in-depth knowledge of mathematics, science, and technologies for developing
and managing scientific substances for machine learning.

8. Customer Segmentation

Customer segmentation is also an important issue while developing a machine


learning algorithm. To identify the customers who paid for the recommendations
shown by the model and who don't even check them. Hence, an algorithm is
necessary to recognize the customer behavior and trigger a relevant recommendation
for the user based on past experience.

9. Process Complexity of Machine Learning

The machine learning process is very complex, which is also another major issue
faced by machine learning engineers and data scientists. However, Machine Learning
and Artificial Intelligence are very new technologies but are still in an experimental
phase and continuously being changing over time. There is the majority of hits and
trial experiments; hence the probability of error is higher than expected. Further, it
also includes analyzing the data, removing data bias, training data, applying complex
mathematical calculations, etc., making the procedure more complicated and quite
tedious.

10. Data Bias

Data Biasing is also found a big challenge in Machine Learning. These errors exist
when certain elements of the dataset are heavily weighted or need more importance
than others. Biased data leads to inaccurate results, skewed outcomes, and other

18
analytical errors. However, we can resolve this error by determining where data is
actually biased in the dataset. Further, take necessary steps to reduce it.

Methods to remove Data Bias:

 Research more for customer segmentation.


 Be aware of your general use cases and potential outliers.
 Combine inputs from multiple sources to ensure data diversity.
 Include bias testing in the development process.
 Analyze data regularly and keep tracking errors to resolve them easily.
 Review the collected and annotated data.
 Use multi-pass annotation such as sentiment analysis, content moderation, and
intent recognition.

11. Lack of Explainability

This basically means the outputs cannot be easily comprehended as it is programmed


in specific ways to deliver for certain conditions. Hence, a lack of explainability is
also found in machine learning algorithms which reduce the credibility of the
algorithms.

12. Slow implementations and results

This issue is also very commonly seen in machine learning models. However,
machine learning models are highly efficient in producing accurate results but are
time-consuming. Slow programming, excessive requirements' and overloaded data
take more time to provide accurate results than expected. This needs continuous
maintenance and monitoring of the model for delivering accurate results.

13. Irrelevant features

Although machine learning models are intended to give the best possible outcome, if
we feed garbage data as input, then the result will also be garbage. Hence, we should
use relevant features in our training sample. A machine learning model is said to be
good if training data has a good set of features or less to no irrelevant features.

Preparing to Model - Introduction

Getting the data right is the first step in any AI or machine learning project -- and it's
often more time-consuming and complex than crafting the machine learning
algorithms themselves. Advanced planning to help streamline and improve data
preparation in machine learning can save considerable work down the road. It can also
lead to more accurate and adaptable algorithms.

"Data preparation is the action of gathering the data you need, massaging it into a
format that's computer-readable and understandable, and asking hard questions of it to
check it for completeness and bias," said Eli Finkelshteyn, founder and CEO of
Constructor.io, which makes an AI-driven search engine for product websites.

19
It's tempting to focus only on the data itself, but it's a good idea to first consider the
problem you're trying to solve. That can help simplify considerations about what kind
of data to gather, how to ensure it fits the intended purpose and how to transform it
into the appropriate format for a specific type of algorithm.

Good data preparation can lead to more accurate and efficient algorithms, while
making it easier to pivot to new analytics problems, adapt when model accuracy drifts
and save data scientists and business users considerable time and effort down the line.

The importance of data preparation in machine learning

"Being a great data scientist is like being a great chef," surmised Donncha Carroll, a
partner at consultancy Axiom Consulting Partners. "To create an exceptional meal,
you must build a detailed understanding of each ingredient and think through how
they'll complement one another to produce a balanced and memorable dish. For a data
scientist, this process of discovery creates the knowledge needed to understand more
complex relationships, what matters and what doesn't, and how to tailor the data
preparation approach necessary to lay the groundwork for a great ML model."

Managers need to appreciate the ways in which data shapes machine learning
application development differently compared to customary application development.
"Unlike traditional rule-based programming, machine learning consists of two parts
that make up the final executable algorithm -- the ML algorithm itself and the data to
learn from," explained Felix Wick, corporate vice president of data science at supply
chain management platform provider Blue Yonder. "But raw data are often not ready
to be used in ML models. So, data preparation is at the heart of ML."

Data preparation consists of several steps, which consume more time than other
aspects of machine learning application development. A 2021 study by data science
platform vendor Anaconda found that data scientists spend an average of 22% of their
time on data preparation, which is more than the average time spent on other tasks
like deploying models, model training and creating data visualizations.

Although it is a time-intensive process, data scientists must pay attention to various


considerations when preparing data for machine learning. Following are six key steps
that are part of the process.

1. Problem formulation

Data preparation for building machine learning models is a lot more than just cleaning
and structuring data. In many cases, it's helpful to begin by stepping back from the
data to think about the underlying problem you're trying to solve. "To build a
successful ML model," Carroll advised, "you must develop a detailed understanding
of the problem to inform what you do and how you do it."

Start by spending time with the people that operate within the domain and have a
good understanding of the problem space, synthesizing what you learn through
conversations with them and using your experience to create a set of hypotheses that
describes the factors and forces involved. This simple step is often skipped or
underinvested in, Carroll noted, even though it can make a significant difference in

20
deciding what data to capture. It can also provide useful guidance on how the data
should be transformed and prepared for the machine learning model.

An Axiom legal client, for example, wanted to know how different elements of
service delivery impact account retention and growth. Carroll's team collaborated with
the attorneys to develop a hypothesis that accounts served by legal professionals
experienced in their industry tend to be happier and continue as clients longer. To
provide that information as an input to a machine learning model, they looked back
over the course of each professional's career and used billing data to determine how
much time they spent serving clients in that industry.

"Ultimately," Carroll added, "it became one of the most important predictors of client
retention and something we would never have calculated without spending the time
upfront to understand what matters and how it matters."

2. Data collection and discovery

Once a data science team has formulated the machine learning problem to be solved,
it needs to inventory potential data sources within the enterprise and from external
third parties. The data collection process must consider not only what the data is
purported to represent, but also why it was collected and what it might mean,
particularly when used in a different context. It's also essential to consider factors that
may have biased the data.

"To reduce and mitigate bias in machine learning models," said Sophia Yang, a senior
data scientist at Anaconda, "data scientists need to ask themselves where and how the
data was collected to determine if there were significant biases that might have been
captured." To train a machine learning model that predicts customer behavior, for
example, look at the data and ensure the data set was collected from diverse people,
geographical areas and perspectives.

"The most important step often missed in data preparation for machine learning is
asking critical questions of data that otherwise looks technically correct,"
Finkelshteyn said. In addition to investigating bias, he recommended determining if
there's reason to believe that important missing data may lead to a partial picture of
the analysis being done. In some cases, analytics teams use data that works
technically but produces inaccurate or incomplete results, and people who use the
resulting models build on these faulty learnings without knowing something is wrong.

3. Data exploration

Data scientists need to fully understand the data they're working with early in the
process to cultivate insights into its meaning and applicability. "A common mistake is
to launch into model building without taking the time to really understand the data
you've wrangled," Carroll said.

Data exploration means reviewing such things as the type and distribution of data
contained within each variable, the relationships between variables and how they vary
relative to the outcome you're predicting or interested in achieving.

21
This step can highlight problems like collinearity -- variables that move together -- or
situations where standardization of data sets and other data transformations are
necessary. It can also surface opportunities to improve model performance, like
reducing the dimensionality of a data set.

Data visualizations can also help improve this process. "This might seem like an
added step that isn't needed," Yang conjectured, "but our brains are great at spotting
patterns along with data that doesn't match the pattern." Data scientists can easily see
trends and explore the data correctly by creating suitable visualizations before
drawing conclusions. Popular data visualization tools include Tableau, Microsoft
Power BI, D3.js and Python libraries such as Matplotlib, Bokeh and the HoloViz
stack.

4. Data cleansing and validation

Various data cleansing and validation techniques can help analytics teams identify
and rectify inconsistencies, outliers, anomalies, missing data and other issues. Missing
data values, for example, can often be addressed with imputation tools that fill empty
fields with statistically relevant substitutes.

But Blue Yonder's Wick cautioned that semantic meaning is an often overlooked
aspect of missing data. In many cases, creating a dedicated category for capturing the
significance of missing values can help. In others, teams may consider explicitly
setting missing values as neutral to minimize their impact on machine learning
models.

A wide range of commercial and open source tools can be used to cleanse and
validate data for machine learning and ensure good quality data. Open source
technologies such as Great Expectations and Pandera, for example, are designed to
validate the data frames commonly used to organize analytics data into
two-dimensional tables. Tools that validate code and data processing workflows are
also available. One of them is pytest, which, Yang said, data scientists can use to
apply a software development unit-test mindset and manually write tests of their
workflows.

5. Data structuring

Once data science teams are satisfied with their data, they need to consider the
machine learning algorithms being used. Most algorithms, for example, work better
when data is broken into categories, such as age ranges, rather than left as raw
numbers.

Two often-missed data preprocessing tricks, Wick said, are data binning and
smoothing continuous features. These data regularization methods can reduce a
machine learning model's variance by preventing it from being misled by minor
statistical fluctuations in a data set.

Binning data into different groups can be done either in an equidistant manner, with
the same "width" for each bin, or equi-statistical method, with approximately the
same number of samples in each bin. It can also serve as a prerequisite for local

22
optimization of the data in each bin to help produce low-bias machine learning
models.

Smoothing continuous features can help in "denoising" raw data. It can also be used
to impose causal assumptions about the data-generating process by representing
relationships in ordered data sets as monotonic functions that preserve the order
among data elements.

Other actions that data scientists often take in structuring data for machine learning
include the following:

 data reduction, through techniques such as attribute or record sampling and


data aggregation;
 data normalization, which includes dimensionality reduction and data
rescaling; and
 creating separate data sets for training and testing machine learning models.

6. Feature engineering and selection

The last stage in data preparation before developing a machine learning model is
feature engineering and feature selection.

Wick said feature engineering, which involves adding or creating new variables to
improve a model's output, is the main craft of data scientists and comes in various
forms. Examples include extracting the days of the week or other variables from a
data set, decomposing variables into separate features, aggregating variables and
transforming features based on probability distributions.

Data scientists also must address feature selection -- choosing relevant features to
analyze and eliminating nonrelevant ones. Many features may look promising but lead
to problems like extended model training and overfitting, which limits a model's
ability to accurately analyze new data. Methods such as lasso regression and
automatic relevance determination can help with feature selection.

Machine Learning Activities

Machine Learning Technology is already a part of all of our lives. It is making


decisions both for us and about us. It is the technology behind:

 Facial recognition
 Targeted advertising
 Voice recognition
 SPAM filters
 Machine translation
 Detecting credit card fraud
 Virtual Personal Assistants
 Self-driving cars
 … and lots more.

23
To fully understand the opportunities and consequences of the machine learning filled
future, everyone needs to be able to …

 Understand the basics of how machine learning works.


 Develop applications by training machine learning engine.
 Use machine learning applications.
 Understand the Ethical and Societal Issues.

What is Machine Learning?

Machine Learning is a technology that “allows computers to perform specific tasks


intelligently, by learning from examples”. Rather than crafting an algorithm to do a
job step by step…you craft an algorithm that learns to do things itself then train it on
large amounts of data. It is all about spotting patterns in massive amounts of data.

In practice creating machine learning tools is done in several steps.

1. First create a machine learning engine. It is a program implementing an


algorithm of how to learn in general. (This step is for experts!)
2. Next you train it on relevant data (e.g. images of animals). The more data it
sees the better it gets at recognising things or making decisions (e.g.
identifying animals).
3. You package up the newly trained tool in a user interface to make it easy for
anyone to use it.
4. Your users then use the new machine learning application by giving it new
data (e.g. you show it pictures of animals and it tells you what kind of animal
they are).

Here is an example of a robot with a machine learning brain. It reacts just to the tone
of voice – it doesn’t understand the words. It learnt very much like a dog does. It was
‘rewarded’ when it reacted in an appropriate way and was ‘punished’ when it reacted
in an inappropriate way. Eventually it learnt to behave like this.

Understanding how machine learning works

There are several ways to try to make a machine do tasks ‘intelligently’. For example:

 Rule-based systems (writing rules explicitly)


 Neural networks (copying the way our brains learn)
 Genetic algorithms (copying the way evolution improves species to fit their
environment)
 Bayesian Networks (building in existing expert knowledge)

Understanding how machine learning works

There are several ways to try to make a machine do tasks ‘intelligently’. For example:

 Rule-based systems (writing rules explicitly)


 Neural networks (copying the way our brains learn)

24
 Genetic algorithms (copying the way evolution improves species to fit their
environment)
 Bayesian Networks (building in existing expert knowledge)

Types of data

Why is machine learning important?

Machine learning is a form of artificial intelligence (AI) that teaches computers to


think in a similar way to humans: learning and improving upon past experiences.
Almost any task that can be completed with a data-defined pattern or set of rules can
be automated with machine learning.

So, why is machine learning important? It allows companies to transform processes


that were previously only possible for humans to perform—think responding to
customer service calls, bookkeeping, and reviewing resumes for everyday businesses.
Machine learning can also scale to handle larger problems and technical
questions—think image detection for self-driving cars, predicting natural disaster
locations and timelines, and understanding the potential interaction of drugs with
medical conditions before clinical trials. That’s why machine learning is important.

Why is data important for machine learning?

Machine learning data analysis uses algorithms to continuously improve itself over
time, but quality data is necessary for these models to operate efficiently.

What is a dataset in machine learning?

A single row of data is called an instance. Datasets are a collection of instances that
all share a common attribute. Machine learning models will generally contain a few
different datasets, each used to fulfill various roles in the system.

For machine learning models to understand how to perform various actions, training
datasets must first be fed into the machine learning algorithm, followed by validation

25
datasets (or testing datasets) to ensure that the model is interpreting this data
accurately.

Once you feed these training and validation sets into the system, subsequent datasets
can then be used to sculpt your machine learning model going forward. The more data
you provide to the ML system, the faster that model can learn and improve.

What type of data does machine learning need?

Data can come in many forms, but machine learning models rely on four primary data
types. These include numerical data, categorical data, time series data, and text data.

Numerical data

Numerical data

Numerical data, or quantitative data, is any form of measurable data such as your
height, weight, or the cost of your phone bill. You can determine if a set of data is
numerical by attempting to average out the numbers or sort them in ascending or
descending order. Exact or whole numbers (ie. 26 students in a class) are considered
discrete numbers, while those which fall into a given range (ie. 3.6 percent interest
rate) are considered continuous numbers. While learning this type of data, keep in
mind that numerical data is not tied to any specific point in time, they are simply raw
numbers.

Categorical data

Categorical data is sorted by defining characteristics. This can include gender, social
class, ethnicity, hometown, the industry you work in, or a variety of other labels.
While learning this data type, keep in mind that it is non-numerical, meaning you are
unable to add them together, average them out, or sort them in any chronological
order. Categorical data is great for grouping individuals or ideas that share similar
attributes, helping your machine learning model streamline its data analysis.

26
Time series data

Time series data consists of data points that are indexed at specific points in time.
More often than not, this data is collected at consistent intervals. Learning and
utilizing time series data makes it easy to compare data from week to week, month to
month, year to year, or according to any other time-based metric you desire. The
distinct difference between time series data and numerical data is that time series data
has established starting and ending points, while numerical data is simply a collection
of numbers that aren’t rooted in particular time periods.

Text data

Text data is simply words, sentences, or paragraphs that can provide some level of
insight to your machine learning models. Since these words can be difficult for
models to interpret on their own, they are most often grouped together or analyzed
using various methods such as word frequency, text classification, or sentiment
analysis.

Where do engineers get datasets for machine learning?

There is an abundance of places you can find machine learning data, but we have
compiled five of the most popular ML dataset resources to help get you started:

27
Exploring structure of data

The data structure used for machine learning is quite similar to other software
development fields where it is often used. Machine Learning is a subset of artificial
intelligence that includes various complex algorithms to solve mathematical problems
to a great extent. Data structure helps to build and understand these complex
problems. Understanding the data structure also helps you to build ML models and
algorithms in a much more efficient way than other ML professionals.

What is Data Structure?

The data structure is defined as the basic building block of computer programming
that helps us to organize, manage and store data for efficient search and retrieval.

In other words, the data structure is the collection of data type 'values' which are
stored and organized in such a way that it allows for efficient access and modification.

Types of Data Structure

The data structure is the ordered sequence of data, and it tells the compiler how a
programmer is using the data such as Integer, String, Boolean, etc.

There are two different types of data structures: Linear and Non-linear data structures.

1. Linear Data structure:

The linear data structure is a special type of data structure that helps to organize and
manage data in a specific order where the elements are attached adjacently.

There are mainly 4 types of linear data structure as follows:

28
Array:

An array is one of the most basic and common data structures used in Machine
Learning. It is also used in linear algebra to solve complex mathematical problems.
You will use arrays constantly in machine learning, whether it's:

 To convert the column of a data frame into a list format in pre-processing


analysis
 To order the frequency of words present in datasets.
 Using a list of tokenized words to begin clustering topics.
 In word embedding, by creating multi-dimensional matrices.

An array contains index numbers to represent an element starting from 0. The lowest
index is arr[0] and corresponds to the first element.

Let's take an example of a Python array used in machine learning. Although the
Python array is quite different from than array in other programming languages, the
Python list is more popular as it includes the flexibility of data types and their length.
If anyone is using Python in ML algorithms, then it's better to kick your journey from
array initially.

Python Array method:

Method Description
Append() It is used to add an element at the end of the list.
Clear() It is used to remove/clear all elements in the list.
Copy() It returns a copy of the list.
Count() It returns the count or total available element with an integer value.
Extend() It is used to add the element of a list to the end of the current list.
Index() It returns the index of the first element with the specified value.
Insert() It is used to add an element at a specific position using an index number.
It is used to remove an element from a specified position using an index
Pop()
number.
Remove() Used to remove the elements with specified values.
Reverse() Used to show list in reverse order
Sort() Used to sort the list in an array.

Stacks:

Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last
Out). It is used for binary classification in deep learning. Although stacks are easy to
learn and implement in ML models but having a good grasp can help in many
computer science aspects such as parsing grammar, etc.

Stacks enable the undo and redo buttons on your computer as they function similar to
a stack of blog content. There is no sense in adding a blog at the bottom of the stack.

29
However, we can only check the most recent one that has been added. Addition and
removal occur at the top of the stack.

Linked List:

A linked list is the type of collection having several separately allocated nodes. Or in
other words, a list is the type of collection of data elements that consist of a value and
pointer that point to the next node in the list.

In a linked list, insertion and deletion are constant time operations and are very
efficient, but accessing a value is slow and often requires scanning. So, a linked list is
very significant for a dynamic array where the shifting of elements is required.
Although insertion of an element can be done at the head, middle or tail position, it is
relatively cost consuming. However, linked lists are easy to splice together and split
apart. Also, the list can be converted to a fixed-length array for fast access.

Queue:

A Queue is defined as the "FIFO" (first in, first out). It is useful to predict a queuing
scenario in real-time programs, such as people waiting in line to withdraw cash in the
bank. Hence, the queue is significant in a program where multiple lists of codes need
to be processed.

The queue data structure can be used to record the split time of a car in F1 racing.

2. Non-linear Data Structures

As the name suggests, in Non-linear data structures, elements are not arranged in any
sequence. All the elements are arranged and linked with each other in a hierarchal
manner, where one element can be linked with one or more elements.

1) Trees

Binary Tree:

The concept of a binary tree is very much similar to a linked list, but the only
difference of nodes and their pointers. In a linked list, each node contains a data value
with a pointer that points to the next node in the list, whereas; in a binary tree, each
node has two pointers to subsequent nodes instead of just one.

Binary trees are sorted, so insertion and deletion operations can be easily done with
O(log N) time complexity. Similar to the linked list, a binary tree can also be
converted to an array on the basis of tree sorting.

30
In a binary tree, there are some child and parent nodes shown in the above image.
Where the value of the left child node is always less than the value of the parent node
while the value of the right-side child nodes is always more than the parent node.
Hence, in a binary tree structure, data sorting is done automatically, which makes
insertion and deletion efficient.

2) Graphs

A graph data structure is also very much useful in machine learning for link
prediction. Graphs are directed or undirected concepts with nodes and ordered or
unordered pairs. Hence, you must have good exposure to the graph data structure for
machine learning and deep learning.

3) Maps

Maps are the popular data structure in the programming world, which are mostly
useful for minimizing the run-time algorithms and fast searching the data. It stores
data in the form of (key, value) pair, where the key must be unique; however, the
value can be duplicated. Each key corresponds to or maps a value; hence it is named a
Map.

In different programming languages, core libraries have built-in maps or, rather,
HashMaps with different names for each implementation.

 In Java: Maps
 In Python: Dictionaries
 C++: hash_map, unordered_map, etc.

31
Python Dictionaries are very useful in machine learning and data science as various
functions and algorithms return the dictionary as an output. Dictionaries are also
much used for implementing sparse matrices, which is very common in Machine
Learning.

4) Heap data structure:

Heap is a hierarchically ordered data structure. Heap data structure is also very much
similar to a tree, but it consists of vertical ordering instead of horizontal ordering.

Ordering in a heap DS is applied along the hierarchy but not across it, where the value
of the parent node is always more than that of child nodes either on the left or right
side.

Here, the insertion and deletion operations are performed on the basis of promotion. It
means, firstly, the element is inserted at the highest available position. After that, it
gets compared with its parent and promoted until it reaches the correct ranking
position. Most of the heaps data structures can be stored in an array along with the
relationships between the elements.

Dynamic array data structure:

This is one of the most important types of data structure used in linear algebra to solve
1-D, 2-D, 3-D as well as 4-D arrays for matrix arithmetic. Further, it requires good
exposure to Python libraries such as Python NumPy for programming in deep
learning.

How is Data Structure used in Machine Learning?

For a Machine learning professional, apart from knowledge of machine learning skills,
it is required to have mastery of data structure and algorithms.

32
When we use machine learning for solving a problem, we need to evaluate the model
performance, i.e., which model is fastest and requires the smallest amount of space
and resources with accuracy. Moreover, if a model is built using algorithms,
comparing and contrasting two algorithms to determine the best for the job is crucial
to the machine learning professional. For such cases, skills in data structures become
important for ML professionals.

With the knowledge of data structure and algorithms with ML, we can answer the
following questions easily:

 How much memory is required to execute?


 How long will it take to run?
 With the business case on hand, which algorithm will offer the best
performance?

Data quality and remediation

Data quality

Good quality data becomes imperative and a basic building block of an ML pipeline.
The ML model can only be as good as its training data.

The machine learns the statistical associations from the historical data and is as good
as the data it is trained on. Hence, good quality data becomes imperative and a basic
building block of an ML pipeline. The ML model can only be as good as its training
data.

Data Quality Assessment

The machine learning algorithms need training data in a single view i.e. a flat
structure. As most organizations maintain multiple sources of data, the data
preparation by combining multiple data sources to bring all necessary attributes in a
single flat file is a time and resource (domain expertise) expensive process.

The data gets exposed to multiple sources of error at this step and requires strict peer
review to ensure that the domain-established logic has been communicated,
understood, programmed, and implemented well.

Since data warehouses integrate data from multiple sources, quality issues related to
data acquisition, cleaning, transformations, linking, and integration become critical.

A very popular notion among most the data scientists is that the data preparation,
cleaning, and transformation take up the majority of the model building time – and it
is an absolute truth. Hence, it is advised not to rush through the data to feed into the
model and perform extensive data quality checks. Though the number and type of
checks one can perform on the data can be very subjective, we will discuss some of
the key factors to be checked in the data while preparing data quality score and
assessing the goodness of data:

33
Techniques to maintain data quality:

 missing data imputation


 outlier detection
 data transformations
 dimensionality reduction
 cross-validation
 bootstrapping

Let’s check how we can improve the data quality:

o All labelers are not the same: Data is gathered from multiple sources.
Multiple vendors have different approaches to collecting and labeling
data with a different understanding of the end-use of the data. Within
the same vendor for data labeling, there are myriad ways data
inconsistency can crop up as the supervisor gets requirements and
shares the guidelines to different team members, all of whom can label
based on their understanding.
 A quality check on the vendor side, validation of adherence to
the shared guidelines at the consumer side will help bring
homogenous labeling.

34
o Distinct Record: Identifying the group of attributes that uniquely
identify a single record is very important and needs validation from a
domain expert. Removing duplicates on this group leaves you with
distinct records necessary for model training. This group acts as a key
to performing multiple aggregate and transformations operations on the
dataset like calculating rolling mean, backfilling null values, missing
value imputation (details on this in next point), etc.
o What to do with the missing data? Systematic missingness of data
leads to the origin of a biased dataset and calls for deeper investigation.
Also, removing the observations from the data with more null/missing
values can lead to the elimination of data representing certain groups
of people (e.g. gender, or race). Hence, misrepresented data will
produce biased results and is not only flawed at the model output level
but is also against the fairness principles of ethical and responsible use
of AI. Another way you may find the missing attributes is “at
random”. Blindly removing a certain important attribute due to a high
missingness quotient can harm the model by reducing its predictive
power.

 The most common way to impute missing values is by mean


values at a particular dimension level. For example, the average
number of conversions from the Delhi to Bengaluru route can
be used to impute the missing value of the conversions for the
route on a given day. To add on to a similar note, one may
calculate the average of all high-running routes like Delhi to
Mumbai, Delhi to Kolkata, Delhi to Chennai for imputing
missing conversions value.

 Flattened Structure: Most organizations do not have a centralized data


warehouse and encounter a lack of structured data as one of the key problems
in preparing the machine learning model for decision-making. For example,
cybersecurity solutions need data from multiple resources like network, cloud,
and endpoint to be normalized into one single view to training the algorithm
on previously seen attacks/threats.

Data Remediation

What is data remediation?

Data remediation is the process of cleansing, organizing and migrating data so that it’s
properly protected and best serves its intended purpose. There is a misconception that
data remediation simply means deleting business data that is no longer needed. It’s
important to remember that the key word “remediation” derives from the word
“remedy,” which is to correct a mistake. Since the core initiative is to correct data, the
data remediation process typically involves replacing, modifying, cleansing or
deleting any “dirty” data.

35
Data remediation terminology

As you explore the data remediation process, you will come across unique
terminology. These are common terms related to data remediation that you should get
acquainted with.

 Data Migration – The process of moving data between two or more systems,
data formats or servers.
 Data Discovery – A manual or automated process of searching for patterns in
data sets to identify structured and unstructured data in an organization’s
systems.
 ROT – An acronym that stands for redundant, obsolete and trivial data.
According to the Association for Intelligent Information Management, ROT
data accounts for nearly 80 percent of the unstructured data that is beyond its
recommended retention period and no longer useful to an organization.
 Dark Data – Any information that businesses collect, process and store, but do
not use for other purposes. Some examples include customer call records, raw
survey data or email correspondences. Often, the storing and securing of this
type of data incurs more expense and sometimes even greater risk than it does
value.
 Dirty Data – Data that damages the integrity of the organization’s complete
dataset. This can include data that is unnecessarily duplicated, outdated,
incomplete or inaccurate.
 Data Overload – This is when an organization has acquired too much data,
including low-quality or dark data. Data overload makes the tasks of
identifying, classifying and remediating data laborious.
 Data Cleansing – Transforming data in its native state to a predefined
standardized format.
 Data Governance – Management of the availability, usability, integrity and
security of the data stored within an organization.

Stages of data remediation

Data remediation is an involved process. After all, it’s more than simply purging your
organization’s systems of dirty data. It requires knowledgeable assessment on how to
most effectively resolve unclean data.

Assessment

Before you take any action on your company’s data, you need to have a complete
understanding of the data you possess. How valuable is this data to the company? Is
this data sensitive? Does this data actually require specialized storage, or is it trivial
information? Identifying the quantity and type of data you’re dealing with, even if it’s
just a ballpark estimate to start, will help your team get a general sense of how much
time and resources need to be dedicated for successful data remediation.

Organizing and segmentation

36
Not all data is created equally, which means that not all pieces of data require the
same level of protection or storage features. For instance, it isn’t cost-efficient for a
company to store all data, ranging from information that is publicly facing to sensitive
data, all in the same high-security vault. This is why organizing and creating segments
based on the information’s purpose is critical during the data remediation process.

Accessibility is a big factor to consider when it comes to segmenting data. There’s


data that needs to be easily accessed by team members for day-to-day tasks, and then
there’s data that needs to have higher security measures for legal or regulatory
purposes. For the data that needs to be regularly accessed, a cloud-based storage
platform makes sense. For sensitive data that has greater privacy requirements,
organizations will probably want to separate that data and store it on another platform
with advanced security features. This is one example of two segments an organization
may create.

Another important consideration factor when creating segments is determining which


historical data is essential to business operations and needs to be stored in an archive
system versus data that can be safely deleted. ROT data is a good example of
information that can be safely deleted, while other business records that are still
within a recommended retention period could be stored in an archive system.

Indexation and classification

Once your data is segmented, you can move onto indexing and classification. These
steps build off of the data segments you have created and helps you determine action
steps. In this step, organizations will focus on segments containing non-ROT data and
classify the level of sensitivity of this remaining data.

Regulated data like personally identifiable information (PII), personal health


information (PHI) and financial information will need to be classified with the
company’s terminology for the highest degree of sensitivity. “Restricted data” is a
common sensitive data classification term for data of this nature. Then, there’s
unregulated and unstructured data that may consider sensitive information, and could
be classified as internal, confidential or restricted data, depending on its level of
sensitivity.

Migrating

If an organization’s end goal is to consolidate their data into a new, cleansed storage
environment, then migration is an essential step in the data remediation process. A
common scenario is an organization who needs to find a new secure location for
storing data because their legacy system has reached its end of life. Some
organizations may also prefer moving their data to cloud-based platforms, like
SharePoint or Office 365, so that information is more accessible for their internal
teams.

Data cleansing

The final task for your organization’s data may not always involve migration. There
may be other actions better suited for the data depending on what segmentation group

37
it falls under and its classification. A few vital actions that a team may proceed with
include shredding, redacting, quarantining, ACL removal and script execution to
clean up data.

Business benefits of data remediation

Data remediation is a big effort, but it comes with big benefits for businesses as well.
These are the top benefits that most organizations realize after data remediation.

 Reduced data storage costs — Although data remediation isn’t solely about
deletion of data, it is a common remediation action and less data means less
storage required. Additionally, many organizations realize that they have
lumped trivial information in the same high-security storage platform for
sensitive information, instead of only paying for the storage space that’s
actually necessary.
 Protection for unstructured sensitive data — Once sensitive data is
discovered and classified, remediation is where you determine and execute the
actions that mitigate risk. This could look like finding a secure area to store
sensitive data or deleting what is necessary from a compliance perspective.
 Reduced sensitive data footprint — By removing sensitive data that is
beyond its recommended retention period and is necessary for compliance,
you’ve reduced your organization’s sensitive data footprint and decreased risk
of potential data breaches or leaks of highly sensitive data.
 Adherence to compliance laws and regulations — Hanging on to data that
is beyond its recommended retention period can create greater risks. By
cleaning up data, your organization reduces data exposure which supports
compliance initiatives.
 Increased staff productivity — Data that your team uses should be available,
usable and trustworthy. By streamlining your organization’s network with data
remediation, information should be easier to find and usable for its intended
purpose.
 Minimized cyberattack risks — By continuously engaging in data
remediation, your organization is proactively minimizing data loss risks and
potential financial or reputational damage of successful cyberattacks.
 Improved overall data security — Data remediation and data governance
work hand in hand. In order to properly remediate data, your organization will
need to establish data governance policies, which is significant for the overall
management and protection of your organization’s data.

When is data remediation necessary?

Data remediation is an essential process for any organization to ensure optimal


hygiene and legal compliance standing. It’s recommended for any company to stay
consistent with data remediation, but there are some specific instances that may occur
and become a strong driver for prioritizing data remediation.

Business changes

If a company has changed software or systems they use, or even moved to a new
office or data center location, that is a case to buckle down on data remediation

38
immediately. Sometimes companies switch to new softwares or systems because they
need to phase out their legacy system that has reached its end of life. Change of any
kind is rarely ever 100 percent smooth, and data could become corrupted or exposed
during the shuffle of changing environments — whether it be digital or physical.

Another event that may be a motivator to conduct data remediation is a company


merger or acquisition. Similar to a change in systems or location, the organization is
likely experiencing major changes in leadership, staff, work processes, and more.
Even if your organization’s data is pristine, you cannot say the same about the new
company that is joining forces with you until you take the time to discover, classify
and, eventually, remediate data.

Laws and regulations

Newly enacted laws or regulations, either on a state or federal level, could be another
major driver for data remediation. Data privacy and protection laws are continuously
being updated and improved upon, like the more recent California Consumer
Protection Act of 2018 (CCPA). Sometimes new policies may be enacted by the
leadership team at your organization as well.

Human error

Drivers for data remediation aren’t always necessarily as grand as a new business
acquisition or legal regulation. Sometimes, instances as simple as human error can be
a catalyst for data remediation. For instance, let’s say that your organization discovers
one of its employees has unintentionally downloaded sensitive corporate data on their
personal mobile phone. Or, perhaps a couple of employees accidentally opened up a
malicious spam email. Actions as innocent as these examples could put the integrity
of your organization’s data at risk and is cause for immediately taking action with
data remediation.

More examples of scenarios that may trigger the need to remediate data include:

 Preparing legal documentation for an investor portfolio sale


 Eliminating personally identifiable information (PII) or personal healthcare
information (PHI)
 Enterprise resource planning (ERP)
 Master Data Management (MDM) implementation

What prevents organizations from performing data remediation?

As important as data remediation is, many organizations bypass this process.


Oftentimes, other activities like data migration may seem to be an adequate
replacement for the exhaustive task of comprehensive data remediation. However,
projects like that are typically one-time endeavors that aren’t a continuous effort of
cleansing and validating an organization’s data.

Lack of information

39
A common reason that organizations ignore data remediation is a lack of information
about what, where, how and why data is stored in the company. An organization may
not even realize the expanse of data they have collected or where it’s even stored.
Awareness is a common issue, and since such a large percentage of sensitive data falls
under the unstructured category, locating and awareness of all of this data is difficult.
It’s recommended that organizations, especially those who belong to industries that
interact with high volumes of sensitive data (like the medical, financial or education
industries), regularly perform sensitive data discovery and data classification to
prepare for data remediation. All of these steps are essential to a healthy data lifecycle
and depend on one another to keep a company’s data security in good standing.

Fear of deleting data

Another factor that may prevent an organization from getting started with data
remediation is a fear of deleting data. The permanency of the action can be
intimidating, and some businesses may be concerned that they may need the data at
hand at some point in the future. However, hanging on to unnecessary data, or leaving
dirty data unmodified or uncleansed, can pose greater risk to an organization —
especially when it comes to compliance laws and regulations.

Unclear data ownership

Lastly, some organizations may not have established clear data ownership. If there
aren’t clear roles and responsibilities for each member of your organization’s security
team, then important tasks like data remediation can easily slip through the cracks.
It’s essential to determine each person’s key responsibilities when it comes to
maintaining data security, and to make those duties transparent across the
organization so that everyone knows who to turn to for specific security questions,
and to keep the team accountable.

How to prepare your business for data remediation

Whether you’ve put data remediation on the back-burner or are realizing for the first
time the benefits of steady data remediation, here are several steps your team should
take to prepare for data remediation.

1. Data remediation teams – First, create data remediation teams. In doing this,
your organization will need to establish data ownership roles and
responsibilities, so everyone on your security team knows how they are
contributing and who to go for with questions or concerns.
2. Data governance policies – From there, you will need to establish company
policies that enforce data governance. An effective data governance plan will
ensure that the company’s data is trustworthy and does not get misused.
Typically, data governance is a process largely based on the company’s
internal data standards and policies that control data usage in order to maintain
the availability, usability, integrity and security of data.
3. Prioritize data remediation areas – Once you have your organization’s
policies and data remediation team assembled, you should begin prioritizing
which areas may require more immediate data remediation. If any of the
drivers we mentioned above have occurred, such as your organization

40
switching to a new platform or an urgent need to eliminate PII, those are great
starting points for prioritizing the order of business areas that need data
remediation.
4. Budget for data-related issues – After compiling a prioritized list, it’s time to
budget for any data-related issues that may occur during the remediation
process. This includes estimating the hours of labor for the process and
factoring in costs for any special tools that may be needed for remediation.
5. Discuss data remediation expectations – Either after or alongside the
budgeting process, your team should sit down and discuss general
expectations of the data remediation process. Are there any types of sensitive
data your team expects to find? Are there any recent overarching data security
issues or changes that could have an impact or effect on the remediation
process? During the discussion, important details may be brought to light for
the team that only one person was aware of and help the team reach success.
6. Track progress and ROI – All company’s want to understand their ROI on
big projects and initiatives, and this applies to data security measures too.
Your organization’s IT data security lead should create a progress reporting
mechanism that can inform company stakeholders on the data remediation
progress, including key performance indicators like amount of issues resolved
or how resolved issues translate into money and risk saved.

Data pre-processing

Companies can use data from nearly endless sources – internal information, customer
service interactions, and all over the internet – to help inform their choices and
improve their business.

But you can’t simply take raw data and run it through machine learning and analytics
programs right away. You first need to preprocess your data, so it can be successfully
“read” or understood by machines.

What Is Data Preprocessing?

Data preprocessing is a step in the data mining and data analysis process that takes
raw data and transforms it into a format that can be understood and analyzed by
computers and machine learning.

Raw, real-world data in the form of text, images, video, etc., is messy. Not only may
it contain errors and inconsistencies, but it is often incomplete, and doesn’t have a
regular, uniform design.

Machines like to process nice and tidy information – they read data as 1s and 0s. So
calculating structured data, like whole numbers and percentages is easy. However,
unstructured data, in the form of text and images must first be cleaned and formatted
before analysis.

41
Data Preprocessing Importance

When using data sets to train machine learning models, you’ll often hear the phrase
“garbage in, garbage out” This means that if you use bad or “dirty” data to train
your model, you’ll end up with a bad, improperly trained model that won’t actually be
relevant to your analysis.

Good, preprocessed data is even more important than the most powerful algorithms,
to the point that machine learning models trained with bad data could actually be
harmful to the analysis you’re trying to do – giving you “garbage” results.

Depending on your data gathering techniques and sources, you may end up with data
that’s out of range or includes an incorrect feature, like household income below zero
or an image from a set of “zoo animals” that is actually a tree. Your set could have
missing values or fields. Or text data, for example, will often have misspelled words
and irrelevant symbols, URLs, etc.

When you properly preprocess and clean your data, you’ll set yourself up for much
more accurate downstream processes. We often hear about the importance of
“data-driven decision making,” but if these decisions are driven by bad data, they’re
simply bad decisions.

Understanding Machine Learning Data Features

Data sets can be explained with or communicated as the “features” that make them up.
This can be by size, location, age, time, color, etc. Features appear as columns in
datasets and are also known as attributes, variables, fields, and characteristics.

Wikipedia describes a machine learning data feature as “an individual measurable


property or characteristic of a phenomenon being observed”.

It’s important to understand what “features” are when preprocessing your data
because you’ll need to choose which ones to focus on depending on what your
business goals are. Later, we’ll explain how you can improve the quality of your
dataset’s features and the insights you gain with processes like feature selection

42
First, let’s go over the two different types of features that are used to describe data:
categorical and numerical:

 Categorical features: Features whose explanations or values are taken from a


defined set of possible explanations or values. Categorical values can be colors
of a house; types of animals; months of the year; True/False; positive, negative,
neutral, etc. The set of possible categories that the features can fit into is
predetermined.
 Numerical features: Features with values that are continuous on a scale,
statistical, or integer-related. Numerical values are represented by whole
numbers, fractions, or percentages. Numerical features can be house prices,
word counts in a document, time it takes to travel somewhere, etc.

The diagram below shows how features are used to train machine learning text
analysis models. Text is run through a feature extractor (to pull out or highlight words
or phrases) and these pieces of text are classified or tagged by their features. Once the
model is properly trained, text can be run through it, and it will make predictions on
the features of the text or “tag” the text itself.

Data Preprocessing Steps

Let’s take a look at the established steps you’ll need to go through to make sure your
data is successfully preprocessed.

1. Data quality assessment


2. Data cleaning
3. Data transformation
4. Data reduction

43
1. Data quality assessment

Take a good look at your data and get an idea of its overall quality, relevance to your
project, and consistency. There are a number of data anomalies and inherent problems
to look out for in almost any data set, for example:

 Mismatched data types: When you collect data from many different sources,
it may come to you in different formats. While the ultimate goal of this entire
process is to reformat your data for machines, you still need to begin with
similarly formatted data. For example, if part of your analysis involves family
income from multiple countries, you’ll have to convert each income amount
into a single currency.
 Mixed data values: Perhaps different sources use different descriptors for
features – for example, man or male. These value descriptors should all be
made uniform.
 Data outliers: Outliers can have a huge impact on data analysis results. For
example if you're averaging test scores for a class, and one student didn’t
respond to any of the questions, their 0% could greatly skew the results.
 Missing data: Take a look for missing data fields, blank spaces in text, or
unanswered survey questions. This could be due to human error or incomplete
data. To take care of missing data, you’ll have to perform data cleaning.

2. Data cleaning

Data cleaning is the process of adding missing data and correcting, repairing, or
removing incorrect or irrelevant data from a data set. Dating cleaning is the most
important step of preprocessing because it will ensure that your data is ready to go for
your downstream needs.

Data cleaning will correct all of the inconsistent data you uncovered in your data
quality assessment. Depending on the kind of data you’re working with, there are a
number of possible cleaners you’ll need to run your data through.

Missing data

There are a number of ways to correct for missing data, but the two most common
are:

 Ignore the tuples: A tuple is an ordered list or sequence of numbers or


entities. If multiple values are missing within tuples, you may simply discard
the tuples with that missing information. This is only recommended for large
data sets, when a few ignored tuples won’t harm further analysis.
 Manually fill in missing data: This can be tedious, but is definitely necessary
when working with smaller data sets.

Noisy data

Data cleaning also includes fixing “noisy” data. This is data that includes unnecessary
data points, irrelevant data, and data that’s more difficult to group together.

44
 Binning: Binning sorts data of a wide data set into smaller groups of more
similar data. It’s often used when analyzing demographics. Income, for
example, could be grouped: $35,000-$50,000, $50,000-$75,000, etc.
 Regression: Regression is used to decide which variables will actually apply
to your analysis. Regression analysis is used to smooth large amounts of data.
This will help you get a handle on your data, so you’re not overburdened with
unnecessary data.
 Clustering: Clustering algorithms are used to properly group data, so that it
can be analyzed with like data. They’re generally used in unsupervised
learning, when not a lot is known about the relationships within your data.

If you’re working with text data, for example, some things you should consider when
cleaning your data are:

 Remove URLs, symbols, emojis, etc., that aren’t relevant to your analysis
 Translate all text into the language you’ll be working in
 Remove HTML tags
 Remove boilerplate email text
 Remove unnecessary blank text between words
 Remove duplicate data

After data cleaning, you may realize you have insufficient data for the task at hand. At
this point you can also perform data wrangling or data enrichment to add new data
sets and run them through quality assessment and cleaning again before adding them
to your original data.

3. Data transformation

With data cleaning, we’ve already begun to modify our data, but data transformation
will begin the process of turning the data into the proper format(s) you’ll need for
analysis and other downstream processes.

This generally happens in one or more of the below:

1. Aggregation
2. Normalization
3. Feature selection
4. Discreditization
5. Concept hierarchy generation

 Aggregation: Data aggregation combines all of your data together in a


uniform format.
 Normalization: Normalization scales your data into a regularized range so
that you can compare it more accurately. For example, if you’re comparing
employee loss or gain within a number of companies (some with just a dozen
employees and some with 200+), you’ll have to scale them within a specified
range, like -1.0 to 1.0 or 0.0 to 1.0.
 Feature selection: Feature selection is the process of deciding which variables
(features, characteristics, categories, etc.) are most important to your analysis.
These features will be used to train ML models. It’s important to remember,

45
that the more features you choose to use, the longer the training process and,
sometimes, the less accurate your results, because some feature characteristics
may overlap or be less present in the data.

 Discreditization: Discreditiization pools data into smaller intervals. It’s


somewhat similar to binning, but usually happens after data has been cleaned.
For example, when calculating average daily exercise, rather than using the
exact minutes and seconds, you could join together data to fall into 0-15
minutes, 15-30, etc.

 Concept hierarchy generation: Concept hierarchy generation can add a


hierarchy within and between your features that wasn’t present in the original
data. If your analysis contains wolves and coyotes, for example, you could add
the hierarchy for their genus: canis.

4. Data reduction

The more data you’re working with, the harder it will be to analyze, even after
cleaning and transforming it. Depending on your task at hand, you may actually have
more data than you need. Especially when working with text analysis, much of
regular human speech is superfluous or irrelevant to the needs of the researcher. Data
reduction not only makes the analysis easier and more accurate, but cuts down on data
storage.

It will also help identify the most important features to the process at hand.

 Attribute selection: Similar to discreditization, attribute selection can fit your


data into smaller pools. It, essentially, combines tags or features, so that tags
like male/female and professor could be combined into male professor/female
professor.

46
 Numerosity reduction: This will help with data storage and transmission.
You can use a regression model, for example, to use only the data and
variables that are relevant to your analysis.
 Dimensionality reduction: This, again, reduces the amount of data used to
help facilitate analysis and downstream processes. Algorithms like K-nearest
neighbors use pattern recognition to combine similar data and make it more
manageable.

Data Preprocessing Examples

Take a look at the table below to see how preprocessing works. In this example, we
have three variables: name, age, and company. In the first example we can tell that #2
and #3 have been assigned the incorrect companies.

Name Age Company


Karen Lynch 57 CVS Health
Elon Musk 49 Amazon
Jeff Bezos 57 Tesla
Tim Cook 60 Apple

We can use data cleaning to simply remove these rows, as we know the data was
improperly entered or is otherwise corrupted.

Name Age Company


Karen Lynch 57 CVS Health
Tim Cook 60 Apple

Or, we can perform data transformation, in this case, manually, in order to fix the
problem:

Name Age Company


Karen Lynch 57 CVS Health
Elon Musk 49 Tesla
Jeff Bezos 57 Amazon
Tim Cook 60 Apple

Once the issue is fixed, we can perform data reduction, in this case by descending age,
to choose which age range we want to focus on:

Name Age Company


Tim Cook 60 Apple
Karen Lynch 57 CVS Health
Jeff Bezos 57 Amazon
Elon Musk 49 Tesla

47

You might also like