Unit-5 Mahout

Unit-5
Mahout
Introduction to Mahout
o Apache Mahout is a library of scalable algorithms and since it runs on top of
Hadoop, characterized by the mascot Elephant, it is given the name Mahout.
o Mahout started as a subproject of Lucene, an API of Apache, which helps in
searching and sorting out relevant data from heterogeneous data sources, such as
XML files, MySQL database, Excel sheets etc., at a very fast speed.
o With time, Mahout acquired functionalities of its own and emerged as an
independent package of scalable and fast-executing machine-learning algorithms.
o Apache Mahout makes use of the MapReduce paradigm and Hadoop framework;
however, it can also run algorithms in the standalone mode.
Some features of Apache Mahout are listed below:
❖ Mahout algorithms are written on the top of Hadoop and can easily run in Hadoop
Distributed File System as well as on independent platforms.
❖ Mahout algorithms are highly scalable and compatible with the cloud network.
❖ Mahout provides a ready-to-use framework facilitating the developers to perform data
mining on huge volumes of data.
❖ Mahout algorithms are very fast and efficient as compared to other Big Data analytic
techniques.
❖ Algorithms of Mahout contain many MapReduce enabled clustering implementations
such as K-Means, Fuzzy K-Means, Canopy, Dirichlet and Mean-Shift.
❖ Mahout supports implementations of Distributed Naïve Bayes and Complementary
Naïve Bayes.
❖ Mahout supports matrix and vector libraries.
Mahout works on the machine-learning framework that processes
data on the basis of the following 3Cs:
➢ Collaborative filtering (as known as Recommendation)
➢ Clustering
➢ Classification
Machine Learning
▪ Machine learning is a type of Artificial Intelligence (AI) that deals with the
programming of computer systems.
▪ Algorithms based on machine-learning techniques enable computer systems to
learn and improve with experience.
▪ Machine learning algorithms targets on the development of computer systems
and programs, enabling them to analyse, grow and adapt themselves to any
changes in the volume and variety of data.
▪The concept of machine learning is similar to that of data mining, as both search
for and establish patterns of data.
Working of Machine learning algorithm
Training Data
Machine Learning
Algorithms
Test Data Hypothesis Performance

Working of Machine learning algorithm
• A set of training data is sent to a machine learning algorithm for developing an
Instance.
• After creating an instance, the algorithm values are sent to the Hypothesis ground
to find whether this algorithm is good for the data or not.
• If the algorithm is good from the hypothesis point, then it is sent for performance.
• Performance is nothing but the actual state of the processing of the algorithm.
• If the performance is good, it sends a feedback signal to the machine learning
algorithm that the algorithm, which is generated on the training data is
performing fine or can be used with other input data and new data.
Some fields that use machine-learning algorithms in their
applications are as follows:
❑ Vision processing
❑ Data mining
❑ Language processing
❑ Stock market forecasting
❑ Pattern recognition
❑ Robotics
❑ Games
❑ Advanced human artificial limbs
Supervised Learning
A supervised learning algorithm analyses the data gathered from various
resources and generates a function that can be used for mapping new data on
similar patterns.
Neural networks, Support Vector Machines (SVMs), and Naïve Bayes classifier are
some examples of supervised learning algorithms.
The fields where supervised learning algorithms are used include:
1. Classification of tweets as private
2. Voice analysis
3. Labelling webpages based on their content
Supervised Machine learning Algorithm
Supervised Learning
o The feature vectors are extracted from training data or the “txt” files, and are
provided to the machine learning algorithm for developing the processing the data
file.
o Labels are generated and added to the machine learning algorithm so that they
can be tracked from the huge stack of data.
o A predictive model is generated that already contains the result from the
machine learning algorithm.
o Finally, the algorithm is processed in the predictive model to generate the label
for the training data that was initially provided for the processing.
Unsupervised Learning
Unsupervised learning algorithm transforms the unlabelled data into some
meaningful form.
This algorithm works by searching patterns and trends in raw data.
It is basically used in clustering similar data into logical groups.
Some examples where unsupervised learning techniques are used include K-
Means hierarchical clustering algorithms.
▪ In unsupervised learning, the raw
data is first scaled and then
converted into a data model.
▪ This data model is then sent for
validation.
▪ If the data model passes the
validation process, it can be used
for further input data or the new
data.
▪ If the validation fails, the data
model is sent back to the building
stage to be restructured in a new
model.
Collaborative Filtering (Recommendation)
When you search for a particular item on the Internet, the side panel starts
displaying other similar items suggesting that people who searched for this item
also showed interest in these items. This is called Collaborative filtering in which
the user behaviour is mined to derive certain patterns.
This information is then used as a suggestion/recommendation for other users
with similar likes and dislikes.
These recommendation are produced with the help of a recommendation engine.
Some common recommendation engines supported by Mahout include:
o User-based recommendation
o Item-based recommendation
User-based Recommendation
In a user-based recommendation, the interfaces are drawn on the basis of the
likes, dislikes and preferences of users.
The data for this analysis is collected after studying the search patterns of a single
or multiple users over a period of the time.
It is based on a simple concept in which to calculate the recommendations of a
targeted user, the search data of all the other users having similar search patterns
is analyzed.
One of the most commonly used methods for doing this analysis is to calculate the
correlation coefficient between the search patterns of different users.
In a user-recommendation, a user profile is created on the basis of the user’s
current search patterns.
The algorithm also generates and displays the content profile of the item selected
by the user.
The user profile is now matched with the content profile of the selected item.
If the profiles match, then the algorithm recommends all the items with the
similar content profile.
➢ In Mahout, we apply user-based recommendation on the given dataset by running the
following command:
UserSimilarity similarity= new PearsonCorrelationsSimilarity (model);
➢ To decide the user that is most similar to the nearest recommender, we will use the
similarity greater than 0.1, which is implemented through a ThresoldUserNeighbourhood
as:
UserNeighbourhood neighbourhood= new ThresholdUserNeighbourhood (0.1,
similarity, model);
➢ We can create our recommender as we have all the portions of similarities. Run the
following command to create a recommender:
UserBasedRecommender recommender = new GenericUserBasedRecommender
(model, neighborhood, similarity);
➢The recommender can now easily give recommendations for a single or multiple items.
Item-based Recommendation
➢ Item-based recommendation is a flexible and easy-to-execute algorithm that can
be applied in a variety of applications.
➢ The algorithm is based on the input of the customer about the items searched or
purchased, generating an output that recommends similar items with a score
reflecting whether a particular customer will “like” the recommended item or not.
➢ One of the advantages of item-based recommendation algorithms is that they are
so flexible that they can adapt themselves to diverse working conditions and
research interests.
Item-based Recommendation
For example, suppose we want to calculate the total number of orders
received for a particular product from each customer. In this case, the
item-based recommenders filter the recommendations by:
❑ Removing the low volume or low revenue gain products from consideration
❑ Clubbing the customers on the basis of segment or market rather than using
customer’s level data
❑ Removing the zero-dollar transitions, paybacks and other order types
The output generated by item-based recommenders serves as an input

of downstream applications, such as ERP systems, websites, etc., for
further analysis.
Clustering
Clustering algorithms act upon data and form groups of data units having similar
characteristics.
For example, customers are grouped together on the basis of their demographic
information, gender, buying patterns, etc.
Clustering is a form of unsupervised learning that involves the following three steps:
1. Similarity and Dissimilarity: Whenever a new element in the group, the
classification cluster verifies whether that element belongs to the present group or
not. If the element doesn’t belongs to some already existing group, then a new cluster
is formed for it.
2. Algorithm: In this step, the element is inserted into the group to which it belongs by
using the appropriate algorithm.
3. Stopping Condition: The stopping condition is imposed to define a point where no
clustering or classification is further required. At this point, the cluster is dissolved.
The following are some common applications of
clustering:
▪ Clustering is used in various applications that involve recognizing patterns,
image processing, rendering, data analytics, etc.
▪ In the IT field, clustering is used to club several computer systems involved in
the same project for easy sharing of documents, resources, etc.
▪ Clustering is used in search engines such as Google and Bing to group data with
similar characteristics.
▪ Newsgroups also use clustering algorithms to group various articles based on
related topics.
Mahout supports the following types of clustering
algorithm:
▪ K-Means clustering: It is one of the simplest and most commonly used machine
learning algorithm that is used for grouping objects. It is based on vector quantization
in which the objects to be clustered are represented as a set of numerals. The user has
to classify a given data set into a certain number of groups (referred to as k). A separate
centroid, k, is defined for each cluster. In the next step, each point of the given data set
is associated with its nearest centroid that is the nearest k. The first level of clustering
is completed when every point is joined to the centroid. The entire procedure is
repeated again with new k points ill the time a loop is generated and there is no more
scope of movement for the k centroids.
▪ Fuzzy K-Means (FKM): It is also known as Fuzzy-C-Means (FCM), and is basically an
extension of the K-Means algorithm. However, unlike K-Means, FKM allows one set of
data to belong to two or more clusters. In other words, in FKM, a data set can get
associated with more than one centroid. Similar to K-Means, FKM does not cause any
changes in the input directories and works on the objects that can be represented in n-
dimensional vector space.
Classification
▪ Classification is a machine learning technique that uses the knowledge of known data to
determine new data, so that it can be arranged or classified into the existing categories of the data.
▪ Classification involves identification of new datasets and categorization on the basis of certain
predefined criteria.
▪ The following are some widely-used applications of classification:
▪ If we download a new audio/video in our phone, then the installed audio/video player
automatically classifies the new content and places it under the relevant category or creates a
new playlist.
▪ Mail service providers also use classification algorithms to classify mails in different categories.
These algorithms study user’s pattern of marking and segregating mails and then decide in
whether the mails should be delivered in the spam, junk or inbox folder.
▪ Classification algorithms are also widely used in financial and insurance organizations to
analyze various risk factors related to monetary transactions or claim.
Mahout algorithms
Algorithm Category Description
Estimates a user’s preference for one item
Distributed item-based Collaborative
by looking at his/her preferences for
collaborative filtering filtering
similar items
Among a matrix of items that a user has
Collaborative filtering using a Collaborative
not yet seen, predict which items the user
parallel matrix factorization filtering
might prefer
For preprocessing data before using k-
Canopy clustering Clustering means or hierarchical clustering
algorithm
Dirichilet process clustering Clustering Performs Bayesian mixture modeling
Discovers soft clusters where a particular
Fuzzy K-means Clustering
point can belong to more than one cluster
Builds a hierarchy of clusters using either
Hierarchical Clustering Clustering a Agglomerative “bottom-up” or Divisive
“top-down” approach
Algorithm Category Description
Aims to partition n observations into k
clusters in which each observation
K-means clustering Clustering
belongs to the cluster with the nearest
mean
Automatically and jointly clusters words
Latent Dirchilet Allocation Clustering into “topics” and documents into mixtures
of topics
Finds modes or clusters in 2-dimensional
Mean shift clustering Clustering space, where the number of clusters is
unknown
Cluster points using eigenvectors of
Spectral clustering Clustering
matrices derived from the data
Bayesian Classification Classifies objects into binary categories
Refers to an ensemble learning method for
classification (and regression) that
Random Forest Classification
operates by constructing a multitude of
decision tree
Quickly estimates similarity between 2
Minhash clustering Clustering
datasets

Unit-5 Mahout

Uploaded by

Copyright:

Available Formats

Unit-5 Mahout

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-5 Mahout

Uploaded by

Copyright:

Available Formats

Unit-5

Test Data Hypothesis Performance

The output generated by item-based recommenders serves as an input

You might also like