Implementing K-Means Clustering Algorithm Using Mapreduce Paradigm
Implementing K-Means Clustering Algorithm Using Mapreduce Paradigm
Implementing K-Means Clustering Algorithm Using Mapreduce Paradigm
Abstract: Clustering is a useful data mining technique which groups data points such that the points within a single group have similar
characteristics, while the points in different groups are dissimilar. Partitioning algorithm methods such as k-means algorithm is one kind
of widely used clustering algorithms. As there is an increasing trend of applications to deal with vast amounts of data, clustering such big
data is a challenging problem. Recently, partitioning clustering algorithms on a large cluster of commodity machines using the
MapReduce framework have received a lot of attention. Traditional way of clustering text documents is Vector space model, in which tf-idf
is used for k-means algorithm with supportive similarity measure. This project exhibits an approach to cluster text documents in which
results obtained by executing map reduce k-means algorithm on single node cluster show that the performance of the algorithm increases
as the text corpus increases.
Keywords: Vector space model, map reduce, text, clustering, map reduce k-means, Hadoop
C. Hadoop
Hadoop is a platform that provides both distributed storage
and computational capabilities. It is an open source software
project that enables the distributed processing of large data
sets across clusters of commodity servers. It is designed to
scale up from a single server to thousands of machines, with
a very high degree of fault tolerance. Hadoop[7] is a Figure 1: MapReduce Tasks
distributed master-slave architecture that consists of Hadoop
distributed file system (HDFS) for storage and Map-Reduce 3. Mapreduce K-Means Clustering
for computational capabilities. Hadoop can handle all types
of data from disparate systems: structured, unstructured, log The mapreduce k-means clustering approach for processing
files, pictures, audio files, communications records, email big text corpus [4] can be done by the following steps:
just about anything you can think of, regardless of its native 1) Give the Sequence file from directory of text documents
format. Even when different types of data have been stored in as input.
unrelated systems, you can dump it all into your Hadoop 2) Tokenize and generate TF-IDF vector for each document
cluster with no prior need for a schema. from sequence file.
3) Apply Map-Reduce K-Means algorithm to form k clusters.
Hadoop Distributed File System (HDFS)[5] is a file system
that spans all the nodes in a Hadoop cluster for data storage. A. Sequence File from Directory of Text Documents
The HDFS splits large data files into chunks that are Map reduce programming is coined to process huge data sets
managed by different nodes in the cluster. Each chunk is in parallel and distributed environment. Suppose, we select
replicated across several nodes to address single node outage the input data from a document set, where the text files in the
or fencing scenarios. directory are small in size. Since HDFS and Mapreduce are
optimized for large files, convert the small text files into
D. MapReduce Programming larger file i.e., SequenceFile format. SequenceFile is a
MapReduce[6] runs as a series of jobs, with each job hadoop class, which allows us to write document data in
essentially a separate Java application that goes out into the terms of binary <key, value> pairs, where key is a Text with
data and starts pulling out information as needed. Based on unique document id and value is Text content within the
the MapReduce design, records are processed in isolation via document in UTF-8 format. SequenceFile packs the small
tasks called Mappers. The output from the Mapper tasks is files and process whole file as a record. Since the
further processed by a second set of tasks, the Reducers, SequenceFile is in binary format, we could not able to read
where the results from the different Mapper tasks are merged the content directly but faster for read /write operations.
together. Using MapReduce instead of a query gives data
seekers a lot of power and flexibility, but also adds a lot of B. Creating TF-IDF Vectors
complexity. The Map and Reduce functions The sequence file from the previous step is fed as Input to
of MapReduce are both defined with respect to data create vectors. The TF-IDF vectors are calculated in
structured in (key, value) pairs. Map takes one pair of data Mapreduce by the following steps:
with a type in one data domain, and returns a list of pairs in a
different domain: Step 1: Tokenization: The input fed to map function is in
Map(k1,v1) list(k2,v2) format of <key, value> pairs, where key is the document
The Map function is applied in parallel to every pair in the name and value as document content. The outcome of reduce
input dataset. This produces a list of pairs for each call. After function is also <key, value> pair where key is document
that, the MapReduce framework collects all pairs with the name and value are tokens (words) present in that document.
same key from all lists and groups them together, creating Ex: Key: /acq1.txt: Value: [macandrews, forbes, holdings,
one group for each key. bids, revlon, mcandrews, forbes, holdings, inc,said, offer,
The Reduce function is then applied in parallel to each group, dlrs, per, share, all, revlon, group]
which in turn produces a collection of values in the same
domain: Step 2: Dictionary file: This step assign unique number to
Reduce(k2, list (v2)) list(v3) each token in all documents. The input format for the map
Each Reduce call typically produces either one value v3 or an function is <document name, wordlist> and the output of
empty return, though one call is allowed to return more than reduce function is <Word, uniqueid>.
Volume 5 Issue 7, July 2016
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
Paper ID: 14071601 1241
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2015): 6.391
Ex: Key: accounts: Value: 152.
Step 5 : Calculate tf-idf value : The output of step 4 is taken
Step 3: Frequency count: The number of times the word as input to map function of this step and calculates the weight
appears globally in all documents is calculated in this step. vector as tf X tf-idf value of each term ti in each document dj
The input to this map function is <docname, words>and as specified in equation(1). The output format of the result is
output format is <word id, 1>.The value of output of the map <docname, {t1: tf-idf1, t2:tf-idf2 ...ti:tf-idfi}>.
function is accumulated and find the sum in reduce function. For example: The output format of thisstep is as follows:
The output format of reduce function is <word id, count>. Key: acq1.txt: Value: {3258:0.12728, 3257:0.12728,
Ex: Key: 50: Value: 2 462:0.08060 ...}
Step 4 : Calculate term frequency : The map reduce function C. MapReduce K-Means Algorithm
in this step takes input as <docname, wordlist > and counts The implementation of map reduce k-means accepts two
the number of times each word or term ti occurs in that input files. One input file contains the documents with each
document dj. The outcome of this step is in the format of term and its tf-idf values, and the second is k initial centroids
<docname, {ti: count}>. file. The set of k initial centroids are selected randomly. In
every iteration, the map reduce framework splits the input
Ex: In the below example, acq1.txt is document name and the data into M splits and then processed in parallel as shown in
values are in the format of list of (wordid:count) Figure 2.
Key:/acq1.txt: Value: {3258:1.0, 3257:1.0, 157:2.0 ...}
Input: A set of objects X = {x1, x2 xn}, A Set ofinitial Output: (Key, Value), where key = oldCentroid and value =
Centroids C = {c1, c2, ,ck} newBestCentroid which is the new centroid value calculated
Output: An output list which contains pairs of (Ci, xj)where for that bestCentroid
1 i n and 1 j k
Procedure
Procedure
M1{x1, x2 xm} Outputlistoutputlist from mappers
current_centroidsC {}
Distance (p, q) =di=1(pi qi)2(where pi (or qi)is the newCentroidList null
coordinate of p (or q) in dimension i) for all outputlist do
for all xi M1 such that 1im do centroid .key
bestCentroidnull object .value
minDist [centroid] object
for all c current_centroids do end for
dist distance (xi, c) for all centroid do
if (bestCentroid = null || dist<minDist) newCentroid, sumofObjects,
then sumofObjects null
minDistdist for all object [centroid] do
The final output of the program will be the cluster name, file
name: number of text documents that belong to that cluster.
4. Experimental Results
This section presents the results obtained by executing map
reduce k-means clustering algorithm on cluster of machines. Time taken for iteration 0
The experimentation with 20_newsgroups dataset is
explained in detailed in below sections.
A. Environment setup
The experimentation is conducted in single node cluster. The
single node is configured with I7processor, 4 GB memory,
64GB hard disk along with JDK 1.7.0 and hadoop 2.4.1
version. The Operating system used is Ubuntu 14.04LTS.
B. Dataset description
The 20_Newsgroups data set is a collection of approximately
20,000 newsgroup documents, partitioned (nearly) evenly
across 20 different newsgroups.It was originally collected by
Time taken for iteration 1
Ken Lang, probably for his Newsweeder: Learning to filter
netnews paper, though he does not explicitly mention this
collection. The 20 newsgroups collection has become a
popular data set for experiments in text applications of
machine learning techniques, such as text classification and
text clustering. The data is organized into 20 different
newsgroups, each corresponding to a different topic. Some of
the newsgroups are very closely related to each other
(e.g. comp.sys.ibm.pc.hardware /
comp.sys.mac.hardware), while others are highly unrelated
(e.g. misc.forsale / soc.religion.christian).
C. Data pre-processing
The data available here are in .tar.gz bundles. You will
need tar and unzip to open them. Each subdirectory in the Time taken for iteration 2
bundle represents a newsgroup; each file in a subdirectory is
the text of some newsgroup document that was posted to that
newsgroup. Create a sequence file from 20_newsgroups text
documents. Sequence file is passed as input to find tf-idf
value of each term in every text document. The tf-idf file is
fed as input to map reduce k-means algorithm for form k
number of clusters.
D. Results
In a single node cluster, the map reduce k-means algorithm is
executed with 20_newsgroups dataset with selected number
of text files of 20,000 from 20 different topics. The algorithm
Volume 5 Issue 7, July 2016
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
Paper ID: 14071601 1243
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2015): 6.391
[2] G. Salton, A. Wong, C. S. Yang.A vector space model for
automatic indexing. Communications of the ACM,
version.18 n.11, pages.613-620, Nov. 1975.
[3] GeorgeTsatsaronis and Vicky Panagiotopoulou. A
generalized vector space model for text retrieval based on
semantic relatedness. Proceedings of the EACL 2009
student research workshop, pages 70-78, April 2009.
[4] J Dittrich, JA Quian-Ruiz. Efficient big data
processingin Hadoop MapReduce. Proceedings of the
VLDB Endowment, 2012 - dl.acm.org, Volume 5 Issue
12, August 2012 ,Volume 5 Issue 12, August 2012.
[5] Sanjay, G., G. Howard, and L. Shun-Tak, The Google file
system, in Proceedings of the nineteenth ACM
symposium on Operating systems principles. ACM:
Bolton Lan,ding, NY, USA 2003,
[6] Jeffrey, D. and G. Sanjay,. MapReduce: simplified
dataprocessing on large clusters. Commun. ACM, 51(1):
pages 107-113 2008.
[7] Apache Lucene Hadoop[EB/OL].
http://hadoop.apache.org/.
Author Profile
Botcha Chandrasekhara raois currently pursuing his
2 years M.Tech in CSE at Pydah Kaushik College of
Engineering, Gambheeram Village, Anandapuram
Mandalam, Visakhapatnam. His area of interest
includes Cloud Computing.
Clusters
Medara Rambabu is currently working as HOD &
5. Conclusions and Future Work Associate Professor in Department of Computer
Science & Engineering at Pydah Kaushik College of
Engineering, Gambheeram Village, Anandapuram
Information retrieval techniques are widely popular in most Mandalam, Visakhapatnam. He completed his M.Tech from
of the search engines to efficiently organize and retrieve JNTUK in 2010, and he completed his B.Tech from AU, in 2004.He
information systems. Most of the data in internet is in the has more than 6 years of teaching experience in various engineering
format of unstructured and semi structured. Currently colleges. His research interest includes Computer Organization,
clustering techniques are used to organize and group the Data Communication Systems, Computer Networks and DM & DW.
similar data objects to retrieve search results faster.
Traditional way of clustering text documents is Vector space
model, in which tf-idf is used for k-means algorithm with
supportive similarity measure. As the data is enormously
increasing day by day, elastic resources are required to store
and compute. Hadoop framework supports to store and
compute big data in parallel and distributed platform with the
help of HDFS and Map reduce.
References
[1] Hadoop The definitive guide.