Big Data Analytics
Big Data Analytics
Big Data Analytics
Dr.. M. Vijayalakshmi
Dr
Professor, Department of Information Technology, VESIT
Vivekanand Education Society Institute of Technology
echnology,,
Affiliated to Mumbai University
Big Data Analytics
Copyright © 2016 by Wiley India Pvt. Ltd., 4435-36/7, Ansari Road, Daryaganj, New Delhi-110002.
Cover Image: © yienkeat/Shutterstock
All rights
rights reserved. No part of
of this book may be reproduced, stored in a retrieval system, or transmitted in any
any form or by any means,
electronic,, mechanical, photocopying, recording or scanning without the written permission of the publisher.
electronic
Limits of Liability: While the publisher and the author have used their best efforts in preparing this book, Wiley and the author make
no representation or warranties with respect to the accuracy
accuracy or completeness of the contents of this book, and specifically disclaim
any implied warranties of merchanta
merchantability
bility or fitness for any particular purpose. There are no warranties which extend beyond
beyond the
descriptions contained in this paragraph. No warranty may be created or extended by sales representatives
representatives or written sales materials.
materials.
The accuracy and completeness of the information provided herein and the opinions stated herein are not guaranteed or warranted to
produce any particular results, and the advice and strategies contained herein may not be suitable for every individual. Neither Wiley
India nor the author shall be liable for any loss of profit or any other commercial
commercial damages, including but not limited to special,
incidental, consequential, or other damages
damages..
emarks: All brand names and product names used in this book are trademarks, registered trademarks,
Trademarks:
Trad trademarks, or trade names of their
respectivee holders. Wiley is not associated with any product or vendor mentioned in this book.
respectiv
Edition: 2016
ISBN: 978-81-265-5865-0
ISBN: 978-81-265-8224-2 (ebk)
www.wileyindia.com
www .wileyindia.com
Printed at:
Dedicated to my husband, Shankaramani,
and son Rohit who were kind enough to
understand my busy schedule and patiently waited
for a holiday together.
together.
—Radha Shankarmani
—M. Vijayalakshmi
Preface
software framework called Hadoop. A NoSQL database is used to capture and store the reference data
that are diverse in format and also change frequently.
With the advent of Big Data, applying existing traditional data mining algorithms to current real-
world problems faces several tough challenges due to the inadequate scalability and other limitations of
theses algorithms. The biggest limitation is the inability of these existing algorithms to match the three
Vs of the emerging big data. Not only the scale of data generated today is unprecedented, the produced
data is often continuously generated in the form of high-dimensional streams which require decisions
just in time. Further, these algorithms were not designed to be applicable to current areas like web based
analytics, social network analysis, etc.
Thus, even though big data bears greater value (i.e., hidden knowledge and more valuable insights),
it brings tremendous challenges to extract these hidden knowledge and insights from big data since
the established process of knowledge discovering and data mining from conventional datasets was not
designed to and will not work well with big data.
One solution to the problem is to improve existing techniques by applying massive parallel process-
ing architectures and novel distributed storage systems which help faster storage and retrieval of data.
But this is not sufficient for mining these new data forms. The true solution lies in designing newer and
innovative mining techniques which can handle the three V ’s effectively.
given at the end of the chapters can be used to test the readers understanding of the content provided
in the chapter. Further, a list of suggested programming assignments can be used by a mature reader to
gain expertise in this field.
1. Chapter 1 contains introduction to Big Data, Big Data Characteristics, Types of Big Data, com-
parison of Traditional and Big Data Business Approach, Case Study of Big Data Solutions.
2. Chapter 2 contains introduction to Hadoop, Core Hadoop Components, Hadoop Ecosystem,
Physical Architecture, and Hadoop Limitations.
3. Chapter 3 discusses about No SQL, NoSQL Business Drivers, Case Studies on NoSQL, No
SQL Data Architecture Patterns, Variations of NoSQL Architectural Patterns, Using NoSQL
to Manage Big Data, Understanding Types of Big Data Problems, Analyzing Big Data with a
Shared-Nothing Architecture, Choosing Distribution Models, Master−Slave vs Peer-to-Peer, the
way NoSQL System Handles Big Data Problems.
4. Chapter 4 covers MapReduce and Distributed File Systems; Map Reduce: The Map Tasks and
The Reduce Tasks; MapReduce Execution, Coping with Node Failures, Algorithms Using
MapReduce: Matrix-Vector Multiplication and Relational Algebra operations.
5. Chapter 5 introduces the concept of similarity between items in a large dataset which is the
foundation for several big data mining algorithms like clustering and frequent itemset mining.
Different measures are introduced so that the reader can apply the appropriate distance measure
to the given application.
6. Chapter 6 introduces the concept of a data stream and the challenges it poses. The chapter
looks at a generic model for a stream-based management system. Several Sampling and Filtering
techniques which form the heart of any stream mining technique are discussed; among them the
popularly used is Bloom filter. Several popular steam-based algorithms like Counting Distinct
Elements in a Stream, Counting Ones in a Window, Query Processing in a Stream are discussed.
7. Chapter 7 introduces the concept of looking at the web in the form of a huge webgraph. This
chapter discusses the ill effects of “Spam” and looks at Link analysis as a way to combat text bead
“Spam”. The chapter discusses Google’s PageRank algorithm and its variants in detail. The alter-
nate ranking algorithm HITS is also discussed. A brief overview of Link spam and techniques to
overcome them are also provided.
8. Chapter 8 covers very comprehensively algorithms for Frequent Itemset Mining which is at the
heart of any analytics effort. The chapter reviews basic concepts and discusses improvements to
the popular A-priori algorithm to make it more efficient. Several newer big data frequent itemset
mining algorithms like PCY, Multihash, Multistage algorithms are discussed. Sampling-based
x • PREFACE
algorithms are also dealt with. The chapter concludes with a brief overview of identifying fre-
quent itemsets in a data stream.
9. Chapter 9 covers clustering which is another important data mining technique. Traditional clus-
tering algorithms like partition-based and hierarchical are insufficient to handle the challenges
posed by Big Data clustering. This chapter discusses two newer algorithms, BFR and CURE,
which can cluster big data effectively. The chapter provides a brief overview of stream clustering.
10. Chapter 10 discusses Recommendation Systems, A Model for Recommendation Systems, Con-
tent-Based Recommendations and Collaborative Filtering.
11. Chapter 11 introduces the social network and enumerates different types of networks and their
applications. The concept of representing a Social Network as a Graph is introduced. Algorithms
for identifying communities in a social graph and counting triangles in a social graph are dis-
cussed. The chapter introduces the concept of SimRank to identify similar entities in a social
network.
12. Appendix: This book also provides a rather comprehensive list of websites which contain open
datasets that the reader can use to understand the concept and use in their research on Big Data
Analytics.
13. Additionally each chapter provides several exercises based on the chapters and also several pro-
gramming assignments that can be used to demonstrate the concepts discussed in the chapters.
14. References are given for detail reading of the concepts in most of the chapters.
Audience
This book can be used to teach a first course on Big Data Analytics in any senior undergraduate or
graduate course in any field of Computer Science or Information Technology. Further it can also be
used by practitioners and researchers as a single source of Big Data Information.
Acknowledgements
First and foremost, I would like to thank my mother for standing beside me throughout my career and
writing this book. My sincere thanks to Principal, Dr. Prachi Gharpure, too. She has been my inspira-
tion and motivation for continuing to improve my knowledge and move my career forward. My thanks
to M.E. research students in writing installation procedures for laboratory exercises.
Radha Shankarmani
Several people deserve my gratitude for their help and guidance in making this book a reality. Foremost
among them is Prof. Radha Shankaramani, my co-author who pushed and motivated me to start this
venture. My sincere thanks to my principal Dr. J.M. Nair (VESIT) who has supported me full heart-
edly in this venture. My thanks to Amey Patankar and Raman Kandpal of Wiley India for mooting the
idea of this book in the first place.
M.Vijayalakshmi
Together,
We would like to express our gratitude to the many people who inspired us and provided support.
Our sincere thanks to the Dean, Ad hoc Board of Studies, Information Technology, Dr. Bakal
for introducing the course in under graduate program and providing us an opportunity to take this
venture. Our sincere thanks to the publishers, Wiley India and the editorial team for their continuing
support in publishing this book.
Radha Shankarmani
M. Vijayalakshmi
About the Authors
Preface vii
Acknowledgements xi
Learning Objectives 1
1.1 Introduction to Big Data 1
1.1.1 So What is Big Data? 1
1.2 Big Data Characteristics 2
1.2.1 Volume of Data 2
1.3 Types of Big Data 3
1.4 Traditional Versus Big Data Approach 4
1.4.1 Traditional Data Warehouse Approach 4
1.4.2 Big Data Approach 5
1.4.3 Advantage of “Big Data” Analytics 5
1.5 Technologies Available for Big Data 6
1.6 Case Study of Big Data Solutions 7
1.6.1 Case Study 1 7
1.6.2 Case Study 2 7
Summary 8
Exercises 8
Chapter 2 Hadoop 11
Learning Objectives 11
2.1 Introduction 11
xiv • CONTENTS
2.2
What is Hadoop? 11
2.2.1 Why Hadoop? 12
2.2.2 Hadoop Goals 12
2.2.3 Hadoop Assumptions 13
2.3 Core Hadoop Components 13
2.3.1 Hadoop Common Package 14
2.3.2 Hadoop Distributed File System (HDFS) 14
2.3.3 MapReduce 16
2.3.4 Yet Another Resource Negotiator (YARN) 18
2.4 Hadoop Ecosystem 18
2.4.1 HBase 19
2.4.2 Hive 19
2.4.3 HCatalog 20
2.4.4 Pig 20
2.4.5 Sqoop 20
2.4.6 Oozie 20
2.4.7 Mahout 20
2.4.8 ZooKeeper 21
2.5 Physical Architecture 21
2.6 Hadoop Limitations 23
2.6.1 Security Concerns 23
2.6.2 Vulnerable By Nature 24
2.6.3 Not Fit for Small Data 24
2.6.4 Potential Stability Issues 24
2.6.5 General Limitations 24
Summary 24
Review Questions 25
Laboratory Exercise 25
Learning Objectives 37
3.1 What is NoSQL? 37
3.1.1 Why NoSQL? 38
3.1.2 CAP Theorem 38
CONTENTS • xv
Chapter 4 MapReduce 69
Learning Objectives 69
4.1 MapReduce and The New Software Stack 69
4.1.1 Distributed File Systems 70
4.1.2 Physical Organization of Compute Nodes 71
4.2 MapReduce 75
4.2.1 The Map Tasks 76
4.2.2 Grouping by Key 76
4.2.3 The Reduce Tasks 76
4.2.4 Combiners 76
4.2.5 Details of MapReduce Execution 78
4.2.6 Coping with Node Failures 80
xvi • CONTENTS
4.3
Algorithms Using MapReduce 81
4.3.1 Matrix-Vector Multiplication by MapReduce 82
4.3.2 MapReduce and Relational Operators 83
4.3.3 Computing Selections by MapReduce 83
4.3.4 Computing Projections by MapReduce 84
4.3.5 Union, Intersection and Difference by MapReduce 85
4.3.6 Computing Natural Join by MapReduce 87
4.3.7 Grouping and Aggression by MapReduce 88
4.3.8 Matrix Multiplication of Large Matrices 89
4.3.9 MapReduce Job Structure 90
Summary 91
Review Questions 92
Laboratory Exercise 92
Summary 123
Exercises 124
Programming Assignments 125
References 125
HDFS stores large files in the range of gigabytes to terabytes across multiple machines. It
achieves reliability by replicating the data across multiple hosts. Data is replicated on three nodes:
two on the same rack and one on a different rack. Data nodes can communicate with each other
to re-balance data and to move copies around. HDFS is not fully POSIX-compliant to achieve
increased performance for data throughput and support for non-POSIX operations such as
Append.
The HDFS file system includes a so-called secondary NameNode, which regularly connects with
the primary NameNode and builds snapshots of the primary NameNode directory information, which
the system then saves to local or remote directories. These check-pointed images can be used to restart
a failed primary NameNode without having to replay the entire journal of file-system actions, then to
edit the log to create an up-to-date directory structure.
An advantage of using HDFS is data awareness between the JobTracker and TaskTracker. The Job-
Tracker schedules map or reduce jobs to TaskTrackers with an awareness of the data location. For
example, if node A contains data ( x, y, z ) and node B contains data (a, b, c ), the JobTracker schedules
node B to perform map or reduce tasks on (a,b,c ) and node A would be scheduled to perform map
or reduce tasks on ( x,y,z ). This reduces the amount of traffic that goes over the network and prevents
unnecessary data transfer.
When Hadoop is used with other file system, this advantage is not always available. This can have a
significant impact on job-completion times, which has been demonstrated when running data-intensive
jobs. HDFS was designed for mostly immutable files and may not be suitable for systems requiring
concurrent write-operations.
HDFS cannot be mounted directly by an existing operating system. Getting data into and out of the
HDFS file system can be inconvenient. In Linux and other Unix systems, a file system in Userspace
(FUSE) virtual file system is developed to address this problem.
File access can be achieved through the native Java API, to generate a client in the language of the
users’ choice (C++, Java, Python, PHP, Ruby, etc.), in the command-line interface or browsed through
the HDFS-UI web app over HTTP.
Summary
• MapReduce brings compute to the data in • Hadoop jobs go through a map stage and a
contrast to traditional distributed system, reduce stage where
which brings data to the compute resources.
� the mapper transforms the input data
• Hadoop stores data in a replicated and into key–value pairs where multiple
distributed way on HDFS. HDFS stores values for the same key may occur.
files in chunks which are physically stored on
� the reducer transforms all of the key–
multiple compute nodes.
value pairs sharing a common key into a
• MapReduce is ideal for operating on single key–value.
very large, unstructured datasets when
• There are specialized services that form the
aggregation across large datasets is required
Hadoop ecosystem to complement the Hadoop
and this is accomplished by using the power
modules. These are HBase, Hive, Pig, Sqoop,
of Reducers.
Mahout, Oozie, Spark, Ambari to name a few.
LABORATORY EXERCISES • 25
Review Questions
Laboratory Exercise
• Click on Next Button. A new window will open. Select the RAM size. Click on Next Button.
•Here you have three options, out of which select “use an existing virtual Hard drive file”. Browse
your Cloudera folder for file with .vmdk extension. Select that file and press ENTER.
Now as we have successfully created vm we can start Cloudera. So start it by clicking on start button.
It will take some time to open. Wait for 2 to 3 minutes. Here the operating system is CentOS.
Once the system gets loaded we will start with the simple program called “wordcount” using
MapReduce function which is a simple “hello world” kind of program for Hadoop.
STEPS FOR RUNNING WORDCOUNT PROGRAM:
1. OPEN the Terminal. Install a package “wget” by typing the following command:
$ sudo yum -y install wget
2. Make directory:
$ mkdir temp
3. Goto temp:
$cd temp
4. Create a file with some content in it:
$ echo “This is SPIT and you can call me Sushil. I am good at statistical modeling and data
analysis” > wordcount.txt
5. Make input directory in the HDFS system:
$ hdfsdfs -mkdir /user/cloudera/input
6. Copy file from local directory to HDFS file system:
$ hdfsdfs -put /home/cloudera/temp/wordcount.txt /user/cloudera/input/
7. To check if your file is successfully copied or not, use:
$ hdfsdfs -ls /user/cloudera/input/
8. To check hadoop-mapreduce-examples, use:
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
9. Run the wordcount program by typing the following command:
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /user/
cloudera/input/wordcount.txt /user/cloudera/output
Note: The output will be generated in the output directory in HDFS file system and stored in
part file “part-r-00000”.
LABORATORY EXERCISES • 27
B. Guidelines to Install Hadoop 2.5.2 on top of Ubuntu 14.04 and write WordCount Program
in Java using MapReduce structure and test it over HDFS
Pre-requisite: Apache, JAVA, ssh packages must be installed. If not then follow the following steps.
1. Before installing above packages, Create a new user to run the Hadoop (hduser or huser) and give
it sudo rights:
• Create group name hadoop:
$ sudoaddgrouphadoop
• To create user and add it in group named Hadoop use
$ sudoadduser --ingrouphadoophduser
• To give sudo rights to hduser use
$ sudoadduserhdusersudo
• To switch user to hduser use
$ suhduser
2. Install the following software:
# Update the source list
$ sudo apt-get update
2.1 Apache
$ sudo apt-get install apache2
# The OpenJDK project is the default version of Java.
# It is provided from a supported Ubuntu repository.
28 • CHAPTER 2/HADOOP
2.2 Java
$ sudo apt-get install default-jdk
$ java -version
2.3 Installing SSH: ssh has two main components, namely,
• ssh: The command we use to connect to remote machines − the client.
• sshd: The daemon that is running on the server and allows clients to connect to the server.
The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first. Use the
following command to do so:
$ sudo apt-get install ssh
This will install ssh on our machine. Verify if ssh is installed properly with which command:
$ whichssh
o/p:usr/bin/ssh
$ whichsshd
o/p:/usr/sbin/sshd
Create and Setup SSH Certificates: Hadoop requires SSH access to manage its nodes, that is, remote
machines plus our local machine. For our single-node setup of Hadoop, we therefore need to configure
SSH access to local host. So, we need to have SSH up and running on our machine and configured to
allow SSH public key authentication. Hadoop uses SSH (to access its nodes) which would normally
require the user to enter a password. However, this requirement can be eliminated by creating and set-
ting up SSH certificates using the following commands. If asked for a filename just leave it blank and
press the enter key to continue.
$ ssh-keygen -t rsa -P “”
Note: After typing the above command just press Enter two times.
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The second command adds the newly created key to the list of authorized keys so that Hadoop can use
ssh without prompting for a password.
We can check if ssh works using the following command:
$ ssh localhost
o/p:
The authenticity of host ‘localhost (127.0.0.1)’ cannot be established.
ECDSA key fingerprint is e1:8b:a0:a5:75:ef:f4:b4:5e:a9:ed:be:64:be:5c:2f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘localhost’ (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64)
LABORATORY EXERCISES • 29
1. ~/.bashrc: Before editing the .bashrc file in our home directory, we need to find the path where
Java has been installed to set the JAVA_HOME environment variable using the following
command:
$ update-alternatives --config java
Now we can append the following to the end of ~/.bashrc:
$ vi ~/.bashrc
$ source ~/.bashrc
Note that the JAVA_HOME should be set as the path just before the ‘.../bin/’:
$ javac -version
$ whichjavac
$ readlink -f /usr/bin/javac
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh: We need to set JAVA_HOME by modifying
hadoop-env.sh file.
$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Add the following configuration:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
3. /usr/local/hadoop/etc/hadoop/core-site.xml: This file contains configuration properties that
Hadoop uses when starting up.This file can be used to override the default settings that Hadoop
starts with.
$ sudomkdir -p /app/hadoop/tmp
$ sudochownhduser:hadoop /app/hadoop/tmp
Open the file and enter the following in between the <configuration></configuration> tag:
$ vi /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property >
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property >
< property >
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value >
</configuration>
LABORATORY EXERCISES • 31
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
</property >
</configuration>
<configuration>
<property >
<name>dfs.replication</name>
<value>1</value>
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
32 • CHAPTER 2/HADOOP
</description>
</property >
<property >
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property >
<property >
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property >
</configuration>
B4. Format the New Hadoop File System: Now, the Hadoop file system needs to be formatted so
that we can start to use it. The format command should be issued with write permission since it
creates current directory:
under /usr/local/hadoop_store/hdfs/namenode folder:
$ hadoopnamenode -format
Note that hadoopnamenode -format command should be executed once before we start using
Hadoop. If this command is executed again after Hadoop has been used, it will destroy all the
data on the Hadoop file system.
Starting Hadoop: Now it is time to start the newly installed single node cluster. We can use
start-all.sh or (start-dfs.sh and start-yarn.sh)
$ cd /usr/local/hadoop/sbin
$ start-all.sh
We can check if it is really up and running using the following command:
$ jps
o/p:
9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode
The output means that we now have a functional instance of Hadoop running on our VPS
(Virtual private server).
$ netstat -plten | grep java
LABORATORY EXERCISES • 33
Stopping Hadoop
$ cd /usr/local/hadoop/sbin
$ stop-all.sh
B5. Running Wordcount on Hadoop 2.5.2: Wordcount is the hello_world program for MapReduce.
Code for wordcount program is:
packageorg.myorg;
importjava.io.IOException;
importjava.util.*;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
importorg.apache.hadoop.util.*;
importorg.apache.hadoop.mapreduce.Mapper;
importorg.apache.hadoop.mapreduce.Reducer;
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.conf.Configured;
importorg.apache.hadoop.mapreduce.Job;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
importorg.apache.hadoop.mapred.JobConf;
public class myWordCount {
public static class Map extends Mapper
<LongWritable, Text, Text, IntWritable > {
LEARNING OBJECTIVES
After reading this chapter, you will be able to:
• Understand NoSQL business drivers. • Learn the variations in NoSQL architectural
• Learn the desirable features of NoSQL that pattern.
drive business. • Learn how NoSQL is used to manage big
• Learn the need for NoSQL through case data.
studies. • Learn how NoSQL system handles big data
• Learn NoSQL data architectural pattern. problems.
NoSQL is database management system that provides mechanism for storage and retrieval of massive
amount of unstructured data in a distributed environment on virtual servers with the focus to provide
high scalability, performance, availability and agility.
In other words, NoSQL was developed in response to a large volume of data stored about users,
objects and products that need to be frequently accessed and processed. Relational databases are not
designed to scale and change easily to cope up with the needs of the modern industry. Also they do not
take advantage of the cheap storage and processing power available today by using commodity hardware.
NoSQL database is also referred as Not only SQL . Most NoSQL systems are entirely non-relational;
they do not have fixed schemas or JOIN operations. Instead they use objects, key-value pairs, or tuples.
Some of the NoSQL implementations are SimpleDB, Google BigTable, Apache Hadoop, MapReduce
and MemcacheDB. There are approximately 150 NoSQL databases available in the market. Companies
that largely use NoSQL are NetFlix, LinkedIn and Twitter for analyzing their social network data.
In short:
1. NoSQL is next generation database which is completely different from the traditional database.
2. NoSQL stands for Not only SQL. SQL as well as other query languages can be used with NoSQL
databases.
38 • CHAPTER 3/WHAT IS NoSQL?
1. Consistency guarantees all storage and their replicated nodes have the same data at the same time.
2. Availability means every request is guaranteed to receive a success or failure response.
3. Partition tolerance guarantees that the system continues to operate in spite of arbitrary partition-
ing due to network failures.
cope up with the speed in which information needs to be extracted. Businesses have to capture and
analyze a large amount of variable data, and make immediate changes in their business based on their
findings.
Figure 3.1 shows RDBMS with the business drivers velocity, volume, variability and agility neces-
sitates the emergence of NoSQL solutions. All of these drivers apply pressure to single CPU relational
model and eventually make the system less stable.
Velocity
Agility
3.2.1 Volume
There are two ways to look into data processing to improve performance. If the key factor is only speed,
a faster processor could be used. If the processing involves complex (heavy) computation, Graphic
Processing Unit (GPU) could be used along with the CPU. But the volume of data is limited to
on-board GPU memory. The main reason for organizations to look at an alternative to their current
RDBMSs is the need to query big data. The need to horizontal scaling made organizations to move
from serial to distributed parallel processing where big data is fragmented and processed using clusters of
commodity machines. This is made possible by the development of technologies like Apache Hadoop,
HDFS, MapR, HBase, etc.
3.2.2 Velocity
Velocity becomes the key factor when frequency in which online queries to the database made by
social networking and e-commerce web sites have to be read and written in real time. Many single
CPU, RDBMS systems are unable to cope up with the demands of real-time inserts. RDBMS systems
frequently index on many columns that decrease the system performance. For example, when online
shopping sites introduce great discount schemes, the random bursts in web traffic will slow down the
40 • CHAPTER 3/WHAT IS NoSQL?
response for every user and tuning these systems as demand increases can be costly when both high read
and write is required.
3.2.3 Variability
Organizations that need to capture and report on certain uncommon data, struggle when attempting to
use RDBMS fixed schema. For example, if a business process wants to store a few special attributes for
a few set of customers,then it needs to alter its schema definition. If a change is made to the schema, all
customer rows within the database will also have this column. If there is no value related to this for most
of the customers, then the row column representation will have sparse matrix. In addition to this, new
columns to an RDBMS require to execute ALTER TABLE command. This cannot be done on the fly
since the present executing transaction has to complete and database has to be closed, and then schema
can be altered. This process affects system availability, which means losing business.
3.2.4 Agility
The process of storing and retrieving data for complex queries in RDBMS is quite cumbersome. If it is
a nested query, data will have nested and repeated subgroups of data structures that are included in an
object-relational mapping layer. This layer is responsible to generate the exact combination of SELECT,
INSERT, DELETE and UPDATE SQL statements to move the object data from and to the backend
RDBMS layer. This process is not simple and requires experienced developers with the knowledge of
object-relational frameworks such as Java Hibernate. Even then, these change requests can cause slow-
downs in implementation and testing.
Desirable features of NoSQL that drive business are listed below:
1. 24 × 7 Data availability: In the highly competitive world today, downtime is equated to real
dollars lost and is deadly to a company’s reputation. Hardware failures are bound to occur. Care
has to be taken that there is no single point of failure and system needs to show fault tolerance.
For this, both function and data are to be replicated so that if database servers or “nodes” fail,
the other nodes in the system are able to continue with operations without data loss. NoSQL
database environments are able to provide this facility. System updates can be made dynamically
without having to take the database offline.
2. Location transparency: The ability to read and write to a storage node regardless of where that
I/O operation physically occurs is termed as “Location Transparency or Location Independence”.
Customers in many different geographies need to keep data local at those sites for fast access. Any
write functionality that updates a node in one location, is propagated out from that location so
that it is available to users and systems at other locations.
3. Schema-less data model: Most of the business data is unstructured and unpredictable which
a RDBMS cannot cater to. NoSQL database system is a schema-free flexible data model that
can easily accept all types of structured, semi-structured and unstructured data. Also relational
model has scalability and performance problems when it has to manage large data volumes.
3.2 NoSQL BUSINESS DRIVERS • 41
NoSQL data model can handle this easily to deliver very fast performance for both read and
write operations.
4. Modern day transaction analysis: Most of the transaction details relate to customer profile,
reviews on products, branding, reputation, building business strategy, trading decisions, etc. that
do not require ACID transactions. The data consistency denoted by “C” in ACID property in
RDBMSs is enforced via foreign keys/referential integrity constraints. This type of consistency is
not required to be used in progressive data management systems such as NoSQL databases since
there is no JOIN operation. Here, the “Consistency” is stated in the CAP theorem that signifies
the immediate or eventual consistency of data across all nodes that participate in a distributed
database.
5. Architecture that suits big data: NoSQL solutions provide modern architectures for
applications that require high degrees of scale, data distribution and continuous availability.
For this multiple data center support with which a NoSQL environment complies is one of the
requirements. The solution should not only look into today’s big data needs but also suit greater
time horizons. Hence big data brings four major considerations in enterprise architecture which
are as follows:
• Scale of data sources : Many companies work in the multi-terabyte and even petabyte data.
• Speed is essential : Overnight extract-transform-load (ETL) batches are insufficient and real-
time streaming is required.
• Change in storage models : Solutions like Hadoop Distributed File System (HDFS) and
unstructured data stores like Apache Cassandra, MongoDb, Neo4j provide new options.
• Multiple compute methods for Big Data Analytics must be supported.
Figure 3.2 shows the architecture that suits big data.
Site
a t a
t
i v ty d
i
A c
Online
query
Storage
engine
6. Analytics and business intelligence: A key business strategic driver that suggests the implemen-
tation of a NoSQL database environment is the need to mine the data that is being collected
in order to derive insights to gain competitive advantage. Traditional relational database system
poses great difficulty in extracting meaningful business intelligence from very high volumes of
data. NoSQL database systems not only provide storage and management of big data but also
deliver integrated data analytics that provides instant understanding of complex datasets and
facilitate various options for easy decision-making.
E-commerce
Final order
RDBMS
(b) To check the different commands available in MongoDB type the command as shown below:
>db.help()
O/P:
(c) DB methods:
db.adminCommand(nameOrDocument) − switches to ‘admin’ db, and runs command [ just
calls db.runCommand(...) ]
db.auth(username, password)
db.cloneDatabase(fromhost)
db.commandHelp(name)− returns the help for the command
db.copyDatabase(fromdb, todb, fromhost)
db.createCollection(name, { size : ..., capped : ..., max : ... } )
db.createUser(userDocument)
db.currentOp()−displays currently executing operations in the db
db.dropDatabase()
db.eval(func, args)− runs code server-side
db.fsyncLock()−flushes data to disk and lock server for backups
db.fsyncUnlock()−unlocks server following a db.fsyncLock()
db.getCollection(cname) same as db[‘cname’] or db.cname
db.getCollectionInfos()
db.getCollectionNames()
db.getLastError()− just returns the err msg string
LABORATORY EXERCISES • 61
(d) To check the current statistic of database type the command as follows:
>db.stats()
O/P:
62 • CHAPTER 3/WHAT IS NoSQL?
{
“db” : “test”,
“collections” : 0,
“objects” : 0,
“avgObjSize” : 0,
“dataSize” : 0,
“storageSize” : 0,
“numExtents” : 0,
“indexes” : 0,
“indexSize” : 0,
“fileSize” : 0,
“dataFileVersion” : {
},
“ok” : 1
}
>
Note: In the output we can see that everything is “0”. This is because we haven’t yet created any collection.
Some considerations while designing schema in MongoDB:
For example: Let us say that a client needs a database design for his blog and see the differences between
RDBMS and MongoDB schema. Website has the following requirements:
In RDBMS schema design for above requirements will have minimum three tables.
Comment(comment_id,post_id,by_user,date_time,likes,messages)
post(id,title,description,like,url,post_by)
LABORATORY EXERCISES • 63
tag_list(id,post_id,tag)
While in MongoDB schema design will have one collection (i.e., Post) and has the following structure:
{
_id: POST_ID
title: TITLE_OF_POST,
description: POST_DESCRIPTION,
by: POST_BY,
url: URL_OF_POST,
tags: [TAG1, TAG2, TAG3],
likes: TOTAL_LIKES,
comments: [
{
user:’COMMENT_BY’,
message: TEXT,
dateCreated: DATE_TIME,
like: LIKES
},
{
user:’COMMENT_BY’,
message: TEXT,
dateCreated: DATE_TIME,
like: LIKES
}
]
}
The table given below shows the basic terminology in MongoDB in relation with RDBMS:
RDBMS MongoDB
Database Database
Table Collection
Tuple/Row Document
Column Field
Table Join Embedded Documents
Primary Key Primary Key (Default key _id provided
by mongodb itself)
1. The use command: In MongoDB use command is used to create the new database.The com-
mand creates new database,if it doesnot exist; otherwise it will return the existing database.
Syntax:
use DATABASE_NAME
Example: If you want to create a database with name <mydatabase1>, then use DATABASE state-
ment as follows:
> use mydatabase1
switched to db mydatabase1
To check your currently selected database type “db”.
>db
mydatabase1
To check your database list type the following command:
> show dbs
admin (empty)
local 0.078GB
test (empty)
Our created database (mydatabase1) is not present in list. To display database we need to insert
atleast one document into it.
>db.students.insert({“name”:”Sushil”,”place”:”Mumbai”})
WriteResult({ “nInserted” : 1 })
> show dbs
admin (empty)
local 0.078GB
mydatabase1 0.078GB
test (empty)
2. The dropDatabase() Method: In MongoDB, db.dropDatabase() command is used to drop an
existing database.
Syntax:
db.dropDatabase()
Basically it will delete the selected database, but if you have not selected any database, then it will
delete the default test database.
For example, to do so, first check the list of available databases by typing the command:
> show dbs
admin (empty)
local 0.078GB
mydatabase1 0.078GB
test (empty)
LABORATORY EXERCISES • 65
Suppose you want to delete newly created database (i.e. mydatabase1) then
> use mydatabase1
switched to db mydatabase1
>db.dropDatabase()
{ “dropped” : “mydatabase1”, “ok” : 1 }
Now just check the list of databases. You will find that the database name mydatabase1 is not
present in the list. This is because it got deleted.
> show dbs
admin (empty)
local 0.078GB
test (empty)
3. The createCollection() Method: In MongoDB, db.createCollection(name, options) is used to
create collection.
Syntax:
db.createCollection(name,options)
4. The drop() Method: MongoDB’s db.collection.drop() is used to drop a collection from the
database.
Syntax:
db.COLLECTION_NAME.drop()
For example,
-First check the available collections:
> show collections
mycollection1
students
system.indexes
-Delete the collection named mycollection1:
>db.mycollection1.drop()
true
-Again check the list of collection:
> show collections
students
system.indexes
>
LABORATORY EXERCISES • 67
(Continued )
68 • CHAPTER 3/WHAT IS NoSQL?
(Continued)
LEARNING OBJECTIVES
After reading this chapter, you will be able to:
• Learn the need for MapReduce. • Learn MapReduce algorithm for relational
• Understand Map task, Reducer task and algebra operations.
Combiner task. • Learn MapReduce algorithm for matrix
• Learn various MapReduce functions. multiplication.
Client
MapReduce HDFS
JobTracker NameNode
Master
Slaves
DataNode DataNode
TaskTracker . . .
TaskTracker
1. The NameNode coordinates and monitors the data storage function (HDFS), while the
JobTracker coordinates the parallel processing of data using MapReduce.
2. SlaveNode does the actual work of storing the data and running the computations. Master-
Nodes give instructions to their SlaveNodes. Each slave runs both a DataNode and a TaskTracker
daemon that communicate with their respective MasterNodes.
3. The DataNode is a slave to the NameNode.
4. The TaskTracker is a slave to the JobTracker.
It is a simple word count exercise. The client will load the data into the cluster (Feedback.txt),
submit a job describing how to analyze that data (word count), the cluster will store the results in a new
file (Returned.txt), and the client will read the results file.
The client is going to break the data file into smaller “Blocks”, and place those blocks on different
machines throughout the cluster. Every block of data is on multiple machines at once to avoid data loss.
So each block will be replicated in the cluster as it is loaded. The standard setting for Hadoop is to have
(three) copies of each block in the cluster. This can be configured with the dfs.replication parameter in
the file hdfs-site.xml.
The client breaks Feedback.txt into three blocks. For each block, the client consults the NameNode
and receives a list of three DataNodes that should have a copy of this block. The client then writes the
block directly to the DataNode. The receiving DataNode replicates the block to other DataNodes, and
the cycle repeats for the remaining blocks. Two of these DataNodes, where the data is replicated, are
in the same rack and the third one is in another rack in the network topology to prevent loss due to
network failure. The NameNode as it is seen is not in the data path. The NameNode only provides the
metadata, that is, the map of where data is and where data should be in the cluster (such as IP address,
port number, Host names and rack numbers).
The client will initiate TCP to DataNode 1 and sends DataNode 1 the location details of the other
two DataNodes. DataNode 1will initiate TCP to DataNode 2, handshake and also provide DataNode
2 information about DataNode 3. DataNode 2 ACKs and will initiate TCP to DataNode 3, handshake
and provide DataNode 3 information about the client which DataNode 3 ACKs.
On successful completion of the three replications, “Block Received” report is sent to the NameNode.
“Success” message is also sent to the Client to close down the TCP sessions. The Client informs the
NameNode that the block was successfully written. The NameNode updates its metadata info with the
node locations of Block A in Feedback.txt. The Client is ready to start the process once again for the
next block of data.
The above process shows that Hadoop uses a lot of network bandwidth and storage.
The NameNode not only holds all the file system metadata for the cluster, but also oversees the
health of DataNodes and coordinates access to data. The NameNode acts as the central controller of
HDFS. DataNodes send heartbeats to the NameNode every 3 seconds via a TCP handshake using the
same port number defined for the NameNode daemon. Every 10th heartbeat is a Block Report, where
the DataNode tells the NameNode about all the blocks it has.
Every hour, by default the Secondary NameNode connects to the NameNode and copies the in-
memory metadata information contained in the NameNode and files that used to store metadata (both
4.1 MAPREDUCE AND THE NEW SOFT WARE STACK • 73
may and may not be in sync). The Secondary NameNode combines this information in a fresh set of
files and delivers them back to the NameNode, while keeping a copy for itself.
1. Client machine submits the MapReduce job to the JobTracker, asking “How many times does
Refund occur in Feedback.txt?”
2. The JobTracker finds from the NameNode which DataNodes have blocks of Feedback.txt.
3. The JobTracker then provides the TaskTracker running on those nodes with the required Java
code to execute the Map computation on their local data.
4. The TaskTracker starts a Map task and monitors the tasks progress.
5. The TaskTracker provides heartbeats and task status back to the JobTracker.
6. As each Map task completes, each node stores the result of its local computation as “intermediate
data” in temporary local storage.
7. This intermediate data is sent over the network to a node running a Reduce task for final
computation.
Note: If the nodes with local data already have too many other tasks running and cannot accept
anymore, then the JobTracker will consult the NameNode whose Rack Awareness knowledge can sug-
gest other nodes in the same rack. In-rack switching ensures single hop and so high bandwidth.
160 • CHAPTER 7/LINK ANALYSIS
Google was the pioneer in this field with the use of a PageRank measure for ranking Web pages with
respect to a user query. Spammers responded with ways to manipulate PageRank too with what is called
Link Spam. Techniques like TrustRank were used for detecting Link Spam. Further, various variants of
PageRank are also in use to evaluate the Web pages.
This chapter provides the reader with a comprehensive overview of Link Analysis techniques.
1. Full-text index search engines such as AltaVista, Lycos which presented the user with a keyword
search interface. Given the scale of the Web and its growth rate, creating indexes becomes a her-
culean task.
2. Taxonomies based search engines where Web pages were organized in a hierarchical way based on
category labels. Example: Yahoo!. Creating accurate taxonomies requires accurate classification
techniques and this becomes impossible given the size of the Web and also its rate of growth.
As the Web became increasingly used in applications like e-selling, opinion forming, information pushing,
etc., web search engines began to play a major role in connecting users to information they require. In these
situations, in addition to fast searching, the quality of results returned by a search engine also is extremely
important. Web page owners thus have a strong incentive to create Web pages that rank highly in a search
query. This led to the first generation of spam, which means “manipulation of Web page content for the
purpose of appearing high up in search results for selected keywords”. Earlier search engines came up with
several techniques to detect spam, and spammers responded with a richer set of spam techniques.
Spamdexing is the practice of search engine spamming. It is a combination of Spamming with Indexing.
Search Engine Optimization (SEO) is an industry that attempts to make a Website attractive to the major
search engines and thus increase their ranking. Most SEO providers resort to Spamdexing, which is the
practice of creating Websites that will be illegitimately indexed with a high position in the search engines.
Two popular techniques of Spamdexing include “Cloaking” and use of “Doorway” pages.
1. Cloaking is the technique of returning different pages to search engines than what is being
returned to the people. When a person requests the page of a particular URL from the Website,
the site’s normal page is returned, but when a search engine crawler makes the same request, a
special page that has been created for the engine is returned, and the normal page for the URL is
hidden from the engine − it is cloaked. This results in the Web page being indexed by the search
engine under misleading keywords. For every page in the site that needs to be cloaked, another
page is created that will cause this page to be ranked highly in a search engine. If more than one
7.2 HISTORY OF SEARCH ENGINES AND SPAM • 161
search engine is being targeted, then a page is created for each engine, based on the criteria used
by the different engines to rank pages. Thus, when the user searches for these keywords and views
the selected pages, the user is actually seeing a Web page that has a totally different content than
that indexed by the search engine. Figure 7.1 illustrates the process of cloaking.
Is request from
human user or
search engine?
Other techniques used by spammers include meta-tag stuffing, scraper sites, article spinning, etc.
The interested reader can go through the references for more information on term spam.
The techniques used by spammers to fool search engines into ranking useless pages higher are called
as “Term Spam”. Term spam refers to spam perpetuated because search engines use the visibility of
terms or content in a Web page to rank them.
As a concerted effort to defeat spammers who manipulate the text of their Web pages, newer search
engines try to exploit the link structure of the Web − a technique known as link analysis . The first Web
search engine known to apply link analysis on a large scale was Google, although almost all current
Web search engines make use of it. But the war between search engines and spammers is far from over
as spammers now invest considerable effort in trying to manipulate the link structure too, which is now
termed link spam.
7.3 PageRank
One of the key concepts for improving Web search has been to analyze the hyperlinks and the graph
structure of the Web. Such link analysis is one of many factors considered by Web search engines in
computing a composite score for a Web page on any given query.
For the purpose of better search results and especially to make search engines resistant against term
spam, the concept of link-based analysis was developed. Here, the Web is treated as one giant graph:
The Web page being a node and edges being links pointing to this Web page. Following this concept,
the number of inbound links for a Web page gives a measure of its importance. Hence, a Web page is
generally more important if many other Web pages link to it. Google, the pioneer in the field of search
engines, came up with two innovations based on link analysis to combat term spam:
1. Consider a random surfer who begins at a Web page (a node of the Web graph) and executes a
random walk on the Web as follows. At each time step, the surfer proceeds from his current page
A to a randomly chosen Web page that A has hyperlinks to. As the surfer proceeds in this random
walk from node to node, some nodes will be visited more often than others; intuitively, these are
nodes with many links coming in from other frequently visited nodes. As an extension to this
idea, consider a set of such random surfers and after a period of time find which Web pages had
large number of surfers visiting it. The idea of PageRank is that pages with large number of visits
are more important than those with few visits.
2. The ranking of a Web page is not dependent only on terms appearing on that page, but some
weightage is also given to the terms used in or near the links to that page. This helps to avoid
term spam because even though a spammer may add false terms to one Website, it is difficult to
identify and stuff keywords into pages pointing to a particular Web page as that Web page may
not be owned by the spammer.
7.3 PAGERANK • 163
The algorithm based on the above two concepts first initiated by Google is known as PageRank.
Since both number and quality are important, spammers just cannot create a set of dummy low-quality
Web pages and have them increase the number of in links to a favored Web page.
In Google’s own words: PageRank works by counting the number and quality of links to a page to deter-
mine a rough estimate of how important the Website is. The underlying assumption is that more important
Websites are likely to receive more links from other Websites.
This section discusses the PageRank algorithm in detail.
1 2
3 4
Let us consider a random surfer who begins at a Web page (a node of the Web graph) and
executes a random walk on the Web as follows. At each time step, the surfer proceeds from his cur-
rent page to a randomly chosen Web page that it has hyperlinks to. So in our figure, the surfer is at
a node 1, out of which there are two hyperlinks to nodes 3 and 5; the surfer proceeds at the next
time step to one of these two nodes, with equal probabilities 1/2. The surfer has zero probability of
reaching 2 and 4.
We can create a “Transition Matrix” “M” of the Web similar to an adjacency matrix representa-
tion of a graph, except that instead of using Boolean values to indicate presence of links, we indicate
the probability of a random surfer reaching that node from the current node. The matrix M is an
n × n matrix if there are n Web pages. For a Web page pair (P i , P j ), the corresponding entry in M (row
i column j ) is
1
M (i , j ) =
k
where k is the number of outlinks from P j and one of these is to page P i , otherwise M (i, j ) = 0. Thus,
for the Web graph of Fig. 7.2, the following will be the matrix M 5:
0 1 0 0 0
1 1
0 0 0
3 2
1 1
0 0 0
M 5 = 2 3 (7.1)
1
0 0 0 0
2
1 1
2 0 1 0
3
We see that column 2 represents node 2, and since it has only one outlink to node 1, only first row has
a 1 and all others are zeroes. Similarly, node 4 has outlinks to node 2, 3, and 5 and thus has value 1/3
to these nodes and zeroes to node 1 and 4.
1. Initially the surfer can be at any of the n pages with probability 1/n. We denote it as follows:
1 / n
1 / n
v 0 =
1 / n
1 / n
2. Consider M, the transition matrix. When we look at the matrix M 5 in Eq. (7.1), we notice two
facts: the sum of entries of any column of matrix M is always equal to 1. Further, all entries
have values greater or equal to zero. Any matrix possessing the above two properties is called as
a matrix of a Markov chain process , also called Markov transition matrix. At any given instant of
time, a process in a Markov chain can be in one of the N states (in a Web set up a state is a node
or Web page). Then, the entry mij in the matrix M gives us the probability that i will be the next
node visited by the surfer, provided the surfer is at node j currently. Because of the Markov prop-
erty, the next node of the surfer only depends on the current node he is visiting. Recall that this
is exactly the way we have designed the Transition matrix in Section 7.3.1.
3. If vector v shows the probability distribution for the current location, we can use v and M to get
the distribution vector for the next state as x = Mv . Say currently the surfer is at node j . Then, we
have
x = M × v j = mij × v j (7.2)
j
Here v j is the column vector giving probability that current location is j for every node 1 to n.
Thus after first step, the distribution vector will be Mv 0. After two steps, it will be M ( Mv 0).
Continuing in this fashion after k steps the distribution vector for the location of the random
surfer will be M k ( Mv 0).
4. This process cannot continue indefinitely. If a Markov chain is allowed to run for many time
steps, the surfer starts to visit certain Web pages (say, a popular stock price indicator site) more
often than other pages and slowly the visit frequency converges to fixed, steady-state quantity.
Thus, the distribution vector v remains the same across several steps. This final equilibrium state
value in v is the PageRank value of every node.
5. For a Markov chain to reach equilibrium, two conditions have to be satisfied, the graph must
be strongly connected and there must not exist any dead ends, that is, every node in the graph
should have at least one outlink. For the WWW this is normally true. When these two condi-
tions are satisfied, for such a Markov chain, there is a unique steady-state probability vector, that
is, the principal left eigenvector of the matrix representing the Markov chain.
166 • CHAPTER 7/LINK ANALYSIS
In our case we have v as the principal eigenvector of matrix M . (An eigenvector of a matrix M is
a vector v that satisfies v = β Mv for some constant eigenvalue β .) Further, because all columns of
the matrix M total to 1, the eigenvalue associated with this principle eigenvector is also 1.
Thus to compute PageRank values of a set of WebPages, we must compute the principal left eigenvector
of the matrix M with eigenvalue 1. There are many algorithms available for computing left eigenvec-
tors. But when we are ranking the entire Web, the size of the matrix M could contain a billion rows and
columns. So a simple iterative and recursive algorithm called the Power method is used to compute the
eigenvalue. It is calculated repetitively until the values in the matrix converge. We can use the following
equation to perform iterative computations to calculate the value of PageRank:
After a large number of steps, the values in x k settle down where difference in values between two
different iterations is negligible below a set threshold. At this stage, the values in vector x k indicate the
PageRank values of the different pages. Empirical studies have shown that about 60–80 iterations cause
the values in x k to converge.
Example 1
Let us apply these concepts to the graph of Fig. 7.2 represented by the matrix M 5 as shown before.
As our graph has five nodes
1 / 5
1 / 5
v 0 = 1 / 5
1 / 5
1 / 5
If we multiply v 0 by matrix M 5 repeatedly, after about 60 iterations we get converging values as
1 / 5 1 / 5 1 / 6 0.4313
1 / 5 1 / 6 13 / 60 0.4313
1 / 5 1 / 6 2 / 15 0.3235
1
/ 5 1 / 10 11 / 60 0.3235
1 / 5 11 / 30 3 / 10 0.6470
Thus Page 1 has PageRank 0.4313 as does Page 2. Pages 3 and 4 have PageRank 0.3235 and Page 5
has the highest PageRank of 0.6470.
7.3 PAGERANK • 167
1. One large portion called as “Core” which is more or less strongly connected so as to form an
SCC. Web surfers in the Core can reach any Webpage in the core from any other Webpage in the
core. Mostly this is the region of the Web that most surfers visit frequently.
2. A portion of the Web consisted of Web Pages that had links that could lead to the SCC but
no path from the SCC led to these pages. This region is called the IN-Component and then
pages are called IN pages or “Origination” Pages. New Web Pages or Web Pages forming closed
communities belong to this component.
3. Another set of nodes exist that can be reached from the SCC but do not have links that can
ultimately lead to a Webpage in the SCC. This is called the “Out” component and the Web Pages
“Out” pages or “termination” pages. Many corporate sites, e-commerce sites, etc. expect the SCC
to have links to reach them but do not really need links back to the core.
Figure 7.3 shows the original image from the study conducted by Broder et al . Because of the visual
impact the picture made, they termed this as the “bow-tie picture” of the Web, with the SCC as the
central “knot”. Figure 7.3 also shows some pages that belong to none of IN, OUT, or SCC. These are
further classified into:
1. Tendrils: These are pages that do not have any inlinks or outlinks to/from the SCC. Some
tendrils consist of pages reachable from the in-component but not SCC and some other tendrils
can reach the out-component but not from the SCC.
2. Tubes: These are pages that reach from in-component to the out-component without linking to
any pages in the SCC.
168 • CHAPTER 7/LINK ANALYSIS
Tubes
IN OUT
SCC
Tendrils
Disconnected
The study further also discussed the size of each region and it was found that perhaps the most
surprising finding is the size of each region. Intuitively, one would expect the core to be the larg-
est component of the Web. It is, but it makes up only about one-third of the total. Origination
and termination pages both make up about a quarter of the Web, and disconnected pages about
one-fifth.
As a result of the bow-tie structure of the Web, assumptions made for the convergence of the
Markov process do not hold true causing problems with the way the PageRank is computed. For
example, consider the out-component and also the out-tendril of the in-component; if a surfer lands
in either of these components he can never leave out, so probability of a surfer visiting the SCC or the
in-component is zero from this point. This means eventually pages in SCC and in-component would
end up with very low PageRank. This indicates that the PageRank computation must take the structure
of the Web into consideration.
There are two scenarios to be taken care of as shown in Fig. 7.4:
1. Dead ends: These are pages with no outlinks. Effectively any page that can lead to a dead end
means it will lose all its PageRank eventually because once a surfer reaches a page that is a dead
end no other page has a probability of being reached.
2. Spider traps: These are a set of pages whose outlinks reach pages only from that set. So eventu-
ally only these set of pages will have any PageRank.
7.3 PAGERANK • 169
Dead end
Spider trap
In both the above scenarios, a method called “taxation” can help. Taxation allows a surfer to leave the
Web at any step and start randomly at a new page.
A simple example can illustrate this. Consider a 3-node network P, Q and R and its associated transition
matrix. Since R is dead, eventually all PageRank leaks out leaving all with zero PageRank.
X P 1 / 3 2 / 6 3 / 12 5 / 24 0
X = 1 / 3 1 / 6 2 / 12 3 / 24 0
Q
X R 1 / 3 3 / 6 7 / 12 16 / 24 1
All the PageRank is trapped in R. Once a random surfer reaches R, he can never leave. Figure 7.5 illus-
trates this.
We now propose modifications to the basic PageRank algorithm that can avoid the above two
scenarios as described in the following subsections.
170 • CHAPTER 7/LINK ANALYSIS
P Q R
P
P 1/2 1/2 0
Q 1/2 0 0
R
Q
R 0 1/2 0
Figure 7.5 A simple Web Graph and its associated transition matrix.
Example 2
Consider Fig. 7.6.
1. Part (a) shows a portion of a Web graph where A is part of the SCC and B is a dead end. The
self-loop of A indicates that A has several links to pages in SCC.
2. In part (b), the dead end B and its links are removed. PageRank of A is computed using any
method.
3. In part (c), the dead end last removed, that is B, is put back with its connections. B will use A to
get its PageRank. Since A now has two outlinks, its PageRank is divided into 2 and half this rank
is propagated to B.
4. In part (d), A has two outlinks and C has three outlinks and both propagate 1/2 and 1/3 of their
PageRank values to B. Thus, B gets the final PageRank value as shown. In part (d), A is having a
PR value of 2/5 and C has 2/7 leading to B obtaining a PR value of 31/105.
7.3 PAGERANK • 171
1
?
A B
A B
(a) (b)
1
1/2
(1/2) of 1
A B
B gets (1/2) A Value
(c)
C
PR = 2/7 B gets (1/3) C Value
(d)
1. When a node has no outlinks, the surfer can invoke the teleport operation with some probability.
2. If a node has outgoing links, the surfer can follow the standard random walk policy of choosing
any one of the outlinks with probability 0 < β < 1 and can invoke the teleport operation with
probability 1 – β , where β is a fixed parameter chosen in advance.
172 • CHAPTER 7/LINK ANALYSIS
Typical values for β might be 0.8–0.9. So now the modified equation for computing PageRank
iteratively from the current PR value will be given by
v ′ = β Mv + (1 – β )e /n
where M is the transition matrix as defined earlier, v is the current PageRank estimate, e is a vector of
all ones and n is the number of nodes in the Web graph.
Example 3
Consider the same Web graph example shown earlier in 7.5 (we repeat the graph for clarity). The
following shows the new PageRank computations with 0.8. =
P Q R
P
P 1/2 1/2 0
Q 1/2 0 0
Q R
R 0 1/2 1
P Q R P Q R P Q R
(0.8)* P 1/2 1/2 0 + (0.2)* P 1/3 1/3 1/3 = P 7/15 7/15 1/15
Eventually
X P 1 1.00 0.84 0.776 7 / 33
X = 1 0.60 0.6
Q 60 0.536 5 / 33
X R 1 1.40 1.56 1.688 21 / 33
This indicates that the spider trap has been taken care of. Even though R has the highest PageRank,
its effect has been muted as other pages have also received some PageRank.
Web pages by the Google search engine was determined by three factors: Page specific factors, Anchor
text of inbound links, PageRank.
Page-specific factors include the body text, for instance, the content of the title tag or the URL of
the document. In order to provide search results, Google computes an IR score out of page-specific fac-
tors and the anchor text of inbound links of a page. The position of the search term and its weightage
within the document are some of the factors used to compute the score. This helps to evaluate the rel-
evance of a document for a particular query. The IR-score is then combined with PageRank to compute
an overall importance of that page.
In general, for queries consisting of two or more search terms, there is a far bigger influence of
the content-related ranking criteria, whereas the impact of PageRank is more for unspecific single
word queries. For example, a query for “Harvard” may return any number of Web pages which
mention Harvard on a conventional search engine, but using PageRank, the university home page
is listed first.
Currently, it is estimated that Google uses about 250 page-specific properties with updated versions
of the PageRank to compute the final ranking of pages with respect to a query.
Generally, after a number of iterations, the authority and hub scores do not vary much and can be
Generally,
considered to have “converged”.
HITS algorithm and the PageRank algorithm both make use of the link structure of the Web
graph to decide the relevance of the pages. The difference is that while the PageRank is
is query indepen-
dent and works on a large portion of the Web,
Web, HITS only operates on a small subgraph (the seed S Q )
from the Web graph.
The most obvious strength of HITS is the two separate vectors it returns, which allow the applica-
tion to decide on which score it is most interested in. The highest ranking pages are then displayed to
the user by the query engine.
This sub-graph generated as seed is query dependent; whenever we search with a different query
phrase, the seed changes as well. Thus, the major disadvantage of HITS is that the query graph must be
regenerated dynamically for each query.
Using a query-based system can also sometimes lead to link spam. Spammers who want their Web
page to appear higher in a search query,
query, can make spam farm pages that link to the original site to give
it an artificially high authority score.
Summary
• As search engines become more and more • Google, the pioneer in the field of search
sophisticated, to avoid being victims of engines, came up with two innovations based
spam, spammers also are finding innovative on Link Analysis to combat term spam and
ways of defeating the purpose
purpose of these search called their algorithm PageRank
PageRank..
engines. One such technique used by modern
• The basic idea behind PageRank is that the
search engines to avoid spam is to analyze
ranking of a Web page is not dependent only
the hyperlinks and the graph structure
str ucture of the
on terms appearing on that page, but some
Web
W eb for ranking of Web search results. This
weightage is also given to the terms used in
is called Link Analysis.
or near the links to that page. Further pages
• Early search engines were mostly text based with large no of visits are more important
and susceptible to spam attacks. Spam than those with few visits.
means “manipulation of Web page content
• To compute PageRank, “Random Surfer
for the purpose of appearing high up in
Model” was used. Calculation of PageRank
search results for selected keywords”.
can be thought of as simulating the behavior
• To attack text-based search engines, spammers of many random surfers, who each start at
resorted to term based spam attacks like a random page and at any step move, at
cloaking and use of Doorway pages. random, to one of the pages to which their
190 • CHAPTER 7/LINK ANALYSIS
current page links. The limiting probability and controlled by the spammer. They may
of a surfer being at a given page is the be created from a set of partner Web sites
PageRank of that page. known as link exchange. Sometimes such
links could also be placed in some unrelated
• An iterative matrix based algorithm was
Websites
W ebsites like blogs or marketplaces. These
proposed to compute the PageRank of a page
structures are called Spam farms.
efficiently.
• Search engines can respond to link spam
• The PageRank algorithm could be
by mining the Web graph for anomalies
compromised due to the bow–tie structure
and propagating a chain of distrust from
of the Web which leads to two types of
spurious pages which will effectively lower
problems, dead ends and spider traps.
the PageRank of such pages. TrustRank
• Using a scheme of random teleportation, and Spam mass are two techniques used to
PageRank can be modified to take care of combat Link Spam.
dead ends and spider traps.
• In a parallel development along with
• To compute the PageRank of pages on PageRank, another algorithm to rank pages
the Web efficiently, use of MapReduce is in relation to a query posed by a user was
advocated. Further schemes of efficiently proposed. This algorithm also used the link
storing the transition matrix and the structure of the Web in order to discover and
PageRank vector are described. rank pages relevant for a particular topic.
The idea was to associate two scores with
• In Topic-Sensitive PageRank, we bias each Web page, contributions coming from
the random walker to teleport to a set of two different types of pages called “hubs”
topic-specific relevant nodes. The topic is and “authorities”. This algorithm is called
determined by the context of the search hyperlink-induced topic search (HITS).
query. A set of PageRank vectors, biased HITS presently is used by the Ask search
using a set of representative topics, helps engine (www.Ask.com).
(www.Ask.com). Further it is believed
to capture more accurately the notion of that modern information retrieval engines
importance with respect to a particular topic. use a combination of PageRank and HITS
This in turn yields more accurate search for query answering.
results specific to a query.
• Calculation of the hubs and authorities scores
• Link spam can be formally stated as a class of for pages depends on solving the recursive
spam techniques that try to increase the link- equations: “a hub links to many authorities,
based score of a target Web page by creating and an authority is linked to by many hubs”.
lots of spurious hyperlinks directed towards The solution to these equations is essentially
it. These spurious hyperlinks may originate an iterated matrix–vector multiplication, just
from a set of Web pages called a Link farm like PageRank’
PageRank’s.
s.
EXERCISES • 191
Exercises
1. Consider the portion of a Web graph shown 3. Let the adjacency matrix for a graph of
of four
below. vertices (n1 to n4) be as follows:
A 0 1 1 1
0 0 1 1
=
A =
A
1 0 0 1
0 0 0 1
B C D
Calculate the authority and hub scores for
this graph using the HITS algorithm with
k = 6, and identify the best authority and
hub nodes.
E F G
a b
C F
Programming Assignments
1. Implement the following algorithms on 3. Now rewrite portions of the code to imple-
standard datasets available on the web. The ment the TrustRank algorithm. The user will
input will normally be in the form of sparse specify which pages (indices) correspond to
matrices representing the webgraph. trustworthy pages. It might be good to look
at the URLs and identify reasonable candi-
(a) Simple PageRank Algorithm
dates for trustworthy pages.
(b) PageRank algorithm with a telepor-
tation factor to avoid dead-ends and 4. Implement Assignments
Assignments 1 and 3 using Map-
Map-
spider traps. Reduce.
2. Describe how you stored the connectivity 5. Implement HITS algorithm any webgraph
matrix on disk and how you computed the using MapReduce.
transition matrix. List the top-10 pages as
returned by the algorithms in each case.
References
2. D. Easley, J. Kleinberg (2010). Networks, 5. T. Haveliwala (2002). Topic-sensitive Page-
Crowds, and Markets: Reasoning about a Rank. In Proceedings of the Eleventh Interna-
Highly Connected World . Cambridge Uni- tional Conference on World Web, 2002.
World Wide Web
versity Press. Complete preprint on-line at
6. J. Kleinberg (1998).
(1998). Authoritative
Authoritative Sources
Sources in
http://www.cs.cornell.edu/home/kleinber/
http://www.cs.cornell.edu/home/ kleinber/
a Hyperlinked Environment. In Proc. ACM-
networks-book/.
SIAM Symposium on Discrete Algorithms .
3. Page, Lawrence and Brin, Sergey and Mot-
7. A. Broder, R. Kumar, F. Maghoul et al.
wani, Rajeev and Winograd, Terry (1999)
(2000). Graph structure in the Web, Com-
The PageRank Citation Ranking: Bringing
Networks, 33:1–6, pp. 309–320.
puter Networks,
Web. Technical Report. Stanford
Order to the Web.
InfoLab.
4. T. Haveliwala. Efficient Computation of
PageRank. Tech. rep., Stanford University,
1999.
8 Frequent Itemset Mining
LEARNING OBJECTIVES
After reading this chapter, you will be able to:
• Review your knowledge about frequent • Learn about the algorithm of Park, Chen
itemsets and basic algorithms to identify and Yu, and its variants.
them. • Understand the sampling-based SON
• Learn about different memory efficient Algorithm and how it can be parallelized
techniques to execute the traditional FIM using MapReduce.
algorithms. • Learn some simple stream-based frequent
• Understand how these algorithms are itemset mining methods.
insufficient to handle larger datasets.
8.1 Introduction
Frequent itemsets play an essential role in many data mining tasks where one tries to find interesting
patterns from databases, such as association rules, correlations, sequences, episodes, classifiers, clusters
and many more. One of the most popular applications of frequent itemset mining is discovery of asso-
ciation rules. The identification of sets of items, products, symptoms, characteristics and so forth that
often occur together in the given database can be seen as one of the most basic tasks in data mining.
This chapter discusses a host of algorithms that can be effectively used to mine frequent itemsets from
very massive datasets.
This chapter begins with a conceptual description of the “market-basket” model of data. The
problem of deriving associations from data was first introduced using the “market-basket” model of
data, which is essentially a many-many relationship between two kinds of elements, called “items” and
“baskets”. The frequent-itemsets problem is that of finding sets of items that appear in (are related to)
many of the same baskets.
The problem of finding frequent itemsets differs from the similarity search discussed in Chapter 5.
In the frequent itemset scenario, we attempt to discover sets of items that are found in the same bas-
kets frequently. Further we need the number of such buckets where these items appear together to
196 • CHAPTER 8/FREQUENT ITEMSET MINING
be sufficiently large so as to be statistically significant. In similarity search we searched for items that
have a large fraction of their baskets in common, even if the absolute number of such baskets is small
in number.
Many techniques have been invented to mine databases for frequent events. These techniques work
well in practice on smaller datasets, but are not suitable for truly big data. Applying frequent itemset
mining to large databases is a challenge. First of all, very large databases do not fit into main memory.
For example consider the well-known Apriori algorithm, where frequency counting is achieved by
reading the dataset over and over again for each size of candidate itemsets. Unfortunately, the memory
requirements for handling the complete set of candidate itemsets blows up fast and renders Apriori-
based schemes very inefficient to use on large data.
This chapter proposes several changes to the basic Apriori algorithm to render it useful for large
datasets. These algorithms take into account the size of the main memory available.
Since exact solutions are costly and impractical to find in large data, a class of approximate algo-
rithms is discussed which exploit parallelism, especially the Map−Reduce concept. This chapter also
gives a brief overview of finding frequent itemsets in a data stream.
identifier, called TID. Let A be a set of items. A transaction T is said to contain A if and only if
A ⊆ T.
A set of items is referred to as an itemset. An itemset that contains k items is a k -itemset. For
example, consider a computer store with computer-related items in its inventory. The set {computer,
anti-virus software, printer, flash-drive} is a 4-itemset. The occurrence frequency of an itemset is the
number of transactions that contain the itemset. This is also known, simply, as the frequency, support
count, or count of the itemset. We can call an itemset I a “frequent itemset” only if its support count is
sufficiently large. We prescribe a minimum support s and any I which has support greater than or equal
to s is a frequent itemset.
Example 1
Items = {milk (m), coke (c), pepsi (p), beer (b), juice (j)}
Minimum support s = 3
Transactions
1. T1 = {m, c, b}
2. T2 = {m, p, j}
3. T3 = {m, b}
4. T4 = {c, j}
5. T5 = {m, p, b}
6. T6 = {m, c, b, j}
7. T7 = {c, b, j}
8. T8 = {b, c}
Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}.
8.2.2 Applications
A supermarket chain may have 10,000 different items in its inventory. Daily millions of customers will
push their shopping carts (“market-baskets”) to the checkout section where the cash register records
the set of items they purchased and give out a bill. Each bill thus represents one market-basket or one
transaction. In this scenario, the identity of the customer is not strictly necessary to get useful informa-
tion from the data. Retail organizations analyze the market-basket data to learn what typical customers
buy together.
198 • CHAPTER 8/FREQUENT ITEMSET MINING
Example 2
Consider a retail organization that spans several floors, where soaps are in floor 1 and items like
towels and other similar goods are in floor 10. Analysis of the market-basket shows a large number of
baskets containing both soaps and towels. This information can be used by the supermarket manager
in several ways:
1. Apparently, many people walk from where the soaps are to where the towels is which means they
have to move form floor 1, catch the elevator to move to floor 10. The manger could choose to
put a small shelf in floor 1 consisting of an assortment of towels and some other bathing acces-
sories that might also be bought along with soaps and towels, for example, shampoos, bath mats
etc. Doing so can generate additional “on the spot” sales.
2. The store can run a sale on soaps and at the same time raise the price of towels (without adver-
tising that fact, of course). People will come to the store for the cheap soaps, and many will
need towels too. It is not worth the trouble to go to another store for cheaper towels, so they
buy that too. The store makes back on towels what it loses on soaps, and gets more customers
into the store.
While the relationship between soaps and towels seems somewhat obvious, market-basket analysis may
identify several pairs of items that occur together frequently but the connections between them may be
less obvious. For example, the analysis could show chocolates being bought with movie CDs. But we
need some rules to decide when a fact about co-occurrence of sets of items can be useful. Firstly any
useful set (need not be only pairs) of items must be bought by a large number of customers. It is not
even necessary that there be any connection between purchases of the items, as long as we know that
lots of customers buy them.
An E-Retail store like E-bay or Amazon.com offers several million different items for sale through
its websites and also cater to millions of customers. While normal offline stores, such as the super-
market discussed above, can only make productive decisions when combinations of items are pur-
chased by very large numbers of customers, online sellers have the means to tailor their offers even
to a single customer. Thus, an interesting question is to find pairs of items that many customers
have bought together. Then, if one customer has bought one of these items but not the other, it
might be good for Amazon or E-bay to advertise the second item when this customer next logs in.
We can treat the purchase data as a market-basket problem, where each “basket” is the set of items
that one particular customer has ever bought. But there is another way online sellers can use the
same data. This approach, called “collaborative filtering”, finds sets of customers with similar pur-
chase behavior. For example, these businesses look for pairs, or even larger sets, of customers who
206 • CHAPTER 8/FREQUENT ITEMSET MINING
Example 7
One example of a market-basket file could look like:
{23, 45, 11001} {13, 48, 92, 145, 222} {…
Here, the character “{” begins a basket and the character “}” ends it. The items in a basket are repre-
sented by integers, and are separated by commas.
Since such a file (Example 7) is typically large, we can use MapReduce or a similar tool to divide the
work among many machines. But non-trivial changes need to be made to the frequent itemset count-
ing algorithm to get the exact collection of itemsets that meet a global support threshold. This will be
addressed in Section 8.4.3.
For now we shall assume that the data is stored in a conventional file and also that the size of the
file of baskets is sufficiently large that it does not fit in the main memory. Thus, the principal cost is
the time it takes to read data (baskets) from the disk. Once a disk block full of baskets is read into the
main memory,
memory, it can be explored, generating all the subsets of size k . It is necessary to point out that it
is logical to assume that the average size of a basket is small compared to the total number of all items.
Thus, generating all the pairs of items from the market-baskets in the main memory should take less
time than the time it takes to read the baskets from disk.
25
For example, if there are 25 items in a basket, then there are = 300 pairs of items in the
2
basket, and these can be generated easily in a pair of nested for-loops. But as the size of the subsets we
want to generate gets larger
larger,, the time required approximately n k / k ! time to gen-
required grows larger; it takes approximately
erate all the subsets of size k for
for a basket with n items. So if k is
is very large then the subset generation
time will dominate the time needed to transfer the data from the disk.
However, surveys have indicated that in most applications
However, applica tions we need only small frequent itemsets.
itemsets . Fur-
ther, when we do need the itemsets for a large size k , it is usually possible to eliminate many of the items
in each basket as not able to participate in a frequent itemset, so the value of n reduces as k increases.
increases.
Thus, the time taken to examine each of the baskets can usually be assumed proportional to the
size of the file. We can thus measure the running time of a frequent-itemset algorithm by the number
of times each disk block of the data file is read. This, in turn, is characterized by the number of passes
through the basket file that they make, and their running time is proportional to the product of the
number of passes they make through the basket file and the size of that file.
Since the amount of data is fixed, we focus only on the number of passes taken by the algorithm.
This gives us a measure of what the running time of a frequent-itemset algorithm will be.
of the data. For example, we might need to count the number of times that each pair of items occurs
in baskets in pass 2. In the next pass along with maintaining the counts of 2-itemsets, we have to
now compute frequency of 3-itemsets and so on. Thus, we need main memory space to maintain
these counts.
If we do not have enough main memory to store each of the counts at any pass then adding 1 to
count of any previous itemset may involve loading the relevant page with the counts from secondary
memory. In the worst case, this swapping of pages may occur for several counts, which will result in
thrashing. This would make the algorithm several orders of magnitude slower than if we were certain to
find each count in main memory. In conclusion, we need counts to be maintained in the main memory
memory..
This sets a limit on how many items a frequent-itemset algorithm can ultimately deal with. This num-
ber of different things we can count is, thus, limited by the main memory.
memory.
The naive way of counting a frequent k -itemset
-itemset is to read the file once and count in main memory
the occurrences of each k -itemset.
-itemset. Let n be the number of items. The number of itemsets of size 1 ≤ k
≤ n is given by
n n!
=
k
k !(n − k )!
Example 8
Suppose we need to count all pairs of items (2-itemset) in some step, and there are n items. We thus
need space to store n(n − 1) / 2 pairs. This algorithm will fail if (#items)2 exceeds main memory.
Consider an e-commerce enterprise like Amazon. The
The number of items can be around 100K or
(Web pages). Thus, assuming 105 items and counts are 4-byte integers, the number of pairs of
10B (Web
items is
105 (105 1) −
5 * 109
=
2
Therefore, 2*1010 (20 GB) of memory is needed. Thus, in general, if integers take 4 bytes, we require
approximately 2n2 bytes. If our machine has 2 GB, or 231 bytes of main memory, then we require
n ≤ 215 or approximately n < 33,000.
It is important to point out here that it is sufficient to focus on counting pairs, because the probability
of an itemset being frequent drops exponentially with size while the number of itemsets grows more
slowly with size. This argument is quite logical. The number of items, while possibly very large, is rarely
so large we cannot count all the singleton sets in main memory at the same time. For larger sets like
triples, quadruples, for frequent-itemset analysis to make sense, the result has to be a small number of
sets, or these itemsets will lose their significance. Thus, in practice, the support threshold is set high
enough that it is only a rare set that is k-frequent (k ≥ 2). Thus, we expect to find more frequent pairs
than frequent triples, more frequent triples than frequent quadruples, and so on. Thus, we can safely
conclude that maximum main memory space is required for counting frequent pairs. We shall, thus,
only concentrate on algorithms for counting pairs.
208 • CHAPTER 8/FREQUENT ITEMSET MINING
8.3.3 Approache
Approachess for Main Memory
Memory Counting
Before we can discuss approaches for counting of pairs in the main memory we have to first, discuss
how items in the baskets are represented in the memory.
memory. As mentioned earlier, it is more space-efficient
to represent items by consecutive positive integers from 1 to n, where n is the number of distinct items.
But items will mostly be names or strings of the form “pencil”, “pen”, “crayons”, etc. We will, there-
fore, need a hash table that translates items as they appear in the file to integers. That is, each time we
see an item in the file, we hash it. If it is already in the hash table, we can obtain its integer code from
its entry in the table. If the item is not there, we assign it the next available number (from a count of
the number of distinct items seen so far) and enter the item and its code into the table.
Pair 1,2 1,3 1,4 1,5 2,3 2,4 2,5 3,4 3,5 4,5
Position 1 2 3 4 5 6 7 8 9 10
Example 9
Suppose there are 100,000 items and 10,000,000 baskets of 10 items each. Then the integer counts
required by triangular matrix method are
1000
= 5 × 109 (approximately)
2
On the other hand, the total number of pairs among all the baskets is
n
107 = 4.5 × 10
108
k
Even in the extreme case that every pair of items appeared only once, there could be only 4.5 × 108 pairs
with non-zero counts. If we used the triples method to store
s tore counts, we would need only three times
210 • CHAPTER 8/FREQUENT ITEMSET MINING
that number of integers or 1.35 × 109 integers. Thus, in this case, the triples method will surely take
much less space than the triangular matrix.
However, even if there were 10 or a 100 times as many baskets, it would be normal for there to be a
However,
sufficiently uneven distribution of items that we might still be better off using the triples method. That
is, some pairs would have very high counts, and the number of different pairs that occurred in one or
more baskets would be much less than the theoretical maximum number of such pairs.
Instead, we could limit ourselves to those sets that occur at least once in the database by generating
only those subsets of all transactions in the database. Of course, for large transactions, this number
could still be too large. As an optimization, we could generate only those subsets of at most a given
maximum size. This technique also suffers from massive memory requirements for even a medium
sized database. Most other efficient solutions perform a more directed search through the search space.
During such a search, several collections of candidate sets are generated and their supports computed
until all frequent sets have been generated. Obviously, the size of a collection of candidate sets must
not exceed the size of available main memory. Moreover, it is important to generate as few candidate
sets as possible, since computing the supports of a collection of sets is a time-consuming procedure. In
the best case, only the frequent sets are generated and counted. Unfortunately, this ideal is impossible
in general. The main underlying property exploited by most algorithms is that support is monotone
decreasing with respect to extension of a set.
Property 1 (Support Monotonicity): Given a database of transactions D over
over I and
and two sets X ,
Y ⊆ I . Then,
X , Y ⊆ I ⇒ support (Y ) ≤ support ( X
X )
Hence, if a set is infrequent, all of its supersets must be infrequent, and vice versa, if a set is frequent,
all of its subsets must be frequent too. In the literature, this monotonicity property is also called the
d-closure property , since the set of frequent sets is downward closed with respect to set inclu-
downward-closure
downwar
sion. Similarly,
Similarly, the set of infrequent sets is upward closed.
The downward-closure property of support also allows us to compact the information about fre-
quent itemsets. First, some definitions are given below:
1. An itemset is closed if
if none of its immediate itemset has the same count as the itemset.
2. An itemset is closed frequent if
if it is frequent and closed.
3. An itemset is maximal frequent if
if it is frequent and none of its immediate superset is frequent.
8.3 ALGORITHM FOR FIND ING FREQUENT IT EMSETS • 211
For example, assume we have items = {apple, beer, carrot} and the following baskets:
1. {apple, beer}
2. {apple, beer}
3. {beer, carrot}
4. {apple, beer
beer,, carrot}
5. {apple, beer, carrot}
{apple} 4 Yes No No No
{beer} 5 Yes Yes Yes No
{carrot} 3 Yes No No No
{apple, beer} 4 Yes Yes Yes Yes
{apple, carrot} 2 No No No No
{beer, carrot} 3 Yes Yes Yes Yes
{apple, beer, 2 No Yes No No
carrot}
From Table 8.4, we see that there are five frequent itemsets, of which only three are closed frequent,
of which in turn only two are maximal frequent. The set of all maximal frequent itemsets is a subset
of the set of all closed frequent itemsets, which in turn is a subset of the set of all frequent itemsets.
Thus, maximal frequent itemsets is the most compact representation of frequent itemsets. In practice,
however,, closed frequent itemsets may be preferred since they also contain not just the frequent itemset
however
information, but also the exact count.
generate a pair, we add 1 to its count. At the end, we examine all pairs to see which have counts that are
equal to or greater than the support threshold s ; these are the frequent pairs.
However, this naive approach fails if there are too many pairs of items to count them all in the main
However,
memory. The Apriori algorithm which is discussed in this section uses the monotonicity property to
reduce the number of pairs that must be counted, at the expense of performing two passes over data,
rather than one pass.
The Apriori algorithm for finding frequent pairs is a two-pass algorithm that limits the amount of
main memory needed by using the downward-closur
downward-closuree property of support to avoid counting pairs that
will turn out to be infrequent
infrequent at the end.
Let s be
be the minimum support required. Let n be the number of items. In the first pass, we read
the baskets and count in main memory the occurrences of each item. We then remove all items whose
frequency is lesser than s to
to get the set of frequent items. This requires memory proportional to n.
In the second pass, we read the baskets again and count in main memory only those pairs where
both items are frequent items. This pass will require memory proportional to square of frequent items
items
only (for counts) plus a list of the frequent items (so you know what must be counted). Figure 8.3
indicates the main memory in the two passes of the Apriori algorithm.
Frequent items
Item counts
Counts of
Main pairs of
memory frequent items
(candidate
pairs)
P as s 1 Pass 2
Finally, at the end of the second pass, examine the structure of counts to determine which pairs are
frequent.
Item
Frequent Old
counts
items item #
Pass 1 P as s 2
The pattern of moving from one set to the next and one size to the next is depicted in Fig. 8.5.
All pairs
of items All triples
Count
from L1 Count the pairs from L2
the items
All item
Example 10
Let C 1 = { {b} {c } { j
j } {m} {n} { p
p} }. Then
1. Count thethe support of itemsets in C 1
2. Prune non-frequent: L1 = { b, c, j, m }
3. Generate C 2 = { {b,c } {b,j } {b,m} {c,j } {c,m} { j,m
j,m} }
4. Count the
the support of itemsets in C 2
5. Prune non-frequent: L2 = { {b,m} {b,c } {c,m} {c,j } }
6. Generate C 3 = { {b,c,m} {b,c,j } {b,m,j } {c,m,j } }
7. Count the
the support of itemsets in C3
8. Prune non-frequent: L3 = { {b,c,m} }
8.4 HANDLING LA RGER DATASETS IN MAIN MEMORY • 215
Example 11
Assume we have items = {a, b, c, d, e } and the following baskets:
1. {a , b}
2. {a , b, c }
3. {a , b, d }
4. { b, c, d }
5. {a , b, c, d }
6. {a , b, d, e }
Let the support threshold s = 3. The Apriori algorithm passes as follows:
1.
(a) Construct C 1 = {{a },
}, {b}, {c },
}, {d },
}, {e } }.
(b) Count the support of itemsets in C 1.
(c) Remove infrequent
infrequent itemsets to get L1 = { {a },}, {b}, {c },
}, {d } }.
2.
(a) Construct C 2 = { {a, b}, {a, c },
}, {a, d },
}, {b, c },
}, {b, d },
}, {c, d } }.
(b) Count the support of itemsets in C 2.
(c) Remove infrequent
infrequent itemsets to get L2 = { {a, b}, {a, d }, }, {b, c },
}, {b, d } }.
3.
(a) Construct C 3 = { {a, b, c },
}, {a, b, d },
}, {b, c, d } }. Note that we can be more careful here with the
rule generation. For example, we know {b, c, d } cannot be frequent since {c, d } is not frequent.
That is, {b, c, d } should not be in C 3 since {c, d } is not in L2.
(b) Count the support of itemsets in C 3.
(c) Remove infrequent
infrequent itemsets to get L3 = { {a, b, d } }.
4. Construct C 4 = {empty set }.
Algorithm
FOR (each basket):
FOR (each item in the basket) :
add 1 to item’s count;
FOR (each pair of items):
{ hash the pair to a bucket;
add 1 to the count for that bucket;}
At the end of the first pass, each bucket has a count, which is the sum of the counts of all the
pairs that hash to that bucket. If the count of a bucket is at least as great as the support thresh-
old s , it is called a frequent bucket. We can say nothing about the pairs that hash to a frequent
bucket; they could all be frequent pairs from the information available to us. But if the count of
the bucket is less than s (an infrequent bucket), we know no pair that hashes to this bucket can
be frequent, even if the pair consists of two frequent items. This fact gives us an advantage on
the second pass.
In Pass 2 we only count pairs that hash to frequent buckets.
Algorithm
Count all pairs {i, j} that meet the conditions for being a candidate pair:
1. Both i and j are frequent items
2. The pair {i, j} hashes to a bucket whose bit in the bit vector is 1 (i.e., a frequent bucket )
Both the above conditions are necessary for the pair to have a chance of being frequent. Figure 8.6
indicates the memory map.
Depending on the data and the amount of available main memory, there may or may not be a ben-
efit in using the hash table on pass 1. In the worst case, all buckets are frequent, and the PCY algorithm
counts exactly the same pairs as Apriori does on the second pass. However, typically most of the buckets
will be infrequent. In that case, PCY reduces the memory requirements of the second pass.
Let us consider a typical situation. Suppose we have 1 GB of main memory available for the hash
table on the first pass. Let us say the dataset has a billion baskets, each with 10 items. A bucket is an
218 • CHAPTER 8/FREQUENT ITEMSET MINING
Bitmap
y
r
o
m
e
m
n Hash table
i Counts of
a for pairs
M candidate
pairs
Pass 1 Pass 2
integer, typically 4 bytes, so we can maintain a quarter of a billion buckets. The number of pairs in all
the baskets is
10
109 × = 4.5 × 1010 pairs
2
This number is also the sum of the counts in the buckets. Thus, the average count is about 180.
If the support threshold s is around 180 or less, we might expect few buckets to be infrequent.
However, if s is much larger, say 1000, then it must be that the great majority of the buckets are
infrequent. The greatest possible number of frequent buckets is, thus, about 45 million out of the
250 million buckets.
Between the passes of PCY, the hash table is reduced to a bitmap, with one bit for each bucket. The
bit is 1 if the bucket is frequent and 0 if it is not. Thus, integers of 4 bytes are replaced by single bits.
Thus, the bitmap occupies only 1/32 of the space that would otherwise be available to store counts.
However, if most buckets are infrequent, we expect that the number of pairs being counted on the sec-
ond pass will be much smaller than the total number of pairs of frequent items. Thus, PCY can handle
large datasets without thrashing during the second pass, while the Apriori algorithm would have run
out of main memory space resulting in thrashing.
There is one more issue that could affect the space requirement of the PCY algorithm in the second
pass. In the PCY algorithm, the set of candidate pairs is sufficiently irregular, and hence we cannot use
the triangular-matrix method for organizing counts; we must use a table of counts. Thus, it does not
make sense to use PCY unless the number of candidate pairs is reduced to at most one-third of all pos-
sible pairs. Passes of the PCY algorithm after the second can proceed just as in the Apriori algorithm,
if they are needed.
Further, in order for PCY to be an improvement over Apriori, a good fraction of the buckets on the
first pass must not be frequent. For if most buckets are frequent, the algorithm does not eliminate many
8.4 HANDLING LA RGER DATASETS IN MAIN MEMORY • 219
pairs. Any bucket to which even one frequent pair hashes will itself be frequent. However, buckets to
which no frequent pair hashes could still be frequent if the sum of the counts of the pairs that do hash
there exceeds the threshold s .
To a first approximation, if the average count of a bucket is less then s , we can expect at least half
the buckets not to be frequent, which suggests some benefit from the PCY approach. However, if the
average bucket has a count above s , then most buckets will be frequent.
Suppose the total number of occurrences of pairs of items among all the baskets in the dataset is P .
Since most of the main memory M can be devoted to buckets, the number of buckets will be approxi-
mately M /4. The average count of a bucket will then be 4P/M . In order that there be many buckets that
are not frequent, we need
4 P 4 p
< s or M >
M s
An example illustrates some of the steps of the PCY algorithm.
Example 12
Given: Database D ; minimum support = 2 and the following data.
TID Items
1 1,3,4
2 2,3,5
3 1,2,3,5
4 2,5
Pass 1:
Step 1: Scan D along with counts. Also form possible pairs and hash them to the buckets.
For example, {1,3}:2 means pair {1,3} hashes to bucket 2.
Itemset Sup
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
8.5 LIMITED PASS ALGORITHMS • 229
if supp(itemset) >= p*s
emit(itemset, null)
reduce(key, values):
emit(key, null)
emit(itemset, supp(itemset))
reduce(key, values):
result = 0
result += value
if result >= s:
emit(key, result)
the threshold be set to something less than its proportional value. That is, if the support threshold for
the whole dataset is s , and the sample size is fraction p, then when looking for frequent itemsets in
the sample, use a threshold such as 0.9 ps or 0.7 ps . The smaller we make the threshold, the more main
memory we need for computing all itemsets that are frequent in the sample, but the more likely we are
to avoid the situation where the algorithm fails to provide an answer.
Having constructed the collection of frequent itemsets for the sample, we next construct the nega-
tive border. This is the collection of itemsets that are not frequent in the sample, but all of their imme-
diate subsets (subsets constructed by deleting exactly one item) are frequent in the sample.
We shall define the concept of “negative border” before we explain the algorithm.
The negative border with respect to a frequent itemset, S , and set of items, I , is the minimal
itemsets contained in PowerSet (I ) and not in S . The basic idea is that the negative border of a set
of frequent itemsets contains the closest itemsets that could also be frequent. Consider the case
where a set X is not contained in the frequent itemsets. If all subsets of X are contained in the set
of frequent itemsets, then X would be in the negative border. We illustrate this with the following
example.
Example 14
Consider the set of items I = { A, B, C, D, E } and let the combined frequent itemsets of size 1 to 3 be
S = {{A}, {B }, {C }, {D }, { AB }, { AC }, {BC }, { AD }, {CD }{ ABC }}
1. The negative border is {{E }, {BD }, { ACD }}.
2. The set {E } is the only 1-itemset not contained in S.
3. {BD } is the only 2-itemset not in S but whose 1-itemset subsets are.
4. { ACD } is the only 3-itemset whose 2-itemset subsets are all in S.
The negative border is important since it is necessary to determine the support for those itemsets
in the negative border to ensure that no large itemsets are missed from analyzing the sample data.
Support for the negative border is determined when the remainder of the database is scanned. If we
find that an itemset X in the negative border belongs in the set of all frequent itemsets, then there is a
potential for a superset of X to also be frequent. If this happens, then a second pass over the database
is needed to make sure that all frequent itemsets are found.
To complete Toivonen’s algorithm, we make a pass through the entire dataset, counting all the
itemsets that are frequent in the sample or are in the negative border. There are two possible out-
comes.
Outcome 1: No member of the negative border is frequent in the whole dataset. In this case, we
already have the correct set of frequent itemsets, which were found in the sample and also were
found to be frequent in the whole.
8.6 COUNTING FREQUENT I TEMS IN A STR EAM • 231
Outcome 2: Some member of the negative border is frequent in the whole. Then it is difficult to
decide whether there exists more larger sets that may become frequent when we consider a more
larger sample. Thus, the algorithm terminates without a result. The algorithm must be repeated with
a new random sample.
It is easy to see that the Toivonen’s algorithm never produces a false positive, since it only ouputs those
itemsets that have been counted and found to be frequent in the whole. To prove that it never produces
a false negative, we must show that when no member of the negative border is frequent in the whole,
then there can be no itemset whatsoever, that is:
Frequent in the whole, but in neither the negative border nor the collection of frequent itemsets for
the sample.
Suppose the above is not true. This means, there is a set S that is frequent in the whole, but not in the
negative border and not frequent in the sample. Also, this round of Toivonen’s algorithm produced an
answer, which would certainly not include S among the frequent itemsets. By monotonicity, all subsets
of S are also frequent in the whole.
Let T be a subset of S that is of the smallest possible size among all subsets of S that are not fre-
quent in the sample. Surely T meets one of the conditions for being in the negative border: It is not
frequent in the sample. It also meets the other condition for being in the negative border: Each of its
immediate subsets is frequent in the sample. For if some immediate subset of T were not frequent in
the sample, then there would be a subset of S that is smaller than T and not frequent in the sample,
contradicting our selection of T as a subset of S that was not frequent in the sample, yet as small as
any such set.
Now we see that T is both in the negative border and frequent in the whole dataset. Consequently,
this round of Toivonen’s algorithm did not produce an answer.
We have seen in Chapter 6 that recently there has been much interest in data arriving in the form of
continuous and infinite data streams, which arise in several application domains like high-speed net-
working, financial services, e-commerce and sensor networks.
We have seen that data streams possess distinct computational characteristics, such as unknown or
unbounded length, possibly very fast arrival rate, inability to backtrack over previously arrived items
(only one sequential pass over the data is permitted), and a lack of system control over the order in
which the data arrive. As data streams are of unbounded length, it is intractable to store the entire data
into main memory.
Finding frequent itemsets in data streams lends itself to many applications of Big Data. In many
such applications, one is normally interested in the frequent itemsets in the recent period of time.
SUMMARY • 235
pairs, doing so wastes half the space. Thus, first and second pass of the PCY algorithm
we use the single dimension representation to hash pairs to other, independent hash
of the triangular matrix. If fewer than one- tables. Alternatively, we can modify the first
third of the possible pairs actually occur in pass of the PCY algorithm to divide available
baskets, then it is more space-efficient to main memory into several hash tables. On
store counts of pairs as triples (i , j , c ), where the second pass, we only have to count a pair
c is the count of the pair { i , j } and i < j . An of frequent items if they hashed to frequent
index structure such as a hash table allows us buckets in all hash tables.
to find the triple for (i , j ) efficiently.
• Randomized algorithms: Instead of making
• Monotonicity of frequent itemsets: An passes through all the data, we may choose a
important property of itemsets is that if a set random sample of the baskets, small enough
of items is frequent, then so are all its sub- that it is possible to store both the sample
sets. We exploit this property to eliminate and the needed counts of itemsets in the
the need to count certain itemsets by using main memory. While this method uses at
its contrapositive: If an itemset is not fre- most one pass through the whole dataset, it
quent, then neither are its supersets. is subject to false positives (itemsets that are
frequent in the sample but not the whole)
• The Apriori algorithm for pairs: We can find
and false negatives (itemsets that are fre-
all frequent pairs by making two passes over
quent in the whole but not the sample).
the baskets. On the first pass, we count the
items themselves and then determine which • The SON algorithm: This algorithm divides
items are frequent. On the second pass, we the entire file of baskets into segments small
count only the pairs of items both of which are enough that all frequent itemsets for the seg-
found frequent on the first pass. Monotonicity ment can be found in main memory. Can-
justifies our ignoring other pairs. didate itemsets are those found frequent for
at least one segment. A second pass allows
• The PCY algorithm: This algorithm
us to count all the candidates and find the
improves on Apriori by creating a hash table
exact collection of frequent itemsets. This
on the first pass, using all main-memory
algorithm is especially appropriate for a
space that is not needed to count the items.
MapReduce setting making it ideal for Big
Pairs of items are hashed, and the hash-
Data.
table buckets are used as integer counts of
the number of times a pair has hashed to • Toivonen’s algorithm: This algorithm
that bucket. Then, on the second pass, we improves on the random sampling algorithm
only have to count pairs of frequent items by avoiding both false positives and nega-
that hashed to a frequent bucket (one whose tives. To achieve this, it searches for frequent
count is at least the support threshold) on itemsets in a sample, but with the threshold
the first pass. lowered so there is little chance of missing an
itemset that is frequent in the whole. Next,
• The multistage and multihash algorithm:
we examine the entire file of baskets, count-
These are extensions to the PCY algorithm
ing not only the itemsets that are frequent in
by inserting additional passes between the
236 • CHAPTER 8/FREQUENT ITEMSET MINING
the sample, but also the negative border. If • Frequent itemsets in streams: We present
no member of the negative border is found overview of a few techniques that can be used
frequent in the whole, then the answer is to count frequent items in a stream. We also
exact. But if a member of the negative border present a few techniques that use the concept
is found frequent, then the whole process has of decaying windows for finding more recent
to repeat with another sample. frequent itemsets.
Exercises
1. Imagine there are 100 baskets, numbered 1, for each pair of items (i, j ) where i < j ) and a
2, …, 100, and 100 items, similarly numbered. hash table of item-item-count triples. In the
Item i is in basket j if and only if i divides j first case neglect the space needed to translate
evenly. For example, basket 24 is the set of between original item numbers and numbers
items {1, 2, 3, 4, 6, 8, 12, 24}. Describe all the for the frequent items, and in the second case
association rules that have 100% confidence. neglect the space needed for the hash table.
Assume that item numbers and counts are
2. Suppose we have transactions that satisfy the
always 4-byte integers.
following assumptions:
As a function of N and M , what is the mini-
• The support threshold s is 10,000.
mum number of bytes of main memory
• There are one million items, which needed to execute the Apriori algorithm on
are represented by the integers 0, 1, ..., this data?
999999.
3. If we use a triangular matrix to count pairs,
• There are N frequent items, that is, items
and n, the number of items, is 20, what pair’s
that occur 10,000 times or more.
count is in a [100]?
• There are one million pairs that occur
10,000 times or more. 4. Let there be I items in a market-basket data-
• There are 2 M pairs that occur exactly set of B baskets. Suppose that every basket
once. M of these pairs consist of two contains exactly K items. As a function of I ,
frequent items, the other M each have at B , and K :
least one non-frequent item. (a) How much space does the triangular-
• No other pairs occur at all. matrix method take to store the counts
of all pairs of items, assuming four bytes
• Integers are always represented by 4
per array element?
bytes.
(b) What is the largest possible number of
Suppose we run the Apriori algorithm to find pairs with a non-zero count?
frequent pairs and can choose on the second
(c) Under what circumstances can we be
pass between the triangular-matrix method
certain that the triples method will use
for counting candidate pairs (a triangular
less space than the triangular array?
array count[i ][ j ] that holds an integer count
EXERCISES • 237
5. Imagine that there are 1100 items, of which table with 10 buckets, and the set { i, j } is
100 are “big” and 1000 are “little”. A bas- hashed to bucket i × j mod 10.
ket is formed by adding each big item with (a) By any method, compute the support
probability 1/10, and each little item with for each item and each pair of items.
probability 1/100. Assume the number
(b) Which pairs hash to which buckets?
of baskets is large enough that each item-
(c) Which buckets are frequent?
set appears in a fraction of the baskets that
equals its probability of being in any given (d) Which pairs are counted on the second
basket. For example, every pair consist- pass of the PCY algorithm?
ing of a big item and a little item appears 8. Suppose we run the Multistage algorithm on
in 1/1000 of the baskets. Let s be the sup- the data of Exercise 7, with the same support
port threshold, but expressed as a fraction of threshold of 3. The first pass is the same as
the total number of baskets rather than as in that exercise, and for the second pass, we
an absolute number. Give, as a function of s hash pairs to nine buckets, using the hash
ranging from 0 to 1, the number of frequent function that hashes {i, j } to bucket i + j mod
items on Pass 1 of the Apriori algorithm. 8. Determine the counts of the buckets on
Also, give the number of candidate pairs on the second pass. Does the second pass reduce
the second pass. the set of candidate pairs?
6. Consider running the PCY algorithm on the 9. Suppose we run the Multihash algorithm
data of Exercise 5, with 100,000 buckets on on the data of Exercise 7. We shall use two
the first pass. Assume that the hash func- hash tables with five buckets each. For one,
tion used distributes the pairs to buckets in the set {i, j } is hashed to bucket 2i + 3 j + 4
a conveniently random fashion. Specifically, mod 5, and for the other, the set is hashed to
the 499,500 little−little pairs are divided i + 4 j mod 5. Since these hash functions are
as evenly as possible (approximately 5 to a not symmetric in i and j , order the items so
bucket). One of the 100,000 big−little pairs that i < j when evaluating each hash func-
is in each bucket, and the 4950 big−big pairs tion. Determine the counts of each of the 10
each go into a different bucket. buckets. How large does the support thresh-
(a) As a function of s , the ratio of the sup- old have to be for the Multistage algorithm
port threshold to the total number of to eliminate more pairs than the PCY algo-
baskets (as in Exercise 5), how many fre- rithm would, using the hash table and func-
quent buckets are there on the first pass? tion described in Exercise 7?
(b) As a function of s , how many pairs must 10. During a run of Toivonen’s algorithm with
be counted on the second pass? set of items {A,B,C,D,E,F,G,H} a sample is
7. Here is a collection of 12 baskets. Each found to have the following maximal fre-
contains three of the six items 1 through 6. quent itemsets: {A,B}, {A,C}, {A,D}, {B,C},
{1,2,3} {2,3,4} {3,4,5} {4,5,6} {1,3,5} {2,4,6} {E}, {F}. Compute the negative border.
{1,3,4} {2,4,5} {3,5,6} {1,2,4} {2,3,5} {3,4,6}. Then, identify in the list below the set that is
Suppose the support threshold is 3. On the NOT in the negative border.
first pass of the PCY algorithm we use a hash
9.2 OVERVIEW OF CLU STERING TECHNIQU ES • 241
are sometimes represented by more complicated data structu res than the vectors of attributes. Good
examples include text documents, images, or graphs. Determining the similarity (or differences) of
two objects in such a situation is more complicated, but if a reasonable similarity (dissimilarity)
measure exists, then a clustering analysis can still be performed. Such measures which also were
discussed in Chapter 5 include the Jaccard distance, Cosine distance, Hamming distance, and Edit
distance.
Finding topics:
1. Represent a document by a vector ( x
x 1, x 2,…,
,…, x where x i = 1 iff the i th word (in some order)
x k ), where x
appears in the document
2. Documents with similar sets of words may be about the same topic
topic
3. W
Wee have a choice when we think
think of documents as sets of words:
• Sets as vectors : Measure similarity by the cosine distance
• Sets as sets : Measure similarity by the Jaccard distance
• Sets as points : Measure similarity by Euclidean distance
242 • CHAPTER 9/CLUSTERING APPROACHES
1. Intermediate step for other fundamental data mining problems: Since a clustering can be
considered a form of data summarization, it often serves as a key intermediate step for many fun-
damental data mining problems, such as classification or outlier analysis. A compact summar y of
the data is often useful for different kinds of application-specific insights.
2. Collaborative filtering: Incollaborative filtering methods of clustering provides a summariza-
tion of like-minded users. The ratings provided by the different users for each other are used in
order to perform the collaborative filtering. This can be used to provide recommendations in a
variety of applications.
3. Customer segmentation: This application is quite similar to collaborative filtering, since it cre-
ates groups of similar customers in the data. The major difference from collaborative filtering is
that instead of using rating information, arbitrary attributes about the objects may be used for
clustering purposes.
4. Data summarization: Many clustering methods are closely related to dimensionality reduction
methods. Such methods can be considered as a form of data summarization. Data summarization
can be helpful in creating compact data representations that are easier to process and interpret in
a wide variety of applications.
5. Dynamic trend detection: Many forms of dynamic and streaming algorithms can be used to
perform trend detection in a wide variety of social networking applications. In such applications,
the data is dynamically clustered in a streaming fashion and can be used in order to determine
the important patterns of changes. Examples of such streaming data could be mu lti-dimensional
data, text streams, streaming time-series data, and trajectory data. Key trends and events in the
data can be discovered with the use of clustering methods.
6. Multimedia data analysis: A
analysis: A variety of different kinds of documents, such as images, audio, or
video, fall in the general category of multimedia data. The determination of similar segments has
numerous applications, such as the determination of similar snippets of music or similar pho-
tographs. In many cases, the data may be multi-modal and may contain different types. In such
cases, the problem becomes even more challenging.
7. Biological data analysis: Biological data has become pervasive
per vasive in the last few years, because of the
success of the human genome effort and the increasing ability to collecting different kinds of gene
expression data. Biological data is usually structured either as sequences or as networks. Clustering
algorithms provide good ideas of the key trends in the data, as well as the unusual sequences.
8. Social network analysis: In these applications, the structure of a social network is used in order
to determine the important communities in the underlying network. Community detection
has important applications in social network analysis, because it provides an important under-
standing of the community structure in the network. Clustering also has applications to social
network summarization, which is useful in a number of applications.
9.2 OVERVIEW OF CLU STERING TECHNIQU ES • 243
The above-mentioned list of applications represents a good cross-section of the wide diversity of
problems that can be addressed with the clustering algorithms.
1. Hierarchical techniques produce
techniques produce a nested arrangement of partitions, with a single cluster at
the top consisting of all data points and singleton clusters of individual points at the bottom.
Each intermediate level can be viewed as combining (splitting) two clusters from the next
lower (next higher) level. Hierarchical clustering techniques which start with one cluster of all
the points and then keep progressively splitting the clusters till singleton clusters are reached
are called “divisive” clustering. On the other hand, approaches that start with singleton clus-
ters and go on merging close clusters at every step until they reach one cluster consisting of
the entire dataset are called “agglomerative”
“agglomerative” methods. While most hierarchical algorithms just
join two cluste
clusters
rs or split a cluster into two sub-cl
sub-clusters
usters at every step, there exist hierarch
hierarchical
ical
algorithms that can join more than two clusters in one step or split a cluster into more than
two sub-clusters.
2. Partitional
Partitional techniques
techniques create
create one-level (un-nested) partitioning of the data points. If K is
is the
desired number of clusters, then partitional approaches typically find all K clusters
clusters in one step.
The important issue is we need to have predefined value of K , the number of clusters we propose
to identify in the dataset.
Of course, a hierarchical approach can be used to generate a flat partition of K clusters,
clusters, and like-
wise, the repeated application of a partitional
partitional scheme can provide a hierarchical
hierarchical clustering.
There are also other important distinctions between clustering algorithms as discussed below:
1. Does a clustering
clustering algorithm use all attributes simultaneously
simultaneously (polythetic) or use only one attri-
bute at a time (monothetic) to compute the distance?
2. Does a clustering technique use one object at a time (incremental)
(incremental) or does the
the algorithm consider
all objects at once (non-incremental)?
3. Does the clustering method allow a point to belong to multiple clusters (overlapping) or does it
insist that each object can belong to one cluster only (non-overlapping)? Overlapping clusters
are not to be confused with fuzzy clusters, as in fuzzy clusters objects actually belong to multiple
classes with varying levels of membership.
1. Whether the dataset is treated as a Euclidean space, and if the algorithm can work
work for any arbi-
trary distance measure. In a Euclidean space where data is represented as a vector of real numbers,
244 • CHAPTER 9/CLUSTERING APPROACHES
the notion of a Centroid which can be used to summarize a collection of data points is very
natural - the mean value of the points. In a non-Euclidean space, for example, images or docu-
ments where data is a set of words or a group of pixels, there is no notion of a Centroid, and we
are forced to develop another way to summarize clusters.
2. Whether
Whethe r the algo
algorithm
rithm is base
basedd on the assu
assumptio
mptionn that data will fit in main memormemoryy, or
whether
wheth er data must resid
residee in seco
secondar
ndaryy memo
memory,
ry, primari
pr imarily
ly.. Algor
Algorithms
ithms for larg
largee amou
amounts
nts of
data often must take shortcuts, since it is infeasible to look at all pairs of points. It is also
necessary to summarize the clusters in main memory itself as is common with most big data
algorithms.
Thus, it is often said, “in high-dimensional spaces, distances between points become relatively
uniform”. In such cases, the notion of the nearest neighbor of a point is meaningless. To understand
this in a more geometrical way, consider a hyper-sphere whose center is the selected point and whose
radius is the distance to the nearest data point. Then, if the relative difference between the distance to
nearest and farthest neighbors is small, expanding the radius
radiu s of the sphere “slight
“slightly”
ly” will include many
more points.
9.3.1 Hierarch
Hierarchical
ical Clustering in Euclidean Space Space
These methods construct the clusters by recursively partitioning the instances in either a top-down or
bottom-up fashion. These methods can be subdivided as following:
1. Agglomerative hierarchical clustering: Each object initially represents a cluster of its own.
Then clusters are successively merged until the desired clustering is obtained.
2. Divisive
Divisive hierarchical clustering:
clustering: All
All objects initially belong to one cluster. Then the cluster is
divided into sub-clusters, which are successively divided into their own sub-clusters. This process
continues until the desired cluster structure is obtained.
The result of the hierarchical methods is a dendrogram, representing the nested grouping of objects and
similarity levels at which groupings change. A clustering of the data objects is obtained by cutting the
dendrogram at the desired similarity level. Figure 9.1 shows a simple example of hierarchical clustering.
The merging or division of clusters is to be performed according to some similarity measure, chosen so
as to optimize some error criterion (such as a sum
su m of squares).
The hierarchical clustering methods can be further
fur ther divided according to the manner in which inter-
cluster distances for merging are calculated.
246 • CHAPTER 9/CLUSTERING APPROACHES
Agglomerative
a
ab
b
abcde
c
cde
de
e
Divisive
1. Single-link clustering: Here
clustering: Here the distance between the two clusters is taken as the shortest dis-
tance from any member of one cluster to any member of the other cluster. If the data consist
of similarities, the similarity between a pair of clusters is considered to be equal to the greatest
similarity from any member of one cluster to any member
membe r of the other cluster.
2. Complete-link clustering: Here the distance between the two clusters is the longest distance
from any member of one cluster to any member of the other cluster.
3. Average-link
Average-link clustering:
c lustering: Here
Here the distance between two clusters is the average of all distances
computed between every pair of two points one from each cluster.
4. Centroid link clustering: Here the distance between the clusters is computed as the dista nce
between the two mean data points (average point) of the clusters. This average point of a clus-
ter is called its Centroid . At each step of the clustering process we combine the two clusters
that have the smallest Centroid distance. The notion of a Centroid is relevant for Euclidean
space only, since all the data points have attributes with real values. Figure 9.2 shows this
process.
(5, 3)
o
(1, 2)
o
x (1.5, 1.5)
x (4.7, 1.3)
o (2, 1) o (4, 1)
x (1, 1)
x (4.5, 0.5)
o (0, 0) o (5, 0)
Data:
o ... data point
x ... centroid
Dendrogram
2. If across a series of merges or splits, very little change occurs to the clustering, it means clustering
has reached some stable structure.
3. If the maximum distance between any two points in a cluster becomes greater than a pre-specified
value or threshold we can stop further steps.
4. Combination of the above conditions.
1. Inability to scale well – The time complexity of hierarchical algorithms is at least O (m2)
(where m is the total number of instances), which is non-linear with the number of objects.
248 • CHAPTER 9/CLUSTERING APPROACHES
Example 3
Suppose we are using Jaccard distances and at some intermediate stage we want to merge two docu-
ments with Jaccard distance within a threshold value. However, we cannot find a document that rep-
resents their average, which could be used as its Centroid. Given that we cannot perform the average
operation on points in a cluster when the space is non-Euclidean, our only choice is to pick one of
the points of the cluster itself as a representative or a prototype of the cluster. Ideally, this point is
similar to most points of the cluster, so it in some sense the “center” point.
This representative point is called the “Clustroid” of the cluster. We can select the Clustroid in various
ways. Common choices include selecting as the Clustroid the point that minimizes:
the data set and compare their clustering characteristics to come up with the ideal clustering. Because
this is not computationally feasible, certain greedy heuristics are used and an iterative optimization
algorithm is used.
The simplest and most popular algorithm in this class is the K -means algorithm. But its memory
requirements dictate they can only be used on small datasets. For big datasets we discuss a variant of
K -means called the BFR algorithm.
Algorithm
Input: S (data points), K (number of clusters)
Output: K clusters
1. Choose initial K cluster Centroids randomly.
2. while termination condition is not satisfied do
(a) Assign data points to the closest Centroid.
(b) Recompute Centroids based on current cluster points.
end while
9.4 PARTITIONING METHODS • 251
One such algorithm is the K -medoids or partition around medoids (PAM). Each cluster is repre-
sented by the central object in the cluster, rather than by the mean that may not even belong to the
cluster. The K -medoids method is more robust than the K -means algorithm in the presence of noise and
outliers because a medoid is less influenced by outliers or other extreme values than a mean. However,
its processing is more costly than the K -means method. Both methods require the user to specify K , the
number of clusters.
The reasoning behind using these parameters comes from the fact that these parameters are easy to
compute when we merge two clusters. We need to just add the corresponding values from the two clus-
ters. Similarly we can compute the Centroid and variance also very easily from these values as:
SUMi
1. The i th coordinate of the Centroid is .
N
2
−
th
SUMSQ i SUMi
2. The variance in the i dimension is N .
N
3. The standard deviation in the i th dimension is the square root of the variance in that dimension.
Initially, the BFR algorithm selects k points, either randomly or using some preprocessing methods to
make better choices. In the next step the data file containing the points of the dataset are in chunks.
These chunks could be from data stored in a distributed file system or there may be one monolithic
252 • CHAPTER 9/CLUSTERING APPROACHES
huge file which is then divided into chunks of the appropriate size. Each chunk consists of just so
many points as can be processed in the main memory. Further some amount of main memory is also
required to store the summaries of the k clusters and other data, so the entire memory is not available
to store a chunk.
The data stored in the main-memory other than the chunk from the input consists of three types
of objects:
1. The discard set: The points already assigned to a cluster. These points do not appear in main
memory. They are represented only by the summary statistics for their cluster.
2. The compressed set: There are several groups of points that are sufficiently close to each other
for us to believe they belong in the same cluster, but at present they are not close to any current
Centroid. In this case we cannot assign a cluster to these points as we cannot ascertain to which
cluster they belong. Each such group is represented by its summary statistics, just like the clusters
are, and the points themselves do not appear in main memory.
3. The retained set: These points are not close to any other points; they are “outliers.” They will
eventually be assigned to the nearest cluster, but for the moment we have to retain each such
point in main memory.
These sets will change as we bring in successive chunks of data into the main memory. Figure 9.4
indicates the state of the data after a few chunks of data have been processed by the BFR algorithm.
Points in
the RS
Compressed sets.
Their points are in
the CS.
1. For all points ( x i, x 2,…, xn) that are “sufficiently close” (based on distance threshold) to the
Centroid of a cluster, add the point to that cluster. The point then is added to the discard
set. We add 1 to the value N in the summary statistics for that cluster indicating that this
cluster has grown by one point. We also add X j to SUM i and add X j 2 to SUMSQ i for that
cluster.
2. If this is the last chunk of data, merge each group from the compressed set and each point of
the retained set into its nearest cluster. We have seen earlier that it is very simple and easy
to merge clusters and groups using their summary statistics. Just add the counts N , and add
corresponding components of the SUM and SUMSQ vectors. The algorithm ends at this
point.
3. Otherwise (this was not the last chunk), use any main-memory clustering algorithm to cluster
the remaining points from this chunk, along with all points in the current retained set. Set a
threshold on the distance values that can occur in the cluster, so we do not merge points unless
they are reasonably close.
4. Those points that remain isolated as clusters of size 1 (i.e., they are not near any other point)
become the new retained set. Clusters of more than one point become groups in the compressed
set and are replaced by their summary statistics.
5. Further we can consider merging groups in the compressed set. Use some threshold to decide
whether groups are close enough; the following section outlines a method to do this. If they can
be merged, then it is easy to combine their summary statistics, as in (2) above.
Suppose initially we assign A1, B1, and C1 • Compute the variance and standard
as the center of each cluster, respectively. deviation of each cluster in each of the two
Use the K -means algorithm to show only dimensions.
(a) the three cluster centers after the first
round of execution and (b) the final three 6. Execute the BDMO algorithm with p = 3
clusters. on the following 1-D, Euclidean data: 1,
45, 80, 24, 56, 71, 17, 40, 66, 32, 48,
2. Given a 1-D dataset {1,5,8,10,2}, use the 96, 9, 41, 75, 11, 58, 93, 28, 39, 77. The
agglomerative clustering algorithms with clustering algorithms is k -means with k =
Euclidean distance to establish a hierarchi- 3. Only the centroid of a cluster, along
cal grouping relationship. Draw the dendro- with its count, is needed to represent a
gram. cluster.
3. Both K -means and K -medoids algorithms 7. Using your clusters from Exercise 6, pro-
can perform effective clustering. Illustrate duce the best centroids in response to a
the strength and weakness of K -means in query asking for a clustering of the last 10
comparison with the K -medoids algorithm. points.
The goal of K -medoid algorithm is the same
as K -means: minimize the total sum of the 8. In certain clustering algorithms, such as
distances from each data point to its cluster CURE, we need to pick a representative set
center. Construct a simple 1-D example of points in a supposed cluster, and these
where K -median gives an output that is points should be as far away from each other
different from the result retuned by the as possible. That is, begin with the two fur-
K -means algorithm. (Starting with the same thest points, and at each step add the point
initial clustering in both cases.) whose minimum distance to any of the pre-
viously selected points is maximum.
4. Suppose a cluster of 3-D points has standard
Suppose you are given the following points
deviations of 2, 3, and 5, in the three dimen-
in 2-D Euclidean space:
sions, in that order. Compute the Mahalano-
bis distance between the origin (0, 0, 0) and x = (0,0); y = (10,10); a = (1,6); b = (3,7);
the point (1, −3, 4). c = (4,3); d = (7,7); e = (8,2); f = (9,5).
5. For the 2-D dataset (2, 2), (3, 4), (4, 8), (4, Obviously, x and y are furthest apart, so start
10), (5, 2), (6, 8), (7, 10), (9, 3), (10, 5), with these. You must add five more points,
(11, 4), (12, 3), (12, 6), we can make three which we shall refer to as the first, second,...,
clusters: fifth points in what follows. The distance
measure is the normal Euclidean distance.
• Compute the representation of the
Identify the order in which the five points
cluster as in the BFR algorithm. That is,
will be added.
compute N, SUM, and SUMSQ.
264 • CHAPTER 9/CLUSTERING APPROACHES
Programming Assignments
1. Implement the BFR (modified K -mean) 2. Implement the CURE algorithm using
algorithm on large set of 3-d points using MapReduce on the same dataset used in
MapReduce. Problem 1.
References
LEARNING OBJECTIVES
After reading this chapter, you will be able to:
• Learn the use of Recommender system. • Learn content based approach for Recom-
• Understand various models of Recom- mender system.
mender system. • Understand the methods to improve
• Learn collaborative filtering approach for prediction function.
Recommender system.
10.1 Introduction
10.1.1 What is the Use of Recommender System?
For business, the recommender system can increase sales. The customer ser vice can be personalized and
thereby gain customer trust and loyalty. It increases the knowledge about the customers. The recom-
mender system can also give opportunities to persuade the customers and decide on the discount offers.
For customers, the recommender system can help to narrow down their choices, find things of their
interests, make navigation through the list easy and to discover new things.
We further look at algorithms for two very interesting problems in social networks. The “SimRank”
algorithm provides a way to discover similarities among the nodes of a graph. We also explore triangle
counting as a way to measure the connectedness of a community.
1. Viral marketing is an application of social network mining that explores how individuals can
influence the buying behavior of others. Viral marketing aims to optimize the positive word-
of-mouth effect among customers. Social network mining can identify strong communities and
influential nodes. It can choose to spend more money marketing to an individual if that person
has many social connections.
2. Similarly in the e-commerce domain, the grouping together of customers with similar buying
profiles enables more personalized recommendation engines. Community discovery in mobile
ad-hoc networks can enable efficient message routing and posting.
3. Social network analysis is used extensively in a wide range of applications, which include data
aggregation and mining, network propagation modeling, network modeling and sampling, use r
attribute and behavior analysis, community-maintained resource support, location-based inter-
action analysis, social sharing and filtering, recommender systems, etc.
4. Many businesses use social network analysis to support activities such as customer interaction
and analysis, information system development analysis, targeted marketing, etc.
Finding communities or clustering social networks, all lead to the standard graph algorithms
when the social network is modeled as a graph. This section will give an overview of how a social
network can be modeled as a graph. A small discussion of the different types of graphs is also
provided.
1. In a social network scenario, the nodes are typically people. But there could be other entities like
companies, documents, computers, etc.
2. A social network can be considered as a heterogeneous and multi-relational dataset represented
by a graph. Both nodes and edges can have attributes. Objects may have class labels.
3. There is at least one relationship between entities of the network. For example, social networks
like Facebook connect entities through a relationship called friends. In LinkedIn, one relation-
ship is “endorse” where people can endorse other people for their skills.
4. In many social networks, this relationship need not be yes or no (binary) but can have a degree.
That means in LinkedIn, we can have a degree of endorsement of a skill like say novice, expert,
etc. Degree can also be a real number.
5. We assume that social networks exhibit the property of non-randomness, often called locality.
Locality is the property of social networks that says nodes and edges of the graph tend to cluster
in communities. This condition is the hardest to formalize, but the intuition is that the relation-
ships tend to cluster. That is, if entity A is related to both B and C, then there is a higher prob-
ability than average that B and C are related. The idea is most relationships in the real worlds
tend to cluster around a small set of individuals.
Example 1
Figure 11.1 shows a small graph of the “followers” network of Twitter. The relationship between the
edges is the “follows” relationship. Jack follows Kris and Pete shown by the direction of the edges.
Jack and Mary follow each other shown by the bi-directional edges. Bob and Tim follow each other
as do Bob and Kris, Eve and Tim. Pete follows Eve and Bob, Mary follows Pete and Bob follows Alex.
Notice that the edges are not labeled, thus follows is a binary connection. Either a person follows
somebody or does not.
Jack
Bob
Alex
Eve
Tim
Example 2
Consider LiveJournal which is a free on-line blogging community where users declare friendship to
each other. LiveJournal also allows users to form a group which other members can then join. The
graph depicted in Fig. 11.2 shows a portion of such a graph. Notice that the edges are undirected,
indicating that the “friendship” relation is commutative.
11.8 COUNTING TRIANGLES I N A SOCIAL GRAPH • 295
X1
X2
X3
P1
C
B
P2
But the existence of even one high degree vertex will make this algorithm quadratic. Since it is highly
likely to come across such high degree nodes in massive social graphs, this algorithm too is not practi-
cal. Further, this algorithm counts each triangle { x, y, z } six times (once each as { x, y, z }, { x, z, y }, { y, x,
z }, { y, z, x }, {z, x, y } and {z, y, x }).
One optimization is to count each triangle only once. We can use another trick for further making
the algorithm efficient. The key idea to use is that “only the lowest-degree vertex of a triangle is respon-
sible for counting it”. Further, we also use an ordering of vertices which starts from the most likely
vertices to form triangles to the ones with least probability.
The algorithm can be described as follows: Identify “Massive” vertices in a graph. Let the social
graph have n vertices and m edges. We call a vertex massive if its degree is at least m . If a triangle
11.8 COUNTING TRIANGLES I N A SOCIAL GRAPH • 297
has all its vertices as massive, then it is a massive triangle. The algorithm identifies massive triangles
and non-massive triangles separately. We can easily see that the maximum number of massive vertices a
graph can have is 2 m . Now represent a graph as a list of its m edges.
1. Compute the degree of each vertex. Examine each edge and add 1 to the count of each of its two
end vertices. The total time required is O( m).
2. Create an index on edges using a hash table, with its vertex pair as a key. So we can check whether
an edge exists given a pair of vertices in constant O(1) time.
3. Create one more hash table for the edges key being a single vertex. Given a vertex, we can easily
identify all vertices adjacent to it.
4. Number the vertices and order them in the ascending order of their degree. Lower degree vertices
first followed by higher degree vertices. If two vertices have same degree then order them based
on their number.
To find massive triangles we notice the following. There can be only O( m ) massive vertices and so
if we check all possible three subsets of these set of vertices, we have O(m3/2) possible massive triangles.
To check the existence of all three edges we need only constant time because of the hash tables. Thus
this implementation is only O(m3/2).
To find the non-massive triangles we adopt the following procedure:
1. Consider each edge (v 1, v 2). Ignore this edge when both v 1 and v 2 are massive.
2. Suppose v 1 is not massive and v 1 appears before v 2 in the vertex ordering. We enumerate all ver-
tices adjacent to v 1 and call them u1, u2 , …, uk , k will be less than m . It is easy to find all the
u’s because of the second hash table which has single vertices as keys (O(1) time).
3. Now for each ui we check if edge (ui , v 2) exists and if v 1 appears before ui in the list. We count
this triangle. (Thus we avoid counting the same triangle several times.) This operation is benefit-
ted because of the first hash table with key being edges.
Thus, the time to process all the nodes adjacent to v 1 is O( m ). Since there are m edges, the total time
spent counting other triangles is O(m3/2).