Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Cs8080 - Irt - Notes All

Download as pdf or txt
Download as pdf or txt
You are on page 1of 281








Information Retrieval – Early Developments – The IR Problem – The User‗s Task –
Information versus Data Retrieval - The IR System – The Software Architecture of the IR
System – The Retrieval and Ranking Processes - The Web – The e-Publishing Era – How
the web changed Search – Practical Issues on the Web – How People Search – Search
Interfaces Today – Visualization in Search Interfaces.


 Cookie Monster’s definition: news or facts about something.

Types of Information
 Text
 XML and structured documents
 Images
 Audio
 Video
 Source Code
 Applications/Web services

 “Fetch something” that’s been stored

Information Retrieval

Information Retieval - Calvin Mooers definition

“Information retrieval is a field concerned with the structure, analysis, organization, storage,
searching, and retrieval of information.”
It is the activity of obtaining information resources relevant to an information need from a collection of
information resources.

Information retrieval (IR) is finding material (usually documents) of an unstructured nature

(usually text) that satisfies an information need from within large collections (usually stored on

 The amount of available information is growing at an incredible rate, for example the Internet and
World Wide Web.
 Information are stored in many forms e.g. images, text, video, and audio.
 Information Retrieval is a way to separate relevant data from irrelevant.
 IR field has developed successful methods to deal effectively with huge amounts of information.
 Common methods include the Boolean, Vector Space and Probabilistic models.

Main objective of IR
 Provide the users with effective access to and interaction with information resources.

Goal of IR
 The goal is to search large document collections to retrieve small subsets relevant to the
user’sinformation need.

Purpose/role of an IR system
 An information retrieval system is designed to retrieve the documents or information required
by the user community.
 It should make the right information available to the right user.
 Thus, an information retrieval system aims at collecting and organizing information in one or
more subject areas in order to provide it to the user as soon as possible.
 Thus it serves as a bridge between the world of creators or generators of information and the users
of that information.

Information retrieval (IR) is concerned with representing, searching, and manipulating large
collections of electronic text and other human-language data.

Web search engines — Google, Bing, and others — are by far the most popular and heavily used IR
services, providing access to up-to-date technical information, locating people and organizations,
summarizing news and events, and simplifying comparison shopping.

Web Search:
Regular users of Web search engines casually expect to receive accurate and near-
instantaneous answers to questions and requests merely by entering a short query — a few
words — into a text box and clicking on a search button. Underlying this simple and intuitive
interface are clusters of computers, comprising thousands of machines, working cooperatively
to generate a ranked list of those Web pages that are likely to satisfy the information need
embodied in the query.
These machines identify a set of Web pages containing the terms in the query, compute a score
for each page, eliminate duplicate and redundant pages, generate summaries of the remaining
pages, and finally return the summaries and links back to the user for browsing.

Consider a simple example.
If you have a computer connected to the Internet nearby, pause for a minute to launch a browser
and try the query “information retrieval” on one of the major commercial Web search engines.
It is likely that the search engine responded in well under a second. Take some time to review
the top ten results. Each result lists the URL for a Web page and usually provides a title and a
short snippet of text extracted from the body of the page.

Overall, the results are drawn from a variety of different Web sites and include sites associated
with leading textbooks, journals, conferences, and researchers. As is common for informational
queries such as this one, the Wikipedia article may be present.

Other Search Applications:

Desktop and file system search provides another example of a widely used IR application. A
desktop search engine provides search and browsing facilities for files stored on a local hard
disk and possibly on disks connected over a local network. In contrast to Web search engines,
these systems require greater awareness of file formats and creation times.
For example, a user may wish to search only within their e-mail or may know the general time
frame in which a file was created or downloaded. Since files may change rapidly, these systems
must interface directly with the file system layer of the operating system and must be engineered
to handle a heavy update load.

Other IR Applications:
1) Document routing, filtering, and selective distribution reverse the typical IR process.
2) Summarization systems reduce documents to a few key paragraphs, sentences, or phrases
describing their content. The snippets of text displayed with Web search results represent one
3) Information extraction systems identify named entities, such as places and dates, and combine
this information into structured records that describe relationships between these entities — for
example, creating lists of books and their authors from Web data.

Application areas within IR

 Cross language retrieval
 Speech/broadcast retrieval
 Text categorization
 Text summarization
 Structured document element retrieval (XML)

Kinds of information retrieval systems

Two broad categories of information retrieval system can be identified:

 In- house: In- house information retrieval systems are set up by a particular library or
information center to serve mainly the users within the organization. One particular type of in-
house database is the library catalogue.
 Online: Online IR is nothing but retrieving data from web sites, web pages and servers that
may include data bases, images, text, tables, and other types.

Features of an information retrieval system

Liston and Schoene suggest that an effective information retrieval system must have provisions for:
 Prompt dissemination of information
 Filtering of information
 The right amount of information at the right time
 Active switching of information
 Receiving information in an economical way
 Browsing
 Getting information in an economical way
 Current literature
 Access to other information systems
 Interpersonal communications, and
 Personalized help.

IR and Related Areas

1. Database Management
2. Library and Information Science
3. Artificial Intelligence
4. Natural Language Processing
5. Machine Learning

1. Database Management
• Focused on structured data stored in relational tables rather than free-form text.
• Focused on efficient processing of well-defined queries in a formal language (SQL).
• Clearer semantics for both data and queries.
• Recent move towards semi-structured data (XML) brings it closer to IR.
2. Library and Information Science
• Focused on the human user aspects of information retrieval (human-computer interaction,
user interface,visualization).
• Concerned with effective categorization of human knowledge.
• Concerned with citation analysis and bibliometrics (structure of information).
• Recent work on digital libraries brings it closer to CS & IR.
3. Artificial Intelligence
• Focused on the representation of knowledge, reasoning, and intelligent action.
• Formalisms for representing knowledge and queries:

– First-order Predicate Logic
– Bayesian Networks
• Recent work on web ontologies and intelligent information agents brings it closer to IR.
4. Natural Language Processing
• Focused on the syntactic, semantic, and pragmatic analysis of natural language text and
• Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on
meaning ratherthan keywords.
Natural Language Processing: IR Directions
• Methods for determining the sense of an ambiguous word based on context
(word sensedisambiguation).
• Methods for identifying specific pieces of information in a document (information extraction).
• Methods for answering specific NL questions from document corpora or structured data like
FreeBase orGoogle’s Knowledge Graph.
5. Machine Learning
• Focused on the development of computational systems that improve their performance with
• Automated classification of examples based on learning concepts from labeled training
examples (supervised learning).
• Automated methods for clustering unlabeled examples into meaningful groups (unsupervised
Machine Learning: IR Directions
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction
• Text Mining
• Learning to Rank


2.1 History of Information Retrieval

1950: The term "information retrieval" was coined by Calvin Mooers.
1951: Philip Bagley conducted the earliest experiment in computerized document retrieval in a master
thesis at MIT.
1955: Allen Kent joined from Western Reserve University published a paper in American

Documentation describing the precision and recall measures as well as detailing a proposed
"framework" for evaluating an IR system which included statistical sampling methods for determining
the number of relevant documents not retrieved.
1959:HansPeter Luhnpublished"Auto-encodingofdocumentsforinformationretrieval."


 Initial exploration of text retrieval systems for “small” corpora of scientific abstracts, and law
and business documents.
 Development of the basic Boolean and vector-space models of retrieval.
 Prof. Salton and his students at Cornell University are the leading researchers in the area.
 early 1960s: Gerard Salton began work on IR at Harvard, later moved to Cornell.
 1963:Joseph Becker and Robert M. Hayes published text on information retrieval. Becker,
Joseph; Hayes, Robert Mayo. Information storage and retrieval: tools, elements, theories. New
York, Wiley (1963).
 Karen Spärck Jones finished her thesis at Cambridge, Synonymy and Semantic Classification,
and continued work on computational linguistics as it applies to IR.

 The National Bureau of Standards sponsored a symposium titled "Statistical Association

Methods for Mechanized Documentation." Several highly significant papers, including G.
Salton's first published reference (we believe) to the SMART system.

National Library of Medicine developed MEDLARS Medical Literature Analysis and Retrieval System,
the first major machine-readable database and batch-retrieval system.
Project Intrex at MIT.
1965: J. C. R. Licklider published Libraries of the Future.

late 1960s: F. Wilfrid Lancaster completed evaluation studies of the MEDLARS system and published
the first edition of his text on information retrieval.
1968: Gerard Salton published Automatic Information Organization and Retrieval. John W. Sammon,
Jr.'s RADC Tech report "Some Mathematics of Information Storage and Retrieval..." outlined the
vector model.
1969: Sammon's "A nonlinear mapping for data structure analysis" (IEEE Transactions on Computers)
was the first proposal for visualization interface to an IR system.

early 1970s:Firstonlinesystems—NLM'sAIM-TWX,MEDLINE;Lockheed'sDialog;SDC'sORBIT.
1971: Nicholas Jardine and Cornelis J. van Rijsbergen published "The use of hierarchic clustering in
information retrieval", which articulated the "cluster hypothesis."
1975: Three highly influential publications by Salton fully articulated his vector processing framework
and term discrimination model: ATheory of Indexing (Society for Industrial and Applied Mathematics)
1979: C. J. van Rijsbergen published Information Retrieval (Butterworths). Heavy emphasis on
probabilistic models.
1979: Tamas Doszkocs implemented the CITE natural language user interface for MEDLINE at the
National Library of Medicine. The CITE system supported free form query input, ranked output and
relevance feedback.


 Large document database systems, many run by companies:

o Lexis-Nexis
o Dialog
1982: Nicholas J. Belkin, Robert N. Oddy, and Helen M. Brooks proposed the ASK (Anomalous State
of Knowledge) viewpoint for information retrieval. This was an important concept, though their
automated analysis toolproved ultimately disappointing.
1983: Salton (and Michael J. McGill) published Introduction to Modern Information Retrieval
(McGraw- Hill), with heavy emphasis on vector space models.

Mid-1980s: Efforts to develop end-user versions of commercial IR systems. 1989: First World Wide
Web proposalsbyTimBerners-Lee at CERN.


 Searching FTPable documents on the Internet

a) Archie
 Searching the World Wide Web
a) Lycos
b) Yahoo
c) Altavista
 Organized Competitions
 Recommender Systems
a) Ringo
b) Amazon
c) NetPerceptions
 Automated Text Categorization & Clustering

1992: First TREC conference.

1997: Publication of Korfhage's Information Storage and Retrieval with emphasis on visualization and
multi-reference point systems.late 1990s: Web search engines implementation of many features
formerly found only in experimental IR systems. Search engines become the most common and maybe
best instantiation of IRmodels.

More applications, especially Web search and interactions with other fields like Learning to rank,
Scalability (e.g., MapReduce), Real-time search
 Link analysis for Web Search
o Google
 Automated Information Extraction
o Whizbang
o Fetch
o Burning Glass

 Question Answering

o TREC Q/A track

 Multimedia IR
o Image
o Video
o Audio and music
 Cross-Language IR
o DARPA Tides
Document Summarization
o Learning to Rank

 The IR Problem: the primary goal of an IR system is to
 retrieve all the documents that are relevant to a user query while retrieving as few non relevant
documents as possible.
 The difficulty is knowing not only how to extract information from the documents
 but also knowing how to use it to decide relevance.
 That is, the notion of relevance is of central importance in IR.

ISSUES IN IR (Bruce Croft)

Information retrieval is concerned with representing, searching, and manipulating large
collections of electronic text and other human-language data.

Three Big Issues in IR


 One main issue is that relevance is a personal assessment that depends on the task being solved
and its context.
For example:
Relevance can change
a) with time (e.g., new information becomes available),
b) with location (e.g., the most relevant answer is the closest one), or
c) even with the device (e.g., the best answer is a short document that is easier to download and
 It is the fundamental concept in IR.
 A relevant document contains the information that a person was looking for when she
submitted a queryto the search engine.
 There are many factors that go into a person’s decision as to whether a document is relevant.
 These factors must be taken into account when designing algorithms for comparing text
and rankingdocuments.
 Simply comparing the text of a query with the text of a document and looking for an exact
match, as might be done in a database system produces very poor results in terms of relevance.
To address the issue of relevance, retrieval models are used.

 A retrieval model is a formal representation of the process of matching a query and a

document. It is the basis of the ranking algorithm that is used in a search engine to produce the
ranked list of documents.
 A good retrieval model will find documents that are likely to be considered relevant by the
person who submitted the query.
 The retrieval models used in IR typically model the statistical properties of text rather than the
linguistic structure. For example, the ranking algorithms are concerned with the counts of
word occurrences than whether the word is a noun or an adjective.

2. Evaluation
 Two of the evaluation measures are precision and recall.
Precision is the proportion of retrieved documents that are relevant.
Recall is the proportion of relevant documents that are retrieved.
Precision = Relevant documents ∩ Retrieved documents
Retrieved documents
Recall = Relevant documents ∩ Retrieved documents
Relevant documents
 When the recall measure is used, there is an assumption that all the relevant documents for a
given query are known. Such an assumption is clearly problematic in a web search
environment, but with smaller test collection of documents, this measure can be useful. It is
not suitable for large volumes of log data.

3. Emphasis on users and their information needs

 The users of a search engine are the ultimate judges of quality. This has led to numerous
studies on how people interact with search engines and in particular, to the development of
techniques to help people express their information needs.
 Text queries are often poor descriptions of what the user actually wants compared to the
request to a database system, such as for the balance of a bank account.
 Despite their lack of specificity, one-word queries are very common in web search. A one-
word query such as “cats” could be a request for information on where to buy cats or for a
description of the Cats (musical).
 Techniques such as query suggestion, query expansion and relevance feedback use interaction
and context to refine the initial query in order to produce better ranked results.

Main problems
 Document and query indexing
o How to represent their contents?
o Query evaluation
 To what extent does a document correspond to a query?
o System evaluation
o How good is a system?
o Are the retrieved documents relevant? (precision)
o Are all the relevant documents retrieved? (recall)

Why is IR difficult?
 Vocabularies mismatching
o The language can be used to express the same concepts in many different ways, with
differentwords. This is referred to as the vocabulary mismatch problem in information
o E.g. Synonymy: car vs automobile
 Queries are ambiguous

 Content representation may be inadequate and incomplete
 The user is the ultimate judge, but we don’t know how the judge judges.

Challenges in IR
 Scale, distribution of documents
 Controversy over the unit of indexing
 High heterogeneity
 Retrieval strategies


4.1 The User‗s Task

The User Task.- The user of a retrieval system has to translate his information need into a query in the
language provided by the system.

With an information retrieval system, this normally implies specifying a set of words which convey the
semantics of the information need.

a) Consider a user who seeks information on a topic of their interest : This user first translates their
information need into a query, which requires specifying the words that compose the query In this case,
we say that the user is searching or querying for information of their interest.

b) Consider now a user who has an interest that is either poorly defined or inherently broad

For instance, the user has an interest in car racing and wants to browse documents on Formula 1 and
Formula Indy, In this case, we say that the user is browsing or navigating the documents of the

 The general objective of an Information Retrieval System is to minimize the time it takes for a
user to locate the information they need.

 The goal is to provide the information needed to satisfy the user's question. Satisfaction does not
necessarily mean finding all information on a particular issue.

 The user of a retrieval system has to translate his information need into a query in the language
provided by the system.

 With an information retrieval system, this normally implies specifying a set of words which
convey the semantics of the information need.

 With a data retrieval system, a query expression (such as, for instance, a regular expression) is
used to convey the constraints that must be satisfied by objects in the answer set.

 In both cases, we say that the user searches for useful information executing a retrieval task.

 Consider now a user who has an interest which is either poorly defined or which is inherently

 For instance, the user might be interested in documents about car racing in general.

 In this situation, the user might use an interactive interface to simply look around in the
collection for documents related to car racing.


 For instance, he might find interesting documents about Formula 1 racing, about car
manufacturers, or about the `24 Hours of Le Mans.'

 We say that the user is browsing or navigating the documents in the collection, not

 It is still a process of retrieving information, but one whose main objectives are less
clearly defined in the beginning.

 Furthermore, while reading about the `24 Hours of Le Mans', he might turn his attention to a
document which provides directions to Le Mans and, from there, to documents which cover
tourism in France.

 In this situation, we say that the user is browsing the documents in the collection, not searching.

 It is still a process of retrieving information, but one whose main objectives are not clearly
defined in the beginning and whose purpose might change during the interaction with the

 The task in this case is more related to exploratory search and resembles a process of quasi-
sequential search for information of interest.
 Here we, make a clear distinction between the different tasks the user of the retrieval system
might be engaged in.
 The task might be then of two distinct types: searching and browsing, as illustrated in Figure:

 In a process of retrieving information, one whose main objectives are not clearly defined in the
beginning and whose purpose might change during the interaction with the system.
 Then, user task may go with Browsing only.

User Choice of Information Retrieval:

a) Push

b) Pull

 Both retrieval and browsing are, in the language of the World Wide Web, `pulling' actions.

 That is, the user requests the information in an interactive manner.

 An alternative is to do retrieval in an automatic and permanent fashion using software agents

which push the information towards the user.

 For instance, information useful to a user could be extracted periodically from a news service.

 In this case, we say that the IR system is executing a particular retrieval task which consists
of filtering relevant information for later inspection by the user.


What is Retrieval?
This does not mean that there is no structure in the data Document structure (headings, paragraphs, lists.
Explicit markup formatting (e.g. in HTML, XML. . . ) Linguistic structure (latent, hidden)
SELECT * FROM business catalogue WHERE category = ’florist’ AND city zip = ’cb1’

 In an IR system the retrieved objects might be inaccurate and small errors are likely to go
 In a data retrieval system, on the contrary, a single erroneous object among a retrieval system,
 such as defined structure and semantics thousand retrieved objects means total failure.

Difference between data retrieval and information retrieval

Information Retrieval vs Information Extraction

 Information Retrieval: Given a set of terms and a set of document terms select only the
most relevantdocument (precision and preferably all the relevant ones (recall).
 Information Extraction: Extract from the text what the document means.
 Data retrieval: the task of determining which documents of a collection contain the keywords
in the user query

Data retrieval system

Ex: relational databases
Deals with data that has a well defined structure and semantics
Data retrieval does not solve the problem of retrieving information about a subject or topic

Parameters Databases/Data Retrieval Information retrieval

Example Data Base Query WWW Search
What we are retrieving Structured data Mostly unstructured
Formally defined queries, Expressed in natural
Queries we are posing
unambiguous language
Matching Exact Partial Match, Best Match
Inference Deduction Induction
Model Deterministic Probabilistic
a) Text operations
b) Indexing
c) Searching
d) Ranking
e) User Interface
f) Query operations

The above figure shows the architecture of IR System with the Specified Components

a) Text operation:
Text Operations forms index words (tokens).
Stop word removal , Stemming

b) Indexing:
Indexing constructs an inverted index of word to document pointers.

c) Searching:
Searchingretrievesdocumentsthat containagivenquerytokenfromtheinverted index.

d) Ranking :
Ranking scores all retrieved documents according to a relevance metric.

e) User Interface:
User Interface manages interaction with the user:
Query input and document output.
Relevance feedback.
Visualization of results.

f) Query Operations:
Query Operations transform the query to improve retrieval:
 Query expansion using a thesaurus.
 Query transformation using relevance feedback.

First of all, before the retrieval process can even be initiated, it is necessary to define the text database.
This is usually done by the manager of the database, which specifies the following:
(a) the documents to be used,
(b) the operations to be performed on the text, and
(c) the text model (i.e., the text structure and what elements can be retrieved).

 The text operations transform the original documents and generate a logical view of them.
 Once the logical view of the documents is defined, the database manager builds an index of the
 An index is a critical data structure because it allows fast searching over large volumes of data.
Different index structures might be used, but the most popular one is the inverted file.
 The resources (time and storage space) spent on defining the text database and building the index
are amortized by querying the retrieval system many times.
 Given that the document database is indexed, the retrieval process can be initiated.
 The user first specifies a user need which is then parsed and transformed by the same text
operations applied to the text.
 Then, query operations might be applied before the actual query, which provides a system
representation for the user need, is generated.
 The query is then processed to obtain the retrieved documents. Fast query processing is made
possible by the index structure previously built.
 Before been sent to the user, the retrieved documents are ranked according to a likelihood of
relevance. The user then examines the set of ranked documents in the search for useful
 At this point, he might pinpoint a subset of the documents seen as definitely of interest and
initiate a user feedback cycle.
 In such a cycle, the system uses the documents selected by the user to change the query
formulation. Hopefully, this modified query is a better representation


Basic IR System Architecture (Stefan Buettcher)

Information retrieval is concerned with representing, searching, and manipulating large
collections of electronic text and other human-language data.

 Before conducting a search, a user has an information need, which underlies and drives the
search process.
 This information need is sometimes referred as a topic, particularly when it is presented in
written form as part of a text collection for IR evaluation.
 As a result of the information need, the user constructs and issues a query to the IR system. This
query consists of a smaller number of terms, with two or three terms being typical for a Web

Figure illustrates the major components in an IR system.

 The primary data structure of most of the IR systems is in the form of inverted index.
 We can define an inverted index as a data structure that list, for every word, all documents that
contain it and frequency of the occurrences in document.
 It makes it easy to search for 'hits' of a query word.

 Depending on the information need, a query term may be a date, a number, a musical note, or a
phrase. Wildcard operators and other partial- match operators may also be permitted in query
 For example, the term “inform*” might match any word starting with that prefix (“inform”,
“informs”, “informal”, “informant”, informative”, etc.).
 Although users typically issue simple keyword queries, IR systems often support a richer query
syntax, frequently with complex Boolean and pattern matching operators.

 These facilities may be used to limit a search to a particular Web site, to specify constraints on
fields such as author and title, or to apply other filters, restricting the search to a subset of the
 A user interface mediates between the user and the IR system, simplifying the query creation
process when these richer query facilities are required.

 The user’s query is processed by a search engine, which may be running on the user’s local
machine, on a large cluster of machines in a remote geographic location, or anywhere in
 A major task of a search engine is to maintain and manipulate an inverted index for a
document collection.
 This index forms the principal data structure used by the engine for searching and relevance
ranking. As its basic function, an inverted index provides a mapping between terms and the
locations in the collection in which they occur.

 To support relevance ranking algorithms, the search engine maintains collection statistics
associated with the index, such as the number of documents containing each term and the length
of each document.
 In addition the search engine usually has access to the original content of the documents in order
to report meaningful results back to the user.
 Using the inverted index, collection statistics, and other data, the search engine accepts queries
from its users, processes these queries, and returns ranked lists of results.
 To perform relevance ranking, the search engine computes a score, sometimes called a retrieval
status value (RSV), for each document.
 After sorting documents according to their scores, the result list must be subjected to further
processing, such as the removal of duplicate or redundant results.
 For example, a Web search engine might report only one or results from a single host or domain,
eliminating the others in favor of pages from different sources.


7.1 The Retrieval Process:

An information retrieval system thus has three major components- the document subsystem, the users
subsystem, and the searching/retrieval subsystem.

These divisions are quite broad and each one is designed to serve one or more functions, such as:

• Analysis of documents and organization of information (creation of a document database)

• Analysis of user’s queries, preparation of a strategy to search the database

• Actual searching or matching of users queries with the database, and finally

• Retrieval of items that fully or partially match the search statement.

 To describe the retrieval process, we use a simple and generic software architecture as shown in
the below Figure:

 First of all, before the retrieval process can even be initiated.

Problem (user subsystem)
 Related to users’ task, situation
o vary in specificity, clarity
 Produces information need
o ultimate criterion for effectiveness of retrieval
 how well was the need met?
 Information need for the same problem may change, evolve, shift during the IR process
adjustment insearching
o often more than one search for same problem over time
Representation (user subsystem)
 Converting a concept to query.
 What we search for.
 These are stemmed and corrected using dictionary.
 Focus toward a good result
 Subject to feedback changes
Query - search statement (user & system)
 Translation into systems requirements & limits
o start of human-computer interaction
 query is the thing that goes into the computer
 Selection of files, resources
 Search strategy - selection of:
o search terms & logic
o possible fields, delimiters
o controlled & uncontrolled vocabulary
o variations in effectiveness tactics
 Reiterations from feedback
o several feedback types: relevance feedback, magnitude feedback..
o query expansion & modification
Matching - searching (Searching subsystem)
 Process of matching, comparing
o search: what documents in the file match the query as stated?
 Various search algorithms:
o exact match - Boolean
 still available in most, if not all systems
o best match - ranking by relevance
 increasingly used e.g. on the web
o hybrids incorporating both
 e.g. Target, Rank in DIALOG
 Each has strengths, weaknesses
o No ‘perfect’ method exists and probably never will
Retrieved documents -from system to user (IR Subsystem)
 Various order of output:
o Last In First Out (LIFO); sorted
o ranked by relevance
o ranked by other characteristics
 Various forms of output
 When citations only: possible links to document delivery
 Base for relevance, utility evaluation by users
 Relevance feedback

Document Retrieval
 This is usually done by the manager of the database, which species the following:
 the documents to be used,
 the operations to be performed on the text, and
 the text model (i.e., the text structure and what elements can be retrieved).
 The text operations transform the original documents and generate a logical view of them.

 Once the logical view of the documents is defined, the database manager (using the DB
Manager Module) builds an index of the text.

 An index is a critical data structure because it allows fast searching over large volumes of data.
Different index structures might be used, but the most popular one is the inverted index as
indicated in Figure.

 The resources (time and storage space) spent on defining the text database and building the
index are amortized by querying the retrieval system many times. Given that the document
database is indexed, the retrieval process can be initiated.

 The user need which is then parsed and transformed by the same text operations applied to the
text. Then, query operations might be applied before the actual query, which provides a system
representation for the user need, is generated.

 The query is then processed to obtain the retrieved documents. Fast query processing is made
possible by the index structure previously built.

 Before been sent to the user, the retrieved documents are ranked according to a likelihood of
relevance. The user then examines the set of ranked documents in the search for useful

 At this point, he might pinpoint a subset of the documents seen as definitely of interest and
initiate a user feedback cycle.

 In such a cycle, the system uses the documents selected by the user to change the query
formulation. Hopefully, this modified query is a better representation of the real user need.



Tim Berners-Lee conceived the conceptual Web in 1989, tested it successfully in December
of 1990, and released the first Web server early in 1991.
It was called World Wide Web, and is referred as Web. At that time, no one could have
imagined the impact that the Web would have.
The Web boom, characterized by exponential growth on the volume of data and information,
imply that various daily tasks such as e-commerce, banking, research, entertainment, and personal
communication can no longer be done outside the Web if convenience and low cost are to be granted.
The amount of textual data available on the Web is estimated in the order of petabytes.
In addition, other media, such as images, audio, and video, are also available in even greater
Thus, the Web can be seen as a very large, public and unstructured but ubiquitous data

repository, which triggers the need for efficient tools to manage, retrieve, and filter information
from the Web.
As a result, Web search engines have become one of the most used tools in the Internet.
Additionally, information finding is also becoming more important in large Intranets, in which one
might need to extract or infer new information to support a decision process, a task called data mining
(or Web mining for the particular case of the Web).
The very large volume of data available, combined with the fast pace of change, make the
retrieval of relevant information from the Web a really hard task.
To cope with the fast pace of change, efficient crawling of the Web has become essential.
In spite of recent progress in image and non-textual data search in general, the existing
techniques do not scale up well on the Web.

Exploring the Web

There are basically two main forms of exploring the Web.

 Issue a word-based query to a search engine that indexes a portion of the Web
 Browse the Web, which can be seen as a sequential search process of following
hyperlinks, as embodied for example, in Web directories that classify selected Web
documents by subject.
Additional methods exist, such as taking advantage of the hyperlink structure of the Web, yet
they are not fully available, and likely less well known and also much more complex.

A Challenging Problem
Let us now consider the main challenges posed by the Web with respect to search. We can
divide them in two classes: those that relate to the data itself, which we refer to as data-centric, and
those that relate to the users and their interaction with the data, which we refer as interaction-centric.

Data-centric challenges
a) Distributed data.
• Due to the intrinsic nature of the Web, data spans over a large number of computers and
platforms. These computers are interconnected with no predefined topology and the
available bandwidth and reliability on the network interconnections vary widely.
b) High percentage of volatile data.
• Due to Internet dynamics, new computers and data can be added or removed easily. To
illustrate, early estimations showed that 50% of the Web changes in a few months. Search
engines are also confronted with dangling (or broken) links and relocation problems when
domain or file names change or disappear.
c) Large volume of data.
• The fast growth of the Web poses scaling issues that are difficult to cope with, as well as
dynamic Web pages, which are in practice unbounded.
d) Unstructured and redundant data.
• The Web is not a huge distributed hypertext system, as some might think, because it does
not follow a strict underlying conceptual model that would guarantee consistency. Indeed,
the Web is not well structured either at the global or at the individual HTML page level.
HTML pages are considered by some as semi-structured data in the best case. Moreover, a
great deal of Web data are duplicated either loosely (such as the case of news originating
from the same news wire) or strictly, via mirrors or copies. Approximately 30% of Web
pages are (near) duplicates. Semantic redundancy is probably much larger.
e) Quality of data
The Web can be considered as a new publishing media. However, there is, in most cases, no
editorial process. So, data can be inaccurate, plain wrong, obsolete, invalid, poorly written or,
as if often the case, full of errors, either innocent (typos, grammatical mistakes, OCR errors,
etc.) or malicious. Typos and mistakes, specially in foreign names are pretty common.
f) Heterogeneous data
 Data not only originates from various media types, each coming in different formats, but it is
also expressed in a variety of languages, with various alphabets and scripts (e.g. India), which
can be pretty large (e.g. Chinese or Japanese Kanji).
Many of these challenges, such as the variety of data types and poor data quality, cannot be
solved by devising better algorithms and software, and will remain a reality simply because they are
problems and issues (consider, for instance, language diversity) that are intrinsic to human nature.
Interaction-centric challenges
1) Expressing a query
 Human beings have needs or tasks to accomplish, which are frequently not easy to express as
 Queries, even when expressed in a more natural manner, are just a reflection of
information needs and are thus, by definition, imperfect. This phenomenon could be compared
to Plato’s cave metaphor, where shadows are mistaken for reality.
2) Interpreting results
 Even if the user is able to perfectly express a query, the answer might be split over thousands
or millions of Web pages or not exist at all. In this context, numerous questions need to be
 In the current state of the Web, search engines need to deal with plain HTML and text, as well
as with other data types, such as multimedia objects, XML data and associated semantic
information, which can be dynamically generated and are inherently more complex.
 In this hypothetical world, IR would become easier, and even multimedia search would be
 Spam would be much easier to avoid as well, as it would be easier to recognize good content.
 On the other hand, new retrieval problems would appear, such as XML processing and
retrieval, and Web mining on structured data, both at a very large scale.
IR Versus Web Search

 Traditional IR systems normally index a closed collection of documents, which are mainly
text-based and usually offer little linkage between documents.
 Traditional IR systems are often referred to as full-text retrieval systems. Libraries were among
the first to adopt IR to index their catalogs and later, to search through information which was
typically imprinted onto CD-ROMs.
 The main aim of traditional IR was to return relevant documents that satisfy the user’s
information need.
 Although the main goal of satisfying the user’s need is still the central issue in web IR (or
web search), there are some very specific challenges that web search poses that have required
new and innovative solutions.
 The first important difference is the scale of web search, as we have seen that the current size
of the webis approximately 600 billion pages.
 This is well beyond the size of traditional document collections.
 The Web is dynamic in a way that was unimaginable to traditional IR in terms of its rate of
change and the different types of web pages ranging from static types (HTML, portable
document format (PDF), DOC, Postscript, XLS) to a growing number dynamic pages written
in scripting languages such a JSP, PHP or Flash. We also mention that a large number of
images, videos, and a growing number of programs are delivered through the Web to our
 The Web also contains an enormous amount of duplication, estimated at about 30%. Such
redundancy is not present in traditional corpora and makes the search engine’s task even more
 The quality of web pages vary dramatically; for example, some web sites create web pages
with the sole intention of manipulating the search engine’s ranking, documents may contain
misleading information, the information on some pages is just out of date, and the overall
quality of a web page may be poor in terms of its use of language and the amount of useful
information it contains. The issue of quality is of prime importance to web search engines as
they would very quickly lose their audience if, in the top- ranked positions, they presented to
users poor quality pages.
 The range of topics covered on the Web is completely open, as opposed to the closed
collections indexed by traditional IR systems, where the topics such as in library catalogues,
are much better defined and constrained.
 Another aspect of the Web is that it is globally distributed. This poses serious logistic
problems to search engines in building their indexes, and moreover, in delivering a service that
is being used from all over the globe. The sheer size of the problem is daunting, considering
that users will not tolerate anything but
 an immediate response to their query. Users also vary in their level of expertise, interests,
information- seeking tasks, the language(s) they understand, and in many other ways.
 Users also tend to submit short queries (between two to three keywords), avoid the use of
anything but the basic search engine syntax, and when the results list is returned, most users do
not look at more than the top 10 results, and are unlikely to modify their query. This is all
contrary to typical usage of traditional IR.
 The hypertextual nature of the Web is also different from traditional document collections, in
giving users the ability to surf by following links.
 On the positive side (for the Web), there are many roads (or paths of links) that “lead to
Rome” and you need only find one of them, but often, users lose their way in the myriad of
choices they have to make.
 Another positive aspect of the Web is that it has provided and is providing impetus for the
development of many new tools, whose aim is to improve the user’s experience.

Classical IR Web IR
Volume Large Huge
Data quality Clean, No duplicates Noisy, Duplicates
Data change rate Infrequent In flux
Data Accessibility Accessible Partially accessible
Format diversity Homogeneous Widely Diverse
Documents Text HTML
No. of Matches Small Large
IR techniques Content based Link based


Differentiator Web Search IR

Documents in many different Databases usually cover only one

languages. Usually search engines language or indexing of documents
1 Languages
use full text indexing; no written in different languages with the
additional subject analysis. same vocabulary.

Several file types, some hard to Usually all indexed documents have the
2 File types index because of a lack of textual same format (e.g. PDF) or only
information. bibliographic information is provided.

Wide range from very short to Document length varies, but not to such
3 Document length very long. Longer documents are a high degree as with the Web
often divided into parts. documents

Document HTML documents are semi Structured documents allow

structure structures. complex field searching

Search engines have to decide

Suitable document typesare defined in
5 Spam which documents are suitable for
the process of database design.

The actual size of the Web is

Amount of data, Exact amount of data can be determined
6 unknown. Complete indexing of
size of databases when using formal criteria.
the whole Web is impossible.

Users have little knowledge how

Users know the retrieval language;
7 Type of queries to search; very short queries (2-3
longer, exact queries.

Easy to use interfaces suitable for Normally complex interfaces; practice

8 User interface
laypersons. needed to conduct searches.

Relevance ranking is often not needed

Due to the large amount of hits
9 Ranking because the users know how to constrain
relevance ranking is the norm.
the amount of hits.

Complex query languages allow

10 Search functions Limited possibilities.
narrowing searches.


A search engine is the practical application of information retrieval techniques to large-scale

text collections. Search engines come in a number of configurations that reflect the applications they
are designed for.
Web search engines, such as Google and Yahoo!, must be able to capture, or crawl, many
terabytes of data, and then provide sub second response times to millions of queries submitted every
day from around the world.
Search engine components support two major functions:

1. Indexing process
2. Query
process 1.Indexing

The indexing process builds the structures that enable searching, and the query process uses
those structures and a person’s query to produce a ranked list of documents. Figure 2.1 shows the
high-level “buildingblocks” of the indexing process.
These major components are
a) Text acquisition
b) Text transformation
c) Index creation
d) Text acquisition

a) Text acquisition
 The task of the text acquisition component is to identify and make available the documents that
will be searched.
 Although in some cases this will involve simply using an existing collection, text acquisition
will more often require building a collection by crawling or scanning the Web, a corporate
intranet, a desktop, or other sources of information.
 In addition to passing documents to the next component in the indexing process, the text
acquisition component creates a document data store, which contains the text and metadata for all
the documents.
 Metadata is information about a document that is not part of the text content, such as the
document type (e.g., email or web page), document structure, and other features, such as
document length.

b) Text transformation
The text transformation component transforms documents into index terms or features.
Index terms, as the name implies, are the parts of a document that are stored in the index and

used in searching. The simplest index term is a word, but not every word may be used for searching.
A “feature” is more often used in the field of machine learning to refer to a part of a text
document that is used to represent its content, which also describes an index term. Examples of other
types of index terms or features are phrases, names of people, dates, and links in a web page. Index
terms are sometimes simply referred to as “terms.” The set of all the terms that are indexed for a
document collection is called the index vocabulary.

c) Index creation
The index creation component takes the output of the text transformation component and
creates the indexes or data structures that enable fast searching. Given the large number of
documents in many search applications, index creation must be efficient, both in terms of time and
space. Indexes must also be able to be efficiently updated when new documents are acquired.
Inverted indexes, or sometimes inverted files, are by far the most common form of index used
by search engines. An inverted index, very simply, contains a list for every index term of the
documents that contain that index term. It is inverted in the sense of being the opposite of a
document file that lists, for every document, the index terms they contain. There are many variations
of inverted indexes, and the particular form of index used is one of the most important aspects of a
search engine.

2. Query process

Figure 2.2 shows the building blocks of the query process.

The major components are

a) User interaction
b) Ranking
c) Evaluation

a)User interaction
 The user interaction component provides the interface between the person doing the
searching and the search engine. One task for this component is accepting the user’s query
and transforming it into index terms.
 Another task is to take the ranked list of documents from the search engine and organize it
into the results shown to the user.
 This includes, for example, generating the snippets used to summarize documents.
 The document data store is one of the sources of information used in generating the results.
Finally, this component also provides a range of techniques for refining the query so that it
better represents the information need.
b) Ranking
The ranking component is the core of the search engine. It takes the transformed query
from the user interaction component and generates a ranked list of documents using scores based
on a retrieval model. Ranking must be both efficient, since many queries may need to be
processed in a short time, and effective, since the quality of the ranking determines whether the
search engine accomplishes the goal of finding relevant information. The efficiency of ranking
depends on the indexes, and the effectiveness depends on the retrieval model.

c) Evaluation
The task of the evaluation component is to measure and monitor effectiveness and
efficiency. An important part of that is to record and analyze user behavior using log data. The
results of evaluation are used to tune and improve the ranking component. Most of the evaluation
component is not part of the online search engine, apart from logging user and system data.
Evaluation is primarily an offline activity, but it is a critical part of any search application.

 Since its inception, the Web became a huge success - Well over 20 billion pages are now
available and accessible in the Web More than one fourth of humanity now access the Web on a
regular basis.

Why is the Web such a success?

What is the single most important characteristic of the Web that makes it so revolutionary?

In search for an answer, let us dwell into the life of a writer who lived at the end of the 18th

She finished the first draft of her novel in 1796

The first attempt of publication was refused without a reading.

The novel was only published 15 years later! She got a flat fee of $110, which meant that she was
not paid anything for the many subsequent editions Further, her authorship was anonymized under
the reference “By a Lady”

• Pride and Prejudice is the second or third best loved novel in the UK ever, after The Lord of
the Rings and Harry Potter. It has been the subject of six TV series and five film versions The last
of these, starring Keira Knightley and Matthew Macfadyen, grossed over 100 million dollars

• Jane Austen published anonymously her entire life Throughout the 20th century, her novels
have never been out of print, Jane Austen was discriminated because there was no freedom to
publish in the beginning of the 19th century .

• The Web, unleashed by the inventiveness of Tim Berners-Lee, changed this once and for all
It did so by universalizing freedom to publish - The Web moved mankind into a new era, into a new
time, into The e-Publishing Era.

The term "electronic publishing" is primarily used in the 2010s to refer to online and web- based
publishers, the term has a history of being used to describe the development of new forms of
production, distribution, and user interaction in regard to computer-based production of text and
other interactivemedia.

The first digitization projects were transferring physical content into digital content. Electronic
publishing is aiming to integrate the whole process of editing and publishing (production, layout,
publication) in the digital world.

The traditional publishing, and especially the creation part, were first revolutionized by new
desktop publishing softwares appearing in the 1980s, and by the text databases created for the
encyclopedias and directories.

At the same time the multimedia was developing quickly, combining book, audiovisual and
computer science characteristics. CDs and DVDs appear, permitting the visualization of these
dictionaries and encyclopedias on computers.
The arrival and democratization of Internet is slowly giving small publishing houses the opportunity
to publish their books directly online.

Some websites, like Amazon, let their users buy eBooks; Internet users can also find many
educative platforms (free or not), encyclopedic websites like Wikipedia, and even digital magazines

The eBook then becomes more and more accessible through many different supports, like the e-
reader and even smart phones.

The digital book had, and still has, an important impact on publishing houses and their economical
models; it is still a moving domain, and they yet have to master the new ways of publishing in a
digital era.


Web search is today the most prominent application of IR and its techniques—the ranking and
indexing components of any search engine are fundamentally IR pieces of technology

The first major impact of the Web on search is related to the characteristics of the document
collection itself

• The Web is composed of pages distributed over millions of sites and connected
through hyperlinks
• This requires collecting all documents and storing copies of them in a central
repository, prior to indexing
• This new phase in the IR process, introduced by the Web, is called crawling

The second major impact of the Web on search is related to:

• The size of the collection

• The volume of user queries submitted on a daily basis
• As a consequence, performance and scalability have become critical characteristics of
the IR system
The third major impact: in a very large collection, predicting relevance is much harder than

• Fortunately, the Web also includes new sources of evidence

• Ex: hyperlinks and user clicks in documents in the answer set

The fourth major impact derives from the fact that the Web is also a medium to do business

• Search problem has been extended beyond the seeking of text information to also
encompass other user needs
• Ex: the price of a book, the phone number of a hotel, the link for downloading a
The fifth major impact of the Web on search is Web spam

• Web spam: abusive availability of commercial information disguised in the form of

informational content
• This difficulty is so large that today we talk of Adversarial Web Retrieval.


Commercial transitions over the Internet are not yet a completely safe procedure.
Frequently, people are willing to exchange information as long as it does not become

•Copyright and patent rights

It is far from clear how the wide spread of data on the Web affects copyright and patent laws in the
various countries.

•Log In Issue
One of the most common problems faced by online businesses is the inability to log in to the
control panel. You need easy access to the control panel for additions and deletions of content and
for other purposes.

•Frequent Technical Breakdown:

Running a website business effectively is only possible when all the functional parameters respond
to your input quickly and smoothly. Unfortunately‚ most of the times‚ this does not happen.

•Slow Performance of Web Server

Slow web server is one of the biggest headaches that businesses have to deal with. When your
customers encounter pages that load slowly‚ they tend to abandon their search and look for other
•Server Limitations:

A few hosting companies follow the undesirable business practice of not disclosing their limit in
terms of space and bandwidth. They try to serve more customers with their limited resources which
can result in major performance issues in the long term.


User interaction with search interfaces differs depending on

• the type of task
• the domain expertise of the information seeker
• the amount of time and effort available to invest in the process

Marchionini makes a distinction between information lookup and exploratory search

Information lookup tasks

1. are akin to fact retrieval or question answering
2. can be satisfied by discrete pieces of information: numbers, dates, names, or Web sites
3. can work well for standard Web search interactions

Exploratory search is divided into learning and investigating tasks Learning search
1. requires more than single query-response pairs
2. requires the searcher to spend time
• scanning and reading multiple information items
• synthesizing content to form new understanding

Investigating refers to a longer-term process which

• involves multiple iterations that take place over perhaps very long periods of time
• may return results that are critically assessed before being integrated into personal and
professional knowledge bases
• may be concerned with finding a large proportion of the relevant information available

Classic × Dynamic Model

Classic notion of the information seeking process:
1. problem identification
2. articulation of information need(s)
3. query formulation
4. results evaluation
More recent models emphasize the dynamic nature of the search process
o The users learn as they search
o Their information needs adjust as they see retrieval results and other document surrogates
This dynamic process is sometimes referred to as theberry picking model of search

Navigation × Search

Navigation: the searcher looks at an information structure and browses among the available
This browsing strategy is preferable when the information structure is well-matched to the user’s
information need
o it is mentally less taxing to recognize a piece of information than it is to recall it
o it works well only so long as appropriate links are available
If the links are not available, then the browsing experience might be frustrating

Search Process

 Numerous studies have been made of people engaged in the search process
 The results of these studies can help guide the design of search interfaces
 One common observation is that users often reformulate their queries with slight
 Another is that searchers often search for information that they have previously accessed The
users’ search strategies differ when searching over previously seen materials.
 Researchers have developed search interfaces support both query history and revisitation
 Studies also show that it is difficult for people to determine whether or not a document is
relevant to a topic, other studies found that searchers tend to look at only the top-ranked
retrieved results. Further, they are biased towards thinking the top one or two results are
better than those beneath them.
 Studies also show that people are poor at estimating how much of the relevant material they
have found, other studies have assessed the effects of knowledge of the search process itself.
 These studies have observed that experts use different strategies than novices searchers.


How does an information seeking session begin in online information systems?

• The most common way is to use a Web search engine
• Another method is to select a Web site from a personal collection of already-visited sites
• Online bookmark systems are popular among a smaller segment of users Ex: Delicious.com
• Web directories are also used as a common starting point, but have been largely replaced by
search engines
The primary methods for a searcher to express their information need are either
1. entering words into a search entry form

2. selecting links from a directory or other information organization display

For Web search engines, the query is specified in textual form. Typically, Web queries today are
very short consisting of one to three words

Query Specification
Short queries reflect the standard usage scenario in which the user tests the waters:
• If the results do not look relevant, then the user reformulates their query
• If the results are promising, then the user navigates to the most relevant-looking web site
Query Specification Interface
The standard interface for a textual query is a search box entry form
Studies suggest a relationship between query length and the width of the entry form
ng queries or wide forms encourage longer

yelp.com, the user can refine the search by location using a second form

Notice that the yelp.com form also shows the user’s home location, if it has been specified

, For instance, in zvents.com search, the first box is labeled “what are you looking for”?
Some interfaces show a list of query suggestions as the user types the query - this is referred to as
auto-complete, auto-suggest, or dynamic query suggestions
Dynamic query suggestions, from Netflix.com
Dynamic query suggestions, grouped by type, from NextBio.com:


When displaying search results, either the documents must be shown in full, or else the searcher
must be presented with some kind of representation of the content of those documents
For example, a query on a term like “rainbow” may return sample images as one entry in the
results listing
A query on the name of a sports team might retrieve the latest game scores and a link to buy

There are tools to help users reformulate their query
• One technique consists of showing terms related to the query or to the documents retrieved
in response to the query
A special case of this is spelling corrections or suggestions
• Usually only one suggested alternative is shown: clicking on that alternative re-executes the
• In years back, the search results were shown using the purportedly incorrect spelling
 Relevance feedback is another method whose goal is to aid in query reformulation
 The main idea is to have the user indicate which documents are relevant to their query
• In some variations, users also indicate which terms extracted from those documents are
 The system then computes a new query from this information and shows a new
retrieval set.


 Organizing results into meaningful groups can help users understand the results and decide
what to do next
 Popular methods for grouping search results: category systems and clustering
 Category system: meaningful labels organized in such a way as to reflect the concepts
relevant to a domain
 The most commonly used category structures are flat, hierarchical, and faceted
categories.Most Web sites organize their information into general categories
 Clustering refers to the grouping of items according to some measure of similarity
 It groups together documents that are similar to one another but different from the rest of the
 The greatest advantage of clustering- is that it is fully automatable
 The disadvantages of clustering include-an unpredictability in the form and quality of results
, the difficulty of labeling the groups


The main components of a search engine are

1) Crawler
2) Indexer
3) Search index
4) Query engine
5) Search interface.

1) Crawler:
A web crawler is a software program that traverses web pages, downloads them for indexing, and
follows the hyperlinks that are referenced on the downloaded pages; a web crawler is also known as
a spider, a wanderer or a software robot.
2) Indexer:The second component is the indexer which is responsible for creating the search
index from the web pages it receives from the crawler
3) Search Index:
The search index is a data repository containing all the information the search engine needs to
match and retrieve web pages. The type of data structure used to organize the index is known as an
inverted file.
4) Query Engine:
The query engine is the algorithmic heart of the search engine. The inner working of a commercial
query engine is a well-guarded secret, since search engines are rightly paranoid, fearing web sites
who wish to increase their ranking by unfairly taking advantage of the algorithms the search engine
uses to rank result pages.
5) Search Interface:
Once the query is processed, the query engine sends the results list to the search interface, which
displays the results on the user’s screen. The user interface provides the look and feel of the search
engine, allowing the user to submit queries, browse the results list, and click on chosen web pages
for further browsing.


Experimentation with visualization for search has been primarily applied in the following ways:
 Visualizing Boolean syntax
 Visualizing query terms within retrieval results
 Visualizing relationships among words and documents
 Visualization for text mining

Visualizing Boolean syntax

Boolean query syntax is difficult for most users and is rarely used in Web search ,For
many years, researchers have experimented with how to visualize Boolean query specification.
A common approach is to show Venn diagrams. A more flexible version of this idea was seen
in the VQuery system, proposed by Steve Jones

Visualizing Query Terms

Understanding the role of the query terms within the retrieved docs can help relevance
assessment , Experimental visualizations have been designed that make this role more explicit.
In the TileBars interface, for instance, documents are shown as horizontal glyphs ,the locations
of the query term hits marked along the glyph.
The user is encouraged to break the query into its different facets, with one concept per
Then, the lines show the frequency of occurrence of query terms within each topic.
The TileBars interface Representation

Words and Docs Relationships

 Numerous works proposed variations on the idea of placing words and docs on a two-
dimensional canvas
 In these works, proximity of glyphs represents semantic relationships among the
terms or documents
 An early version of this idea is the VIBE interface
 Documents containing combinations of the query terms are placed midway between
the iconsrepresenting those terms
 The Aduna Autofocus and the Lyberworld projects presented a 3D version of the ideas


Another idea is to map docs or words from a very high- dimensional term space down
into a 2D plane ,The docs or words fall within that plane, using 2D or 3D
Visualization for Text Mining
 Visualization is also used for purposes of analysis and exploration of textual data
 Visualizations such as the Word Tree show a piece of a text concordance
 It allows the user to view which words and phrases commonly precede or follow a
given word
 The Word Tree visualization, on Martin Luther King’s , I have a dream
speech, fromWattenberg et al

Visualization is also used in search interfaces intended for analysts, an example is

the TRIST information triage system, from Proulx et al ,In this system, search results is
represented as document icons- thousands of documents can be viewed in one display.





Basic IR Models - Boolean Model - TF-IDF (Term Frequency/Inverse Document
Frequency) Weighting - Vector Model – Probabilistic Model – Latent Semantic
Indexing Model – Neural Network Model – Retrieval Evaluation – Retrieval
Metrics – Precision and Recall – Reference Collection – User-based Evaluation –
Relevance Feedback and Query Expansion – Explicit Relevance Feedback.


1.1 Introduction to Modeling

What is modeling?
 Modeling in IR is a complex process aimed at producing a ranking function.
 Ranking function: a function that assigns scores to documents with regard to a given

This process consists of two main tasks:

a) The conception of a logical framework for representing documents and queries
b) The definition of a ranking function that allows quantifying the similarities among
documents and queries.
IR systems usually adopt index terms to index and retrieve documents.

What is an information retrieval model?

Definition: A model of information retrieval (IR) selects and ranks the relevant documents
with respect to a user's query.
 Web Information Retrieval models are ways of integrating many sources of
evidence about documents,
 such as the links, the structure of the document, the actual content of the
document, the quality of the document, etc.
 so that an effective Web search engine can be achieved.

A simple IR model has the following structure:

How IR Models do distinguishes with other models?
The IR models can be distinguished by the way how they represent the documents and query
a) how the system matches the query with the documents in the corpus to find out the
related one ?
b) how the system ranks these documents?
An IR model defines the following aspects of retrieval procedure of a search
a) How the documents in the collection and user’s queries are transformed?
b) How system identifies the relevancy of the documents based on the query
word/phrase given by the user?
c) How system ranks the retrieved documents based on the relevancy?

Why IR Models?
 Mathematically, models are used in many scientific areas having objective to
understand some phenomenon in the real world.
 A model of information retrieval predicts and explains what a user will find in relevance
to the given query.

1.2 Constructing an IR Model

IR model is basically a pattern that defines the mathematical aspects of retrieval procedure and
consists of the following −
a) A model for documents.
b) A model for queries.
c) A matching function that compares queries to documents.

Mathematically, a retrieval model consists of −

D − Representation for documents.
R − Representation for queries.
F − The modeling framework for D, Q along with relationship between them.
R (q,di) − A similarity function which orders the documents with respect to the query.
It is also called ranking.

An IR model is a quadruple [ D, Q, F, R(qi , dj)]
a) D is a set of logical views for the documents in the collection
b) Q is a set of logical views for the user queries
c) F is a framework for modeling documents and queries
d) R(qi,dj) is a Ranking function.

1.3 Categories of Information Retrieval (IR) Model

An information model (IR) model can be classified into the following three model categories:
a) Classical IR Model
b) Non-Classical IR Model
c) Alternative IR Model

a) Classical IR Model
 It is the simplest and easy to implement IR model.
 This model is based on mathematical knowledge that was easily recognized and
understood as well.

Types of IR Classical Models:

a) Boolean IR models
b) Vector IR models
c) Probabilistic IR models
are the three classical IR models.

A typical Classical model has the following functionalities when it is a part of an IRS system:

A classical Model is capable of handling:
 Each document is described by a set of representative keywords called index
 Assign numerical weights to distinct relevance between index terms.
Thus each model adopts the basic mathematical functionalities as depicted below:

In overall any traditional IRS holds set of functionalities derived from following
mathematical concepts:

b) Non-Classical IR Model
 It is completely opposite to classical IR model.
 Such kind of IR models are based on principles other than similarity, probability,
Boolean operations.
 Information logic model, situation theory model and interaction models are the examples
of non-classical IR model.
 Non-classical information retrieval models are based on principles like information logic
model, situation theory model and interaction model.
 They are not based on concepts like similarity, probability, Boolean operations, etc, on
which classical retrieval models are based on.

c) Alternative IR Model
 Alternative models are advanced classical IR models.
 It is the enhancement of classical IR model making use of some specific techniques
from some other fields.
 These models make use of specific techniques from other fields.
Types of Alternative IR Models are:
a) Cluster model,
b) fuzzy model and
c) latent semantic indexing (LSI) models.

 Boolean Model is the oldest information retrieval (IR) model.
 This is the simplest retrieval model which retrieves the information on the basis
of the query given in Boolean expression.
 Boolean queries are queries that uses And, OR and Not Boolean operations to
join the query terms.

 The Boolean retrieval model is a model for information retrieval in which we can pose
any query which is in the form of a Boolean expression of terms, that is, in which
terms are combined with the operators AND, OR, and NOT.

 The model views each document as just a set of words.

 Based on a binary decision criterion without any notion of a grading scale.
 Boolean expressions have precise semantics.

How the information is retrieved from Boolean model?

 The BIR is based on Boolean logic and classical set theory in that both the
documents to be searched and the user's query are conceived as sets of terms (a
bag-of-words model).
 Retrieval is based on whether or not the documents contain the query terms.

• Document collection:
• dl = “Sachin scores hundred.”
• d2 = “Dravid is the most technical batsman of the era.”
• d3 = “Sachin, Dravid duo is the best to watch.”
• d4 = “India wins courtesy to Dravid, Sachin partnership”

• Lexicon and inverted index:

• Sachin —► {dl,d3,d4}
• score —» {dl}
• hundred —*• {dl}
• Dravid —*• {d2,d3,d4}
• teclmical —*• {d2}

• batsman —*• {d2}
• watch —*• {d3}
• India —*• {d4j
• partnership —*• {d4j
• win —*• {d4j

Residt set:
{Dl, D3, D4} fi {D2, D3, D4j = {D3, D4}

 In Boolean model, the IR system retrieves the documents based on the occurrence of
query key words in the document.
 It doesn’t provide any ranking of documents based on the relevancy.
 The model is based on set theory and the Boolean algebra, where documents are sets of
terms and queries are Boolean expressions on terms.

The Boolean model can be defined as −

 D − A set of words, i.e., the indexing terms present in a document. Here, each term is
either present (1) or absent (0).
 Q − A Boolean expression, where terms are the index terms and operators are logical
products − AND, logical sum − OR and logical difference − NOT
 F − Boolean algebra over sets of terms as well as over sets of documents.

 If we talk about the relevance feedback, then in Boolean IR model the Relevance
prediction can be defined as follows −
o R − A document is predicted as relevant to the query expression if and only if it
satisfies the query expression as −
(( e ˅ i fo io ) ˄ e ie ˄ ˜ ℎeo )
 We can explain this model by a query term as an unambiguous definition of a set of
 For example, the query term “economic” defines the set of documents that are indexed
with the term “economic”.

What would be the result after combining terms with Boolean AND Operator?
 It will define a document set that is smaller than or equal to the document sets of any of
the single terms.
 For example,

o the query with terms “social” and “economic” will produce the documents set
of documents that are indexed with both the terms.
 In other words, document set with the intersection of both the sets.

What would be the result after combining terms with Boolean OR operator?
It will define a document set that is bigger than or equal to the document sets of any of
the single terms.
 For example,
o the query with terms “social” or “economic” will produce the documents set
of documents that are indexed with either the term “social” or “economic”.
 In other words, document set with the union of both the sets.

BIR Example
 There is a way to avoid linearly scanning the texts for each query is to index the
documents in advance.
 Let us stick with Shakespeare’s Collected Works, and use it to introduce the basics of
the Boolean retrieval model.
 Suppose we record for each document – here a play of Shakespeare’s –
o whether it contains each word out of all the words Shakespeare used
(Shakespeare used about 32,000 different words).
 The result is a binary term-document incidence, as in Figure.
 Terms are the indexed units; they are usually words, and for the moment you can think
of them as words.

Figure : A term-document incidence matrix. Matrix element (t, d) is 1 if the play in
column d contains the word in row t, and is 0 otherwise.

To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus,
Caesar and Calpurnia, complement the last, and then do a bitwise AND:
110100 AND 110111 AND 101111 = 100100
Answer :
The answers for this query are thus Antony and Cleopatra and Hamlet Let us now consider a
more realistic scenario, simultaneously using the opportunity to introduce some terminology
and notation.
Suppose we have N = 1 million documents.
 By documents we mean whatever units we have decided to build a retrieval system over.
 They might be individual memos or chapters of a book.
 We will refer to the group of documents over which we perform retrieval as the
 It is sometimes also referred to as a Corpus.
 We assume an average of 6 bytes per word including spaces and punctuation,
 then this is a document collection about 6 GB in size.
 Typically, there might be about M = 500,000 distinct terms in these documents.
 There is nothing special about the numbers we have chosen, and they might vary by an
order of magnitude or more, but they give us some idea of the dimensions of the kinds of
problems we need to handle.
Advantages of the Boolean Model

The advantages of the Boolean model are as follows −
a) The simplest model, which is based on sets.
b) Easy to understand and implement.
c) It only retrieves exact matches
d) It gives the user, a sense of control over the system.

Disadvantages of the Boolean Model

The disadvantages of the Boolean model are as follows −
 The model’s similarity function is Boolean. Hence, there would be no partial matches.
This can be annoying for the users.
 In this model, the Boolean operator usage has much more influence than a critical word.
 The query language is expressive, but it is complicated too.
 No ranking for retrieved documents.

3. TF-IDF (Term Frequency/Inverse Document Frequency)
3.1 Term Frequency (tfij)
 It may be defined as the number of occurrences of wi in dj.
 The information that is captured by term frequency is how salient a word is within the
given document or in other words we can say that the higher the term frequency the
more that word is a good description of the content of that document.

3.2 Document Frequency (dfi)

 It may be defined as the total number of documents in the collection in which wi occurs.
 It is an indicator of informativeness.
 Semantically focused words will occur several times in the document unlike the
semantically unfocused words.

 Assign to each term in a document a weight for that term, that depends on the number of
occurrences of the term in the document.
 We would like to compute a score between a query term t and a document d, based on
the weight of t in d.

 The simplest approach is to assign the weight to be equal to the number of occurrences
of term t in document d.
 This weighting scheme is referred to as term frequency and is denoted tft,d with the

subscripts denoting the term and the document in order.

3.3 Inverse document frequency

 This is another form of document frequency weighting and often called idf weighting or
inverse document frequency weighting.
 The important point of idf weighting is that the term’s scarcity across the collection is a
measure of its importance and importance is inversely proportional to frequency of
 Raw term frequency as above suffers from a critical problem:
 all terms are considered equally important when it comes to assessing relevancy on a
 In fact certain terms have little or no discriminating power in determining relevance.

 For instance, a collection of documents on the auto industry is likely to have the term
auto in almost every document.
 A mechanism for attenuating the effect of terms that occur too often in the collection to
be meaningful for relevance determination.
 An immediate idea is to scale down the term weights of terms with high collection
frequency, defined to be the total number of occurrences of a term in the collection.
 The idea would be to reduce the tf weight of a term by a factor that grows with its
collection frequency.


N = documents in the collection nt = documents containing term t

3.4 Tf-idf weighting

 We now combine the definitions of term frequency and inverse document frequency, to
produce a composite weight for each term in each document.

 The tf-idf weighting scheme assigns to term t a weight in document d given by


The need of TF-IDF model

 The reason we need IDF is to help correct for words like “of”, “as”, “the”, etc. since they
appear frequently in an English corpus.
 Thus by taking inverse document frequency, we can minimize the weighting of frequent
terms while making infrequent terms have a higher impact.


4.1 Introduction to Vector Model

 In vector space model documents and queries are represented as vectors of features
representing terms.
 Features are assigned some numerical value that is usually some function of frequency
of terms.
 Vector Space Model of Information Retrieval provides rankings of the resulted
documents based on the similarity of the query vector with the documents vector.
 That is, it provides documents in the order of relevance with the user query.

4.2 Vector Space Model

Definition : Vector space model

Vector space model or term vector model is an algebraic model for representing text documents
as vectors of identifiers.
It is used in information filtering, information retrieval, indexing and relevancy rankings. Its
first use was in the SMART Information Retrieval System.


 The vector space model represents the documents and queries as vectors in a
multidimensional space, whose dimensions are the terms used to build an index to
represent the documents [Salton 1983].
 The creation of an index involves lexical scanning to identify the significant terms,
where morphological analysis reduces different word forms to common "stems", and the
occurrence of those stems is computed.
 Query and document surrogates are compared by comparing their vectors, using, for
example, the cosine similarity measure.
 In this model, the terms of a query surrogate can be weighted to take into account their
importance, and they are computed by using the statistical distributions of the terms in
the collection and in the documents [Salton 1983].
 The vector space model can assign a high ranking score to a document that contains only
a few of the query terms if these terms occur infrequently in the collection but frequently
in the document.

The vector space model makes the following assumptions:

1) The more similar a document vector is to a query vector, the more likely it is that the
document is relevant to that query.

2) The words used to define the dimensions of the space are orthogonal or independent. While
it is a reasonable first approximation, the assumption that words are pair wise independent is
not realistic.

Vector Model example

 In VSM, each document d is viewed as a vector of tf'idf values, one component for each
 So we have a vector space where
a. terms are axes; and
b. documents live in this space.

Representation of document and query features as vectors
 Ranking algorithm compute similarity between document and query vectors to yield a
retrieval score to each document.
 The Postulate is: Documents related to the same information are close together in the
vector space.
 Assign non-binary weights to index terms in queries and in documents. Compute the
similarity between documents and query.
 More precise than Boolean model.

 Due to the disadvantages of the Boolean model, Gerard Salton and his colleagues
suggested a model, which is based on Luhn’s similarity criterion.
 The similarity criterion formulated by Luhn states, “the more two representations agreed
in given elements and their distribution, the higher would be the probability of their
representing similar information.”

Consider the following important points to understand more about the Vector Space
Model −
 The index representations (documents) and the queries are considered as vectors
embedded in a high dimensional Euclidean space.
 The similarity measure of a document vector to a query vector is usually the cosine of
the angle between them.

Cosine Similarity Measure Formula

Cosine is a normalized dot product, which can be calculated with the help of the
following formula –

Vector Space Representation with Query and Document
 The query and documents are represented by a two-dimensional vector space.
 The terms are car and insurance.
 There is one query and three documents in the vector space.

 The top ranked document in response to the terms car and insurance will be the
document d2 because the angle between q and d2 is the smallest.
 The reason behind this is that both the concepts car and insurance are salient in d2 and
hence have the high weights.
 On the other side, d1 and d3 also mention both the terms but in each case, one of them is
not a centrally important term in the document.

Vector Space Model: Pros

• Automatic selection of index terms
• Partial matching of queries and documents (dealing with the case where no document
contains all search terms)
• Ranking according to similarity score(dealing with large result sets)
• Term weighting schemes (improves retrieval performance)
• Various extensions
• Document clustering
• Relevance feedback (modifying query vector)
• Geometric foundation

Disadvantages or Problems with Vector Space Models

• Ambiguity and association in natural language
• Polysemy: Words often have a multitude of meanings and different types of usage (more
urgent for very heterogeneous collections).
• The vector space model is unable to discriminate between different meanings of the same
• Synonymy: Different terms may have an identical or a similar meaning (weaker: words
indicating the same topic).
• No associations between words are made in the vector space representation.


5.1 Introduction to Probabilistic Model

• The probabilistic retrieval model is based on the Probability Ranking Principle, which
states that an information retrieval system is supposed to rank the documents based on their
probability of relevance to the query, given all the evidence available [Belkin and Croft 1992].
• The principle takes into account that there is uncertainty in the representation of the
information need and the documents.
• There can be a variety of sources of evidence that are used by the probabilistic retrieval
methods, and the most common one is the statistical distribution of the terms in both the
relevant and non-relevant documents.

What is probabilistic model in information retrieval?

• It is a formalism of information retrieval useful to derive ranking functions used by
search engines and web search engines in order to rank matching documents according to their
relevance to a given search query. It is a theoretical model estimating the probability that a
document dj is relevant to a query q.

This model works in following phases:

1) In first phase, some set of documents is retrieved by using Vector Space Model or
Boolean model.
2) In next step, the user reviews these documents produced in phase 1, to look for the
relevant ones and gives his feedback.
3) Finally the Information Retrieval system then uses this feedback information to refine
the searching criteria and rankings of the retrieved documents.
4) This process is repeated,
until the user gets the desired information in response to his needs.

• The probabilistic model tries to estimate the probability that the user will find the
document dj relevant with ratio
P(dj relevant to q) / P(dj non relevant to q)

• Given a user query q, and the ideal answer set R of the relevant documents, the problem
is to specify the properties for this set. Assumption (probabilistic principle):

the probability of relevance depends on the query and document representations only;
ideal answer set R should maximize the overall probability of relevance.

• Given a query q, there exists a subset of the documents R which are relevant to q But
membership of R is uncertain (not sure) ,
• A Probabilistic retrieval model ranks documents in decreasing order of probability of
relevance to the information need:
P(R | q,d I ).

• Users gives with information needs, which they translate into query representations.
Similarly, there are documents, which are converted into document representations .

Given only a query, an IR system has an uncertain understanding of the information need.

So IR is an uncertain process , Because,

• Information need to query
• Documents to index terms
• Query terms and index terms mismatch

• Probability theory provides a principled foundation for such reasoning under


• This model provides how likely a document is relevant to an information need.

• Document can be relevant and non relevant document, we can estimate the probability of
a term t appearing in a relevant document P(t | R=1) .
• Probabilistic methods are one of the oldest but also one of the currently hottest topics in
IR .

5.2 Basic probability theory

•For events A , the probability of the event lies between 0= P(A) = 1 , For 2 events A and B
•Joint probability P(A, B) of both events occurring
•Conditional probability P(A|B) of event A occurring given that event B has occurred
•Chain rule gives fundamental relationship between joint and conditional probabilities:

Similarly for the complement of an event P(A,B) :

Partition rule: if B can be divided into an exhaustive set of disjoint sub cases, then P(B) is the
sum of the probabilities of the sub cases.

A special case of this rule gives:

P(B) = P(A,B) + P(A, B)

Baye’s Rule for inverting conditional probabilities:

Can be thought of as a way of updating probabilities:

 Start off with prior probability P(a) (intial estimate of how likely event A is in the
absence of any other information).
 Derive a posterior probability P(A / B) after having seen the evidence B, based on the
likely hood of B occurring in the two cases that A does or does not hold.

Odds of an event ( is the ratio of probability of an event to the probability of its compliment).
Prove a kind of multiplier for how probabilities change:

What is the benefit of probabilistic modeling?

The statistical approaches have the following strengths:
1) They provide users with a relevance ranking of the retrieved documents. Hence, they enable
users to control the output by setting a relevance threshold or by specifying a certain number of
documents to display.
2) Queries can be easier to formulate because users do not have to learn a query language and
can use natural language.
3) The uncertainty inherent in the choice of query concepts can be represented.

• In fact, probabilistic modeling is extremely useful as an exploratory decision making
• It allows managers to capture and incorporate in a structured way their insights into the
businesses they run and the risks and uncertainties they face.

a) The claimed advantage to the probabilistic model is that it is entirely based on
probability theory.
b) The implication is that other models have a certain arbitrary characteristic.
c) They might perform well experimentally, but they lack asound theoretical basis because
the parameters are not easy to estimate.

a) They need to guess the initial relevant and non-relevant sets.
b) Term frequency is not considered
c) Independence assumption for index terms

Probabilistic Model Example

Using the same example we used previously with the vector space model, we now show how
the four different weights can be used for relevance ranking.
Again, the documents and the query are:
Q: "gold silver truck"
D1: "Shipment of gold damaged in a fire."
D2: "Delivery of silver arrived in a silver truck."
D3: "Shipment of gold arrived in a truck."

Since training data are needed for the probabilistic model, we assume thatthese three
documents are the training data and we deem documents D2 andD3 as relevant to the query.To
compute the similarity coefficient, we assign term weights to each term
in the query. We then sum the weights of matching terms.

There are four quantities we are interested in:

N=number of documents in the collection
n =number of documents indexed by a given term
R =number of relevant documents for the query

r =number of relevant documents indexed by the given term

Table: Frequencies of each term

 Note that with our collection, the weight for silver is infinite, since (n-r) =O.
 This is because "silver" only appears in relevant documents. Since weare using this
procedure in a predictive manner, Robertson and Sparck Jones recommended adding
constants to each quantity [Robertson and Sparck Jones,1976].

The new weights are:

Term Weight:

Document Weight:

The similarity coefficient for a given document is obtained by summing theweights of

the terms present. Table gives the similarity coefficients for eachof the four different weighting
For D1, gold is the only term to appearso the weight for D1 is just the weight for gold,
which is -0.079.
For D2 , silverand truck appear so the weight for D2 is the sum of the weights for silver
andtruck, which is 0.097 + 0.143 = 0.240.
For D3 , gold and truck appear so theweight for D3 is the sum for gold and truck, which
is -0.079+0.143 = 0.064.

6. Latent Semantic Indexing (LSI) Model
6.1 Introduction to LSI
• Several statistical and AI techniques have been used in association with domain
semantics to extend the vector space model to help overcome some of the retrieval

• One such method is Latent Semantic Indexing (LSI).

What is Latent Semantic Indexing (LSI) and explain with an example?

• This Is Latent Semantic Indexing

• Synonymy is a reference to how many words can describe the same thing. A person
searching for “flapjack recipes” is equal to a search for “pancake recipes” (outside of the
UK) because flapjacks and pancakes are synonymous.
• Latent semantic indexing (LSI) is an indexing and retrieval method that uses a
mathematical technique called singular value decomposition (SVD) to identify patterns
in the relationships between the terms and concepts contained in an unstructured
collection of text.

6.2 The importance of latent semantic indexing for search

• Latent semantic indexing uses natural language processing (NLP) to help a search
engine determine relevant content for a specific search query.

• LSI is based on the principle that words that are used in the same contexts tend to have
similar meanings.

• A key feature of LSI is its ability to extract the conceptual content of

• a body of text by establishing associations between those terms that occur in similar
• In LSI the associations among terms and documents are calculated and exploited in the
retrieval process.
• The assumption is that there is some "latent" structure in the pattern of word usage
across documents and that statistical techniques can be used to estimate this latent

structure. An advantage of this approach is that queries can retrieve documents even if
they have no words in common.
• The LSI technique captures deeper associative structure than simple term-to-term
correlations and is completely automatic.
• The only difference between LSI and vector space methods is that LSI represents terms
and documents in a reduced dimensional space of the derived indexing dimensions. As
with the vector space method, differential term weighting and relevance feedback can
improve LSI performance substantially.

• The LSI match-document profile method combines the advantages of both LSI and the
document profile.
• The document profile provides a simple, but effective, representation of the user's
• Indicating just a few documents that are of interest is as effective as generating a long
list of words and phrases that describe one's interest.
• Document profiles have an added advantage over word profiles: users can just indicate
documents they find relevant without having to generate a description of their interests.

• The words that searchers use to describe the their information needs are often not the
same words used by authors to describe the same information.
• I.e., index terms and user search terms often do NOT match
1. Synonymy
2. Polysemy

How LSI Works

• Start with a matrix of terms by documents
• Analyze the matrix using SVD to derive a particular “latent semantic structure model”
• Two-Mode factor analysis, unlike conventional factor analysis, permits an arbitrary
rectangular matrix with different entities on the rows and columns
• Such as Terms and Documents

• The rectangular matrix is decomposed into three other matices of a special form by SVD
• The resulting matrices contain “singular vectors” and “singular values”

• The matrices show a breakdown of the original relationships into linearly independent
components or factors
• Many of these components are very small and can be ignored – leading to an
approximate model that contains many fewer dimensions

• In the reduced model all of the term-term, document-document and term-document

similiarities are now approximated by values on the smaller number of dimensions
• The result can still be represented geometrically by a spatial configuration in which the
dot product or cosine between vectors representing two objects corresponds to their
estimated similarity
• Typically the original term-document matrix is approximated using 50-100 factors.

6.4 Comparisons in LSI

a) Comparing two terms
b) Comparing two documents
c) Comparing a term and a document

In the original matrix these amount to:

a) Comparing two rows
b) Comparing two columns
c) Examining a single cell in the table

a) Comparing Two Terms

Dot product between the row vectors of X(hat) reflects the extent to which two terms
have a similar pattern of occurrence across the set of documents
b) Comparing Two Documents
The dot product between two column vectors of the matrix X(hat) which tells the extent
to which two documents have a similar profile of terms
c) Comparing a term and a document
Treat the query as a pseudo-document and calculate the cosine between the pseudo-
document and the other documents

What is Latent Semantic Analysis used for?

 Latent Semantic Analysis is an efficient way of analysing the text and finding the hidden
topics by understanding the context of the text.
 Latent Semantic Analysis (LSA) is used to find the hidden topics represented by the
document or text. This hidden topic then is used for clustering the similar documents
 LSI has been tested and found to be “modestly effective” with traditional test
 Permits compact storage/representation (vectors are typically 50-150 elements instead of

6.5 Advantages of LSI

• LSI overcomes two of the most problematic constraints of Boolean keyword queries:
a) multiple words that have similar meanings (synonymy)
b) words that have more than one meaning (polysemy).

• Text does not need to be in sentence form for LSI to be effective. It can work with lists,
free-form notes, email, webcontent, etc.
• LSI is also used to perform automated document categorization and clustering.
• In fact, several experiments have demonstrated that there are a number of correlations

between the way LSI and humans process and categorize text.


7.1 Introduction to Neural Network

What Is a Neural Network?

• A neural network is a series of algorithms that endeavors to recognize underlying

relationships in a set of data through a process that mimics the way the human brain
• In this sense, neural networks refer to systems of neurons, either organic or artificial in
• Neural networks can adapt to changing input; so the network generates the best possible
result without needing to redesign the output criteria.
• The concept of neural networks, which has its roots in artificial intelligence, is swiftly
gaining popularity in the development of trading systems.

Components of a Neural Network

There are three main components:

a) an input later,
b) a processing layer, and
c) an output layer.

The inputs may be weighted based on various criteria.

Types of Neural Network

a) Convolutional Neural Network

A convolutional neural network is one adapted for analyzing and identifying visual data
such as digital images or photographs.

b) Recurrent Neural Network

A recurrent neural network is one adapted for analyzing time series data, event history, or
temporal ordering.

c) Deep Neural Network

Also known as a deep learning network, a deep neural network, at its most basic, is one that
involves two or more processing layers.

Why are neural networks important?

• Neural networks are also ideally suited to help people solve complex problems in real-life
• They can learn and model the relationships between inputs and outputs that are nonlinear
and complex; make generalizations and inferences; reveal hidden relationships, patterns and
predictions; and model highly volatile data (such as financial time series data) and
variances needed to predict rare events (such as fraud detection).

7.3 The Neural Network Model
What is a neural network model?

• A neural network is a simplified model of the way the human brain processes information.
• It works by simulating a large number of interconnected processing units that resemble
abstract versions of neurons. The processing units are arranged in layers.
• A neural network is a method in artificial intelligence that teaches computers to process
data in a way that is inspired by the human brain.
• It is a type of machine learning process, called deep learning, that uses interconnected
nodes or neurons in a layered structure that resembles the human brain.

• Neural ranking models for information retrieval (IR) use shallow or deep neural networks
to rank search results in response to a query.
• Traditional learning to rank models employ supervised machine learning (ML)
techniques—including neural networks—over hand-crafted IR features.

• Neural networks are not themselves algorithms, but rather frameworks for many different
machine learning algorithms that work together.
• The algorithms process complex data.
• A neural network is an example of machine learning, where software can change as it
learns to solve a problem.

Key- Query words Documents

1.Neural network 2. Neural network

 First neural network -multilayer perceptron - back propagation type) consists of three
layers, i.e., input layer, hidden layer and output layer.

Query Keywords

 Input layer is created of N input neurons x1, …, xN, where each neuron represents one
character of a query, i.e. input layer represents one word.

 Hidden layer is created by M neurons y1, …, yM, which express the inner query

 Output layer is created by L neurons k1, …, kL, where each neuron represents one

 For learning this neural network a back propagation algorithm was used.

7.4 The Neural Network Model Example
• The problem that we are going to solve is pretty simple.
• Suppose we have some information about obesity, smoking habits, and exercise habits of
five people.
• We also know whether these people are diabetic or not.
• Our dataset looks like this:
Person Smoking Obesity Exercise Diabetic
Person 1 0 1 0 1
Person 2 0 0 1 0
Person 3 1 0 0 0
Person 4 1 1 0 1
Person 5 1 1 1 1

• In the above table, we have five columns: Person, Smoking, Obesity, Exercise, and
• Here 1 refers to true and 0 refers to false.
• For instance, the first person has values of 0, 1, 0 which means that the person doesn't
smoke, is obese, and doesn't exercise.
• The person is also diabetic.
• It is clearly evident from the dataset that a person's obesity is indicative of him being
• Our task is to create a neural network that is able to predict whether an unknown person is
diabetic or not given data about his exercise habits, obesity, and smoking habits.
• This is a type of supervised learning problem where we are given inputs and corresponding
correct outputs and our task is to find the mapping between the inputs and the outputs.
• Note: This is just a fictional dataset, in real life, obese people are not necessarily always

• The Solution
We will create a very simple neural network with one input layer and one output layer.

Neural Network Theory

• A neural network is a supervised learning algorithm which means that we provide it the

input data containing the independent variables and the output data that contains the
dependent variable.
• For instance, in our example our independent variables are smoking, obesity and exercise.
The dependent variable is whether a person is diabetic or not.
• In the beginning, the neural network makes some random predictions, these predictions are
matched with the correct output and the error or the difference between the predicted values
and the actual values is calculated.
• The function that finds the difference between the actual value and the propagated values is
called the cost function. The cost here refers to the error.
• Our objective is to minimize the cost function.
• Training a neural network basically refers to minimizing the cost function.
• We will see how we can perform this task.
• The neural network that we are going to create has the following visual representation.

 A neural network executes in two steps:

a) Feed Forward and

b) Back Propagation.

Feed Forward

 In the feed-forward part of a neural network, predictions are made based on the values in
the input nodes and the weights.

 If you look at the neural network in the above figure, you will see that we have three
features in the dataset: smoking, obesity, and exercise, therefore we have three nodes in
the first layer, also known as the input layer. We have replaced our feature names with
the variable x, for generality in the figure above.

 The weights of a neural network are basically the strings that we have to adjust in order
to be able to correctly predict our output.

 For now, just remember that for each input feature, we have one weight.

The following are the steps that execute during the feed forward phase of a neural

Step 1: (Calculate the dot product between inputs and weights)

• The nodes in the input layer are connected with the output layer via three weight
parameters. In the output layer, the values in the input nodes are multiplied with their
corresponding weights and are added together. Finally, the bias term is added to the sum.
The b in the above figure refers to the bias term.
• The bias term is very important here. Suppose if we have a person who doesn't smoke, is
not obese, and doesn't exercise, the sum of the products of input nodes and weights will be
zero. In that case, the output will always be zero no matter how much we train the
algorithms. Therefore, in order to be able to make predictions, even if we do not have any
non-zero information about the person, we need a bias term. The bias term is necessary to
make a robust neural network.
• Mathematically, in step 1, we perform the following calculation:

Step 2: (Pass the result from step 1 through an activation function)

• The result from Step 1 can be a set of any values. However, in our output we have the
values in the form of 1 and 0.
• We want our output to be in the same format. To do so we need an activation function,
which squashes input values between 1 and 0.
• One such activation function is the sigmoid function.
• The sigmoid function returns 0.5 when the input is 0. It returns a value close to 1 if the

input is a large positive number. In case of negative input, the sigmoid function outputs a
value close to zero.
• Mathematically, the sigmoid function can be represented as:


8.1 Text retrieval in IR

 Text retrieval in IR, where the user enters a text query and the system returns a ranked
list of search results.

 Search results may be passages of text or full text documents.

 The system’s goal is to rank the user’s preferred search results at the top.

 This problem is a central one in the IR literature, with well-understood challenges and

8.2 Evaluation in IR

 The quality of the documents ranking performed by a matching model is essential.

 Indeed, the user usually considers only the first page of ranked documents
 (the top 10 or 20) by a web search engine [64].
 If relevant documents are not present in the first page, the user will not be satisfied by
the results returned by the system. Several measures and practical benchmarks have been
proposed in order to evaluate how effective is an IR approach.

8.3 Performance Evaluation

 Most common measures of system performance are
 time and space.
 Time: how fast does the system run?
 Space: what fraction of the available resources does the system consume?
 Time x Space: good metrics for data retrieval systems and for IR systems.
 But, since answers in an IR system are only approximate, we must also evaluate

the quality of those answers!

8.4 Retrieval Performance Evaluation

 To evaluate the quality of the approximate answers, we compare them with a set of ideal
answers (provided by specialists).
 Clearly, we can only do this for a set of pre-defined example information requests, also
referred to as reference topics.
 For each reference topic, the ideal answer set is provided.
 The documents used for generating the various ideal answer sets form a reference

 The evaluation of the quality of a ranking algorithm involves then:

a) a reference collection
b) a set of reference topics
c) an ideal answer set for each reference topic
 The answers generated by a ranking algorithm (such as the vector model) are compared
with the ideal answer sets to determine how good is the ranking.
 This process of evaluating the quality of a ranking is usually referred to as retrieval
performance evaluation.

Retrieval performance evaluation is often measured in terms of two metrics:

a) precision and
b) recall.

I : an example information request (topic)
R : the ideal answer set for the topic I
|R| : number of docs in the set R
A : the answer set generated by a ranking strategy we wish to evaluate
|A| : the number of docs in the set A.

Relationship between the sets R and A, given I.

 The viewpoint using the sets R, A, and Ra, does not consider that documents presented
to the user are ordered (i.e., ranked).
 User sees a ranked set of documents and examines them starting from the top.
 Thus, precision and recall vary as the user proceeds with his examination of the set A.
 Most appropriate then is to plot a curve of precision versus recall.

Let Rq be the set of relevant docs for a query q:

Rq = d3, d5, d9, d25, d39, d44, d56, d71, d89, d123
Consider a new retrieval algorithm that yields the following set of docs as answers to the
query q:

01. d123 06. d9 11. d38

02. d84 07. d511 12. d48

03. d56 08. d129 13. d250

04. d6 09. d187 14. d113

05. d8 10. d25 15. d3

Consider a new retrieval algorithm that yields the following set of docs as answers to the query
01. d123 06. d9 11. d38
02. d84 07. d511 12. d48
03. d56 08. d129 13. d250
04. d6 09. d187 14. d113

05. d8 10. d25 15. d3


9.1 Introduction to Evaluation Measures

 Evaluation measures for an information retrieval system are used to assess how well the
search results satisfied the user's query intent.

 Such metrics are often split into kinds: online metrics look at users' interactions with the
search system, while offline metrics measure relevance, in other words how likely each
result, or search engine results page (SERP) page as a whole, is to meet the information
needs of the user.

 Several evaluation measures are used in order to assess the effectiveness of an IRS.

Some of widely used measures in IR.

We group them according to their use, results-based or ranking-based:


Results-based evaluation measures evaluate the overall set of returned documents

per query. The objective is to measure how an IRS is capable to find all

relevant documents and reject all irrelevant documents.

• Recall & Precision.

Given _ a set of retrieved documents by a given system, precision P is the fraction of relevant
documents that have been retrieved, _+, over the total amount of returned documents, while the

recall R corresponds to the fraction of relevant documents that have been retrieved among all
relevant documents.

• Acc. The Accuracy is one of the metrics used to evaluate binary classifiers.

While in ranking tasks, the objective is to evaluate the ordering of the relevant elements in a list
of results, in classification tasks, the evaluation objective is to assess the systems’ ability to
correctly categorize a set of instances.

For example, as a binary classification application in text matching, we would like the system
to predict the correct label (ex: 0 or 1) that reflects an element’s relevance. Hence, given a
dataset having S positive elements and N negative elements. The accuracy (Acc) of a model M
could be

defined as in equation 1.4

where, TS and TN are respectively the number of elements that are correctly classified as
positive ones, and the number of elements that are correctly classified as negative ones.

Hence, for a total population (evaluation dataset), the closer to 1 is the model’s accuracy, the
better it is.

 The recall, precision and Acc assume that the relevance of each document
 could be judged in isolation, independently from other documents [66]


10.1 Introduction to Precision and recall

 Offline metrics are generally created from relevance judgment sessions where the judges
score the quality of the search results.
 Both binary (relevant/non-relevant) and multi-level (e.g., relevance from 0 to 5) scales
can be used to score each document returned in response to a query.
 In practice, queries may be ill-posed, and there may be different shades of relevance.
 For instance, there is ambiguity in the query "mars": the judge does not know if the user
is searching for the planet Mars, the Mars chocolate bar, or the singer Bruno Mars.
I : an information request
R: the set of relevant documents for I
A: the answer set for I, generated by an IR System.

R∩ A : the intersection of the sets R and A.

10.2 Precision
 Precision is the fraction of the documents retrieved that are relevant to the user's
information need.

In binary classification, precision is analogous to positive predictive value. Precision takes

all retrieved documents into account. It can also be evaluated considering only the topmost
results returned by the system using Precision@k.
Note that the meaning and usage of "precision" in the field of information retrieval differs
from the definition of accuracy and precision within other branches of science and statistics.

10.3 Recall
Recall is the fraction of the documents that are relevant to the query that are successfully

In binary classification, recall is often called sensitivity. So it can be looked at as the
probability that a relevant document is retrieved by the query.
It is trivial to achieve recall of 100% by returning all documents in response to any
query. Therefore, recall alone is not enough but one needs to measure the number of
non-relevant documents also, for example by computing the precision.


• A collection of documents used for testing information retrieval models and algorithms.
• A reference collection usually includes a set of documents, a set of test queries, and a set
of documents known to be relevant to each query.

11.1 Reference Collections

• Reference collections, which are based on the foundations established by the Cranfield
experiments, constitute the most evaluation method in IR.

• A reference collection is composed of:

• A set D of pre-selected documents
• A set I of information need descriptions used for testing
• A set of relevance judgments associated with each pair [im, dj]

• the relevance judgment has a value of 0 if document dj is non-relevant to im, and 1

• These judgments are produced by human specialists.

• With small collections one can apply the Cranfield evaluation paradigm to provide
relevance assessments.

• With large collections, however, not all documents can be evaluated relatively to a given
information need.
• The alternative consider only the top k documents produced by various ranking
algorithms for a given information need.
• This is called the pooling method.
• The method work for reference collections of a few million documents such as the
TREC collections.


12.1 User-based Evaluation

 User preferences are affected by the characteristics of the User Interface(UI)

 For instance, the users of search engines look first at the upper corner of the results page.

 Thus, changing the layout is likely to affect the assessment made by the users and their

 Proper evaluation of the user interface requires going beyond the framework of the
Cranfield experiments.

12.2 User-Centered Evaluation
• User base evaluation is the most common evaluation system advocated by many
information scientists.

• A criterion for evaluation of information retrieval system includes:

a) Recall
b) Precision
c) Fallout
d) Generality

• Qualitative methods of evaluation such as case studies, focus groups or in-depth

interviews can be combined with objective measures to produce more effective
information retrieval research and evaluation.
• The key to the future of information systems and searching processes lies not in
increased sophistication of technology, but in increased understandingof human
involvement with information. (Saracevic & Kantor, 1988)
• User-centered evaluation goes one step beyond user-oriented evaluation.
• A usercentred study looks at the user in various settings possibly not even library
settings to determine how the user behaves.
• The user-centered approach examines the information-seeking task in the context of
human behaviour in order to understand more completely the nature of user interaction
with an information system.
• User-centered evaluation is based on the premise that understanding user behavior
facilitates more effective system design and establishes criteria to usein evaluating the
user's interaction with the system.
• These studies examine the user from a behavioural science perspective using methods
common to psychology, sociology, and anthropology.
• While empirical methods such as experimentation are frequently employed, there has
been an increased interest in qualitative methods that capture the complexity and
diversity of human experience.
• In addition to observing behaviour, a user-centered approach attempts to probe beneath
the surface to get at subjective and affective factors. Concern for the user and the context
of information seeking and retrieval is not new, nor is it confined to library and
information science.


13.1 Relevance feedback and query expansion:

 Relevance feedback is a query expansion and refinement technique with a long history.
 First proposed in the 1960s, it relies on user interaction to identify relevant documents in a
ranking based on the initial query.
 In other semi-automatic techniques instead of choosing from lists of terms or alternative
queries, in relevance feedback the user indicates which documents are interesting (i.e.,
relevant) and possibly which documents are completely off-topic (i.e., non-relevant).
 Based on this information, the system automatically reformulates the query by adding terms
and reweighting the original terms, and a new ranking is generated using this modified
 This process is a simple example of using machine learning in information retrieval, where
training data (the identified relevant and non-relevant documents) is used to improve the
system’s performance.
 Modifying the query is in fact equivalent to learning a classifier that distinguishes between
relevant and non-relevant documents.

Fig: Relevance Feedback Process

 For example, an initial query "find information surrounding the various conspiracy theories
about the assassination of John F. Kennedy" has both useful keywords and noise. The most
useful keywords are probably assassination.
 Like many queries (in terms of retrieval) there is some meaningless information. Terms
such as various and information are probably not stop words (i.e., frequently used words
that are typically ignored by an information retrieval system such as a, an, and, the), but
they are more than likely not going to help retrieve relevant documents.
 The idea is to use all terms in the initial query and ask the user if the top ranked documents
are relevant.
 The hope is that the terms in the top ranked documents that are said to be relevant will be
"good" terms to use in a subsequent query.
 Assume a highly ranked document contains the term Oswald.
 It is reasonable to expect that adding the term Oswald to the initial query would improve
both precision and recall. Similarly, if a top ranked document that is deemed relevant by the
user contains many occurrences of the term assassination, the weight used in the initial
query for this term should be increased.
 With the vector space model, the addition of new terms to the original query, the deletion of
terms from the query, and the modification of existing term weights has been done.
 With the probabilistic model, relevance feedback initially was only able to re-weight
existing terms, and there was no accepted means of adding terms to the original query.
 The exact means by which relevance feedback is implemented is fairly dependent on the
retrieval strategy being employed.

Relevance Feedback in the Vector Space Model: /(The Rocchio algorithm for relevance
 Rocchio's approach used the vector space model to rank documents.
 The query is represented by a vector Q, each document is represented by a vectorDi, and
a measure of relevance between the query and the document vector iscomputed as SC(Q,
Di), where SC is the similarity coefficient.
 The SC is computed as an inner product of the document andquery vector or the cosine
of the angle between the two vectors.
 The basic assumption is that the user has issued a query Q and retrieved a set of
 The user is then asked whether or not the documents are relevant.
 After the user responds, the set R contains the nl relevant document vectors, and the set
S contains the n2 non-relevant document vectors.

 Rocchio builds the newquery Q' from the old query Q using the equation given below:

Ri and Si are individual components of R and S, respectively.

 The document vectors from the relevant documents are added to the initial query vector, and
the vectors from the non-relevant documents are subtracted.
 If all documents are relevant, the third term does not appear.
 To ensure that the new information does not completely override the original query, all
vector modifications are normalized by the number of relevant and non-relevant documents.
 The process can be repeated such that Qi+1 is derived from Qi for as many iterations as
 The idea is that the relevant documents have terms matching those in the original query.
The weights corresponding to these terms are increased by adding the relevant document
 Terms in the query that are in the non relevant documents have their weights decreased.
Also, terms that are not in the original query (had an initial component value of zero) are
now added to the original query.
In addition to using values n1 and n2, it is possible to use arbitrary weights.The equation
now becomes:

Not all of the relevant or non-relevant documents must be used. Addingthresholds na

and nb to indicate the thresholds for relevant and non-relevantvectors results in:

The weights α,β and γ and, are referred to as Rocchio weights and are
frequentlymentioned in the annual proceedings of TREC. The optimal values were
experimentallyobtained, but it is considered common today to drop the use of
nonrelevantdocuments (assign zero to γ) and only use the relevant documents.This basic theme
was used by Ide in follow-up research to Rocchio where thefollowing equation was defined:

Only the top ranked non-relevant document is used, instead of the sum of allnon-
relevant documents. Ide refers to this as the Dec-Hi (decrease using highestranking non-
relevant document) approach. Also, a more simplistic weight is described in which the
normalization, based on the number of document vectors is removed, and α,β and γ are set to
one [Salton, 1971a].
This new equation is:

 An interesting case occurs when the original query retrieves only non-relevant
 Kelly addresses this case in [Salton, 1971b]. The approach suggests that an arbitrary
weight should be added to the most frequently occurring concept in the document
collection. This can be generalized to increase the component with the highest weight.
 The hope is that the term was important,but it was drowned out by all of the surrounding
 By increasing the weight, the term now rings true and yields some relevant documents.
 Note that this approach is applied only in manual relevance feedback approaches.
 It is not applicable to automatic feedback as the top n documents are assumed, by
definition, to be relevant.



• In an explicit relevance feedback cycle, the feedback information is

• provided directly by the users.
• However, collecting feedback information is expensive and time consuming.
• In the Web, user clicks on search results constitute a new source of feedback information
• A click indicate a document that is of interest to the user in the context of the current
• Notice that a click does not necessarily indicate a document that is relevant to the query.

• In an implicit relevance feedback cycle, the feedback information is derived
• implicitly by the system
• There are two basic approaches for compiling implicit feedback information:
• local analysis, which derives the feedback information from the documents in the result
• top ranked global analysis, which derives the feedback information from external
sources such as a thesaurus.

Classic Relevance Feedback
• In a classic relevance feedback cycle, the user is presented with a list of the retrieved
documents .
• Then, the user examines them and marks those that are relevant In practice, only the top
10 (or 20) ranked documents need to be examined.
• The main idea consists of selecting important terms from the documents that have been
identified as relevant, and enhancing the importance of these terms in a new query



A Characterization of Text Classification – Unsupervised Algorithms: Clustering –
Naïve Text Classification – Supervised Algorithms – Decision Tree – k-NN
Classifier – SVM Classifier – Feature Selection or Dimensionality Reduction –
Evaluation metrics – Accuracy and Error – Organizing the classes – Indexing and
Searching – Inverted Indexes – Sequential Searching – Multi-dimensional Indexing.


1.1 Text Classification

What is Classification?
• Classification is:
– the data mining process of
– finding a model (or function) that
– describes and distinguishes data classes or concepts,
– for the purpose of being able to use the model to predict
the class of objects whose class label is unknown.
• That is, predicts categorical class labels (discrete or
• Classifies the data (constructs a model) based on the
training set.
• It predict group membership for data instances.

Classification and Prediction:

Classification and prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends. Such analysis can help
provide us with a better understanding of the data at large. Whereas classification predicts
categorical (discrete, unordered) labels, prediction models continuous valued functions.

A model or classifier is constructed to predict categorical labels, such as “safe” or

“risky” for the loan application data; “yes” or “no” for the marketing data; or “treatment A,”
“treatment B,” or “treatment C” for the medical data. These categories can be represented by
discrete values, where the ordering among values has no meaning.

Predictor where the model constructed predicts a continuous-valued function, or
ordered value, as opposed to a categorical label. This model is a predictor.

Classification and numeric prediction are the two major types of prediction problems.


Data classification is a two-step process,

In the first step, a classifier is built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classification algorithm builds
the classifier by analyzing or “learning from” a training set made up of database tuples and
their associated class labels.

A tuple, X, is represented by an n-dimensional attribute vector, X = (x1, x2, …. , xn),

depicting n measurements made on the tuple from n database attributes, respectively, A1, A2,.. ,

Figure shows ,The data classification process: (a) Learning: Training data are analyzed
by a classification algorithm. Here, the class label attribute is loan decision, and the learned
model or classifier is represented in the form of classification rules. (b) Classification: Test
data are used to estimate the accuracy of the classification rules. If the accuracy is considered
acceptable, the rules can be applied to the classification of new data tuples.

Each tuple, X, is assumed to belong to a predefined class as determined by another

database attribute called the class label attribute.

The individual tuples making up the training set are referred to as training tuples and
are selected from the database under analysis.

supervised learning (i.e., the learning of the classifier is “supervised” in that it is told
to which class each training tuple belongs.)

It contrasts with unsupervised learning (or clustering), in which the class label of each
training tuple is not known, and the number or set of classes to be learned may not be known in

This first step of the classification process can also be viewed as the learning of a
mapping or function, y = f (X), that can predict the associated class label y of a given tuple X.

This mapping is represented in the form of classification rules, decision trees, or

mathematical formulae.

In the second step,

The model is used for classification. First, the predictive accuracy of the classifier is
estimated. If we were to use the training set to measure the accuracy of the classifier, this
estimate would likely be optimistic, because the classifier tends to overfit the data (i.e., during
learning it may incorporate some particular anomalies of the training data that are not present in

the general data set overall). Therefore, a test set is used, made up of test tuples and their
associated class labels. These tuples are randomly selected from the general data set.

The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier. The associated class label of each test tuple is compared
with the learned classifier’s class prediction for that tuple.


Data prediction is a two step process, similar to that of data classification.

However, for prediction, we lose the terminology of “class label attribute” because the
attribute for which values are being predicted is continuous-valued (ordered) rather than
categorical (discrete-valued and unordered). The attribute can be referred to simply as the
predicted attribute.

Note that prediction can also be viewed as a mapping or function, y= f (X), where X is the input
(e.g., a tuple describing a loan applicant), and the output y is a continuous or ordered value
(such as the predicted amount that the bank can safely loan the applicant); That is, we wish to
learn a mapping or function that models the relationship between X and y.


The following preprocessing steps may be applied to the data to help improve the
accuracy,efficiency, and scalability of the classification or prediction process.

Preparing the Data for Classification and Prediction

The following preprocessing steps may be applied to the data to help improve the
accuracy, efficiency, and scalability of the classification or prediction process.

Data cleaning: This refers to the preprocessing of data in order to remove or reduce
noise (by applying smoothing techniques, for example) and the treatment of missing values
(e.g., by replacing a missing value with the most commonly occurring value for that attribute,
or with the most probable value based on statistics).

Relevance analysis: Many of the attributes in the data may be redundant. Correlation

analysis can be used to identify whether any two given attributes are statistically related.
Attribute subset selection can be used in these cases to find a reduced set of attributes such
that the resulting probability distribution of the data classes is as close as possible to the
original distribution obtained using all attributes.

Data transformation and reduction: The data may be transformed by normalization,

particularly when neural networks or methods involving distance measurements are used in the
learning step.Normalization involves scaling all values for a given attribute so that they fall
within a small specified range, such as 1:0 to 1:0, or 0:0 to 1:0.

The data can also be transformed by generalizing it to higher-level concepts.

Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization techniques, such as binning,
histogram analysis, and clustering.

4.2.2 Comparing Classification and Prediction Methods :

Classification and prediction methods can be compared and evaluated according to the

following criteria:

Accuracy: The accuracy of a classifier refers to the ability of a given classifier to

correctly predict the class label of new or previously unseen data (i.e., tuples without class label

information). Similarly, the accuracy of a predictor refers to how well a given predictor can
guess the value of the predicted attribute for new or previously unseen data.

Speed: This refers to the computational costs involved in generating and using the given
classifier or predictor.

Robustness: This is the ability of the classifier or predictor to make correct predictions
given noisy data or data with missing values.

Scalability: This refers to the ability to construct the classifier or predictor efficiently
given large amounts of data.

Interpretability: This refers to the level of understanding and insight that is provided by
the classifier or predictor.


Decision tree induction is the learning of decision trees from class-labeled training
tuples.A decision tree is a flowchart-like tree structure, where each internal node (nonleaf
node) denotes a test on an attribute, each branch represents an outcome of the test, and each

node (or terminal node) holds a class label. The top most node in a tree is the root node.

Fig. A decision tree for the concept buys computer

How are decision trees used for classification?

Given a tuple, X, for which the associated class label is unknown, the attribute values of
the tuple are tested against the decision tree. A path is traced from the root to a leaf node, which
holds the class prediction for that tuple. Decision trees can easily be converted to classification

Why are decision tree classifiers so popular?

The construction of decision tree classifiers does not require any domain knowledge or
parameter setting, and therefore is appropriate for exploratory knowledge discovery. Decision
trees can handle high dimensional data. Their representation of acquired knowledge in tree
form is intuitive and generally easy to assimilate by humans. The learning and classification
steps of decision tree induction are simple and fast. In general, decision tree classifiers have
good accuracy.

Decision Tree Induction:

Decision tree algorithm known as ID3 (Iterative Dichotomiser) and it is expanded on

earlier work on concept learning systems.

In later presented C4.5 (a successor of ID3), which became a benchmark to which newer
supervised learning algorithms are often compared.

Classification and Regression Trees (CART), which described the generation of binary
decision trees. ID3 and CART were invented independently of one another at around the same
time, yet follow a similar approach for learning decision trees from training tuples.

ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which decision
trees are constructed in a top-down recursive divide-and-conquer manner. Most algorithms for
decision tree induction also follow such a top-down approach, which starts with a training set
of tuples and their associated class labels. The training set is recursively partitioned into smaller
subsets as the tree is being built. A basic decision tree algorithm is summarized here.

Fig. Basic algorithm for inducing a decision tree from training tuples.

The strategy is as follows.

The algorithm is called with three parameters: D, attribute list, and Attribute selection
method.We refer to D as a data partition. Initially, it is the complete set of training tuples and
their associated class labels. The parameter attribute list is a list of attributes describing the
tuples. Attribute selection method specifies a heuristic procedure for selecting the attribute that
“best” discriminates the given tuples according to class. This procedure employs an attribute
selection measure, such as information gain or the gini index. Whether the tree is strictly binary
is generally driven by the attribute selection measure. Some attribute selection measures, such
as the gini index, enforce the resulting tree to be binary. Others, like information gain, do not,
therein allowing multiway splits (i.e., two or more branches to be grown from a node).

The tree starts as a single node, N, representing the training tuples in D (step 1).

If the tuples in D are all of the same class, then node N becomes a leaf and is labeled
with that class (steps 2 and 3). Note that steps 4 and 5 are terminating conditions. All of the
terminating conditions are explained at the end of the algorithm.

Otherwise, the algorithm calls Attribute selection method to determine the splitting
criterion. The splitting criterion tells us which attribute to test at node N by determining the
“best” way to separate or partition the tuples in D into individual classes (step 6). The splitting
criterion also tells us which branches to grow from node N with respect to the outcomes of the
chosen test. More specifically, the splitting criterion indicates the splitting attribute and may
also indicate either a split-point or a splitting subset. The splitting criterion is determined so
that, ideally, the resulting partitions at each branch are as “pure” as possible.

A partition is pure if all of the tuples in it belong to the same class. In other words, if we
were to split up the tuples in D according to the mutually exclusive outcomes of the splitting
criterion, we hope for the resulting partitions to be as pure as possible.

The node N is labeled with the splitting criterion, which serves as a test at the node (step
7). A branch is grown from node N for each of the outcomes of the splitting criterion. The

tuples in D are partitioned accordingly (steps 10 to 11). There are three possible scenarios, as
illustrated in Figure 6.4. Let A be the splitting attribute. A has v distinct values, fa1, a2, : : : ,
avg, based on the training data.

1. A is discrete-valued: In this case, the outcomes of the test at node N correspond

directly to the known values of A. A branch is created for each known value, aj, of A and
labeled with that value (Figure 6.4(a)). Partition Dj is the subset of class-labeled tuples in D
having value aj of A. Because all of the tuples in a given partition have the same value for A,
then A need not be considered in any future partitioning of the tuples. Therefore, it is removed
from attribute list (steps 8 to 9).

2. A is continuous-valued: In this case, the test at node N has two possible outcomes,
corresponding to the conditions A _ split point and A > split point, respectively, where split
point is the split-point returned by Attribute selection method as part of the splitting criterion.
(In practice, the split-point, a, is often taken as the midpoint of two known adjacent values of A
and therefore may not actually be a pre-existing value of A from the training data.) Two
branches are grown from N and labeled

according to the above outcomes (Figure 6.4(b)). The tuples are partitioned such thatD1 holds
the subset of class-labeled tuples inDforwhich A_split point,while D2 holds the rest.

3. A is discrete-valued and a binary treemust be produced (as dictated by the attribute

selection measure or algorithm being used): The test at node N is of the form “A 2 SA?”. SA is
the splitting subset for A, returned by Attribute selection method as part of the splitting
criterion. It is a subset of the known values of A. If a given tuple has value aj of A and if aj 2
SA, then the test at node N is satisfied. Two branches are grown from N . By convention, the
left branch out of N is labeled yes so that D1 corresponds to the subset of class-labeled tuples in
Dthat satisfy the test. The right branch out of N is labeled no so that D2 corresponds to the
subset of class-labeled tuples from D that do not satisfy the test.

The algorithm uses the same process recursively to form a decision tree for the tuples

at each resulting partition, Dj, of D (step 14).

The recursive partitioning stops only when any one of the following terminating
conditions is true:

1. All of the tuples in partition D (represented at node N) belong to the same class

(steps 2 and 3), or

2. There are no remaining attributes on which the tuples may be further partitioned

(step 4). In this case, majority voting is employed (step 5). This involves converting node N
into a leaf and labeling it with the most common class in D. Alternatively, the class distribution
of the node tuples may be stored.

3. There are no tuples for a given branch, that is, a partition Dj is empty (step 12).

The above figure shows three possibilities for partitioning tuples based on the splitting
criterion, shown with examples. Let A be the splitting attribute. (a) If A is discrete-valued, then
one branch is grown for each known value of A. (b) If A is continuous-valued, then two

branches are grown, corresponding to A _ split point and A > split point. (c) If A is discrete-
valued and a binary tree must be produced, then the test is of the form A 2 SA, where SA is the
splitting subset for A.

In this case, a leaf is created with the majority class in D (step 13).

The resulting decision tree is returned (step 15).

Attribute Selection Measures :

Attribute selection measures are used to select the attribute that best partitions the tuples
into distinct classes.

An attribute selection measure is a heuristic for selecting the splitting criterion that
“best” separates a given data partition, D, of class-labeled training tuples into individual
classes. If we were to split D into smaller partitions according to the outcomes of the splitting
criterion, ideally each partition would be pure (i.e., all of the tuples that fall into a given
partition would belong to the same class).

Attribute selection measures are also known as splitting rules because they determine
how the tuples at a given node are to be split.

The attribute having the best score for the measure6 is chosen as the splitting attribute
for the given tuples. If the splitting attribute is continuous-valued or if we are restricted to
binary trees then, respectively, either a split point or a splitting subset must also be determined
as part of the splitting criterion. The tree node created for partition D is labeled with the
splitting criterion, branches are grown for each outcome of the criterion, and the tuples are
partitioned accordingly.

Three popular attribute selection measures are

information gain,

gain ratio, and

gini index.

Information gain

ID3 uses information gain as its attribute selection measure. This measure is based on
which studied the value or “information content” of messages.

Let node N represent or hold the tuples of partition D. The attribute with the highest
information gain is chosen as the splitting attribute for node N.

The expected information needed to classify a tuple in D is given by

Info(D) is also known as the entropy of D.

Now, suppose we were to partition the tuples in D on some attribute A having v distinct
values, {a1, a2, : : : , av}, as observed from the training data. If A is discrete-valued, these values
correspond directly to the v outcomes of a test on A. Attribute A can be used to split D into v
partitions or subsets, {D1, D2, : : : , Dv},where Dj contains those tuples in D that have outcome
aj of A. These partitions would correspond to the branches grown from node N.

The term |Dj | / |D| acts as the weight of the jth partition. InfoA(D) is the expected
information required to classify a tuple from D based on the partitioning by A. The smaller the
expected information (still) required, the greater the purity of the partitions.

Information gain is defined as the difference between the original information


(i.e., based on just the proportion of classes) and the newrequirement (i.e., obtained after
partitioning on A). That is,

Example : Induction of a decision tree using information gain. The following table presents a
training set,

D, of class-labeled tuples randomly selected from the AllElectronics customer database.

The class label attribute, buys computer, has two distinct values (namely, fyes, nog);
therefore, there are two distinct classes (that is, m = 2). Let class C1 correspond to yes and class
C2 correspond to no. There are nine tuples of class yes and five tuples of class no. A (root)
node N is created for the tuples in D. To find the splitting criterion for these tuples, we must
compute the information gain of each attribute.

Table : Class-labeled training tuples from the AllElectronics customer database.

By using following equation to compute the expected information needed to classify a tuple in

Where Total number of tuple is 14 , Class = “Yes” are 9 and Class = “no” are 5 , Therefore

Info(D) = - Positive tuple( Yes) – Negative tuples ( no).

Next, we need to compute the expected information requirement for each attribute. Let’s start
with the attribute age.We need to look at the distribution of yes and no tuples for each category
of age. For the age category youth, there are two yes tuples and three no tuples. For the
category middle aged, there are four yes tuples and zero no tuples. For the category senior,
there are three yes tuples and two no tuples. Using InfoA(D) equation, the expected information
needed to classify a tuple in D if the tuples are partitioned according to age is

That is

InfoA(D) = Age tuple (Youth) * ( - Youth with “Yes” - Youth with “no” ) +

Age tuple (middle_aged) * ( - Middle_aged with “Yes” – Middle_aged with “no” ) +

Age tuple (senior) * ( - Senior with “Yes” - Senior with “no” )

Hence, the gain in information from such a partitioning would be,

Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and
Gain(credit rating) = 0.048 bits. Because age has the highest information gain among the
attributes, it is selected as the splitting attribute. Node N is labeled with age, and branches are
grown for each of the attribute’s values.

The attribute age has the highest information gain and therefore becomes the splitting
attribute at the root node of the decision tree. Branches are grown for each outcome of age. The
tuples are shown partitioned accordingly.

Suppose, instead, that we have an attribute A that is continuous-valued, rather than discrete-

For such a scenario, we must determine the “best” split-point for A, where the split-
point is a threshold on A. We first sort the values of A in increasing order. Typically, the
midpoint between each

pair of adjacent values is considered as a possible split-point.

The point with the minimum expected information requirement for A is selected as the
split point for A. D1 is the set of tuples in D satisfying A <= split point, and D2 is the set of
tuples in D satisfying A > =split point.

Gain ratio :

The information gain measure is biased toward tests with many outcomes. That is, it
prefers to select attributes having a large number of values.

Therefore, the information gained by partitioning on this attribute is maximal. Clearly,
such a partitioning is useless for classification.

C4.5, a successor of ID3, uses an extension to information gain known as gain ratio,
which attempts to overcome this bias. It applies a kind of normalization to information gain
using a “split information” value defined analogously with Info(D) as

This value represents the potential information generated by splitting the training data
set, D, into v partitions, corresponding to the v outcomes of a test on attribute A.

It differs from information gain, which measures the information with respect to
classification that is acquired based on the same partitioning. The gain ratio is defined as

The attribute with the maximum gain ratio is selected as the splitting attribute.

Example, Computation of gain ratio for the attribute income. A test on income splits the
data of Table 6.1 into three partitions, namely low, medium, and high, containing four, six, and
four tuples, respectively. To compute the gain ratio of income, we first use Equation

we have Gain(income) = 0.029. Therefore, GainRatio(income) = 0.029/0.926 = 0.031.

Gini index :

The Gini index is used in CART. Using the notation described above, the Gini index

measures the impurity of D, a data partition or set of training tuples, as

where pi is the probability that a tuple in D belongs to class Ci and is estimated by The sum is
computed over m classes.

The Gini index considers a binary split for each attribute. Let’s first consider the case
where A is a discrete-valued attribute having v distinct values, {a1, a2, : : : , av}, occurring in
D. To determine the best binary split on A, we examine all of the possible subsets that can be
formed using known values of A.

If A has v possible values, then there are 2v possible subsets. For example, if income has
three possible values, namely { low, medium, high}, then the possible subsets are {low,
medium, high},{low, medium},{low, high},{medium, high},{low}, {medium},{high}, and {}.
We exclude the power set, {low, medium, high}, and the empty set from consideration since,

conceptually, they do not represent a split. Therefore, there are 2v-2 possible ways to form two
partitions of the data, D, based on a binary split on A.

For each attribute, each of the possible binary splits is considered. For a discrete-valued
attribute, the subset that gives the minimum gini index for that attribute is selected as its
splitting subset.

For continuous-valued attributes, each possible split-point must be considered. The

strategy is similar to that described above for information gain, where the midpoint between
each pair of (sorted) adjacent values is taken as a possible split-point.

The point giving the minimum Gini index for a given (continuous-valued) attribute is
taken as the split-point of that attribute. Recall that for a possible split-point of A, D1 is the set
of tuples in D satisfying A <= split point, and D2 is the set of tuples in D satisfying A > split

The reduction in impurity that would be incurred by a binary split on a discrete- or

continuous-valued attribute A is

Similarly, the Gini index values for splits on the remaining subsets are: 0.315 (for the
subsets {low, high} and {medium}) and 0.300 (for the subsets {medium, high} and {low}).
Therefore, the best binary split for attribute income is on {medium, high} (or {low}) because it
minimizes the gini index.

Many other attribute selection measures have been proposed. CHAID, a decision tree
algorithm that is popular in marketing, uses an attribute selection measure that is based on the
statistical c2 test for independence. Other measures include C-SEP (which performs better than
information gain and Gini index in certain cases) and G-statistic (an information theoretic
measure that is a close approximation to c2 distribution).

Attribute selection measures based on the Minimum Description Length (MDL)

principle have the least bias toward multivalued attributes. MDL-based measures use encoding
techniques to define the “best” decision tree as the one that requires the fewest number of bits
to both (1) encode the tree and (2) encode the exceptions to the tree (i.e., cases that are not
correctly classified by the tree). Its main idea is that the simplest of solutions is preferred.

Other attribute selection measures consider multivariate splits (i.e., where the
partitioning of tuples is based on a combination of attributes, rather than on a single attribute).
The CART system, for example, can find multivariate splits based on a linear combination of
attributes. Multivariate splits are a form of attribute (or feature) construction, where new
attributes are created based on the existing ones.

Tree Pruning :

When a decision tree is built, many of the branches will reflect anomalies in the training
data due to noise or outliers. Tree pruning methods address this problem of overfitting the data.

Pruned trees tend to be smaller and less complex and, thus, easier to comprehend. They
are usually faster and better at correctly classifying independent test data (i.e., of previously
unseen tuples) than unpruned trees.

“How does tree pruning work?” There are two common approaches to tree pruning:

prepruning and


In the prepruning approach, a tree is “pruned” by halting its construction early (e.g., by
deciding not to further split or partition the subset of training tuples at a given node).

Figure shows An unpruned decision tree and a pruned version of it.

When constructing a tree, measures such as statistical significance, information gain,

Gini index, and so on can be used to assess the goodness of a split.

The second and more common approach is postpruning, which removes subtrees from a
“fully grown” tree. A subtree at a given node is pruned by removing its branches and replacing
it with a leaf. The leaf is labeled with the most frequent class among the subtree being replaced.

C4.5 uses a method called pessimistic pruning, which is similar to the cost complexity
method in that it also uses error rate estimates to make decisions regarding subtree pruning.
Pessimistic pruning, however, does not require the use of a prune set. Instead, it uses the
training set to estimate error rates.

Scalability and Decision Tree Induction :

More recent decision tree algorithms that address the scalability issue have been
proposed. Algorithms for the induction of decision trees from very large training sets include
SLIQ and SPRINT, both of which can handle categorical and continuous valued attributes.

SLIQ employs disk-resident attribute lists and a single memory-resident class list. The
attribute lists and class list generated by SLIQ for the tuple data of Table

Table for tuple data for the class buys computer.

Figure Attribute list and class list data structures used in SLIQ for the tuple data of
above table

Table attribute list data structure used in SPRINT for the tuple data of above table.

The use of data structures to hold aggregate information regarding the training data are
one approach to improving the scalability of decision tree induction.

While both SLIQ and SPRINT handle disk-resident data sets that are too large to fit into
memory, the scalability of SLIQis limited by the use of its memory-resident data structure.

The method maintains an AVC-set (where AVC stands for “Attribute-Value,

Classlabel”) for each attribute, at each tree node, describing the training tuples at the node.

BOAT (Bootstrapped Optimistic Algorithm for Tree Construction) is a decision tree

algorithm that takes a completely different approach to scalability—it is not based on the use of
any special data structures. Instead, it uses a statistical technique known as “bootstrapping ” to
create several smaller samples (or subsets) of the given training data, each of which fits in

Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities, such as the probability that a given tuple belongs to a particular class.

Bayesian classification is based on Bayes’ theorem, described below. Studies comparing

classification algorithms have found a simple Bayesian classifier known as the naïve Bayesian
classifier to be comparable in performance with decision tree and selected neural network
classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to
large databases.

 In general all of Machine Learning Algorithms need to be trained for supervised

learning tasks like classification, prediction etc. or for unsupervised learning tasks like

 By training it means to train them on particular inputs so that later on we may test them
for unknown inputs (which they have never seen before) for which they may classify or
predict etc (in case of supervised learning) based on their learning.
 This is what most of the Machine Learning techniques like Neural Networks, SVM,
Bayesian etc. are based upon.

 So in a general Machine Learning project basically you have to divide your input set to
a Development Set (Training Set + Dev-Test Set) & a Test Set (or Evaluation set).
 Remember your basic objective would be that your system learns and classifies new
inputs which they have never seen before in either Dev set or test set.

 The test set typically has the same format as the training set.
 However, it is very important that the test set be distinct from the training corpus: if we
simply reused the training set as the test set, then a model that simply memorized its
input, without learning how to generalize to new examples, would receive misleadingly
high scores.

 In general, for an example, 70% can be training set cases. Also remember to partition
the original set into the training and test sets randomly.

Concept of Naïve Bayes Classification

To demonstrate the concept of Naïve Bayes Classification, consider the example given below:

Naive Bayes Classifier Introductory Overview

 The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is
particularly suited when the dimensionality of the inputs is high.
 Despite its simplicity, Naive Bayes can often outperform more sophisticated
classification methods.

 To demonstrate the concept of Naïve Bayes Classification, consider the example

displayed in the illustration above.
 As indicated, the objects can be classified as either GREEN or RED.
 Our task is to classify new cases as they arrive, i.e., decide to which class label they
belong, based on the currently exiting objects.

 Since there are twice as many GREEN objects as RED, it is reasonable to believe that a
new case (which hasn't been observed yet) is twice as likely to have membership
GREEN rather than RED.
 In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities
are based on previous experience, in this case the percentage of GREEN and RED
objects, and often used to predict outcomes before they actually happen.

Thus, we can write:

 Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior
probabilities for class membership are:

 Having formulated our prior probability, we are now ready to classify a new object
(WHITE circle).
 Since the objects are well clustered, it is reasonable to assume that the more GREEN (or
RED) objects in the vicinity of X, the more likely that the new cases belong to that

particular color.
 To measure this likelihood, we draw a circle around X which encompasses a number (to
be chosen a priori) of points irrespective of their class labels.
 Then we calculate the number of points in the circle belonging to each class label.
 From this we calculate the likelihood:

 From the illustration above, it is clear that Likelihood of X given GREEN is smaller
than Likelihood of X given RED, since the circle encompasses 1 GREEN object and
3 RED ones. Thus:

 Although the prior probabilities indicate that X may belong to GREEN (given that there
are twice as many GREEN compared to RED) the likelihood indicates otherwise; that
the class membership of X is RED (given that there are more RED objects in the
vicinity of X than GREEN).

 In the Bayesian analysis, the final classification is produced by combining both sources
of information, i.e., the prior and the likelihood, to form a posterior probability using the
so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).

 Finally, we classify X as RED since its class membership achieves the largest posterior

 Note. The above probabilities are not normalized. However, this does not affect the
classification outcome since their normalizing constants are the same.

 As indicated, the objects can be classified as either GREEN or RED.

 Our task is to classify new cases as they arrive, i.e., decide to which class label they
belong, based on the currently existing objects.

 Since there are twice as many GREEN objects as RED, it is reasonable to believe that a
new case (which hasn't been observed yet) is twice as likely to have membership
GREEN rather than RED.
 In the Bayesian analysis, this belief is known as the prior probability.
 Prior probabilities are based on previous experience, in this case the percentage of
GREEN and RED objects, and often used to predict outcomes before they actually

Thus, we can write:

Prior Probability of GREEN: number of GREEN objects / total number of objects

Prior Probability of RED: number of RED objects / total number of objects

Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities
for class membership are:

Prior Probability for GREEN: 40 / 60

Prior Probability for RED: 20 / 60

 Having formulated our prior probability, we are now ready to classify a new object
(WHITE circle in the diagram below).
 Since the objects are well clustered, it is reasonable to assume that the more GREEN (or
RED) objects in the vicinity of X, the more likely that the new cases belong to that
particular color.
 To measure this likelihood, we draw a circle around X which encompasses a number (to
be chosen a priori) of points irrespective of their class labels.
 Then we calculate the number of points in the circle belonging to each class label.


Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class conditional
independence. It is made to simplify the computations involved and, in this sense, is considered
“naïve.” Bayesian belief networks are graphical models, which unlike naïve Bayesian
classifiers, allow the representation of dependencies among subsets of attributes. Bayesian
belief networks can also be used for classification.


P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.

In contrast, P(H) is the prior probability, or a priori probability, of H.

The posterior probability, P(H|X), is based on more information (e.g., customer

information) than the prior probability, P(H), which is independent of X.

Similarly, P(X|H) is the posterior probability of X conditioned on H.

P(X) is the prior probability of X.

P(H), P(X|H), and P(X) may be estimated from the given data, as we shall see below. Bayes’
theorem is useful in that it provides a way of calculating the posterior probability, P(H|X), from
P(H), P(X|H), and P(X). Bayes’ theorem is

4.2.2.Naïve Bayesian Classification :

The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

1. Let D be a training set of tuples and their associated class labels. As usual, each tuple

is represented by an n-dimensional attribute vector, X = (x1, x2, ….. , xn), depicting n

measurements made on the tuple from n attributes, respectively, A1, A2, ….. , An.

2. Suppose that there are m classes, C1, C2, …. , Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned on X.
That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if

Thus we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem

3. As P(X) is constant for all classes, only P(X|Ci) P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) =…. = P(Cm), and we would therefore maximize P(X|Ci).

4. Given data sets with many attributes, it would be extremely computationally


to compute P(X|Ci). In order to reduce computation in evaluating P(XjCi), the naive

assumption of class conditional independence is made. This presumes that the values of the
attributes are conditionally independent of one another, given the class label of the tuple (i.e.,
that there are no dependence relationships among the attributes). Thus,

For each attribute, we look at whether the attribute is categorical or continuous-valued. For
instance, to compute P(X|Ci), we consider the following:

(a) If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in D having the
value xk for Ak, divided by |Ci,D|, the number of tuples of class Ci in D.

(b) If Ak is continuous-valued, then we need to do a bit more work, but the calculation is
pretty straightforward. A continuous-valued attribute is typically assumed to have a Gaussian
distribution with a mean μ and standard deviation s, defined by

5. In order to predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci.
The classifier predicts that the class label of tuple X is the class Ci if and only if

4.2.3.Bayesian Belief Networks :

The naïve Bayesian classifier makes the assumption of class conditional independence,

that is, given the class label of a tuple, the values of the attributes are assumed to be

independent of one another. This simplifies computation.

Bayesian belief networks specify joint conditional probability distributions. They allow
class conditional independencies to be defined between subsets of variables. They provide a
graphical model of causal relationships, on which learning can be performed. Trained Bayesian
belief networks can be used for classification. Bayesian belief networks are also known as
belief networks, Bayesian networks, and probabilistic networks. For brevity, we will refer to
them as belief networks.

A belief network is defined by two components—a directed acyclic graph and a set of
conditional probabability tables.

The variables may be discrete or continuous-valued. They may correspond to actual

attributes given in the data or to “hidden variables” believed to form a relationship (e.g., in the
case of medical data, a hidden variable may indicate a syndrome, representing a number of
symptoms that, together, characterize a specific disease). Each arc represents a probabilistic
dependence. If an arc is drawn from a node Y to a node Z, then Y is a parent or immediate

predecessor of Z, and Z is a descendant of Y. Each variable is conditionally independent of its
non descendants in the graph, given its parents.

Figure 6.11 A simple Bayesian belief network: (a) A proposed causal model, represented by a
directed acyclic graph. (b) The conditional probability table for the values of the variable Lung
Cancer (LC) showing each possible combination of the values of its parent nodes, Family
History (FH) and Smoker (S).

A belief network has one conditional probability table (CPT) for each variable. The
CPT for a variable Y specifies the conditional distribution P(Y |Parents (Y)), where Parents(Y)
are the parents of Y.

4.2.4.Training Bayesian Belief Networks :

The network topology (or “layout” of nodes and arcs) may be given in advance or
inferred from the data. The network variables may be observable or hidden in all or some of the
training tuples. The case of hidden data is also referred to as missing values or incomplete data.

If the network topology is known and the variables are observable, then training the
network is straightforward. It consists of computing the CPT entries, as is similarly done when
computing the probabilities involved in naive Bayesian classification. When the network
topology is given and some of the variables are hidden, there are various methods to choose
from for training the belief network.

A gradient descent strategy is used to search for the wi jk values that best model the
data, based on the assumption that each possible setting of wi jk is equally likely.

The gradient descent method performs greedy hill-climbing in that, at each iteration or
step along the way, the algorithm moves toward what appears to be the best solution at the
moment, without backtracking. The weights are updated at each iteration. Eventually, they
converge to a local optimum solution.

1. Compute the gradients: For each i, j, k, compute,

2. Take a small step in the direction of the gradient: The weights are updated by

where l is the learning rate representing the step size and

is computed.

3. Renormalize the weights: Because the weights wi, jk are probability values, they must

be between 0.0 and 1.0 and

must equal 1 for all i, k. Algorithms that follow this form of learning are called Adaptive
Probabilistic Networks.


Rule-based classifiers, where the learned model is represented as a set of IF-THEN


4.3.1. Using IF-THEN Rules for Classification :

Rules are a good way of representing information or bits of knowledge. A rule-based

classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of
the form

IF condition THEN conclusion.

An example is rule R1,

The “IF”-part (or left-hand side)of a rule is known as the rule antecedent or precondition.
The “THEN”-part (or right-hand side) is the rule consequent. In the rule antecedent, the
condition consists of one or more attribute tests (such as age = youth, and student = yes) that
are logically ANDed. The rule’s consequent contains a class prediction (in this case, we are
predicting whether a customer will buy a computer). R1 can also be written as

If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given
tuple,we say that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that
the rule covers the tuple.

A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class
labeled data set,D, let ncovers be the number of tuples covered by R; ncorrect be the number of
tuples correctly classified by R; and |D| be the number of tuples in D. We can define the
coverage and accuracy of R as

That is, a rule’s coverage is the percentage of tuples that are covered by the rule (i.e., whose
attribute values hold true for the rule’s antecedent). For a rule’s accuracy, we look at the tuples
that it covers and see what percentage of them the rule can correctly classify.

Let’s see how we can use rule-based classification to predict the class label of a given
tuple, X. If a rule is satisfied by X, the rule is said to be triggered. For example, suppose we

If R1 is the only rule satisfied, then the rule fires by returning the class prediction for X.

If more than one rule is triggered, we need a conflict resolution strategy to figure out
which rule gets to fire and assign its class prediction to X. There are many possible strategies ,

Size ordering and

Rule ordering.

The size ordering scheme assigns the highest priority to the triggering rule that has the
“toughest” requirements, where toughness is measured by the rule antecedent size.That is, the
triggering rule with the most attribute tests is fired.

The rule ordering scheme prioritizes the rules beforehand. The ordering may be class
based or rule-based.

With class-based ordering, the classes are sorted in order of decreasing “importance,”
such as by decreasing order of prevalence.

With rule-based ordering, the rules are organized into one long priority list, according
to some measure of rule quality such as accuracy, coverage, or size (number of attribute tests
in the rule antecedent), or based on advice from domain experts.

4.3.2. Rule Extraction from a Decision Tree :

To extract rules from a decision tree, one rule is created for each path from the root to a
leaf node. Each splitting criterion along a given path is logically ANDed to form the rule
antecedent (“IF” part). The leaf node holds the class prediction, forming the rule consequent
(“THEN” part).

A disjunction (logical OR) is implied between each of the extracted rules. Because the
rules are extracted directly from the tree, they are mutually exclusive and exhaustive. By
mutually exclusive, this means that we cannot have rule conflicts here because no two rules will
be triggered for the same tuple. (We have one rule per leaf, and any tuple can map to only one
leaf.) By exhaustive, there is one rule for each possible attribute-value combination, so that this
set of rules does not require a default rule. Therefore, the order of the rules does not matter—
they are unordered.

The training tuples and their associated class labels are used to estimate rule accuracy.

Other problems arise during rule pruning, however, as the rules will no longer be
mutually exclusive and exhaustive. For conflict resolution, C4.5 adopts a class-based ordering
scheme. It groups all rules for a single class together, and then determines a ranking of these
class rule sets. Within a rule set, the rules are not ordered. C4.5 orders the class rule sets so as
to minimize the number of false-positive errors (i.e., where a rule predicts a class, C, but the

actual class is not C). The class rule set with the least number of false positives is examined
first. Once pruning is complete, a final check is done to

remove any duplicates.

4.3.3.Rule Induction Using a Sequential Covering Algorithm :

IF-THEN rules can be extracted directly from the training data (i.e., without having to
generate a decision tree first) using a sequential covering algorithm.

Sequential covering algorithms are the most widely used approach to mining disjunctive
sets of classification rules, and form the topic of this subsection.

A basic sequential covering algorithm is shown as ,

Here, rules are learned for one class at a time. Ideally, when learning a rule for a class,
Ci, we would like the rule to cover all (or many) of the training tuples of class C and none (or
few) of the tuples from other classes. In this way, the rules learned should be of high accuracy.
The rules need not necessarily be of high coverage. This is because we can have more than one
rule for a class, so that different rules may cover different tuples within the same class. The
process continues until the terminating condition is met, such as when there are no more
training tuples or the quality of a rule returned is below a user-specified threshold. The Learn

One Rule procedure finds the “best” rule for the current class, given the current set of training

Typically, rules are grown in a general-to-specific manner, The classifying attribute is loan
decision, which indicates whether a loan is accepted (considered safe) or rejected (considered
risky). To learn a rule for the class “accept,” we start off with the most general rule possible,
that is, the condition of the rule antecedent is empty. The rule is:

Figure shows a general-to-specific search through rule space.

IF income = high THEN loan decision = accept.

Each time we add an attribute test to a rule, the resulting rule should cover more of the
“accept” tuples. During the next iteration, we again consider the possible attribute tests and end
up selecting credit rating = excellent. Our current rule grows to become

The process repeats, where at each step, we continue to greedily grow rules until the

resulting rule meets an acceptable quality level.

4.3.4.Rule Quality Measures :

Learn One Rule needs a measure of rule quality. Every time it considers an attribute test,

it must check to see if appending such a test to the current rule’s condition will result in an
improved rule.

Choosing between two rules based on accuracy. Consider the two rules as illustrated in
Figure 6.14. Both are for the class loan decision = accept. We use “a” to represent the tuples of
class “accept” and “r” for the tuples of class “reject.” Rule R1 correctly classifies 38 of the 40
tuples it covers. Rule R2 covers only two tuples, which it correctly classifies. Their respective
accuracies are 95% and 100%. Thus, R2 has greater accuracy than R1, but it is not the better
rule because of its small coverage.

Figure shows Rules for the class loan decision = accept, showing accept (a) and reject (r)

Another measure is based on information gain and was proposed in FOIL (First Order
Inductive Learner), a sequential covering algorithm that learns first-order logic rules.

Learning first-order rules is more complex because such rules contain variables, whereas
the rules we are concerned with in this section are propositional.

FOIL assesses the information gained by extending condition as

4.3.5.Rule Pruning :

The rules may perform well on the training data, but less well on subsequent data. To
compensate for this, we can prune the rules. A rule is pruned by removing a conjunct (attribute
test). We choose to prune a rule, R, if the pruned version of R has greater quality, as assessed on
an independent set of tuples. FOIL uses a simple yet effective method. Given a rule, R,

where pos and neg are the number of positive and negative tuples covered by R, respectively.

This value will increase with the accuracy of R on a pruning set. Therefore, if the FOIL Prune
value is higher for the pruned version of R, then we prune R.

Classification by Back Propogation:

Classification is a data mining (machine learning) technique used to predict group

membership for data instances. Classification means evaluating a function, which assigns a
class label to a data item. Classification is a supervised learning process, uses training set which
has correct answers (class label attribute). Classification proceeds as these steps: First create a
model by running the algorithm on the training data. Then test the model. If accuracy is low,
regenerate the model, after changing features, reconsidering samples. Then identify a class
label for the incoming new data. So here the problem is to develop the classification model
using the available training set which needs to be normalized. Then this data is given to the

Back propagation algorithm for classification. After applying Back propagation algorithm,
genetic algorithm is applied for weight adjustment. The developed model can then be applied to
classify the unknown tuples from the given database and this information may be used by
decision maker to make useful decision. If one can write down a flow chart or a formula that
accurately describes the problem, then stick with a traditional programming method. There are
many tasks of data mining that are not solved efficiently with simple mathematical formulas.
Large scale data mining applications involving complex decision making can access billions of
bytes of data.. Hence, the efficiency of such applications is paramount. Classification is a key
data mining technique.

Artificial Neural Network (ANN) is a computational model, which is based on


Neural Network. Artificial Neural Network is often called as Neural Network (NN). To build
artificial neural network, artificial neurons, also called as nodes, are interconnected. The
architecture of NN is very important for performing a particular computation. Some neurons
are arranged to take inputs from outside environment. These neurons are not connected with
each other, so the arrangement of these neurons is in a layer, called as Input layer. All the
neurons of input layer are producing some output, which is the input to next layer. The
architecture of NN can be of single layer or multilayer. In a single layer Neural Network, only
one input layer and one output layer is there, while in multilayer neural network, there can be
one or more hidden layer.

An artificial neuron is an abstraction of biological neurons and the basic unit in an ANN.
The Artificial Neuron receives one or more inputs and sums them to produce an output. Usually
the sums of each node are weighted, and the sum is passed through a function known as an
activation or transfer function. The objective here is to develop a data classification algorithm
that will be used as a general-purpose classifier. To classify any database first, it is required to
train the model. The proposed training algorithm used here is a Hybrid BP-GA. After
successful training user can give unlabeled data to classify The synapses or connecting links:
that provide weights, wj, to the input values, xj for j = 1 ...m; An adder: that sums the weighted
input values to compute the input to the activation function


w0 is called the bias, is a numerical value associated with the neuron. It is convenient to think

of the bias as the weight for an input x0 whose value is always equal to one, so that;

An activation function g: that maps v to g(v) the output value of the neuron. This function is a

monotone function. The practical value of the logistic function arises from the fact

that it is almost linear in the range where g is between 0.1 and 0.9 but has a squashing effect on

very small or very large values.

ANN Learning: Back Propagation Algorithm

The back propagation algorithm cycles through two distinct passes, a forward pass
followed by a backward pass through the layers of the network. The algorithm alternates
between these passes several times as it scans the training data.

Forward Pass: Computation of outputs of all the neurons in the network

• The algorithm starts with the first hidden layer using as input values the independent
variables of a case from the training data set.

• The neuron outputs are computed for all neurons in the first hidden layer by performing the
relevant sum and activation function evaluations.

• These outputs are the inputs for neurons in the second hidden layer. Again the relevant sum

and activation function calculations are performed to compute the outputs of second layer


Backward pass: Propagation of error and adjustment of weights

• This phase begins with the computation of error at each neuron in the output layer. A popular
error function is the squared difference between ok the output of node k and yk the target value
for that node.

• The target value is just 1 for the output node corresponding to the class of the exemplar and
zero for other output nodes.

• The new value of the weight wjk of the connection from node j to node k is given by:

wnewjk= woldjk+_oj_k. Here _ is an important tuning parameter that is chosen by trial and

error by repeated runs on the training data. Typical values for _ are in the range 0.1 to 0.9.

• The backward propagation of weight adjustments along these lines continues until we

reach the input layer.

• At this time we have a new set of weights on which we can make a new forward pass when

presented with a training data observation

Parameters to be considered to build BP algorithm

Initial weight range(r): It is the range usually between [-r, r], weights are initialized between

these range.

Number of hidden layers: Up to four hidden layers can be specified; see the overview section

for more detail on layers in a neural network (input, hidden and output). Let us specify the

number to be 1.

Number of Nodes in Hidden Layer: Specify the number of nodes in each hidden layer.
Selecting the number of hidden layers and the number of nodes is largely a matter of trial and

Number of Epochs: An epoch is one sweep through all the records in the training set.
Increasing this number will likely improve the accuracy of the model, but at the cost of time,
and decreasing this number will likely decrease the accuracy, but take less time.

Step size (Learning rate) for gradient descent: This is the multiplying factor for the error
correction during back propagation; it is roughly equivalent to the learning rate for the neural
network. A low value produces slow but steady learning; a high value produces rapid but
erratic learning. Values for the step size typically range from 0.1 to 0.9.

Error tolerance: The error in a particular iteration is back propagated only if it is greater than
the error tolerance. Typically error tolerance is a small value in the range 0 to 1.

Hidden layer sigmoid: The output of every hidden node passes through a sigmoid function.
Standard sigmoid function is logistic; the range is between 0 and 1.

Why to choose Back Propagation Neural Network?

• A study of comparing Feed forward Network, Recurrent Neural Network and Time-delay
neural Network shows that highest correct classification rate is achieved by the fully connected
feed forward neural network.

• From table 3.1, it can be seen that results obtained from BPNN are better than those obtained
from MLC.

• From the obtained results in table 3.3, it can be seen that MLP is having highest classification
rate, with reasonable error and time taken to classify is also reasonably less

• From table 3.3, it can be seen that considering some of the performance parameters, BPNN is
better than other methods, GA, KNN and MLC.

Classification consists of examining the properties of a newly presented observation and
assigning it to a predefined class.
Assigning customers to predefined customer segments (good vs. bad)
Assigning keywords to articles
Classifying credit applicants as low, medium, or high risk
Classifying instructor rating as excellent, very good, good, fair, or poor
Classification means that based on the properties of existing data, we have made or groups i.e.
have made classification. The concept can be well understood by a very simple example of
student grouping. A student can be grouped either as good or bad depending on his previous
record. Similarly an employee can be grouped as excellent, good, fair etc based on his tra ck
record in the organization. So how students or employees were classified? Answer is using the
historical data. Yes history is the best predictor of the future. When an organization conducts
and interviews from candidate employees, their performanc e is compared with those of the
existing employees. The knowledge can be used to predict how good you can perform if
employed. So we are doing classification, here absolute classification i.e. either good or bad or
other words we are doing binary class ification. Either you are in this group or this. Each entity
assigned one of the groups or classes. An example where classification can prove to be
is in customer segmentation. The businesses can classify their customers as either good or bad;
the knowledge thus can be utilized for executing targeted marketing plans. Another example is
a news site, where there are number of visitors and also many content developers. Now where

place a specific news item on the web site? What should be the hierarchical position of the
item, what should be the news chapter, category? Either it should be in the sports or weather
section and so on. What is the problem in doing all this? The problem is that it's not a matter of
placing a single news item. The site as already mentioned contains a number of content
developers and also many categories. If sorting is performed humanly, then it is time
That is why classification techniques can scan and process the document to decide its category
class. How and what sort of processing will be discussed in the next lecture. It is not possible
there are flaws in assigning category to any news document just based on the keyword.
occurrence of the word keyword cricket in a document doesn't necessary means that the
document be placed in the sports category. The document may be actually political in nature

Same as classification or estimation except records are classified according to some predicted
future behavior or estimated value.
Using class ification or estimation on a training example with known predicted values and
historical data a model is built.
Then explain the known values, and use the model to predict future.
Predicting how much customers will spend during next 6 months.
Prediction here is not like a palmists approach that if this line then this. Prediction means that
what's the probability of an item/event/customer to go in a specific class. This means that
prediction tells that in which class this specific item would lie in future or to which class this
specific event can be assigned in any time in future, say after six years. How prediction actually
works? First of all a model is built using exiting data. The existing data set is divided into two
subsets, one is called the training set and the other is called test set. The training set is used to
form model and the associated rules. Once model built and rules defined, the test set is used for
grouping. It must be noted the test set groupings are already known but they are put in the
to test its accuracy. Accuracy, we will discuss in detail in following slides but is dependent on
many factors like the model, training data and test data selection and sizes and many more
So, the accuracy gives the confidence level, that the rules are accurate to that much level.

Prediction can be well understood by considering a simple example. Suppose a business wants
know about their customers their propensity to buy/spend/purchase. In other words, how much
the customer will spend in next 6 months? Similarly a mobile phone company can install a new
tower based on the knowledge spending habits of its customers in the surroundings. It is not the
case that companies install facilities or invest money because of their gut feelings. If you think
like this you are absolutely wrong. Why companies should bother about their customers?
if they know their customers, their interests, their like and dislikes, their buying patterns then it
possible to run targeted marketing campaigns and thus increasing profit

Task of segmenting a heterogeneous population into a number of more homogenous sub-
groups or clusters.
Unlike classification, it does NOT depend on predefined classes.
It is up to you to determine what meaning, if any, to attached to resulting clusters.
It could be the first step to the market segmentation effort.
What else data mining can do? We can do clustering with DM. Clustering is the technique of
reshuffling, relocating exiting segments in given data which is mostly heterogeneous so that the
new segments have more homogeneous data items. This can be very easily understood by a
simple example. Suppose some items have been segmented on the basis of color in the given
Suppose the items are fruits, then the green segment may contain all green fruits like apple,
grapes etc. thus a heterogeneous mixture of items. Clustering segregates such items and brings
apples in one segment or cluster although it may contain apples of different colors red, green,
yellow etc. thus a more homogeneous cluster than the previous cluster.
Clustering is a difficult task, why? In case of classification we already know the number of
classes, either good or bad or yes or no or any number of classes. We also have the knowledge
classes properties so its easy to segment data into known classes. However, in case of clustering
we don't know the number of clusters a priori. Once clusters are found in the data business
intelligence, domain knowledge is needed to analyze the found clusters. Clustering can be the
first step towards market segmentation i.e. we can use countermining to know the possible
clusters in the data. Once clusters found and analyzed classification can be applied thus gaining
more accuracy than any standalone technique. Thus clustering is at higher level than

not only because of its complexity but also because it leads to classification.
Examples of Clustering Applications
Marketing: Discovering distinct groups in customer databases, such as customers who
make lot of long-distance calls and don't have a job. Who are they? Students. Marketers
use this knowledge to develop targeted marketing programs.
Insurance: Identifying groups of crop insurance policy holders with a high average claim
rate. Farmers crash crops, when it is "profitable".
Land use: Identification of areas of similar land use in a GIS database.
Seismic studies: Identifying probable areas for oil/gas exploration based on seismic data.
We discussed that what clustering is and how it works. Now to know the real spirit of it, lets
at some of the real world examples to show the blessings of clustering;
1. Knowing or discovering about your market segment: Suppose a telecom company whose
data when clustered revealed that there is a group or cluster of people or customers whose

long distance calls are greater in number. Is this a discovery that such a group exi sts? Nope
not really. The real discovery is analyzing the cluster, the real fun part. Why these people are
in a cluster? Is important to know. Analysis of the cluster reveals that all the people in the
group are unemployed! How come it is possible that unemployed people are making
expensive far distance calls? The excitement lead to further analysis which ultimately
revealed that the people in the cluster were mostly students, students like you living away
from home in universities , colleges and hostels. They are making calls back home. So this is
a real example of clustering. Now the same question what is the benefit of knowing al this?

The answer is customer is like an asset for any organization. To know the customer is crucial
for any organization/compa ny so as to satisfy the customer which is a key of any company's
success in terms of profit. The company can rum targeted sale promotion and marketing
effort to target customers i.e. students.
2. Insurance: Now lets have look at how clustering plays a role in insurance sector. Insurance
companies are interested in knowing the people having higher insurance claim. You may
astonish that clustering has successfully been used in a developed country to detect farmer
insurance abuses. Some of the malicious farmers used to crash their crops intentionally to
gain insurance money which presumably was higher than the amount of profit and effort from
their crops. The farmer was happy but the loss was to be bear by the insurance company. The
company successfully used clustering techniques to identify such farmers, and thus saving a
lot of money.
Clustering thus has a wider scope in real life applications. Other areas where clustering is being
used are for city planning, GIS (Land use management), seismic data for mining (real mining)
and the list goes on.
Ambiguity in Clustering
How many clusters?
o Two clusters
o Four clusters
o Six clusters
Figure-30.2: Ambiguity in Clustering
As we mentioned the spirit of clustering lies in its analysis. A common ambiguity in clustering
regarding the number of clusters, since the cluster are not known in advance. To understand the
problem, consider the example in Figure 30.2. The black dots represent individual data records

tuples and they are placed as a result of a clustering algorithm. Now can u tell how many

are there?
Yes two clusters, but look at your screens again and tell how many clusters now?
Yes four clusters now, you are absolutely right. Now look again and tell how ma ny clusters?
6 clusters as shown in the Figure 30.2. What all this shows? This shows that deciding upon the
number of clusters is a complex task depending on factors like level of detail, application
etc. By level of detail I mean that either the black point represents a single record or an
The thing which is important is to know how many clusters solve our problem. Understanding
this solves the problem.
Describe what is going on in a complicated database so as to increas e our understanding.
A good description of a behavior will suggest an explanation as well.
Another application of DM is description. To know what is happening in our databases is
beneficial. How? The OLAP cubes provide ample amount of information, which is otherwise
distributed in the haystack. We can move the cube in different angles to get to the information
interest. However, we might miss the angle which might have given use some useful
Description is used to describe such things.


Comparing Methods (1)

Predictive accuracy: this refers to the ability of the model to correctly predict the class
label of new or previously unseen data
Speed: this refers to the computation costs involved in generating and using the method.
Robustness: this is the ability of the method to make correct predictions/groupings given
noisy data or data with missing values
We discussed different data mining techniques. Now the question, which technique is good and
which bad? Or say like which is the best technique for a given problem. Thus we need to
evaluation criteria like data metrics as we did in the data quality lecture. The metrics we use for
comparison of DM techniques are;
Accuracy: Accuracy is the measure of correctness of your model e.g. in classification we have
two data sets, training and test sets. A classification model is built based on the data properties
and relationships in training data. Once built the model is tested for accuracy in terms of %
correct results as the classification of the test data is already known. So we specify the
or confidence level of the technique in terms % accuracy.
Speed: In previous lectures we discussed the term "Need for Speed". Yes speed is a crucial
aspect of Dm techniques. Speed refers to the time complexity. If a technique has O (n) and
another has O (n log n) time complexities then which is better? Yes O (n) is better. This is the
computational time but user or business decision maker is interested in the absolute clock time.
He has nothing to do with complexities. What he is interested in is, knowing how fast he gets
answers. So just comparing on the basis of complexities is not sufficient. We must look at the
overall process and interdependencies among tasks which ultimately result in the answer or
Robustness: It is the ability of the technique to work accurately even in conditions of noisy or
dirty data. Missing data is a reality and presence of noise also true. So a technique is better if it
can run smoothly even in stress conditions i.e. with noisy and missing data.
Scalability: As we mentioned in our initial lectures that the main motivation for data

warehousing is to deal huge amounts of data. So scaling is very important, which is the ability
of the method to work efficiently even when the data size is huge.
Interpretability: It refers to the level of understanding and insight that is provided by the
method. As we discussed in clustering one of the complex and difficult tasks is the cluster
analysis. The techniques can be compared on the basis of their interpretational ability e.g. there
might be some methods which give additional functionalities to provide meaning to the
discovered information like color coding, plots and curve fittings etc

Classification and Prediction

Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts, for the purpose of being able to use the model to predict the class of
objects whose class label is unknown. The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label is known).

“How is the derived model presented?” The derived model may be represented in various
forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae, or
neural networks (Figure 1.10).

A decision tree is a flow-chart-like tree structure, where each node denotes a test on an
attribute value, each branch represents an outcome of the test, and tree leaves represent classes
or class distributions. Decision trees can easily be converted to classification rules. A neural
network, when used for classification, is typically a collection of neuron-like processing units
with weighted connections between the units. There are many other methods for constructing
classification models, such as naïve Bayesian classification, support vector machines, and k-
nearest neighbor classification.

Whereas classification predicts categorical (discrete, unordered) labels, prediction models

continuous-valued functions. That is, it is used to predict missing or unavailable numerical data
values rather than class labels. Although the term prediction may refer to both numeric
prediction and class label prediction,

Example: Classification and prediction. Suppose, as sales manager of AllElectronics, you

would like to classify a large set of items in the store, based on three kinds of responses to a
sales campaign: good response, mild response, and no response. You would like to derive a
model for each of these three classes based on the descriptive features of the items, such as
price, brand, place_made, type, and category. The resulting classification should maximally

distinguish each class from the others, presenting an organized picture of the data set. Suppose
that the resulting classification is expressed in the form of a decision tree. The decision tree, for
instance, may identify price as being the single factor that best distinguishes the three classes.
The tree may reveal that, after price, other features that help further distinguish objects of each
class from another include brand and place made. Such a decision tree may help you
understand the impact of the given sales campaign and design a more effective campaign for
the future.

Figure 1.10 A classification model can be represented in various forms,

such as (a) IF-THEN rules, (b) a decision tree, or a (c) neural network.

Associative Classification
 Associative classification
 Association rules are generated and analyzed for use in classification
 Search for strong associations between frequent patterns (conjunctions of
attribute-value pairs) and class labels
 Classification: Based on evaluating a set of rules in the form of
P1 ^ p2 … ^ pl  “Aclass = C” (conf, sup)
 Why effective?
 It explores highly confident associations among multiple attributes and may
overcome some constraints introduced by decision-tree induction, which
considers only one attribute at a time
 In many studies, associative classification has been found to be more accurate
than some traditional classification methods, such as C4.5

Typical Associative Classification Methods

 CBA (Classification By Association: Liu, Hsu & Ma, KDD’98)
 Mine association possible rules in the form of
 Cond-set (a set of attribute-value pairs)  class label
 Build classifier: Organize rules according to decreasing precedence based on
confidence and then support
 CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
 Classification: Statistical analysis on multiple rules
 CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM’03)
 Generation of predictive rules (FOIL-like analysis)

 High efficiency, accuracy similar to CMAR
 RCBT (Mining top-k covering rule groups for gene expression data, Cong et al.
 Explore high-dimensional classification, using top-k rule groups
 Achieve high classification accuracy and high run-time efficiency

A Closer Look at CMAR

 CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
 Efficiency: Uses an enhanced FP-tree that maintains the distribution of class labels
among tuples satisfying each frequent itemset
 Rule pruning whenever a rule is inserted into the tree
 Given two rules, R1 and R2, if the antecedent of R1 is more general than that of R2
and conf(R1) ≥ conf(R2), then R2 is pruned
 Prunes rules for which the rule antecedent and class are not positively correlated,
based on a χ2 test of statistical significance
 Classification based on generated/pruned rules
 If only one rule satisfies tuple X, assign the class label of the rule
 If a rule set S satisfies X, CMAR
 divides S into groups according to class labels
 uses a weighted χ2 measure to find the strongest group of rules, based on
the statistical correlation of rules within a group
 assigns X the class label of the strongest group


Bayesian Classification
 A statistical classifier: performs probabilistic prediction, i.e., predicts class membership

 Foundation: Based on Bayes’ Theorem.

 Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable

performance with decision tree and selected neural network classifiers

 Incremental: Each training example can incrementally increase/decrease the probability

that a hypothesis is correct — prior knowledge can be combined with observed data

 Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be

Bayesian Theorem: Basics

 Let X be a data sample (“evidence”): class label is unknown

 Let H be a hypothesis that X belongs to class C

 Classification is to determine P(H|X), the probability that the hypothesis holds given the
observed data sample X

 P(H) (prior probability), the initial probability

 E.g., X will buy computer, regardless of age, income, …

 P(X): probability that sample data is observed

 P(X|H) (posteriori probability), the probability of observing the sample X, given that the
hypothesis holds

 E.g., Given that X will buy computer, the prob. that X is 31..40, medium income

Bayesian Theorem

 Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the

Bayes theorem

P(H | X)  P(X | H )P(H )


 Informally, this can be written as

posteriori = likelihood x prior/evidence

 Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all the P(Ck|X)
for all the k classes

 Practical difficulty: require initial knowledge of many probabilities, significant
computational cost

Classification vs. Prediction

 Classification

 predicts categorical class labels (discrete or nominal)

 classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data

 Prediction

 models continuous-valued functions, i.e., predicts unknown or missing values

 Typical applications

 Credit approval

 Target marketing

 Medical diagnosis

 Fraud detection

Classification—A Two-Step Process

 Model construction: describing a set of predetermined classes

 Each tuple/sample is assumed to belong to a predefined class, as determined by

the class label attribute

 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees, or mathematical


 Model usage: for classifying future or unknown objects

 Estimate accuracy of the model

 The known label of test sample is compared with the classified result from
the model

 Accuracy rate is the percentage of test set samples that are correctly
classified by the model

 Test set is independent of training set, otherwise over-fitting will occur

 If the accuracy is acceptable, use the model to classify data tuples whose class
labels are not known

Process (1): Model Construction

Process (2): Using the Model in Prediction

Supervised vs. Unsupervised Learning
 Supervised learning (classification)

 Supervision: The training data (observations, measurements, etc.) are

accompanied by labels indicating the class of the observations

 New data is classified based on the training set

 Unsupervised learning (clustering)

 The class labels of training data is unknown

 Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data

Issues: Evaluating Classification Methods

 Accuracy

 classifier accuracy: predicting class label

 predictor accuracy: guessing value of predicted attributes

 Speed

 time to construct the model (training time)

 time to use the model (classification/prediction time)

 Robustness: handling noise and missing values

 Scalability: efficiency in disk-resident databases

 Interpretability

 understanding and insight provided by the model

 Other measures, e.g., goodness of rules, such as decision tree size or compactness of
classification rules

 Clustering is the process of partitioning a group of data points into a small
number of clusters. For instance, the items in a supermarket are clustered in
categories (butter, cheese and milk are grouped in dairy products). Of course
this is a qualitative kind of partitioning.

 A quantitative approach would be to measure certain features of the products,

say percentage of milk and others, and products with high percentage of milk
would be grouped together.
 In general, we have n data points xi,i=1...n that have to be partitioned in k
 The goal is to assign a cluster to each data point.
 K-means is a clustering method that aims to find the positions μi,i=1...k of the
clusters that minimize the distance from the data points to the cluster.

 K-means clustering solves


whereci is the set of points that belong to cluster i.

 The K-means clustering uses the square of the Euclidean distance
 This problem is not trivial (in fact it is NP- hard), so the K-means algorithm
only hopes to find the global minimum, possibly getting stuck in a different


.The expectation maximization algorithm is a natural generalization of maximum

likelihood estimation to the incomplete data case. In particular, expectation
maximization attempts to find the parameters that maximize the log probability logP(x; )
of the observed data.

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find

maximum likelihood or maximum a posteriori (MAP) estimates of parameters in
statistical models, where the model depends on unobserved latent variables. The EM
iteration alternates between performing an expectation (E) step, which creates a function
for the expectation of the log-likelihood evaluated using the current estimate for the
parameters, and a maximization (M) step, which computes parameters maximizing the
expected log- likelihood found on the E step. These parameter-estimates are then used to
determine the distribution of the latent variables in the next E step.

 Given the statistical model which generates a set of observed data, a set of
unobserved latent data or missing values , and a vector of unknown parameters ,
along with a likelihood function , the maximum likelihood estimate (MLE) of
the unknown parameters is determined by the marginal likelihood of the observed
dataHowever, this quantity is often intractable (e.g. if is a sequence of events, so
that the number of values grows exponentially with the sequence length, making
the exact calculation of the sum extremely difficult).

The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively
applying these two steps:

a) Expectation step (E step): Calculate the expected value of the log likelihood
function, with respect to the conditional distribution of given under the current
estimate of the parameters :

b) Maximization step (M step): Find the parameter that maximizes this quantity:

The typical models to which EM is applied uses as a latent variable indicating
membership in one of a set of groups:

The observed data points may be discrete (taking values in a finite or countably infinite
set) or continuous (taking values in an uncountably infinite set). Associated with each
data point may be a vector of observations.

The missing values (aka latent variables) are discrete, drawn from a fixed number of
values, and with one latent variable per observed unit.

The parameters are continuous, and are of two kinds: Parameters that are associated with
all data points, and those associated with a specific value of a latent variable (i.e.,
associated with all data points which corresponding latent variable has that value).

However, it is possible to apply EM to other sorts of models.The motive is as follows. If

the value of the parameteris known, usually the value of the latent variables can be found
by maximizing the log-likelihood over all possible values of , either simply by iterating
over or through an algorithm such as the Viterbi algorithm for hidden Markov models.
Conversely, if we know the value of the latent variables , we can find an estimate
of the parameters fairly easily, typically by simply grouping the observed data points
according to the value of the associated latent variable and averaging the values, or some
function of the values, of the points in each group.

This suggests an iterative algorithm, in the case where bothandare unknown:

 First, initialize the parameters to some random values.

 Compute the probability of each possible value of given
 Then, use the just-computed values of to compute a better estimate for the
 Iterate steps 2 and 3 until convergence.

The algorithm as just described monotonically approaches a local minimum of the cost



A support vector machine (SVM) is a supervised machine learning
model that uses classification algorithms. It is more preferred for classification but is
sometimes very useful for regression as well. Basically, SVM finds a hyper-plane that creates
a boundary between the types of data. In 2- dimensional space, this hyper-plane is nothing
but a line. In SVM, we plot eachdata item in the dataset in an N-dimensional space, where
N is the number of features/attributes in the data. Next, find the optimal hyperplane to
separate the data. So by this, you must have understood that inherently, SVM can only
perform binary classification (i.e., choose between two classes). However, there are various
techniques to use for multi-class problems. Support Vector Machine for Multi- class
Problems To perform SVM on multi-class problems, we can create a binary classifier for
each class of the data. The two results of each classifier will be :
 The data point belongs to that class OR
 The data point does not belong to that class.
For example, in a class of fruits, to perform multi-class classification, we can createa binary
classifier for each fruit. For say, the ‘mango’ class, there will be a binary classifier to predict if
it IS a mango OR it is NOT a mango. The classifier with the highest score is chosen as the
output of the SVM.

SVM for complex (Non Linearly Separable) SVM works very well without any
modifications for linearly separable data. Linearly Separable Data is any data that

can be plotted in a graph and can be separated into classes using a straight line.

A: Linearly Separable Data B: Non-Linearly Separable Data

 a vector space method for binary classification problems
 documents represented in t-dimensional space
 find a decision surface (hyperplane) that best separate
 documents of two classes
 new document classified by its position relative to
hyperplane.Simple 2D example: training documents linearly

Line s—The Decision Hyperplane

 maximizes distances to closest docs of each class
 it is the best separating hyperplane

Delimiting Hyperplanes
 parallel dashed lines that delimit region where to look for a solution

 Lines that cross the delimiting hyperplanes.
 candidates to be selected as the decision hyperplane
 lines that are parallel to delimiting hyperplanes: best candidates

Support vectors: documents that belong to, and define, the delimiting hyperplanesOur
example in a 2-dimensional system of coordinates

Feature selection and dimensionality reduction allow us to minimize the number of
features in a dataset by only keeping features that are important. In other words, we want to
retain features that contain the most useful information that is needed by our model to make
accurate predictions while discarding redundant features that contain little to no
information. There are several benefits in performing feature selection and dimensionality
reduction which include model

interpretability, minimizing overfitting as well as reducing the size of the training set and
consequently training time.

Dimensionality Reduction
The number of input variables or features for a dataset is referred to as its
dimensionality. Dimensionality reduction refers to techniques that reduce the number of
input variables in a dataset. More input features often make a predictive modeling task more
challenging to model, more generally referred to as the curse of dimensionality. High-
dimensionality statistics and dimensionality reduction techniques are often used for data
visualization. Nevertheless these techniques can be used in applied machine learning to
simplify a classification or regression dataset in order to better fit a predictive model.

Problem With Many Input Variables

If your data is represented using rows and columns, such as in a spreadsheet,
then the input variables are the columns that are fed as input to a model to predict the
target variable. Input variables are also called features. We can consider the columns of data
representing dimensions on an n-dimensional feature space and the rows of data as points in
that space. This is a useful geometric interpretation of a dataset. Having a large number of
dimensions in the feature space can mean that the volume of that space is very large,
and in turn, the points that we have in that space (rows of data) often represent a small and
non- representative sample. This can dramatically impact the performance of machine
learning algorithms fit on data with many input features, generally referred to as the “curse
of dimensionality.”
Therefore, it is often desirable to reduce the number of input features. This reduces the
number of dimensions of the feature space, hence the name “dimensionality reduction.”

Dimensionality Reduction
Dimensionality reduction refers to techniques for reducing the number of input variables in
training data.
When dealing with high dimensional data, it is often useful to reduce the
dimensionality by projecting the data to a lower dimensional subspace which
captures the “essence” of the data. This is called dimensionality reduction.

Fewer input dimensions often mean correspondingly fewer parameters or a simpler

structure in the machine learning model, referred to as degrees of freedom. A model with
too many degrees of freedom is likely to overfit the training dataset and therefore may not
perform well on new data.
It is desirable to have simple models that generalize well, and in turn, input data with few
input variables. This is particularly true for linear models where the number of inputs and
the degrees of freedom of the model are often closely related.

Techniques for Dimensionality Reduction

There are many techniques that can be used for dimensionality reduction.

 Feature Selection Methods

 Matrix Factorization
 Manifold Learning
 Auto encoder Methods

Feature Selection Methods

Feature selection is also called variable selection or attribute selection.

It is the automatic selection of attributes in your data (such as columns in tabular

data) that are most relevant to the predictive modeling problem you are working
on. feature selection… is the process of selecting a subset of relevant features for
use in model construction

Feature selection is different from dimensionality reduction. Both methods seek to reduce
the number of attributes in the dataset, but a dimensionality reduction

method do so by creating new combinations of attributes, where as feature selection
methods include and exclude attributes present in the data without changing them.
Examples of dimensionality reduction methods include Principal ComponentAnalysis,
Singular Value Decomposition and Sammon’s Mapping.
Feature selection is itself useful, but it mostly acts as a filter, muting out
featuresthat aren’t useful in addition to your existing features.

Feature Selection
AlgorithmsFilter Methods
Filter feature selection methods apply a statistical measure to assign a scoring to each
feature. The features are ranked by the score and either selected to be kept or removed from
the dataset. The methods are often univariate and consider the feature independently, or
with regard to the dependent variable. Some examples of some filter methods include the Chi
squared test, information gain and correlation coefficient scores.

Wrapper Methods
Wrapper methods consider the selection of a set of features as a search problem, where
different combinations are prepared, evaluated and compared to other combinations. A
predictive model us used to evaluate a combination of features and assign a score based on
model accuracy. The search process may be methodical such as a best-first search, it may
stochastic such as a random hill-climbing algorithm, or it may use heuristics, like forward
and backward passes to add and remove features. An example if a wrapper method is the
recursive feature elimination algorithm.

Embedded Methods
Embedded methods learn which features best contribute to the accuracy of the model while
the model is being created. The most common type of embedded feature selection methods
are regularization methods. Regularization methods are also called penalization methods
that introduce additional constraints into the

optimization of a predictive algorithm (such as a regression algorithm) that bias the model
toward lower complexity (fewer coefficients). Examples of regularization algorithms are the
LASSO, Elastic Net and Ridge Regression.











The Web – Search Engine Architectures – Cluster based Architecture – Distributed
– Search Engine Ranking – Link based Ranking – Simple Ranking Functions –
Learning to Rank – Evaluations -- Search Engine Ranking – Search Engine User
Interaction – Browsing – Applications of a Web Crawler – Taxonomy – Architecture
and Implementation – Scheduling Algorithms – Evaluation.

The Web
World Wide Web, which is also known as a Web, is a collection of
websites or web pages stored in web servers and connected to local computers
through the internet. These websites contain text pages, digital images, audios,
videos, etc. Users can access the content of these sites from any part of the world
over the internet using their devices such as computers, laptops, cell phones, etc.
The WWW, along with internet, enables the retrieval and display of text and
media to your device.

The building blocks of the Web are web pages which are formatted in
HTML and connected by links called "hypertext" or hyperlinks and accessed by
HTTP. These links are electronic connections that link related pieces of
information so that users can access the desired information quickly. Hypertext
offers the advantage to select a word or phrase from text and thus to access other
pages that provide additional information related to that word or phrase.

A web page is given an online address called a Uniform Resource Locator

(URL). A particular collection of web pages that belong to a specific URL is
called a website, e.g., www.facebook.com, www.google.com, etc. So, the World
Wide Web is like a huge electronic book whose pages are stored on multiple
servers across the world.

Components Of A Search Engine

Search Engine refers to a huge database of internet resources such as web

pages, newsgroups, programs, images etc. It helps to locate information on
World Wide Web.
User can search for any information by passing query in form of keywords or
phrase. It then searches for relevant information in its database and return
to theuser.

Search Engine Components

Generally there are three basic components of a search engine as listed below:
1. Web Crawler
2. Database
3. Search Interfaces

Web crawler
It is also known as spider or bots. It is a software component that traverses
the webto gather information.

All the information on the web is stored in database. It consists of huge web

Search Interfaces
This component is an interface between user and the database. It helps the
user tosearch through the database.

Search Engine Working

Web crawler, database and the search interface are the major component of a
search engine that actually makes search engine to work. Search engines make
use of Boolean expression AND, OR, NOT to restrict and widen the results of
a search. Following are the steps that are performed by the search engine:
 The search engine looks for the keyword in the index for predefined
database instead of going directly to the web to search for the keyword.
 It then uses software to search for the information in the database. This
software component is known as web crawler.
 Once web crawler finds the pages, the search engine then shows the
relevant web pages as a result. These retrieved web pages generally
include title of page, size of text portion, first several sentences etc.
These search criteria may vary from one search engine to the other. The
retrieved information is ranked according to various factors such as frequency
of keywords, relevancy of information, links etc.
 User can click on any of the search results to open it.

The search engine architecture comprises of the three basic layers listed below:
 Content collection and refinement.
 Search core
 User and application interfaces

Search Engine Processing

Indexing Process
Indexing process comprises of the following three tasks:
 Text acquisition
 Text transformation
 Index creation
Text acquisition
It identifies and stores documents for indexing.

Text Transformation
It transforms document into index terms or features.

Index Creation
It takes index terms created by text transformations and create data
structures to support fast searching.

Query Process
Query process comprises of the following three tasks:
 User interaction
 Ranking
 Evaluation

User interaction
It supports creation and refinement of user query and displays the results.

It uses query and indexes to create ranked list of documents.

It monitors and measures the effectiveness and efficiency. It is done offline.


 The task for the retrieval system is to match the query against clusters of
documents instead of individual documents, and rank clusters based on
their similarity to the query.
 Any document from a cluster that is ranked higher is considered more
likely to be relevant than any document from a cluster ranked lower on the
 This is in contrast to most other cluster search methods that use clusters

primarily as a tool to identify a subset of documents that are likely to be
relevant, so that at the time of retrieval, only those documents will be
matchedto the query.
 This approach has been the most common for cluster-based retrieval.

 The second approach to cluster-based retrieval is to use clusters as a form
of document smoothing.
 Previous studies have suggested that by grouping documents into clusters,
differences between representations of individual documents are, in effect,
smoothed out.

Current search engines use a massive parallel and cluster-based

architecture. Due to the large size of the document collection, the inverted
index does not fit in a single computer and must be distributed across the
computers of a cluster. The large volume of queries implies that the basic
architecture must be replicated in order to handle the overall query load, and
that each cluster must handle a subset of the query load. In addition, as queries
originate from all around the world and Internet latency is appreciable across
continents, cluster replicas are maintained in different geographical locations
to decrease answer time. This allows search engines to be fault-tolerant in
most typical worst-case scenarios, such as power outages or natural disasters.

There are many crucial details to be carefully addressed in this type of

1. It is particularly important to achieve a good balance between the internal
(answering queries and indexing) and external (crawling) activities of the
search engine. This is achieved by assigning dedicated clusters to
crawling, to document serving, to indexing, to user interaction, to query
processing, and even to the generation of the result pages.
2. In addition, a good load balancing among the different clusters needs to be
maintained. This is achieved by specialized servers called (quite trivially)
load balancers.
3. Finally, since hardware breaks often, fault tolerance is handled at the
software level. Queries are routed to the most adequate available cluster
and CPUs and disks are routinely replaced upon failure, using inexpensive

exchangeable hardware components.

Figure 11.7 shows a generic search cluster architecture with its key components.

The front-end servers receive queries and process them right away if
the answer is already in the “answer cache” servers. Otherwise they route the
query to the search clusters through a hierarchical broker network. The
exact topology of this network can vary but basically, it should be designed to
balance traffic so as to reach the search clusters as fast as possible. Each
search cluster includes a load balancing server (LB in the figure) that routes
the query to all the servers in one replica of the search cluster. In this
figure, we show an index partitioned into n
clusters with m replicas. Although partitioning the index into a single cluster is
conceivable, it is not recommended as the cluster would turn out to be very
largeand consequently suffer from additional management and fault tolerance

Each search cluster also includes an index cache, which is depicted
at thetop, as a flat rectangle. The broker network merges the results
coming from the

search clusters and sends the merged results to the appropriate front-end server
that will use the right document servers to generate the full results pages,
including snippet and other search result page artifacts. This is an example of a
more general trend to consider a whole data center as a computer.

There exist several variants of the crawler-indexer architecture and
we describe here the most important ones. Among them, the most significant
early example is Harvest.

Harvest uses a distributed architecture to gather and distribute data,
which is more efficient than the standard Web crawler architecture. The main
drawback is that Harvest requires the coordination of several Web servers.
Interestingly, the Harvest distributed approach does not suffer from some of
the common problems of the crawler-indexer architecture, such as:
 increased servers load caused by the reception of simultaneous requests
fromdifferent crawlers,
 increased Web traffic, due to crawlers retrieving entire objects, while
mostcontent is not retained eventually, and
 lack of coordination between engines, as information is
gathered independently by each crawler.

Avoiding these issues was achieved by introducing two main

components in the architecture: gatherers and brokers. A gatherer collects and
extracts indexing information from one or more Web servers. Gathering times
are defined by the system and are periodic (i.e., there are harvesting times as
the name of the system suggests). A broker provides the indexing mechanism
and the query interface to the data gathered. Brokers retrieve information from

one or more gatherers or other brokers, updating incrementally their indexes.
Depending on the configuration of gatherers and brokers, different
improvements on server load and network traffic

can be achieved. For example, a gatherer can run on a Web server, generating
no external traffic for that server. Also, a gatherer can send information to
several brokers, avoiding work repetition. Brokers can also filter information
and send it to other brokers. This design allows the sharing of work and
information in a very flexible and generic manner. An example of the Harvest
architecture is shown in Figure 11.9.

One of the goals of Harvest is to build topic-specific brokers, focusing the

index contents and avoiding many of the vocabulary and scaling problems of
generic indexes. Harvest includes a dedicated broker that allows other brokers to
register information about gatherers and brokers. This is mostly useful for
identifying an
broker or gatherer when building a system. The
Harvest architecture also provides replicators and object caches. A replicator
can be used to replicate servers, enhancing user-base scalability. For example,
the registration broker can be replicated in different geographic regions to
allow faster access. Replication can also be used to divide the gathering
process among many Web servers. Finally, the object cache reduces network
and server load, as well as response latency when accessing Web pages.

Ranking is the hardest and most important function search engines have
to execute. A first challenge is to devise an adequate evaluation process that
allows judging the efficacy of a ranking, in terms of its relevance to the
users. Without such evaluation process it is close to impossible to fine tune the
ranking function, which basically prevents achieving high quality results.
There are many possible evaluation techniques and measures. We cover this
topic in the context of the Web, paying particular attention to the exploitation
of user’s clicks.

A second critical challenge is the identification of quality content in

the Web. Evidence of quality can be indicated by several signals such as
domain names (i.e., .edu is a positive signal as content originating from
academic institutions is more likely to be reviewed), text content and various
counts (such as the number of word occurrences), links (like Page Rank), and
Web page access patterns as monitored by the search engine. Indeed, as
mentioned before, clicks are a key element of quality. The more traffic a
search engine has, the more signals it will have available. Additional useful
signals are provided by the layout of the Web page, its title, metadata, font
sizes, as discussed later.

The economic incentives of the current advertising based business model

adopted by search engines have created a third challenge, avoiding Web spam.
Spammers in the context of the Web are malicious users who try to trick
search engines by artificially inflating the signals mentioned in the
previous paragraph. This can be done, for instance, by repeating a term in
a page a great number of times, using link farms, hiding terms from the users
but keeping them visible to the search engines through weird coloring
tricks, or even for the most sophisticated ones, deceiving Javascript code.

Finally, the fourth issue lies in defining the ranking function and computing
it (which is different from evaluating its quality as mentioned above). While it
is fairly difficult to compare different search engines as they evolve and
operate on different Web corpora, leading search engines have to

constantly measure and compare themselves, each one using its own
measure, so asto remain competitive.

Ranking Signals
We distinguish among different types of signals used for ranking
improvements according to their origin, namely content, structure, or usage, as
follows. Content signals are related to the text itself, to the distributions of
words in the documents as has been traditionally studied in IR. The signal
in this case can vary from simple word counts to a full IR score such as
BM25. They can also be provided by the layout, that is, the HTML source,
ranging from simple format indicators (more weight given to titles/headings)
to sophisticated ones such as the proximity of certain tags in the page.
Structural signals are intrinsic to the linked structure of the Web. Some of
them are textual in nature, such as anchor text, which describe in very brief
form the content of the target Web page. In fact, anchor text is usually used
as surrogate text of the linked Web page. That implies that Web pages can
be found by searching the anchor texts associated with links that point to them,
even if they have not been crawled. Other signals pertain to the links
themselves, such as the number of in-links to or outlinks from a page.

The next set of signals comes from Web usage. The main one is the
implicit feedback of the user through clicks. In our case the main use of
clicks are the onesin the URLs of the results set.

Given that there might be thousands or even millions of pages available
for any given query, the problem of ranking those pages to generate a short list
is probably one of the key problems of Web IR; one that requires some kind of
relevance estimation. In this context, the number of hyperlinks that point to a
page provides a measure of its popularity and quality. Further, many links in

common among pages and pages referenced by a same page are often
indicative of page relations with potential value for ranking purposes. Next,
we present several

examples of ranking techniques that exploit links, but differ on whether
they arequery dependent or not.

Early Algorithms
 Boolean spread, vector spread, and most-cited
 WebQuery

A better idea is due to Kleinberg and used in HITS (Hypertext Induced Topic
Search). This ranking scheme is query-dependent and considers the set of
pages S that point to or are pointed by pages in the answer. Pages that have
many links pointing to it in S are called authorities because they are
susceptible to contain authoritative and thus, relevant content. Pages that have
many outgoing links are called hubs and are susceptible to point to relevant
similar content. A positive two- way feedback exists: better authority pages
come from incoming edges from good hubs and better hub pages come from
outgoing edges to good authorities. Let H(b) and A(p) be the hub and
authority values of page p. These values are defined such that the following
equations are satisfied for all pages p:

where H(p) and A(p) for all pages are normalized (in the original paper, the
sum of the squares of each measure is set to one). These values can be
determined through an iterative algorithm, and they converge to the
principal eigenvector of the linkmatrix of S. In the case of the Web, to
avoid an explosion on the size of S, a maximal number of pages pointing
to the answer can be defined. This technique does not work with non-

existent, repeated, or automatically generated links. One solution is to weigh
each link based on the surrounding content. A second problem is that of
topic diffusion, because as a consequence of link weights, the result set
might include pages that are not directly related to the query (even if they
have got

high hub and authority values). A typical case of this phenomenon is when a
particular query is expanded to a more general topic that properly contains the
original answer. One solution to this problem is to associate a score with the
content of each page, like in traditional IR ranking, and combine this score
with the link weight. The link weight and the page score can be included in the
previous formula multiplying each term of the summation. Experiments show
that the recall and precision for the first ten results increase significantly. The
appearance order of the links on the Web page can also be used by dividing
the links into subgroups and using the HITS algorithm on those subgroups
instead of the original Web pages. In Table 11.2, we show the exponent of the
power law of the distribution for authority and hub values for different
countries of the globe adapted from.


The simplest ranking scheme consists of using a global ranking
function such as PageRank. In that case, the quality of a Web page is
independent of the query and the query only acts as a document filter. That is,
for all Web pages that satisfy a query, rank them using their PageRank order.
A more elaborated ranking scheme consists of using a linear
combination of different relevance signals. For example, combining textual
features, say BM25, and link features, such as PageRank. To illustrate,
consider the pages p that satisfy query Q. Then, the rank score R(p, Q) of page

p with regard to query Q can be computed as:

Further, R(p, Q)=0 if p does not satisfies Q. If we assume that all the
functions are normalized and a ∈ [0, 1], then R(p, Q) ∈ [0, 1]. Notice that this
linear function is convex in a. Also, while the first term depends on the query,
the second term does not. If a = 1, we have a pure textual ranking, which
was the typical case in the early search engines. If a = 0, we have a pure link-
based ranking that is also independent of the query. Thus, the order of the
pages is known in advance for pages that do contain q. We can tune the
value of a experimentally using labeled data as ground truth or click through
data. In fact, a might even be query dependent. For example, for navigational
queries a could be made smaller than for informational queries.

Early work on combining text-based and link-based rankings was

published by Silva. The authors used a Bayesian network to combine the
different signals and showed that the combination leads to far better results
than those produced by any of the combining ranking functions in isolation.
Subsequent research by Calado discussed the efficacy of a global link-based
ranking versus a local link based ranking for computing Web results. The
local link-based ranking for a page p is computed considering only the pages
that link to and are linked by page p. The authors compare results produced by
a combination of a text-based ranking (Vector model) with global HITS, local
HITS, and global PageRank, to conclude that a global link-based ranking
produces better results at the top of the ranking, while a local link-based
ranking produces better results later in the ranking.

A rather distinct approach for computing a Web ranking is to apply
machine learning techniques for learning to rank. For this, one can use their
favorite machine learning algorithm, fed with training data that contains

ranking information, to “learn” a ranking of the results, analogously to the

algorithms for text classification. The loss function to minimize in this case is
the number of mistakes done by the learned algorithm, which is similar to
counting the number of misclassified instances in traditional classification.
The evaluation of the learned ranking must be done with another data set
(which also includes ranking information) distinct from the one used for
training. There exist three types of ranking information for a query Q, that
can be used for training:
 point wise: a set of relevant pages for Q.
 Pair wise: a set of pairs of relevant pages indicating the ranking
relationbetween the two pages.
That is, the pair [p1>p2], implies that the page p1 is more relevant than p2.
 List-wise: a set of ordered relevant pages: p1>p2··· >pm.
In any case, we can consider that any page included in the ranking
information is more relevant than a page without information, or we can
maintain those cases undefined. Also, the ranking information does not need
to be consistent (for example, in the pair wise case). The training data may
come from the so-called “editorial judgments” made by people or, better,
from click through data. Given that users’ clicks reflect preferences that
agree in most cases with relevance judgments done by human assessors, one
can consider using click through information to generate the training data.
Then, we can learn the ranking function from click-based preferences. That
is, if for query Q, p has more clicks than p2 ,then [p1_ p21].

One approach for learning to rank from clicks using the pair wise
approach is to use support vector machines (SVMs), to learn the ranking
function. In this case, preference relations are transformed into inequalities
among weighted term vectors representing the ranked documents. These
inequalities are then translated into an SVM optimization problem, whose
solution computes optimal weights for the document terms. This approach
proposes the combination of different retrieval functions with different

weights into a single ranking function.

The point wise approach solves the problem of ranking by means of
regression or classification on single documents, while the pairwise approach
transforms ranking into a problem of classification on document pairs. The
advantage of these two approaches is that they can make use of existing results
in regression and classification. However, ranking has intrinsic characteristics
that cannot be always solved by the latter techniques. The list wise approach
tackles the ranking problem directly, by adopting list wise loss functions, or
directly optimizes IR evaluation measures such as average precision.
However, this case is in general more complex. Some authors have proposed
to use a multi-variant function, also called relational ranking function, to
perform list wise ranking, instead of using a single-document based ranking

To be able to evaluate quality, Web search engines typically use
human judgments that indicate which results are relevant for a given query, or
some approximation of a “ground truth” inferred from user’s clicks, or finally
a combination of both, as follows.
Precision at 5, 10, 20
One simple approach to evaluate the quality of Web search results is
to adapt the standard precision-recall metrics to the Web. For this, the
following observations are important:
on the Web
it is almost impossible to measure recall, as the number of relevant
pages for most typical queries is prohibitive and ultimately unknown.
Thus, standard precision-recall figures cannot be applied directly.

Most Web users inspect only the top 10 results and it is relatively
uncommon that a user inspects answers beyond the top 20 results. Thus,

evaluating the quality of Web results beyond position 20 in the ranking is not
indicated as does not reflect common user behavior. Since Web queries tend
to be short and vague,

human evaluation of results should be based on distinct relevance assessments
for each query-result pair. For instance, if three separate assessments are made
for each query-result pair, we can consider that the result is indeed relevant to
the query if at least two of the assessments suggest so. The compounding
effect of these observations is that
(a) precision of Web results should be measured only at the top
positions in the ranking, say P@5, P@10, and P@20 and
(b) each query-result pair should be subjected to 3-5 independent
relevant assessments.

Click-through Data as an Evaluation Metric

One major advantage of using click through data to evaluate the quality
of answers derives from its scalability. Its disadvantage is that it works less
well in smaller corpora, such as countries with little Web presence, Intranet
search, or simply in the long tail of queries. Note that users’ clicks are not
used as a binary signal but in significantly more complex ways such as
considering whether the user remained a long time on the page it clicked (a
good signal) or jumped from one result to the other (a signal that nothing
satisfying was found). These measures and their usage are complex and kept
secret by leading search engines.

Evaluating the Quality of Snippets

A related problem is to measure the quality of the snippets in the results.
Search snippets are the small text excerpts associated with each result
generated bya search engine. They provide a summary of the search result and
indicate how it is related to the query (by presenting query terms in boldface,
for instance). This provides great value to the users who can quickly inspect
the snippets to decide which results are of interest.

Web Spam
The Web contains numerous profit-seeking ventures, so there is an
economic incentive from Web site owners to rank high in the result lists of
search engines.

All deceptive actions that try to increase the ranking of a page in search
engines are generally referred to as Web spam or spamdexing (a
portmanteau of “spamming” and “index”). The area of research that relates to
spam fighting is called Adversarial Information Retrieval, which has been
the object of several publications and workshops.


Assigning Identifiers to Documents

Document identifiers are usually assigned randomly or according to the

ordering with which URLs are crawled. Numerical identifiers are used to
represent URLs in several data structures. In addition to inverted lists, they are
also used to number nodes in Web graphs and to identify documents in search
engines repositories.

It has been shown in the literature that a careful ordering of documents

leads to an assignment of identifiers from which both index and Web graph
storing methods can benefit. Also an assignment based on a global ranking
scheme may simplify the ranking of answers (see section 11.5.3). Regarding
the compression of inverted lists, a very effective mapping can be obtained
by considering the sorted list of URLs referencing the Web documents of the
collection. Assigning identifiers in ascending order of lexicographically sorted
URLs improves the compression rate. The hypothesis that is empirically
validated by Silvestri is that documents sharing correlated and discriminant
terms are very likely to be hosted by the same site and will therefore also
share a large prefix of their URLs. Experiments validate the hypothesis since
compression rates can be improved up to 0.4 by using the URL sorting
technique. Furthermore, sorting a few million URLs takes only tens of
seconds and takes only a few hundreds megabytes of main memory.

search engine user interaction
Web search engines target hundreds of millions of users, most of
which have very little technical background. As a consequence, the design of
the interface has been heavily influenced by a extreme simplicity rule, as

Extreme Simplicity Rule. The design of the user search experience,

i.e., the patterns of user interaction with the search engine, must assume that
the users have only minimal prior knowledge of the search task and must
require as little learning on their part as possible. In fact, users are expected to
read the “user manual” of a new refrigerator or DVD player more often than
the help page of their favorite search engine. One immediate consequence of
this state of affairs is that a user that does not “get it”, while interacting with a
search engine, is very likely to attempt to solve the problem by simply
switching to another search engine. In this context, extreme simplicity has
become the rule for user interaction in Web search.

In this section, we describe typical user interaction models for the most
popular Web Search engines of today, their recent innovations, and the
challenges they face to abide by this extreme simplicity rule. But we revisit
them here in more depth, in the context of the Web search experience offered
by major players such as Ask.com, Bing, Google and Yahoo! Search. We do
not discuss here “vertical” search engines, i.e., search engines restricted to
a specific domains of knowledge such as Yelp or Netflix, or major search
engines verticals, such as Google Image Search or Yahoo! Answers.

The Search Rectangle Paradigm

Users are now accustomed with specifying their information needs by
formulating queries in a search “rectangle”. This interaction mode has become

so popular that many Web homepages now feature a rectangle search box,
visible in prominent area of the site, even if the supporting search
technology is provided by a third partner. To illustrate, Figure 11.10 displays
the search rectangle of the Ask, Bing, Google, and Yahoo! search engines.
The rectangle design has remained

pretty much stable for some engines such as Google, whose main homepage
has basically not changed in the last ten years. Others like Ask and Bing allow
a more fantasy oriented design with colorful skins and beautiful photos of
interesting places and objects (notice, for instance, the Golden Gate bridge
background of Ask in Figure 11.10).

Despite these trends, the search rectangle remains the center piece of
the action in all engines. While the display of a search rectangle at the
center of the page is the favored layout style, there are alternatives:
Some Web portals embed the search rectangle in a privileged area of the
homepage. Examples of this approach are provided by yahoo.com or aol.com.
• Many sites include an Advanced Search page, which provides the users
with a form composed of multiple search “rectangles” and options (rarely
• The search toolbars provided by most search engines as a browser plug-in,
or built-in in browsers like Firefox, can be seen as a leaner version of the
central search rectangle. By being accessible at all times, they represent a
more convenient alternative to the homepage rectangle, yet their requirement
of a download for
installation prevents wider adoption. Notice that, to compensate this overhead,
many search deal
negotiate costly OEM
engines s

with PC
distributors/manufacturers to preinstall their toolbars.
• The “ultimate” rectangle, introduced by Google’s Chrome “omnibox”,
merges the functionality of the address bar with that of the search box. It
becomes then

responsibility of the browser to decide whether the text inputted by the user
aims at navigating to a given site or at conducting a search. Prior to the
introduction of the Omnibox, Firefox already provided functionality to
recognize that certain words cannot be part of a URL and thus, should be
treated as part of a query. In these cases, it would trigger Google’s “I feel
lucky” function to return the top search result. Interestingly enough, this
Firefox feature is customizable, allowing users to trigger search engines other
than Google or to obtain a full page of results.

The Search Engine Result Page

Result Presentation -The Basic Layout
The classic presentation style of the Search Engine Result Page, often
referred to as SERP, consists of a list of “organic” or “algorithmic” results,
that appear on the left hand side of the results page, as well as paid/sponsored
results (ads) that appear on the right hand side. Additionally, the most relevant
paid results might appear on top of the organic results in the North area, as
illustrated in Figure
11.12. By default, most search engines show ten results in the first page, even
though some engines, such as Bing, allow users to customize the number of
results to show on a page. Figure 11.12 illustrates the layout generally adopted
by most engines, with common but not uniformly adopted locations being
indicated by a dotted line framed box.

These engines m ght differ on small details like the assistance”
features, which might appear in the North, South or West region of the page,
the position of the navigational tools, which might or might not be
displayed on the West region, or the position of spelling correction
recommendations, which might
appear before of after the sponsored results in the North region. Search engines
constantly experiment with small variations of layout, and it might be the case
that drastically different layouts be adopted in the future, as this is a space that
calls for innovative features. To illustrate, Cuil introduced a radically different
layout that departs from the one dimensional ranking, but this is more the
exception than the rule. In contrast, search properties other than the main
engines, commonly adopt distinct, such as the image search in both Google
and Yahoo! as an example, or Google Ads search results, all of which display
results across several columns. In this section, we focus exclusively on the

organic part of search results. We will refer to them from now as “search
results”, note the distinction with paid/sponsored search results.

Major search engines use a very similar format to display individual
results composed basically of
(a) a title shown in blue and underlined,
(b) a short snippet consisting of two or three sentences extracted
from the result page, and
(c) a URL, that points to the page that contains the full text. In most
cases,titles can be extracted directly from the page.
When a page does not have a title, anchor texts pointing to it can be
used togenerate a title.

More Structured Results

In addition to results fed by the main Web corpus, search engines include
additional types of results, as follows.
Oneboxes” results. These are very specific results, produced in response to
very precise queries, that are susceptible of having one unique answer. They
are displayed above regular Web results, due to their high relevance, and in a
distinct format. Oneboxes are triggered by specific terms in the user’s query
that indicate a clear intent. They aim at either exposing the answer directly or
exposing a direct link to the answer, which provides the ultimate search
experience but can be achieved only in very specific cases, i.e., when
relevance is guaranteed and the answer is short and unambiguous.
As an example, both Google and Yahoo! Search support a weather
onebox, which is triggered by entering “weather <a location>” (see example
in Figure 11.13).

Universal search results: Most Web search engines offer, in addition
to core Web search, other properties, such as Images, Videos, Products, Maps,
which come with their own vertical search. While users can go directly to
these properties to conduct corpus-specific searches, the “universal” vision
states that users should not have to specify the target corpus. The engine
should guess their intent and automatically return results from the most
relevant sources when appropriate. The key technical challenge here is to
select these sources and to decide how many results from each sources to

In this section, we cover browsing as an additional discovery paradigm,
with special attention to Web directories. Browsing is mostly
when users
usefulhave no idea of how to specify a query (which becomes
rarer in the
rarer and
context of the global Web), or when they want to explore a specific collection
and are not sure of its scope. Nowadays, browsing is no longer the discovery
paradigm of choice on the Web. Despite that, it can still be useful in specific
contexts such as
that of an Intranet or in vertical

domains, as we now the case of
discuss. In
browsing, users are willing to invest some time exploring the document space,
looking for interesting or even unexpected references. Both with browsing and
searching, the user is pursuing discovery goals. However, in search, the user’s goal

is somewhat crisper. In contrast, with browsing, the user’s needs are usually
broader. While this distinction is not valid in all cases, we will adopt it here
for the sake of simplicity. We first describe the three types of browsing
namely, flat, structure driven (with special attention to Web directories), and
hypertext driven. Following, we discuss attempts at combining searching and
browsing in a hybrid manner.

Flat Browsing
In flat browsing, the user explores a document space that follows a
flat organization. For instance, the documents might be represented as dots in
a two- dimensional plane or as elements in a single dimension list, which
might be ranked by alphabetical or by any other order. The user then glances
here and there looking for information within the visited documents. Note that
exploring search results is a form of flat browsing. Each single document can
also be explored in a flat manner via the browser, using navigation arrows
and the scroll bar.

One disadvantage is that in a given page or screen there may not be any
clear indication of the context the user is in. For example, while browsing
large documents, users might lose track of which part of the document they
are looking at. Flat browsing is obviously not available in the global Web due
to its scale and distribution, but is still the mechanism of choice when
exploring smaller sets. Furthermore, it can be used in combination with search
for exploring search results or attributes. In fact, flat browsing conducted after
an initial search allows identifying new keywords of interest. Such keywords
can then be added to the original query in an attempt to provide better

A Web Crawler is a software for downloading pages from the Web.
Alsoknown as Web Spider, Web Robot, or simply Bot.
A Web crawler, sometimes called a spider or spiderwort and
oftenshortened to crawler, is an Internet bot that
systematically browses the World

Wide Web, typically operated by search engines for the purpose of Web
indexing (web spidering).
Web search engines and some other websites use Web crawling or
spidering software to update their web content or indices of other sites' web
content. Web crawlers copy pages for processing by a search engine,
which indexes the downloaded pages so that users can search more efficiently.
Crawlers consume resources on visited systems and often visit sites without
approval. Issues of schedule, load, and "politeness" come into play when large
collections of pages are accessed. Mechanisms exist for public sites not wishing to
be crawled to make this known to the crawling agent. For example, including
a robots.txt file can request bots to index only parts of a website, or nothing at all.

The number of Internet pages is extremely large; even the largest

crawlers fall short of making a complete index. For this reason, search engines
struggled to give relevant search results in the early years of the World Wide
Web, before 2000. Today, relevant results are given almost
instantly. Crawlers can validate hyperlinks and HTML code. They can
also be used for web scraping and data-driven programming.

A Web Crawler can be used to
 create an index covering broad topics (general Web search)
 create an index covering specific topics (vertical Web search)
 archive content (Web archival)
 analyze Web sites for extracting aggregate statistics
 keep copies or replicate Web sites (Web mirroring)
 Web site analysis

Types of Web search

 General Web search: done by large search engines
 Vertical Web search: the set of target pages is delimited by a topic, a
country or a language
Crawler for general Web search must balance coverage and quality.
 Coverage: It must scan pages that can be used to answer many
different queries
 Quality: The pages should have high quality

Vertical Crawler: focus on a particular subset of the Web

 This subset may be defined geographically, linguistically, topically, etc.

Examples of vertical crawlers
1. Shopbot: designed to download information from on-line shopping

catalogs and provide an interface for comparing prices in a centralized way

2. News crawler: gathers news items from a set of pre-defined sources
3. Spambot: crawler aimed at harvesting e-mail addresses inserted on Web pages

Vertical search also includes segmentation by a data format. In this

case, the crawler is tuned to collect only objects of a specific type, as image,
audio, or video objects. Example

Feed crawler: checks for updates in RSS/RDF files in Web sites.

Focused crawlers: focus on a particular topic

Provides a more efficient strategy to avoid collecting more pages than necessary
A focused crawler receives as input the description of a topic, usually described by
 a driving query
 a set of example
documents The crawler can
operate in
batch mode, collecting pages about the topic periodically
on-demand, collecting pages driven by a user query

The crawlers assign different importance to issues such as freshness,
quality, and volume The crawlers can be classified according to these three

A crawler would like to use all the available resources as much as possible
crawling servers,
bandwidth). However, crawlers also fulfill

politeness, That is, a crawler cannot overload a Web site with HTTP requests.
That implies that a crawler should wait a small delay between two requests to
the same Web site Later we will detail other aspects of politeness

The crawler is composed of three main modules:
 downloader,
 storage, and
 scheduler
Scheduler: maintains a queue of URLs to visit
Downloader: downloads the pages

Storage: makes the indexing of the pages, and provides the scheduler with
metadata on the pages retrieved.

The scheduling can be further divided into two parts:

 long-term scheduling: decide which pages to visit next
 short-term scheduling: re-arrange pages to fulfill politeness
The storage can also be further subdivided into three parts: (rich) text,
metadata,and links.

In the short-term scheduler, enforcement of the politeness policy requires
maintaining several queues, one for each site, and a list of pages to download
in each queue.

The implementation of a crawler involves many practical issues.

Most of them is due to the need to interact with many different

systems Example: How to download maintaining the traffic uniform as
produced as possible?
The pages are from multiple sources DNS and Web server response times
arehighly variable Web server up-time cannot be taken for granted.
Other practical issues are related with:
types of Web pages, URL canonization, parsing, wrong implementation of
HTMLstandards, and duplicates.

A Web crawler needs to balance various objectives that contradict
each other It must download new pages and seek fresh copies of downloaded
pages It must use network bandwidth efficiently avoiding to download bad
pages However, the crawler cannot know which pages are good, without first
downloading them To further complicate matters there is a huge amount of

pages being added, changed and removed every day on the Web

Crawling the Web, in a certain way, resembles watching the sky in a clear
night:the star positions that we see reflects the state of the stars at different

The simplest crawling scheduling is traversing Web sites in a breadth-first

fashion.This algorithm increases the Web site coverage and is good to the
politeness policy
by not requesting many pages from a site in a row. be useful
However, can consider the crawler’s behavior as a combination topolicies
of a series of To
illustrate, a crawling algorithm can be viewed as composed of three distinct policies
 selection policy: to visit the best quality pages, first
 re-visit policy: to update the index when pages change
 politeness policy: to avoid overloading Web sites

The diagram below depicts an optimal Web crawling scenario for an
hypothetical batch of five pages. The x-axis is time and the y-axis is speed, so

the area of each page is its size (in bytes).

Let us consider now a more realistic setting in which:
The speed of download of every page is variable and the effective
bounded by
bandwidth to a Web site (a fraction of the bandwidth that the crawler would
like to use can be lost) Pages from the same site can not be downloaded
right away one after the other (politeness policy).

By the end of the batch of pages, it is very likely that only a few hosts are active

Once a large fraction of the pages have been downloaded it is reasonable to
stop the crawl, if only a few hosts remain at the end. Particularly, if the number
of hosts remaining is very small then the bandwidth cannot be used

A short-term scheduler may improve its finishing time by saturating the

bandwidth usage, through the use of more threads However, if too many
threads are used, the cost of switching control among threads becomes
prohibitive The ordering of the pages can also be optimized by avoiding
having a few sites to choose from at any point of the batch.



1. 1 Introduction to Recommender System

What is Recommender System?
• A recommender system, or a recommendation system, is
• a subclass of information filtering system that seeks to predict the
"rating" or "preference" a user would give to an item.
• Wikipedia

• Recommender Systems (RSs) are software tools and techniques providing

suggestions for items to be of use to a user.

• The suggestions relate to various decision-making processes,
• What items to buy?
• What music to listen to?
• What online news to read?


Recommender or Recommendation Systems (RS) aim to help users dealing with

information overload: finding relevant items in a vast space of resources.

• The goal of a recommender system is to generate meaningful recommendations to a

collection of users for items or products that might interest them.

Examples :

• Suggestions for books on Amazon, or movies on Netflix,

are real-world examples of the operation of industry-strength recommender systems.

• An idea on Recommender System

deals with

a) a large volume of information

present by

b) filtering the most important information

based on

c) the data provided by a user and other factors

that take care of the user’s preference and interest.

It finds out

• the match between user and item and

imputes or attributes

• the similarities between users and items for recommendation.

Benefices by

• Both the users and the services provided have benefited from these kinds of

attains – Quality

• The quality and decision-making process has also improved through these
kinds of systems.

Why the Recommendation system?

1) Benefits users in finding items of their interest.
2) Help item providers in delivering their items to the right user.
3) Identity products that are most relevant to users.
4) Personalized content.
5) Help websites to improve user engagement.

Real-life user interaction with a recommendations system

• In the above image, a user has searched for a laptop with 1TB HDD, 8GB RAM,
and an i5 processor for 40,000₹.
• The system has recommended 3 most similar laptops to the user.

Perfect matching may not be recommended here.

What cab be recommended by Recommendation system?

• There are many different things that can be recommended by the system like
• movies, books, news, articles, jobs, advertisements, etc.
• Netflix uses a recommender system to recommend movies & web-series to its
• Similarly, YouTube recommends different videos.

How do User and Item matching is done?
• In order to understand how the item is recommended and how the matching is done,
let us a look at the images below:

Showing user-item matching for social websites

1.2 Types of Recommendation System

There are three major types of RS and they are:

a) Content-based b) Collaborative filtering

c) Hybrid

Other types of RS are:

1) Popularity-Based Recommendation System

• It works on the principle of popularity and or anything which is in trend.

• These systems check about the product or movie which are in trend or are most
popular among the users and directly recommend those.

For example:

• if a product is often purchased by most people then

• the system will get to know that that product is most popular so
for every new user who just signed it,
• the system will recommend that product to that user also and
chances becomes high that the new user will also purchase that.

• Demerits of popularity based recommendation system

• Not personalized
• The system would recommend the same sort of products/movies which are solely
based upon popularity to every other user.
• Example
• Google News: News filtered by trending and most popular news.

• YouTube: Trending videos.

2) Merits of popularity based recommendation system

• It does not suffer from cold start problems which means on day 1 of the
business also it can recommend products on various different filters.
• There is no need for the user's historical data.

How Recommendation System initiated ?

Birth of Amazon site

A few years ago,

• readers of Krakauer would never even have learned about Simpson's

book—and if they had, they wouldn't have been able to find it.

Amazon changed that. How?

• It created the Touching the Void phenomenon by combining infinite shelf

space with real-time information about buying trends and public opinion.

The result: Rising demand for an obscure book (not well-known book)

A virtue of online booksellers

it is an example of an entirely new economic model for the media and entertainment

• one that is just beginning to show its power.

Unlimited selection is revealing truths about

• What consumers want and how they want to get it in service after service,
from DVDs at Netflix to music videos on Yahoo! ?


• Launch to songs in the iTunes Music Store and Rhapsody.

• People are going deep into the catalog, down the long, long list of available
titles, far past what's available at Blockbuster Video.

Birth of Recommender System

The Major Rules / decision to ponder:

Rule 1: Make everything ( every information ) available.

Rule 2: Cut the price of needed / not needed information in half. Now lower it.

Rule 3: Help me find it myself.



a) Information retrieval: perform your planned information retrieval (information

retrieval techniques)

b) Evaluating the results: evaluate the results of your information retrieval (number and
relevance of search results)

c) Locating publications: find out where and how the required publication, e.g. article,
can be acquired.


What are the major components of the information retrieval process?

• An automated or manually-operated indexing system or Recommender

system used to index and search techniques and procedures.
• The two major components are :
1. A collection of documents in any one of the following formats: text,
image or multimedia.
2. A set of queries that serve as the input to a system, via a human or

Requirements for RS Functions

A Tourist Travels Books suitable Hotel OR Made attracted on Interesting

First of all,

• we must distinguish between the role played by the RS on behalf of the
service provider from that of the user of the RS.

For instance, a travel recommender system is

• typically introduced by a travel intermediary (e.g., Expedia.com) or a

destination management organization (e.g., Visitfinland.com) to increase its
turnover (Expedia),

i.e., sell more hotel rooms, or to increase the number of tourists to the destination.

The user’s primary motivations for accessing the two systems is to

find a suitable hotel and interesting events/attractions when visiting a

• There are various reasons as to why service providers may want to exploit this
• Increase the number of items sold.
• Sell more diverse items.
• Increase the user satisfaction.
• Increase user fidelity.

• Better understand what the user wants.


• RSs are information processing systems that actively gather various kinds of data in
order to build their recommendations.
• Data is primarily about the items to suggest and the users who will receive these
• since the data and knowledge sources available for recommender systems can be very
• ultimately, whether they can be exploited or not depends on the recommendation

• In general, there are recommendation techniques that are knowledge poor,

• i.e., they use very simple and basic data, such as user ratings/evaluations for
• Other techniques are much more knowledge dependent,
• e.g., using ontological descriptions of the users or the items, or constraints, or social
relations and activities of the users.

The three major sources of Data and knowledge are:

• In any case, as a general classification, data used by RSs refers to three kinds of
a) items,
b) users, and
c) transactions, i.e., relations between users and items.

Sources of Knowledge for Recommender Systems

a) Ratings

 Ratings have been the most popular source of knowledge for RS to represent
users’s preferences from the early 1990s to more recent years.

 The foundational RS algorithm collaborative filtering, tries to find like-minded
users by correlating the ratings that users have provided in a system.
 The goal of the algorithm is predicting users’ ratings, under the assumption that
this is a good way to estimate the interest that a user will show for a previously
unseen item.

b) Implicit Feedback

 This source of knowledge refers to actions that the user performs over items, but that
cannot be directly interpreted as explicit interest, i. e., the user explicitly stating her
preference or the relevance of an item.

c) Social Tags

 Social Tagging systems (STS) allow users to attach free keywords, also known as
tags, to items that users share or items that are already available in the system.

 Common examples of these systems are CiteULike3, Bibsonomy4 , or Mendeley5

(mainly for academic resources), Delicious6 (URLs), Flickr7 (photographs), and

last.fm (music).

 Social Recommender Systems (SRSs) are recommender systems that target the social
media domain.
 The main goals for these systems are to improve recommendation quality and solve
the social information overload problem.
 These recommender systems provide people, web pages, items, or groups as
recommendations to users.


• Items are the objects that are recommended.

• Items may be characterized by their complexity and their value or utility.

• The value of an item may be positive
• if the item is useful for the user, or negative if the item is not appropriate
and the user made a wrong decision when selecting it.
• We note that:
• when a user is acquiring an item she will always incur in a cost,
• which includes
• the cognitive cost of searching for the item and the real monetary cost
eventually paid for the item.

• For instance,
• the designer of a news RS must take into account the complexity of a news item,
i.e., its structure, the textual representation, and the time-dependent importance of
any news item.
• But, at the same time,
• the RS designer must understand that even if the user is not paying for
reading news,
• there is always a cognitive cost associated to searching and reading news

• If a selected item is relevant for the user

• this cost is dominated by the benefit of having acquired a useful information,
• whereas if the item is not relevant
• the net value of that item for the user, and its recommendation, is negative.
• In other domains,
• e.g., cars, or financial investments,
• the true monetary cost of the items becomes an important element to
consider when selecting the most appropriate recommendation

• Items with low complexity and value are: news, Web pages, books, CDs,
• Items with larger complexity and value are:
a) digital cameras
b) mobile phones
c) PCs
• The most complex items that have been considered are:

a) insurance policies,
b) financial investments,
c) travels, jobs.


• Recommender Systems are information processing systems that actively gather

various kinds of data in order to build their recommendations.
• Data is primarily about the items to suggest and the users who will receive these
• since the data and knowledge sources available for recommender systems can be very
diverse, ultimately, whether they can be exploited or not depends on the
recommendation technique.

• In general, there are recommendation techniques that are knowledge poor,

• i.e., they use very simple and basic data, such as user
ratings/evaluations for items.
• Other techniques are much more knowledge dependent,
• e.g., using ontological descriptions of the users or the items, or
constraints, or social relations and activities of the users.
• In any case, as a general classification, data used by RSs refers to three kinds of
• items, users, and transactions, i.e., relations between users and


There are three sources for Data and Knowledge sources:


• Items are the objects that are recommended.

• Items may be characterized by their complexity and their value or utility.
• The value of an item may be positive
• if the item is useful for the user, or negative if the item is not appropriate and the
user made a wrong decision when selecting it.
• We note that:
• when a user is acquiring an item she will always incur in a cost, which includes the
cognitive cost of searching for the item and the real monetary cost eventually paid for
the item.

• For instance,
• the designer of a news RS must take into account the complexity
of a news item,
• i.e., its structure, the textual representation, and the time-
dependent importance of any news item.
• But, at the same time,
• the RS designer must understand that even if the user is not
paying for reading news,
• there is always a cognitive cost associated to searching and
reading news items.

• If a selected item is relevant for the user

• this cost is dominated by the benefit of having acquired a useful
• whereas if the item is not relevant
• the net value of that item for the user, and its recommendation, is

• In other domains,
• e.g., cars, or financial investments,
• the true monetary cost of the items becomes an
important element to consider when selecting the
most appropriate recommendation approach.

• Items with low complexity and value are:

a) news,
b) Web pages,
c) books,
d) CDs,
e) movies.

• Items with larger complexity and value are:

a) digital cameras
b) mobile phones
c) PCs
• The most complex items that have been considered are:
a) insurance policies,
b) financial investments,
c) travels, jobs.

• RSs, according to their core technology,

• can use a range of properties and features of the items.
• For example
• in a movie recommender system, the genre (such as comedy, thriller, etc.),
• as well as the director, and actors can be used to describe a movie
• to learn how the utility of an item depends on its features.

• Items can be represented using various information and representation
• e.g., in a minimalist way as a single id code, or in a richer form,
as a set of attributes,
• but even as a concept in an ontological representation of the

2) Users

• Users of a RS,
• may have very diverse goals and characteristics.
• In order to personalize the recommendations and the human-
computer interaction.
• RSs exploit a range of information about the users.
• This information can be
• structured in various ways and
• again the selection of what information to model depends on the
recommendation technique.

• Users can also be described by their behavior pattern data.

• For example:
a) site browsing patterns (in a Web-based recommender system),
b) travel search patterns (in a travel recommender system).
• User data may include
a) relations between users such as the trust level of these relations
between users.
• A RS might utilize this information to
a) recommend items to users that were preferred by similar or
trusted users.

3) Transactions

• We generically refer to a transaction as

• a recorded interaction between a user and the RS.
• Transactions are log-like data that store
• important information generated during the human-computer
interaction and which are
• useful for the recommendation generation
algorithm that the system is using.

• For instance,
• a transaction log may contain a reference to the item selected by
the user and a description of the context
• (e.g., the user goal/query) for that particular recommendation.
• If available,
• that transaction may also include
• an explicit feedback the user has provided,
• such as the rating for the selected item.

• In fact, ratings are the most popular form of transaction data that a RS collects.
• These ratings may be collected explicitly or implicitly.
• In the explicit collection of ratings,
• the user is asked to provide her opinion about an item on a rating

• Numerical ratings such as the 1-5 stars provided in the book recommender
associated with Amazon.com.

• Ordinal ratings, such as “strongly agree, agree, neutral, disagree, strongly

• where the user is asked to select the term that best indicates her
opinion regarding an item (usually via questionnaire).

• Binary ratings that model choices in which the user is simply asked to decide
if a certain item is good or bad.
• Unary ratings can indicate that a user has observed or purchased an item, or
otherwise rated the item positively.
• In such cases, the absence of a rating indicates that we have no information
relating the user to the item (perhaps she purchased the item somewhere else).


In order to implement its core function,

• identifying the useful items for the user,

• a RS must predict that an item is worth recommending.

In order to do this,

• the system must be able to predict the utility of some of them, or

• at least compare the utility of some items, and then
• decide what items to recommend based on this comparison.

• The prediction step may not be explicit in the recommendation algorithm :

• but we can still apply this unifying model to describe the general
role of a RS.
• Goal :
• To provide the reader with a unifying perspective rather than
• an account of all the different recommendation approaches
• We quote
• a taxonomy provided by that has become a classical way of distinguishing
between recommender systems and referring to them.


Different classes of recommendation approaches

Five different classes of recommendation approaches are:

a) Content-based

b) Collaborative filtering

c) Demographic

d) Knowledge-based

e) Community-based

a) Content-based:

• The system learns to recommend items that are similar to the ones that the user liked
in the past.
• The similarity of items is calculated based on the features associated with the
compared items.

For example, if a user has positively rated a movie that belongs to the comedy genre, then
the system can learn to recommend other movies from this genre.

• The system learns to recommend items that are similar to the ones that the user
liked in the past.
• The similarity of items is calculated based on the features associated with the
compared items.
• For example,
• if a user has positively rated a movie that belongs to the comedy genre,
• the system can learn to recommend other movies from this

b) Collaborative filtering

• The simplest and original implementation of this approach recommends to the

active user the items that other users with similar tastes liked in the past.
• The similarity in taste of two users is calculated based on the similarity in the
rating history of the users.
• This is the reason why refers to collaborative filtering as “people-to-people
• Collaborative filtering is considered to be the most popular and widely
implemented technique in RS.

c) Demographic

• This type of system recommends items based on the demographic profile of the
• The assumption is that different recommendations should be generated for
different demographic niches.
• Many Web sites adopt simple and effective personalization solutions based on
• For example,
• users are dispatched to particular Web sites based on their
language or country.
• Or suggestions may be customized according to the age of the
• While these approaches have been quite popular in the marketing
• there has been relatively little proper RS research into
demographic systems.

d) Knowledge-based

• Knowledge-based systems recommend items based on specific domain

knowledge about how certain item features meet users needs and preferences
• ultimately, how the item is useful for the user. Notable knowledge based
recommender systems are case-based.
• In these systems a similarity function estimates

• How much the user needs (problem description) match the
recommendations (solutions of the problem).
• Here the similarity score can be directly interpreted as the utility of the
recommendation for the user.

e) Community-based

• This type of system recommends items based on the preferences of the users
• This technique follows the epigram
• “Tell me who your friends are, and I will tell you who you are”.
• Evidence suggests that people tend to rely more on recommendations from
their friends than on recommendation from similar but anonymous
• This observation, combined with the growing popularity of open social
networks, is generating a rising interest in community-based systems or, as or
as they usually referred to, social recommender systems.
• This type of RSs models and acquires information about the social relations of
the users and the preferences of the user’s friends.
• The recommendation is based on ratings that were provided by the user’s

• In fact these RSs are following the rise of social-networks and enable a simple
and comprehensive acquisition of data related to the social relations of the
• Hybrid recommender systems:
• These RSs are based on the combination of the above mentioned techniques.
• A hybrid system combining techniques A and B tries to use the advantages of A
to fix the disadvantages of B.

• For instance, CF methods suffer from new-item problems, i.e., they cannot
recommend items that have no ratings.

• This does not limit content-based approaches since the prediction for new
items is based on their description (features) that are typically easily available.
• Given two (or more) basic RSs techniques, several ways have been proposed
for combining them to create a new hybrid.

• The aspects that apply to the design stage include factors that might affect the
choice of the algorithm.
• The first factor to consider, the application’s domain, has a major effect on the
algorithmic approach that should be taken.
• Based on the specific application domains,
• we define more general classes of domains for the most common
recommender systems applications:
• Entertainment - recommendations for movies, music, and IPTV.
• Content - personalized newspapers, recommendation for
documents, recommendations of Web pages, e-learning
applications, and e-mail filters.
• E-commerce - recommendations for consumers of products to
buy such as books, cameras, PCs etc.
• Services - recommendations of travel services, recommendation
of experts for consultation, recommendation of houses to rent, or
matchmaking services.



• What is RS?
• Recommender systems have the effect of
• guiding users in a personalized way to interesting objects in a
large space of possible options.

• What C-BRS?
• Content-based recommendation systems try to
• recommend items similar to those a given user has liked in
the past.
• Content-Based Recommender works by the data that we take from the user,

• either explicitly (rating) or implicitly (clicking on a link).
• By the data, we create a user profile,
• which is then used to suggest to the user, as the user provides
more input or take more actions on the recommendation, the
engine becomes more accurate.

User Profile

• In the User Profile,

• we create vectors that describe the user’s preference.
• In the creation of a user profile,
• we use the utility matrix which describes the relationship
between user and item.
• With this information,
• the best estimate we can make regarding which item user likes, is
some aggregation of the profiles of those items.

Item Profile
• To build a profile for each item,
• which will represent the important characteristics of that item.
• Example: if we make a movie as an item then
• its actors, director, release year and genre are the most
significant features of the movie.
• We can also add its rating from the IMDB (Internet Movie Database) in the
Item Profile.
• A predefined DATASET for an ITEM is readily available.

Utility Matrix
• Utility Matrix signifies the user’s preference with certain items.
• In the data gathered from the user,
• we have to find some relation between the items which are liked
by the user and those which are disliked, for this purpose we use
the utility matrix.
• In it we assign a particular value to each user-item pair, this value is known as
the degree of preference.
• Then we draw a matrix of a user with the respective items to identify their
preference relationship.

• the goal of a recommendation system is not to fill all the columns

• but to recommend a movie to the user which he/she will
• Through this table, our recommender system won’t suggest Movie 3 to User 2,
• because in Movie 1 they have given approximately the same
ratings, and
• in Movie 3 User 1 has given the low rating,
• so it is highly possible that User 2 also won’t like it.

C-BRS : Recommending Items to User Based on Content

• Method 1:
We can use the cosine distance between the vectors of the item and the user to
determine its preference to the user.

• Method 2:
We can use a classification approach in the recommendation systems:
• like we can use the Decision Tree for finding out whether a user
wants to watch a movie or not
• like at each level we can apply a certain condition to refine our


Indeed, the basic process performed by a content-based recommender consists in :
• matching up
• the attributes of a user profile in which preferences and interests
are stored,

• with
• the attributes of a content object (item),
in order to recommend to the user new interesting items.

How do Content Based Recommender Systems work?

• A content based recommender works with data that the user provides, either
explicitly (rating) or implicitly (clicking on a link).
• Based on that data, a user profile is generated, which is then used to make
suggestions to the user.

• Content-based recommendation systems try to
• recommend items similar to those a given user has liked in the
• Systems designed according to the collaborative recommendation paradigm
• users whose preferences are similar to those of the given user and
recommend items they have liked.

• In Content-Based Recommender,
• we must build a profile for each item, which will represent the important
characteristics of that item.
• For example, if we make a movie as an item then its actors, director, release
year and genre are the most significant features of the movie.

• Systems implementing a content-based recommendation approach analyze

• a set of documents and/or descriptions of items previously rated
by a user, and
• build
• a model or profile of user interests based on the features of the
objects rated by that user.
• The profile is a structured representation of user interests,
• adopted to recommend new interesting items.

• The recommendation process basically consists

• in matching up
• the attributes of the user profile
• against
• the attributes of a content object.
• The result is
• a relevance judgment that represents the user’s level of
interest in that object.
• If a profile accurately reflects user preferences,
• it is of tremendous advantage for the effectiveness of an
information access process.

• . For instance,
• it could be used to filter search results by
• deciding whether a user is interested in a specific Web page or
not and,
• in the negative case, preventing it from being displayed.


• Content-based Information Filtering (IF) systems need proper techniques for
• representing the items and producing the user profile, and
• some strategies for comparing the user profile with the item
• The high level architecture of a content based recommender system is depicted
in Figure below:

Components of Content-based recommender system

• The recommendation process is performed in three steps, each of which is

handled by a separate component:


• The three principal components are:

a) A Content Analyzer, that give us a classification of the items, using some sort of
representation (more of this later on this post).

b) A Profile Learner, that makes a profile that represents each user’s preferences.

c) A Filtering Component, that takes all the inputs and generates the list of
recommendations for each user.


When information has no structure (e.g. text), some kind of pre-processing step is needed
to extract structured relevant information.

• The main responsibility of the component is to

• represent the content of items
• (e.g. documents, Web pages, news, product descriptions, etc.)
coming from
• information sources in a form suitable for the next
processing steps.

• Data items are analyzed by feature extraction techniques in order to

• shift item representation from the original information space to
the target one
• For example : Web pages represented as keyword vectors.
• This representation is the input to the


• This module collects

• data representative of the user preferences and tries to generalize
this data, in order to construct the user profile.
• Usually, the generalization strategy is realized through machine learning
• which are able to infer a model of user interests starting from
items liked or disliked in the past.

• For instance, the PROFILE LEARNER of a Web page recommender can

• a relevance feedback method in which the learning technique
• vectors of positive and negative examples into a
prototype vector representing the user profile.
• Training examples are :
• Web pages on which a positive or negative feedback has been
provided by the user.

• This module exploits the user profile to
• suggest relevant items by matching the profile representation
against that of items to be recommended.
• The result is a binary or continuous relevance judgment (computed using some
similarity metrics [42]), the latter case resulting in a ranked list of potentially
interesting items.
• In the above mentioned example,
• the matching is realized by computing the cosine similarity
between the prototype vector and the item vectors.

• The first step of the recommendation process is the one performed by the
CONTENT ANALYZER, that usually borrows
• techniques from Information Retrieval system.
• Item descriptions coming from Information Source are processed by the
CONTENT ANALYZER, that extracts features (keywords, n-grams, concepts,
. . . ) from
• unstructured text to produce a structured item representation,
stored in the repository Represented Items.

• In order to construct and update the profile of the active user (user for which
recommendations must be provided)
• her reactions to items are collected in some way and recorded in
the repository Feedback.
• These reactions, called annotations or feedback, together with the
related item descriptions, are exploited during the process of learning a
model useful to predict the actual relevance of newly presented items.

• Two different techniques can be adopted for recording user’s feedback.

1) Explicit feedback :

• When a system requires the user to explicitly evaluate items, this

technique is usually referred to as “explicit feedback”

2) Implicit feedback: It does not require any active user involvement, in the sense that
feedback is derived from monitoring and analyzing user’s activities.

• Explicit evaluations indicate

• how relevant or interesting an item is to the user?

• There are three main approaches to get explicit relevance feedback:

• like/dislike – items are classified as “relevant” or “not relevant” by adopting a
simple binary rating scale; ratings – a discrete numeric scale is usually adopted
to judge items.

• Alternatively, symbolic ratings are mapped to a numeric scale, such as in
Syskill & Webert, where users have the possibility of rating a Web page as hot,
lukewarm, or cold;

• text comments – Comments about a single item are collected and presented to
the users as a means of facilitating the decision-making process.
• For instance, customer’s feedback at Amazon.com or eBay.com might help
users in deciding whether an item has been appreciated by the community.
• Textual comments are helpful, but they can overload the active user because
she must read and interpret each comment to decide if it is positive or negative,
and to what degree.

text comments – Comments about a single item are collected and presented to the users
as a means of facilitating the decision-making process.

For instance, customer’s feedback at Amazon.com or eBay.com might help users in

deciding whether an item has been appreciated by the community.

• Textual comments are helpful,

• but they can overload the active user because she must read and
interpret each comment to decide if it is positive or negative, and
to what degree
• The literature proposes advanced techniques from the affective computing
research area to
• make content-based recommenders able to automatically perform
this kind of analysis.

• How to represent the content?

• The content of an item is a very abstract thing and gives us a lot of options. We
could use a lot of different variables. For example, for a book we could
consider the author, the genre, the text of the book itself… the list goes on.

• Content-based filtering is a type of recommender system that attempts to
guess what a user may like based on that user's activity.
• Content-based filtering makes recommendations by
• using keywords and attributes assigned to objects in a database
• e.g., items in an online marketplace and
• matching them to a user profile.

Advantages of Content-based Filtering

• The adoption of the content-based recommendation paradigm has several



• Content-based recommenders exploit solely ratings provided by the active user

to build her own profile.
• Instead, collaborative filtering methods need ratings from other users in order
to find the “nearest neighbors” of the active user,
• i.e., users that have similar tastes since they rated the same items
• Then, only the items that are most liked by the neighbors of the active user will
be recommended.


• Explanations on how the recommender system works can be provided by

explicitly listing content features or descriptions that caused an item to occur in
the list of recommendations.
• Those features are indicators to consult in order to decide whether to trust a
• Conversely, collaborative systems are black boxes since the only explanation
for an item recommendation is that unknown users with similar tastes liked that


• Content-based recommenders are capable of

• recommending items not yet rated by any user.
• As a consequence,
• they do not suffer from the first-rater problem, which affects
collaborative recommenders which rely solely on users’
preferences to make recommendations.
• Therefore, until the new item is rated by a substantial number of users,
• the system would not be able to recommend it.

Drawbacks of Content-based Filtering


• Content-based techniques have a natural limit in the number and type of

features that are associated, whether automatically or manually, with the
objects they recommend.
• Domain knowledge is often needed, e.g., for movie recommendations the
system needs to know the actors and directors, and sometimes, domain
ontologies are also needed.

• No content-based recommendation system can provide

• suitable suggestions if the analyzed content does not contain
enough information to discriminate items the user likes from
items the user does not like.
• Some representations capture only certain aspects of the content,
• but there are many others that would influence a user’s

• For instance,
• often there is not enough information in the word frequency to
• Model the user interests in jokes or poems,
• while techniques for affective computing would be most
• Again, for Web pages, feature extraction techniques from text completely
ignore aesthetic qualities and additional multimedia information.
• To sum up,
• both automatic and manually assignment of features to items
could not be sufficient to
• define distinguishing aspects of items that turn out to be
necessary for the elicitation of user interests.


• Content-based recommenders have no inherent method for finding something

• The system suggests items whose scores are high when matched against the
user profile, hence the user is going to be recommended items similar to those
already rated.
• This drawback is also called serendipity problem to highlight the tendency of
the content-based systems to produce recommendations with a limited degree
of novelty.

• To give an example:
• when a user has only rated movies directed by Stanley Kubrick,
she will be recommended just that kind of movies.
• A “perfect” content-based technique would rarely find anything
novel, limiting the range of applications for which it would be


• Enough ratings have to be collected before
• a content-based recommender system can really understand
• user preferences and provide accurate
• Therefore, when few ratings are available,
• as for a new user, the system will not be able to provide reliable



You might also like