Cs8080 - Irt - Notes All
Cs8080 - Irt - Notes All
Cs8080 - Irt - Notes All
– INFORMATION TECHNOLOGY
PROFESSIONAL ELECTIVE – V
1
UNIT I: INTRODUCTION
Syllabus:
UNIT I INTRODUCTION 9
Information Retrieval – Early Developments – The IR Problem – The User‗s Task –
Information versus Data Retrieval - The IR System – The Software Architecture of the IR
System – The Retrieval and Ranking Processes - The Web – The e-Publishing Era – How
the web changed Search – Practical Issues on the Web – How People Search – Search
Interfaces Today – Visualization in Search Interfaces.
2
1. INFORMATION RETRIEVAL
Information
Types of Information
Text
XML and structured documents
Images
Audio
Video
Source Code
Applications/Web services
Retrieval
“Fetch something” that’s been stored
Information Retrieval
The amount of available information is growing at an incredible rate, for example the Internet and
World Wide Web.
Information are stored in many forms e.g. images, text, video, and audio.
Information Retrieval is a way to separate relevant data from irrelevant.
IR field has developed successful methods to deal effectively with huge amounts of information.
Common methods include the Boolean, Vector Space and Probabilistic models.
Main objective of IR
Provide the users with effective access to and interaction with information resources.
3
Goal of IR
The goal is to search large document collections to retrieve small subsets relevant to the
user’sinformation need.
Purpose/role of an IR system
An information retrieval system is designed to retrieve the documents or information required
by the user community.
It should make the right information available to the right user.
Thus, an information retrieval system aims at collecting and organizing information in one or
more subject areas in order to provide it to the user as soon as possible.
Thus it serves as a bridge between the world of creators or generators of information and the users
of that information.
Information retrieval (IR) is concerned with representing, searching, and manipulating large
collections of electronic text and other human-language data.
Web search engines — Google, Bing, and others — are by far the most popular and heavily used IR
services, providing access to up-to-date technical information, locating people and organizations,
summarizing news and events, and simplifying comparison shopping.
Web Search:
Regular users of Web search engines casually expect to receive accurate and near-
instantaneous answers to questions and requests merely by entering a short query — a few
words — into a text box and clicking on a search button. Underlying this simple and intuitive
interface are clusters of computers, comprising thousands of machines, working cooperatively
to generate a ranked list of those Web pages that are likely to satisfy the information need
embodied in the query.
These machines identify a set of Web pages containing the terms in the query, compute a score
for each page, eliminate duplicate and redundant pages, generate summaries of the remaining
pages, and finally return the summaries and links back to the user for browsing.
4
Consider a simple example.
If you have a computer connected to the Internet nearby, pause for a minute to launch a browser
and try the query “information retrieval” on one of the major commercial Web search engines.
It is likely that the search engine responded in well under a second. Take some time to review
the top ten results. Each result lists the URL for a Web page and usually provides a title and a
short snippet of text extracted from the body of the page.
Overall, the results are drawn from a variety of different Web sites and include sites associated
with leading textbooks, journals, conferences, and researchers. As is common for informational
queries such as this one, the Wikipedia article may be present.
Other IR Applications:
1) Document routing, filtering, and selective distribution reverse the typical IR process.
2) Summarization systems reduce documents to a few key paragraphs, sentences, or phrases
describing their content. The snippets of text displayed with Web search results represent one
example.
3) Information extraction systems identify named entities, such as places and dates, and combine
this information into structured records that describe relationships between these entities — for
example, creating lists of books and their authors from Web data.
5
Kinds of information retrieval systems
1. Database Management
• Focused on structured data stored in relational tables rather than free-form text.
• Focused on efficient processing of well-defined queries in a formal language (SQL).
• Clearer semantics for both data and queries.
• Recent move towards semi-structured data (XML) brings it closer to IR.
2. Library and Information Science
• Focused on the human user aspects of information retrieval (human-computer interaction,
user interface,visualization).
• Concerned with effective categorization of human knowledge.
• Concerned with citation analysis and bibliometrics (structure of information).
• Recent work on digital libraries brings it closer to CS & IR.
3. Artificial Intelligence
• Focused on the representation of knowledge, reasoning, and intelligent action.
• Formalisms for representing knowledge and queries:
6
– First-order Predicate Logic
– Bayesian Networks
• Recent work on web ontologies and intelligent information agents brings it closer to IR.
4. Natural Language Processing
• Focused on the syntactic, semantic, and pragmatic analysis of natural language text and
discourse.
• Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on
meaning ratherthan keywords.
Natural Language Processing: IR Directions
• Methods for determining the sense of an ambiguous word based on context
(word sensedisambiguation).
• Methods for identifying specific pieces of information in a document (information extraction).
• Methods for answering specific NL questions from document corpora or structured data like
FreeBase orGoogle’s Knowledge Graph.
5. Machine Learning
• Focused on the development of computational systems that improve their performance with
experience.
• Automated classification of examples based on learning concepts from labeled training
examples (supervised learning).
• Automated methods for clustering unlabeled examples into meaningful groups (unsupervised
learning).
Machine Learning: IR Directions
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction
• Text Mining
• Learning to Rank
7
2. EARLY DEVELOPMENTS
1950s:
1950: The term "information retrieval" was coined by Calvin Mooers.
1951: Philip Bagley conducted the earliest experiment in computerized document retrieval in a master
thesis at MIT.
1955: Allen Kent joined from Western Reserve University published a paper in American
Documentation describing the precision and recall measures as well as detailing a proposed
"framework" for evaluating an IR system which included statistical sampling methods for determining
the number of relevant documents not retrieved.
1959:HansPeter Luhnpublished"Auto-encodingofdocumentsforinformationretrieval."
1960s:
1960-70’s:
Initial exploration of text retrieval systems for “small” corpora of scientific abstracts, and law
and business documents.
Development of the basic Boolean and vector-space models of retrieval.
Prof. Salton and his students at Cornell University are the leading researchers in the area.
early 1960s: Gerard Salton began work on IR at Harvard, later moved to Cornell.
1963:Joseph Becker and Robert M. Hayes published text on information retrieval. Becker,
Joseph; Hayes, Robert Mayo. Information storage and retrieval: tools, elements, theories. New
York, Wiley (1963).
1964:
Karen Spärck Jones finished her thesis at Cambridge, Synonymy and Semantic Classification,
and continued work on computational linguistics as it applies to IR.
mid-1960s:
National Library of Medicine developed MEDLARS Medical Literature Analysis and Retrieval System,
the first major machine-readable database and batch-retrieval system.
Project Intrex at MIT.
1965: J. C. R. Licklider published Libraries of the Future.
8
late 1960s: F. Wilfrid Lancaster completed evaluation studies of the MEDLARS system and published
the first edition of his text on information retrieval.
1968: Gerard Salton published Automatic Information Organization and Retrieval. John W. Sammon,
Jr.'s RADC Tech report "Some Mathematics of Information Storage and Retrieval..." outlined the
vector model.
1969: Sammon's "A nonlinear mapping for data structure analysis" (IEEE Transactions on Computers)
was the first proposal for visualization interface to an IR system.
1970s:
early 1970s:Firstonlinesystems—NLM'sAIM-TWX,MEDLINE;Lockheed'sDialog;SDC'sORBIT.
1971: Nicholas Jardine and Cornelis J. van Rijsbergen published "The use of hierarchic clustering in
information retrieval", which articulated the "cluster hypothesis."
1975: Three highly influential publications by Salton fully articulated his vector processing framework
and term discrimination model: ATheory of Indexing (Society for Industrial and Applied Mathematics)
1979: C. J. van Rijsbergen published Information Retrieval (Butterworths). Heavy emphasis on
probabilistic models.
1979: Tamas Doszkocs implemented the CITE natural language user interface for MEDLINE at the
National Library of Medicine. The CITE system supported free form query input, ranked output and
relevance feedback.
1980s
Mid-1980s: Efforts to develop end-user versions of commercial IR systems. 1989: First World Wide
Web proposalsbyTimBerners-Lee at CERN.
1990s
2000s-present:
More applications, especially Web search and interactions with other fields like Learning to rank,
Scalability (e.g., MapReduce), Real-time search
Link analysis for Web Search
o Google
Automated Information Extraction
o Whizbang
o Fetch
o Burning Glass
Question Answering
Multimedia IR
o Image
o Video
o Audio and music
Cross-Language IR
o DARPA Tides
Document Summarization
o Learning to Rank
10
3. THE IR PROBLEM
The IR Problem: the primary goal of an IR system is to
retrieve all the documents that are relevant to a user query while retrieving as few non relevant
documents as possible.
The difficulty is knowing not only how to extract information from the documents
but also knowing how to use it to decide relevance.
That is, the notion of relevance is of central importance in IR.
1.Relevance
One main issue is that relevance is a personal assessment that depends on the task being solved
and its context.
For example:
Relevance can change
a) with time (e.g., new information becomes available),
b) with location (e.g., the most relevant answer is the closest one), or
c) even with the device (e.g., the best answer is a short document that is easier to download and
visualize).
It is the fundamental concept in IR.
A relevant document contains the information that a person was looking for when she
submitted a queryto the search engine.
There are many factors that go into a person’s decision as to whether a document is relevant.
These factors must be taken into account when designing algorithms for comparing text
and rankingdocuments.
Simply comparing the text of a query with the text of a document and looking for an exact
match, as might be done in a database system produces very poor results in terms of relevance.
To address the issue of relevance, retrieval models are used.
2. Evaluation
Two of the evaluation measures are precision and recall.
Precision is the proportion of retrieved documents that are relevant.
Recall is the proportion of relevant documents that are retrieved.
Precision = Relevant documents ∩ Retrieved documents
Retrieved documents
Recall = Relevant documents ∩ Retrieved documents
Relevant documents
When the recall measure is used, there is an assumption that all the relevant documents for a
given query are known. Such an assumption is clearly problematic in a web search
environment, but with smaller test collection of documents, this measure can be useful. It is
not suitable for large volumes of log data.
Main problems
Document and query indexing
o How to represent their contents?
o Query evaluation
To what extent does a document correspond to a query?
o System evaluation
o How good is a system?
o Are the retrieved documents relevant? (precision)
o Are all the relevant documents retrieved? (recall)
Why is IR difficult?
Vocabularies mismatching
o The language can be used to express the same concepts in many different ways, with
differentwords. This is referred to as the vocabulary mismatch problem in information
retrieval.
o E.g. Synonymy: car vs automobile
Queries are ambiguous
12
Content representation may be inadequate and incomplete
The user is the ultimate judge, but we don’t know how the judge judges.
Challenges in IR
Scale, distribution of documents
Controversy over the unit of indexing
High heterogeneity
Retrieval strategies
The User Task.- The user of a retrieval system has to translate his information need into a query in the
language provided by the system.
With an information retrieval system, this normally implies specifying a set of words which convey the
semantics of the information need.
a) Consider a user who seeks information on a topic of their interest : This user first translates their
information need into a query, which requires specifying the words that compose the query In this case,
we say that the user is searching or querying for information of their interest.
b) Consider now a user who has an interest that is either poorly defined or inherently broad
For instance, the user has an interest in car racing and wants to browse documents on Formula 1 and
Formula Indy, In this case, we say that the user is browsing or navigating the documents of the
collection
The general objective of an Information Retrieval System is to minimize the time it takes for a
user to locate the information they need.
The goal is to provide the information needed to satisfy the user's question. Satisfaction does not
necessarily mean finding all information on a particular issue.
The user of a retrieval system has to translate his information need into a query in the language
provided by the system.
With an information retrieval system, this normally implies specifying a set of words which
convey the semantics of the information need.
With a data retrieval system, a query expression (such as, for instance, a regular expression) is
used to convey the constraints that must be satisfied by objects in the answer set.
13
In both cases, we say that the user searches for useful information executing a retrieval task.
Consider now a user who has an interest which is either poorly defined or which is inherently
broad.
For instance, the user might be interested in documents about car racing in general.
In this situation, the user might use an interactive interface to simply look around in the
collection for documents related to car racing.
For instance, he might find interesting documents about Formula 1 racing, about car
manufacturers, or about the `24 Hours of Le Mans.'
We say that the user is browsing or navigating the documents in the collection, not
searching.
It is still a process of retrieving information, but one whose main objectives are less
clearly defined in the beginning.
Furthermore, while reading about the `24 Hours of Le Mans', he might turn his attention to a
document which provides directions to Le Mans and, from there, to documents which cover
tourism in France.
In this situation, we say that the user is browsing the documents in the collection, not searching.
It is still a process of retrieving information, but one whose main objectives are not clearly
defined in the beginning and whose purpose might change during the interaction with the
system.
The task in this case is more related to exploratory search and resembles a process of quasi-
sequential search for information of interest.
Here we, make a clear distinction between the different tasks the user of the retrieval system
might be engaged in.
The task might be then of two distinct types: searching and browsing, as illustrated in Figure:
14
In a process of retrieving information, one whose main objectives are not clearly defined in the
beginning and whose purpose might change during the interaction with the system.
Then, user task may go with Browsing only.
a) Push
b) Pull
Both retrieval and browsing are, in the language of the World Wide Web, `pulling' actions.
For instance, information useful to a user could be extracted periodically from a news service.
15
In this case, we say that the IR system is executing a particular retrieval task which consists
of filtering relevant information for later inspection by the user.
What is Retrieval?
This does not mean that there is no structure in the data Document structure (headings, paragraphs, lists.
..)
Explicit markup formatting (e.g. in HTML, XML. . . ) Linguistic structure (latent, hidden)
SELECT * FROM business catalogue WHERE category = ’florist’ AND city zip = ’cb1’
In an IR system the retrieved objects might be inaccurate and small errors are likely to go
unnoticed.
In a data retrieval system, on the contrary, a single erroneous object among a retrieval system,
such as defined structure and semantics thousand retrieved objects means total failure.
The above figure shows the architecture of IR System with the Specified Components
a) Text operation:
Text Operations forms index words (tokens).
Stop word removal , Stemming
b) Indexing:
Indexing constructs an inverted index of word to document pointers.
c) Searching:
Searchingretrievesdocumentsthat containagivenquerytokenfromtheinverted index.
d) Ranking :
Ranking scores all retrieved documents according to a relevance metric.
17
e) User Interface:
User Interface manages interaction with the user:
Query input and document output.
Relevance feedback.
Visualization of results.
f) Query Operations:
Query Operations transform the query to improve retrieval:
Query expansion using a thesaurus.
Query transformation using relevance feedback.
First of all, before the retrieval process can even be initiated, it is necessary to define the text database.
This is usually done by the manager of the database, which specifies the following:
(a) the documents to be used,
(b) the operations to be performed on the text, and
(c) the text model (i.e., the text structure and what elements can be retrieved).
The text operations transform the original documents and generate a logical view of them.
Once the logical view of the documents is defined, the database manager builds an index of the
text.
An index is a critical data structure because it allows fast searching over large volumes of data.
Different index structures might be used, but the most popular one is the inverted file.
The resources (time and storage space) spent on defining the text database and building the index
are amortized by querying the retrieval system many times.
Given that the document database is indexed, the retrieval process can be initiated.
The user first specifies a user need which is then parsed and transformed by the same text
operations applied to the text.
Then, query operations might be applied before the actual query, which provides a system
representation for the user need, is generated.
The query is then processed to obtain the retrieved documents. Fast query processing is made
possible by the index structure previously built.
Before been sent to the user, the retrieved documents are ranked according to a likelihood of
relevance. The user then examines the set of ranked documents in the search for useful
information.
At this point, he might pinpoint a subset of the documents seen as definitely of interest and
initiate a user feedback cycle.
In such a cycle, the system uses the documents selected by the user to change the query
formulation. Hopefully, this modified query is a better representation
18
7. THE SOFTWARE ARCHITECTURE OF THE SYSTEM
Before conducting a search, a user has an information need, which underlies and drives the
search process.
This information need is sometimes referred as a topic, particularly when it is presented in
written form as part of a text collection for IR evaluation.
As a result of the information need, the user constructs and issues a query to the IR system. This
query consists of a smaller number of terms, with two or three terms being typical for a Web
search.
The primary data structure of most of the IR systems is in the form of inverted index.
We can define an inverted index as a data structure that list, for every word, all documents that
contain it and frequency of the occurrences in document.
It makes it easy to search for 'hits' of a query word.
19
Depending on the information need, a query term may be a date, a number, a musical note, or a
phrase. Wildcard operators and other partial- match operators may also be permitted in query
terms.
For example, the term “inform*” might match any word starting with that prefix (“inform”,
“informs”, “informal”, “informant”, informative”, etc.).
Although users typically issue simple keyword queries, IR systems often support a richer query
syntax, frequently with complex Boolean and pattern matching operators.
These facilities may be used to limit a search to a particular Web site, to specify constraints on
fields such as author and title, or to apply other filters, restricting the search to a subset of the
collection.
A user interface mediates between the user and the IR system, simplifying the query creation
process when these richer query facilities are required.
The user’s query is processed by a search engine, which may be running on the user’s local
machine, on a large cluster of machines in a remote geographic location, or anywhere in
between.
A major task of a search engine is to maintain and manipulate an inverted index for a
document collection.
This index forms the principal data structure used by the engine for searching and relevance
ranking. As its basic function, an inverted index provides a mapping between terms and the
locations in the collection in which they occur.
To support relevance ranking algorithms, the search engine maintains collection statistics
20
associated with the index, such as the number of documents containing each term and the length
of each document.
In addition the search engine usually has access to the original content of the documents in order
to report meaningful results back to the user.
Using the inverted index, collection statistics, and other data, the search engine accepts queries
from its users, processes these queries, and returns ranked lists of results.
To perform relevance ranking, the search engine computes a score, sometimes called a retrieval
status value (RSV), for each document.
After sorting documents according to their scores, the result list must be subjected to further
processing, such as the removal of duplicate or redundant results.
For example, a Web search engine might report only one or results from a single host or domain,
eliminating the others in favor of pages from different sources.
These divisions are quite broad and each one is designed to serve one or more functions, such as:
• Actual searching or matching of users queries with the database, and finally
To describe the retrieval process, we use a simple and generic software architecture as shown in
the below Figure:
21
Problem (user subsystem)
Related to users’ task, situation
o vary in specificity, clarity
Produces information need
o ultimate criterion for effectiveness of retrieval
how well was the need met?
Information need for the same problem may change, evolve, shift during the IR process
adjustment insearching
o often more than one search for same problem over time
Representation (user subsystem)
Converting a concept to query.
What we search for.
These are stemmed and corrected using dictionary.
Focus toward a good result
Subject to feedback changes
22
Query - search statement (user & system)
Translation into systems requirements & limits
o start of human-computer interaction
query is the thing that goes into the computer
Selection of files, resources
Search strategy - selection of:
o search terms & logic
o possible fields, delimiters
o controlled & uncontrolled vocabulary
o variations in effectiveness tactics
Reiterations from feedback
o several feedback types: relevance feedback, magnitude feedback..
o query expansion & modification
Matching - searching (Searching subsystem)
Process of matching, comparing
o search: what documents in the file match the query as stated?
Various search algorithms:
o exact match - Boolean
still available in most, if not all systems
o best match - ranking by relevance
increasingly used e.g. on the web
o hybrids incorporating both
e.g. Target, Rank in DIALOG
Each has strengths, weaknesses
o No ‘perfect’ method exists and probably never will
Retrieved documents -from system to user (IR Subsystem)
Various order of output:
o Last In First Out (LIFO); sorted
o ranked by relevance
o ranked by other characteristics
Various forms of output
When citations only: possible links to document delivery
Base for relevance, utility evaluation by users
Relevance feedback
Document Retrieval
This is usually done by the manager of the database, which species the following:
the documents to be used,
the operations to be performed on the text, and
the text model (i.e., the text structure and what elements can be retrieved).
The text operations transform the original documents and generate a logical view of them.
Once the logical view of the documents is defined, the database manager (using the DB
Manager Module) builds an index of the text.
23
An index is a critical data structure because it allows fast searching over large volumes of data.
Different index structures might be used, but the most popular one is the inverted index as
indicated in Figure.
The resources (time and storage space) spent on defining the text database and building the
index are amortized by querying the retrieval system many times. Given that the document
database is indexed, the retrieval process can be initiated.
The user need which is then parsed and transformed by the same text operations applied to the
text. Then, query operations might be applied before the actual query, which provides a system
representation for the user need, is generated.
The query is then processed to obtain the retrieved documents. Fast query processing is made
possible by the index structure previously built.
Before been sent to the user, the retrieved documents are ranked according to a likelihood of
relevance. The user then examines the set of ranked documents in the search for useful
information.
At this point, he might pinpoint a subset of the documents seen as definitely of interest and
initiate a user feedback cycle.
In such a cycle, the system uses the documents selected by the user to change the query
formulation. Hopefully, this modified query is a better representation of the real user need.
9. THE WEB
Tim Berners-Lee conceived the conceptual Web in 1989, tested it successfully in December
of 1990, and released the first Web server early in 1991.
It was called World Wide Web, and is referred as Web. At that time, no one could have
imagined the impact that the Web would have.
The Web boom, characterized by exponential growth on the volume of data and information,
imply that various daily tasks such as e-commerce, banking, research, entertainment, and personal
communication can no longer be done outside the Web if convenience and low cost are to be granted.
The amount of textual data available on the Web is estimated in the order of petabytes.
In addition, other media, such as images, audio, and video, are also available in even greater
volumes.
Thus, the Web can be seen as a very large, public and unstructured but ubiquitous data
24
repository, which triggers the need for efficient tools to manage, retrieve, and filter information
from the Web.
As a result, Web search engines have become one of the most used tools in the Internet.
Additionally, information finding is also becoming more important in large Intranets, in which one
might need to extract or infer new information to support a decision process, a task called data mining
(or Web mining for the particular case of the Web).
The very large volume of data available, combined with the fast pace of change, make the
retrieval of relevant information from the Web a really hard task.
To cope with the fast pace of change, efficient crawling of the Web has become essential.
In spite of recent progress in image and non-textual data search in general, the existing
techniques do not scale up well on the Web.
Issue a word-based query to a search engine that indexes a portion of the Web
documents.
Browse the Web, which can be seen as a sequential search process of following
hyperlinks, as embodied for example, in Web directories that classify selected Web
documents by subject.
Additional methods exist, such as taking advantage of the hyperlink structure of the Web, yet
they are not fully available, and likely less well known and also much more complex.
A Challenging Problem
Let us now consider the main challenges posed by the Web with respect to search. We can
divide them in two classes: those that relate to the data itself, which we refer to as data-centric, and
those that relate to the users and their interaction with the data, which we refer as interaction-centric.
Data-centric challenges
a) Distributed data.
• Due to the intrinsic nature of the Web, data spans over a large number of computers and
platforms. These computers are interconnected with no predefined topology and the
available bandwidth and reliability on the network interconnections vary widely.
b) High percentage of volatile data.
• Due to Internet dynamics, new computers and data can be added or removed easily. To
illustrate, early estimations showed that 50% of the Web changes in a few months. Search
engines are also confronted with dangling (or broken) links and relocation problems when
domain or file names change or disappear.
c) Large volume of data.
• The fast growth of the Web poses scaling issues that are difficult to cope with, as well as
25
dynamic Web pages, which are in practice unbounded.
d) Unstructured and redundant data.
• The Web is not a huge distributed hypertext system, as some might think, because it does
not follow a strict underlying conceptual model that would guarantee consistency. Indeed,
the Web is not well structured either at the global or at the individual HTML page level.
HTML pages are considered by some as semi-structured data in the best case. Moreover, a
great deal of Web data are duplicated either loosely (such as the case of news originating
from the same news wire) or strictly, via mirrors or copies. Approximately 30% of Web
pages are (near) duplicates. Semantic redundancy is probably much larger.
e) Quality of data
The Web can be considered as a new publishing media. However, there is, in most cases, no
editorial process. So, data can be inaccurate, plain wrong, obsolete, invalid, poorly written or,
as if often the case, full of errors, either innocent (typos, grammatical mistakes, OCR errors,
etc.) or malicious. Typos and mistakes, specially in foreign names are pretty common.
f) Heterogeneous data
Data not only originates from various media types, each coming in different formats, but it is
also expressed in a variety of languages, with various alphabets and scripts (e.g. India), which
can be pretty large (e.g. Chinese or Japanese Kanji).
Many of these challenges, such as the variety of data types and poor data quality, cannot be
solved by devising better algorithms and software, and will remain a reality simply because they are
problems and issues (consider, for instance, language diversity) that are intrinsic to human nature.
Interaction-centric challenges
1) Expressing a query
Human beings have needs or tasks to accomplish, which are frequently not easy to express as
“queries”.
Queries, even when expressed in a more natural manner, are just a reflection of
information needs and are thus, by definition, imperfect. This phenomenon could be compared
to Plato’s cave metaphor, where shadows are mistaken for reality.
2) Interpreting results
Even if the user is able to perfectly express a query, the answer might be split over thousands
or millions of Web pages or not exist at all. In this context, numerous questions need to be
addressed.
In the current state of the Web, search engines need to deal with plain HTML and text, as well
as with other data types, such as multimedia objects, XML data and associated semantic
information, which can be dynamically generated and are inherently more complex.
In this hypothetical world, IR would become easier, and even multimedia search would be
simplified.
Spam would be much easier to avoid as well, as it would be easier to recognize good content.
On the other hand, new retrieval problems would appear, such as XML processing and
retrieval, and Web mining on structured data, both at a very large scale.
26
IR Versus Web Search
Traditional IR systems normally index a closed collection of documents, which are mainly
text-based and usually offer little linkage between documents.
Traditional IR systems are often referred to as full-text retrieval systems. Libraries were among
the first to adopt IR to index their catalogs and later, to search through information which was
typically imprinted onto CD-ROMs.
The main aim of traditional IR was to return relevant documents that satisfy the user’s
information need.
Although the main goal of satisfying the user’s need is still the central issue in web IR (or
web search), there are some very specific challenges that web search poses that have required
new and innovative solutions.
The first important difference is the scale of web search, as we have seen that the current size
of the webis approximately 600 billion pages.
This is well beyond the size of traditional document collections.
The Web is dynamic in a way that was unimaginable to traditional IR in terms of its rate of
change and the different types of web pages ranging from static types (HTML, portable
document format (PDF), DOC, Postscript, XLS) to a growing number dynamic pages written
in scripting languages such a JSP, PHP or Flash. We also mention that a large number of
images, videos, and a growing number of programs are delivered through the Web to our
browsers.
The Web also contains an enormous amount of duplication, estimated at about 30%. Such
redundancy is not present in traditional corpora and makes the search engine’s task even more
difficult.
The quality of web pages vary dramatically; for example, some web sites create web pages
with the sole intention of manipulating the search engine’s ranking, documents may contain
misleading information, the information on some pages is just out of date, and the overall
quality of a web page may be poor in terms of its use of language and the amount of useful
information it contains. The issue of quality is of prime importance to web search engines as
they would very quickly lose their audience if, in the top- ranked positions, they presented to
users poor quality pages.
The range of topics covered on the Web is completely open, as opposed to the closed
collections indexed by traditional IR systems, where the topics such as in library catalogues,
are much better defined and constrained.
Another aspect of the Web is that it is globally distributed. This poses serious logistic
problems to search engines in building their indexes, and moreover, in delivering a service that
is being used from all over the globe. The sheer size of the problem is daunting, considering
that users will not tolerate anything but
an immediate response to their query. Users also vary in their level of expertise, interests,
information- seeking tasks, the language(s) they understand, and in many other ways.
27
Users also tend to submit short queries (between two to three keywords), avoid the use of
anything but the basic search engine syntax, and when the results list is returned, most users do
not look at more than the top 10 results, and are unlikely to modify their query. This is all
contrary to typical usage of traditional IR.
The hypertextual nature of the Web is also different from traditional document collections, in
giving users the ability to surf by following links.
On the positive side (for the Web), there are many roads (or paths of links) that “lead to
Rome” and you need only find one of them, but often, users lose their way in the myriad of
choices they have to make.
Another positive aspect of the Web is that it has provided and is providing impetus for the
development of many new tools, whose aim is to improve the user’s experience.
Classical IR Web IR
Volume Large Huge
Data quality Clean, No duplicates Noisy, Duplicates
available
Data change rate Infrequent In flux
Data Accessibility Accessible Partially accessible
Format diversity Homogeneous Widely Diverse
Documents Text HTML
No. of Matches Small Large
IR techniques Content based Link based
COMPARISION
Sl.
Differentiator Web Search IR
No
Several file types, some hard to Usually all indexed documents have the
2 File types index because of a lack of textual same format (e.g. PDF) or only
information. bibliographic information is provided.
28
Wide range from very short to Document length varies, but not to such
3 Document length very long. Longer documents are a high degree as with the Web
often divided into parts. documents
1. Indexing process
2. Query
29
process 1.Indexing
process
The indexing process builds the structures that enable searching, and the query process uses
those structures and a person’s query to produce a ranked list of documents. Figure 2.1 shows the
high-level “buildingblocks” of the indexing process.
These major components are
a) Text acquisition
b) Text transformation
c) Index creation
d) Text acquisition
a) Text acquisition
The task of the text acquisition component is to identify and make available the documents that
will be searched.
Although in some cases this will involve simply using an existing collection, text acquisition
will more often require building a collection by crawling or scanning the Web, a corporate
intranet, a desktop, or other sources of information.
In addition to passing documents to the next component in the indexing process, the text
acquisition component creates a document data store, which contains the text and metadata for all
the documents.
Metadata is information about a document that is not part of the text content, such as the
document type (e.g., email or web page), document structure, and other features, such as
document length.
b) Text transformation
The text transformation component transforms documents into index terms or features.
Index terms, as the name implies, are the parts of a document that are stored in the index and
30
used in searching. The simplest index term is a word, but not every word may be used for searching.
A “feature” is more often used in the field of machine learning to refer to a part of a text
document that is used to represent its content, which also describes an index term. Examples of other
types of index terms or features are phrases, names of people, dates, and links in a web page. Index
terms are sometimes simply referred to as “terms.” The set of all the terms that are indexed for a
document collection is called the index vocabulary.
c) Index creation
The index creation component takes the output of the text transformation component and
creates the indexes or data structures that enable fast searching. Given the large number of
documents in many search applications, index creation must be efficient, both in terms of time and
space. Indexes must also be able to be efficiently updated when new documents are acquired.
Inverted indexes, or sometimes inverted files, are by far the most common form of index used
by search engines. An inverted index, very simply, contains a list for every index term of the
documents that contain that index term. It is inverted in the sense of being the opposite of a
document file that lists, for every document, the index terms they contain. There are many variations
of inverted indexes, and the particular form of index used is one of the most important aspects of a
search engine.
2. Query process
31
Figure 2.2 shows the building blocks of the query process.
a)User interaction
The user interaction component provides the interface between the person doing the
searching and the search engine. One task for this component is accepting the user’s query
and transforming it into index terms.
Another task is to take the ranked list of documents from the search engine and organize it
into the results shown to the user.
This includes, for example, generating the snippets used to summarize documents.
The document data store is one of the sources of information used in generating the results.
Finally, this component also provides a range of techniques for refining the query so that it
better represents the information need.
b) Ranking
The ranking component is the core of the search engine. It takes the transformed query
from the user interaction component and generates a ranked list of documents using scores based
on a retrieval model. Ranking must be both efficient, since many queries may need to be
processed in a short time, and effective, since the quality of the ranking determines whether the
search engine accomplishes the goal of finding relevant information. The efficiency of ranking
depends on the indexes, and the effectiveness depends on the retrieval model.
c) Evaluation
The task of the evaluation component is to measure and monitor effectiveness and
efficiency. An important part of that is to record and analyze user behavior using log data. The
results of evaluation are used to tune and improve the ranking component. Most of the evaluation
component is not part of the online search engine, apart from logging user and system data.
Evaluation is primarily an offline activity, but it is a critical part of any search application.
1.10 E-PUBLISHING-ERA
Since its inception, the Web became a huge success - Well over 20 billion pages are now
available and accessible in the Web More than one fourth of humanity now access the Web on a
regular basis.
32
Why is the Web such a success?
What is the single most important characteristic of the Web that makes it so revolutionary?
In search for an answer, let us dwell into the life of a writer who lived at the end of the 18th
Century.
The novel was only published 15 years later! She got a flat fee of $110, which meant that she was
not paid anything for the many subsequent editions Further, her authorship was anonymized under
the reference “By a Lady”
• Pride and Prejudice is the second or third best loved novel in the UK ever, after The Lord of
the Rings and Harry Potter. It has been the subject of six TV series and five film versions The last
of these, starring Keira Knightley and Matthew Macfadyen, grossed over 100 million dollars
• Jane Austen published anonymously her entire life Throughout the 20th century, her novels
have never been out of print, Jane Austen was discriminated because there was no freedom to
publish in the beginning of the 19th century .
• The Web, unleashed by the inventiveness of Tim Berners-Lee, changed this once and for all
It did so by universalizing freedom to publish - The Web moved mankind into a new era, into a new
time, into The e-Publishing Era.
The term "electronic publishing" is primarily used in the 2010s to refer to online and web- based
publishers, the term has a history of being used to describe the development of new forms of
production, distribution, and user interaction in regard to computer-based production of text and
other interactivemedia.
The first digitization projects were transferring physical content into digital content. Electronic
publishing is aiming to integrate the whole process of editing and publishing (production, layout,
publication) in the digital world.
The traditional publishing, and especially the creation part, were first revolutionized by new
desktop publishing softwares appearing in the 1980s, and by the text databases created for the
encyclopedias and directories.
At the same time the multimedia was developing quickly, combining book, audiovisual and
computer science characteristics. CDs and DVDs appear, permitting the visualization of these
dictionaries and encyclopedias on computers.
33
The arrival and democratization of Internet is slowly giving small publishing houses the opportunity
to publish their books directly online.
Some websites, like Amazon, let their users buy eBooks; Internet users can also find many
educative platforms (free or not), encyclopedic websites like Wikipedia, and even digital magazines
platforms.
The eBook then becomes more and more accessible through many different supports, like the e-
reader and even smart phones.
The digital book had, and still has, an important impact on publishing houses and their economical
models; it is still a moving domain, and they yet have to master the new ways of publishing in a
digital era.
Web search is today the most prominent application of IR and its techniques—the ranking and
indexing components of any search engine are fundamentally IR pieces of technology
The first major impact of the Web on search is related to the characteristics of the document
collection itself
• The Web is composed of pages distributed over millions of sites and connected
through hyperlinks
• This requires collecting all documents and storing copies of them in a central
repository, prior to indexing
• This new phase in the IR process, introduced by the Web, is called crawling
34
The fourth major impact derives from the fact that the Web is also a medium to do business
• Search problem has been extended beyond the seeking of text information to also
encompass other user needs
• Ex: the price of a book, the phone number of a hotel, the link for downloading a
software
The fifth major impact of the Web on search is Web spam
•Log In Issue
One of the most common problems faced by online businesses is the inability to log in to the
control panel. You need easy access to the control panel for additions and deletions of content and
for other purposes.
35
A few hosting companies follow the undesirable business practice of not disclosing their limit in
terms of space and bandwidth. They try to serve more customers with their limited resources which
can result in major performance issues in the long term.
Exploratory search is divided into learning and investigating tasks Learning search
1. requires more than single query-response pairs
2. requires the searcher to spend time
• scanning and reading multiple information items
• synthesizing content to form new understanding
Navigation × Search
Navigation: the searcher looks at an information structure and browses among the available
information
This browsing strategy is preferable when the information structure is well-matched to the user’s
information need
o it is mentally less taxing to recognize a piece of information than it is to recall it
o it works well only so long as appropriate links are available
If the links are not available, then the browsing experience might be frustrating
Search Process
Numerous studies have been made of people engaged in the search process
The results of these studies can help guide the design of search interfaces
One common observation is that users often reformulate their queries with slight
modifications
Another is that searchers often search for information that they have previously accessed The
users’ search strategies differ when searching over previously seen materials.
Researchers have developed search interfaces support both query history and revisitation
Studies also show that it is difficult for people to determine whether or not a document is
relevant to a topic, other studies found that searchers tend to look at only the top-ranked
retrieved results. Further, they are biased towards thinking the top one or two results are
better than those beneath them.
Studies also show that people are poor at estimating how much of the relevant material they
have found, other studies have assessed the effects of knowledge of the search process itself.
These studies have observed that experts use different strategies than novices searchers.
37
1.14 SEARCH INTERFACES TODAY
Query Specification
Short queries reflect the standard usage scenario in which the user tests the waters:
• If the results do not look relevant, then the user reformulates their query
• If the results are promising, then the user navigates to the most relevant-looking web site
Query Specification Interface
The standard interface for a textual query is a search box entry form
Studies suggest a relationship between query length and the width of the entry form
ng queries or wide forms encourage longer
queries
yelp.com, the user can refine the search by location using a second form
Notice that the yelp.com form also shows the user’s home location, if it has been specified
previously
, For instance, in zvents.com search, the first box is labeled “what are you looking for”?
Some interfaces show a list of query suggestions as the user types the query - this is referred to as
auto-complete, auto-suggest, or dynamic query suggestions
Dynamic query suggestions, from Netflix.com
Dynamic query suggestions, grouped by type, from NextBio.com:
QUERY REFORMULATION
There are tools to help users reformulate their query
• One technique consists of showing terms related to the query or to the documents retrieved
in response to the query
A special case of this is spelling corrections or suggestions
• Usually only one suggested alternative is shown: clicking on that alternative re-executes the
query
• In years back, the search results were shown using the purportedly incorrect spelling
Relevance feedback is another method whose goal is to aid in query reformulation
The main idea is to have the user indicate which documents are relevant to their query
• In some variations, users also indicate which terms extracted from those documents are
relevant
The system then computes a new query from this information and shows a new
retrieval set.
39
COMPONENTS OF SEARCH ENGINE
1) Crawler:
A web crawler is a software program that traverses web pages, downloads them for indexing, and
follows the hyperlinks that are referenced on the downloaded pages; a web crawler is also known as
a spider, a wanderer or a software robot.
2) Indexer:The second component is the indexer which is responsible for creating the search
index from the web pages it receives from the crawler
3) Search Index:
The search index is a data repository containing all the information the search engine needs to
match and retrieve web pages. The type of data structure used to organize the index is known as an
inverted file.
4) Query Engine:
The query engine is the algorithmic heart of the search engine. The inner working of a commercial
query engine is a well-guarded secret, since search engines are rightly paranoid, fearing web sites
who wish to increase their ranking by unfairly taking advantage of the algorithms the search engine
uses to rank result pages.
5) Search Interface:
Once the query is processed, the query engine sends the results list to the search interface, which
displays the results on the user’s screen. The user interface provides the look and feel of the search
engine, allowing the user to submit queries, browse the results list, and click on chosen web pages
for further browsing.
Experimentation with visualization for search has been primarily applied in the following ways:
Visualizing Boolean syntax
Visualizing query terms within retrieval results
Visualizing relationships among words and documents
40
Visualization for text mining
42
Another idea is to map docs or words from a very high- dimensional term space down
into a 2D plane ,The docs or words fall within that plane, using 2D or 3D
Visualization for Text Mining
Visualization is also used for purposes of analysis and exploration of textual data
Visualizations such as the Word Tree show a piece of a text concordance
It allows the user to view which words and phrases commonly precede or follow a
given word
The Word Tree visualization, on Martin Luther King’s , I have a dream
speech, fromWattenberg et al
***********************
43
UNIT II: MODELING AND RETRIEVAL
EVALUATION
Syllabus:
1. BASIC IR MODELS
44
1.1 Introduction to Modeling
What is modeling?
Modeling in IR is a complex process aimed at producing a ranking function.
Ranking function: a function that assigns scores to documents with regard to a given
query.
45
How IR Models do distinguishes with other models?
The IR models can be distinguished by the way how they represent the documents and query
statements,
a) how the system matches the query with the documents in the corpus to find out the
related one ?
b) how the system ranks these documents?
An IR model defines the following aspects of retrieval procedure of a search
engine:
a) How the documents in the collection and user’s queries are transformed?
b) How system identifies the relevancy of the documents based on the query
word/phrase given by the user?
c) How system ranks the retrieved documents based on the relevancy?
Why IR Models?
Mathematically, models are used in many scientific areas having objective to
understand some phenomenon in the real world.
A model of information retrieval predicts and explains what a user will find in relevance
to the given query.
46
An IR model is a quadruple [ D, Q, F, R(qi , dj)]
Where
a) D is a set of logical views for the documents in the collection
b) Q is a set of logical views for the user queries
c) F is a framework for modeling documents and queries
d) R(qi,dj) is a Ranking function.
a) Classical IR Model
It is the simplest and easy to implement IR model.
This model is based on mathematical knowledge that was easily recognized and
understood as well.
A typical Classical model has the following functionalities when it is a part of an IRS system:
47
A classical Model is capable of handling:
Each document is described by a set of representative keywords called index
terms.
Assign numerical weights to distinct relevance between index terms.
Thus each model adopts the basic mathematical functionalities as depicted below:
In overall any traditional IRS holds set of functionalities derived from following
mathematical concepts:
48
b) Non-Classical IR Model
It is completely opposite to classical IR model.
Such kind of IR models are based on principles other than similarity, probability,
Boolean operations.
Information logic model, situation theory model and interaction models are the examples
of non-classical IR model.
Non-classical information retrieval models are based on principles like information logic
model, situation theory model and interaction model.
They are not based on concepts like similarity, probability, Boolean operations, etc, on
which classical retrieval models are based on.
c) Alternative IR Model
Alternative models are advanced classical IR models.
It is the enhancement of classical IR model making use of some specific techniques
from some other fields.
These models make use of specific techniques from other fields.
Types of Alternative IR Models are:
a) Cluster model,
b) fuzzy model and
c) latent semantic indexing (LSI) models.
49
2. BOOLEAN MODEL
Boolean Model is the oldest information retrieval (IR) model.
This is the simplest retrieval model which retrieves the information on the basis
of the query given in Boolean expression.
Boolean queries are queries that uses And, OR and Not Boolean operations to
join the query terms.
The Boolean retrieval model is a model for information retrieval in which we can pose
any query which is in the form of a Boolean expression of terms, that is, in which
terms are combined with the operators AND, OR, and NOT.
Example:
• Document collection:
• dl = “Sachin scores hundred.”
• d2 = “Dravid is the most technical batsman of the era.”
• d3 = “Sachin, Dravid duo is the best to watch.”
• d4 = “India wins courtesy to Dravid, Sachin partnership”
50
• batsman —*• {d2}
• watch —*• {d3}
• India —*• {d4j
• partnership —*• {d4j
• win —*• {d4j
Residt set:
{Dl, D3, D4} fi {D2, D3, D4j = {D3, D4}
In Boolean model, the IR system retrieves the documents based on the occurrence of
query key words in the document.
It doesn’t provide any ranking of documents based on the relevancy.
The model is based on set theory and the Boolean algebra, where documents are sets of
terms and queries are Boolean expressions on terms.
If we talk about the relevance feedback, then in Boolean IR model the Relevance
prediction can be defined as follows −
o R − A document is predicted as relevant to the query expression if and only if it
satisfies the query expression as −
(( e ˅ i fo io ) ˄ e ie ˄ ˜ ℎeo )
We can explain this model by a query term as an unambiguous definition of a set of
documents.
For example, the query term “economic” defines the set of documents that are indexed
with the term “economic”.
Now,
What would be the result after combining terms with Boolean AND Operator?
It will define a document set that is smaller than or equal to the document sets of any of
the single terms.
For example,
51
o the query with terms “social” and “economic” will produce the documents set
of documents that are indexed with both the terms.
In other words, document set with the intersection of both the sets.
Now,
What would be the result after combining terms with Boolean OR operator?
It will define a document set that is bigger than or equal to the document sets of any of
the single terms.
For example,
o the query with terms “social” or “economic” will produce the documents set
of documents that are indexed with either the term “social” or “economic”.
In other words, document set with the union of both the sets.
BIR Example
There is a way to avoid linearly scanning the texts for each query is to index the
documents in advance.
Let us stick with Shakespeare’s Collected Works, and use it to introduce the basics of
the Boolean retrieval model.
Suppose we record for each document – here a play of Shakespeare’s –
o whether it contains each word out of all the words Shakespeare used
(Shakespeare used about 32,000 different words).
The result is a binary term-document incidence, as in Figure.
Terms are the indexed units; they are usually words, and for the moment you can think
of them as words.
52
Figure : A term-document incidence matrix. Matrix element (t, d) is 1 if the play in
column d contains the word in row t, and is 0 otherwise.
To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus,
Caesar and Calpurnia, complement the last, and then do a bitwise AND:
110100 AND 110111 AND 101111 = 100100
Answer :
The answers for this query are thus Antony and Cleopatra and Hamlet Let us now consider a
more realistic scenario, simultaneously using the opportunity to introduce some terminology
and notation.
Suppose we have N = 1 million documents.
By documents we mean whatever units we have decided to build a retrieval system over.
They might be individual memos or chapters of a book.
We will refer to the group of documents over which we perform retrieval as the
COLLECTION.
It is sometimes also referred to as a Corpus.
We assume an average of 6 bytes per word including spaces and punctuation,
then this is a document collection about 6 GB in size.
Typically, there might be about M = 500,000 distinct terms in these documents.
There is nothing special about the numbers we have chosen, and they might vary by an
order of magnitude or more, but they give us some idea of the dimensions of the kinds of
problems we need to handle.
Advantages of the Boolean Model
53
The advantages of the Boolean model are as follows −
a) The simplest model, which is based on sets.
b) Easy to understand and implement.
c) It only retrieves exact matches
d) It gives the user, a sense of control over the system.
**********************
3. TF-IDF (Term Frequency/Inverse Document Frequency)
3.1 Term Frequency (tfij)
It may be defined as the number of occurrences of wi in dj.
The information that is captured by term frequency is how salient a word is within the
given document or in other words we can say that the higher the term frequency the
more that word is a good description of the content of that document.
Assign to each term in a document a weight for that term, that depends on the number of
occurrences of the term in the document.
We would like to compute a score between a query term t and a document d, based on
the weight of t in d.
The simplest approach is to assign the weight to be equal to the number of occurrences
of term t in document d.
This weighting scheme is referred to as term frequency and is denoted tft,d with the
54
subscripts denoting the term and the document in order.
For instance, a collection of documents on the auto industry is likely to have the term
auto in almost every document.
A mechanism for attenuating the effect of terms that occur too often in the collection to
be meaningful for relevance determination.
An immediate idea is to scale down the term weights of terms with high collection
frequency, defined to be the total number of occurrences of a term in the collection.
The idea would be to reduce the tf weight of a term by a factor that grows with its
collection frequency.
Mathematically,
Here,
N = documents in the collection nt = documents containing term t
55
The tf-idf weighting scheme assigns to term t a weight in document d given by
********************
4. VECTOR MODEL
56
-Wikipedia
The vector space model represents the documents and queries as vectors in a
multidimensional space, whose dimensions are the terms used to build an index to
represent the documents [Salton 1983].
The creation of an index involves lexical scanning to identify the significant terms,
where morphological analysis reduces different word forms to common "stems", and the
occurrence of those stems is computed.
Query and document surrogates are compared by comparing their vectors, using, for
example, the cosine similarity measure.
In this model, the terms of a query surrogate can be weighted to take into account their
importance, and they are computed by using the statistical distributions of the terms in
the collection and in the documents [Salton 1983].
The vector space model can assign a high ranking score to a document that contains only
a few of the query terms if these terms occur infrequently in the collection but frequently
in the document.
1) The more similar a document vector is to a query vector, the more likely it is that the
document is relevant to that query.
2) The words used to define the dimensions of the space are orthogonal or independent. While
it is a reasonable first approximation, the assumption that words are pair wise independent is
not realistic.
57
Vector Model example
In VSM, each document d is viewed as a vector of tf'idf values, one component for each
term
So we have a vector space where
a. terms are axes; and
b. documents live in this space.
58
Representation of document and query features as vectors
Ranking algorithm compute similarity between document and query vectors to yield a
retrieval score to each document.
The Postulate is: Documents related to the same information are close together in the
vector space.
Assign non-binary weights to index terms in queries and in documents. Compute the
similarity between documents and query.
More precise than Boolean model.
Due to the disadvantages of the Boolean model, Gerard Salton and his colleagues
suggested a model, which is based on Luhn’s similarity criterion.
The similarity criterion formulated by Luhn states, “the more two representations agreed
in given elements and their distribution, the higher would be the probability of their
representing similar information.”
Consider the following important points to understand more about the Vector Space
Model −
The index representations (documents) and the queries are considered as vectors
embedded in a high dimensional Euclidean space.
The similarity measure of a document vector to a query vector is usually the cosine of
the angle between them.
59
60
Cosine Similarity Measure Formula
Cosine is a normalized dot product, which can be calculated with the help of the
following formula –
61
Vector Space Representation with Query and Document
The query and documents are represented by a two-dimensional vector space.
The terms are car and insurance.
There is one query and three documents in the vector space.
The top ranked document in response to the terms car and insurance will be the
document d2 because the angle between q and d2 is the smallest.
The reason behind this is that both the concepts car and insurance are salient in d2 and
hence have the high weights.
On the other side, d1 and d3 also mention both the terms but in each case, one of them is
not a centrally important term in the document.
62
• Automatic selection of index terms
• Partial matching of queries and documents (dealing with the case where no document
contains all search terms)
• Ranking according to similarity score(dealing with large result sets)
• Term weighting schemes (improves retrieval performance)
• Various extensions
• Document clustering
• Relevance feedback (modifying query vector)
• Geometric foundation
5. PROBABILISTIC MODEL
• The probabilistic retrieval model is based on the Probability Ranking Principle, which
states that an information retrieval system is supposed to rank the documents based on their
probability of relevance to the query, given all the evidence available [Belkin and Croft 1992].
• The principle takes into account that there is uncertainty in the representation of the
information need and the documents.
• There can be a variety of sources of evidence that are used by the probabilistic retrieval
methods, and the most common one is the statistical distribution of the terms in both the
relevant and non-relevant documents.
63
• It is a formalism of information retrieval useful to derive ranking functions used by
search engines and web search engines in order to rank matching documents according to their
relevance to a given search query. It is a theoretical model estimating the probability that a
document dj is relevant to a query q.
• The probabilistic model tries to estimate the probability that the user will find the
document dj relevant with ratio
P(dj relevant to q) / P(dj non relevant to q)
• Given a user query q, and the ideal answer set R of the relevant documents, the problem
is to specify the properties for this set. Assumption (probabilistic principle):
the probability of relevance depends on the query and document representations only;
ideal answer set R should maximize the overall probability of relevance.
• Given a query q, there exists a subset of the documents R which are relevant to q But
membership of R is uncertain (not sure) ,
• A Probabilistic retrieval model ranks documents in decreasing order of probability of
relevance to the information need:
P(R | q,d I ).
• Users gives with information needs, which they translate into query representations.
Similarly, there are documents, which are converted into document representations .
64
Given only a query, an IR system has an uncertain understanding of the information need.
• Document can be relevant and non relevant document, we can estimate the probability of
a term t appearing in a relevant document P(t | R=1) .
• Probabilistic methods are one of the oldest but also one of the currently hottest topics in
IR .
65
Partition rule: if B can be divided into an exhaustive set of disjoint sub cases, then P(B) is the
sum of the probabilities of the sub cases.
Odds of an event ( is the ratio of probability of an event to the probability of its compliment).
Prove a kind of multiplier for how probabilities change:
66
• In fact, probabilistic modeling is extremely useful as an exploratory decision making
tool.
• It allows managers to capture and incorporate in a structured way their insights into the
businesses they run and the risks and uncertainties they face.
Advantages:
a) The claimed advantage to the probabilistic model is that it is entirely based on
probability theory.
b) The implication is that other models have a certain arbitrary characteristic.
c) They might perform well experimentally, but they lack asound theoretical basis because
the parameters are not easy to estimate.
Disadvantages:
a) They need to guess the initial relevant and non-relevant sets.
b) Term frequency is not considered
c) Independence assumption for index terms
Using the same example we used previously with the vector space model, we now show how
the four different weights can be used for relevance ranking.
Again, the documents and the query are:
Q: "gold silver truck"
D1: "Shipment of gold damaged in a fire."
D2: "Delivery of silver arrived in a silver truck."
D3: "Shipment of gold arrived in a truck."
Since training data are needed for the probabilistic model, we assume thatthese three
documents are the training data and we deem documents D2 andD3 as relevant to the query.To
compute the similarity coefficient, we assign term weights to each term
in the query. We then sum the weights of matching terms.
67
r =number of relevant documents indexed by the given term
Note that with our collection, the weight for silver is infinite, since (n-r) =O.
This is because "silver" only appears in relevant documents. Since weare using this
procedure in a predictive manner, Robertson and Sparck Jones recommended adding
constants to each quantity [Robertson and Sparck Jones,1976].
68
69
Result:
Term Weight:
Document Weight:
70
6. Latent Semantic Indexing (LSI) Model
6.1 Introduction to LSI
• Several statistical and AI techniques have been used in association with domain
semantics to extend the vector space model to help overcome some of the retrieval
problems.
• LSI is based on the principle that words that are used in the same contexts tend to have
similar meanings.
71
structure. An advantage of this approach is that queries can retrieve documents even if
they have no words in common.
• The LSI technique captures deeper associative structure than simple term-to-term
correlations and is completely automatic.
• The only difference between LSI and vector space methods is that LSI represents terms
and documents in a reduced dimensional space of the derived indexing dimensions. As
with the vector space method, differential term weighting and relevance feedback can
improve LSI performance substantially.
• The LSI match-document profile method combines the advantages of both LSI and the
document profile.
• The document profile provides a simple, but effective, representation of the user's
interests.
• Indicating just a few documents that are of interest is as effective as generating a long
list of words and phrases that describe one's interest.
• Document profiles have an added advantage over word profiles: users can just indicate
documents they find relevant without having to generate a description of their interests.
• The words that searchers use to describe the their information needs are often not the
same words used by authors to describe the same information.
• I.e., index terms and user search terms often do NOT match
1. Synonymy
2. Polysemy
• The rectangular matrix is decomposed into three other matices of a special form by SVD
• The resulting matrices contain “singular vectors” and “singular values”
72
• The matrices show a breakdown of the original relationships into linearly independent
components or factors
• Many of these components are very small and can be ignored – leading to an
approximate model that contains many fewer dimensions
73
a) Comparing two rows
b) Comparing two columns
c) Examining a single cell in the table
Latent Semantic Analysis is an efficient way of analysing the text and finding the hidden
topics by understanding the context of the text.
Latent Semantic Analysis (LSA) is used to find the hidden topics represented by the
document or text. This hidden topic then is used for clustering the similar documents
together.
LSI has been tested and found to be “modestly effective” with traditional test
collections.
Permits compact storage/representation (vectors are typically 50-150 elements instead of
thousands)
• LSI overcomes two of the most problematic constraints of Boolean keyword queries:
a) multiple words that have similar meanings (synonymy)
b) words that have more than one meaning (polysemy).
• Text does not need to be in sentence form for LSI to be effective. It can work with lists,
free-form notes, email, webcontent, etc.
• LSI is also used to perform automated document categorization and clustering.
• In fact, several experiments have demonstrated that there are a number of correlations
74
between the way LSI and humans process and categorize text.
a) an input later,
b) a processing layer, and
c) an output layer.
75
Types of Neural Network
• Neural networks are also ideally suited to help people solve complex problems in real-life
situations.
• They can learn and model the relationships between inputs and outputs that are nonlinear
and complex; make generalizations and inferences; reveal hidden relationships, patterns and
predictions; and model highly volatile data (such as financial time series data) and
variances needed to predict rare events (such as fraud detection).
76
7.3 The Neural Network Model
What is a neural network model?
• A neural network is a simplified model of the way the human brain processes information.
• It works by simulating a large number of interconnected processing units that resemble
abstract versions of neurons. The processing units are arranged in layers.
• A neural network is a method in artificial intelligence that teaches computers to process
data in a way that is inspired by the human brain.
• It is a type of machine learning process, called deep learning, that uses interconnected
nodes or neurons in a layered structure that resembles the human brain.
• Neural ranking models for information retrieval (IR) use shallow or deep neural networks
to rank search results in response to a query.
• Traditional learning to rank models employ supervised machine learning (ML)
techniques—including neural networks—over hand-crafted IR features.
• Neural networks are not themselves algorithms, but rather frameworks for many different
machine learning algorithms that work together.
• The algorithms process complex data.
• A neural network is an example of machine learning, where software can change as it
learns to solve a problem.
77
Key- Query words Documents
First neural network -multilayer perceptron - back propagation type) consists of three
layers, i.e., input layer, hidden layer and output layer.
Query Keywords
Input layer is created of N input neurons x1, …, xN, where each neuron represents one
character of a query, i.e. input layer represents one word.
Hidden layer is created by M neurons y1, …, yM, which express the inner query
representation.
Output layer is created by L neurons k1, …, kL, where each neuron represents one
keyword.
For learning this neural network a back propagation algorithm was used.
78
7.4 The Neural Network Model Example
• The problem that we are going to solve is pretty simple.
• Suppose we have some information about obesity, smoking habits, and exercise habits of
five people.
• We also know whether these people are diabetic or not.
• Our dataset looks like this:
Person Smoking Obesity Exercise Diabetic
Person 1 0 1 0 1
Person 2 0 0 1 0
Person 3 1 0 0 0
Person 4 1 1 0 1
Person 5 1 1 1 1
• In the above table, we have five columns: Person, Smoking, Obesity, Exercise, and
Diabetic.
• Here 1 refers to true and 0 refers to false.
• For instance, the first person has values of 0, 1, 0 which means that the person doesn't
smoke, is obese, and doesn't exercise.
• The person is also diabetic.
• It is clearly evident from the dataset that a person's obesity is indicative of him being
diabetic.
• Our task is to create a neural network that is able to predict whether an unknown person is
diabetic or not given data about his exercise habits, obesity, and smoking habits.
• This is a type of supervised learning problem where we are given inputs and corresponding
correct outputs and our task is to find the mapping between the inputs and the outputs.
• Note: This is just a fictional dataset, in real life, obese people are not necessarily always
diabetic.
• The Solution
We will create a very simple neural network with one input layer and one output layer.
• A neural network is a supervised learning algorithm which means that we provide it the
79
input data containing the independent variables and the output data that contains the
dependent variable.
• For instance, in our example our independent variables are smoking, obesity and exercise.
The dependent variable is whether a person is diabetic or not.
• In the beginning, the neural network makes some random predictions, these predictions are
matched with the correct output and the error or the difference between the predicted values
and the actual values is calculated.
• The function that finds the difference between the actual value and the propagated values is
called the cost function. The cost here refers to the error.
• Our objective is to minimize the cost function.
• Training a neural network basically refers to minimizing the cost function.
• We will see how we can perform this task.
• The neural network that we are going to create has the following visual representation.
b) Back Propagation.
Feed Forward
80
In the feed-forward part of a neural network, predictions are made based on the values in
the input nodes and the weights.
If you look at the neural network in the above figure, you will see that we have three
features in the dataset: smoking, obesity, and exercise, therefore we have three nodes in
the first layer, also known as the input layer. We have replaced our feature names with
the variable x, for generality in the figure above.
The weights of a neural network are basically the strings that we have to adjust in order
to be able to correctly predict our output.
For now, just remember that for each input feature, we have one weight.
The following are the steps that execute during the feed forward phase of a neural
network:
• The nodes in the input layer are connected with the output layer via three weight
parameters. In the output layer, the values in the input nodes are multiplied with their
corresponding weights and are added together. Finally, the bias term is added to the sum.
The b in the above figure refers to the bias term.
• The bias term is very important here. Suppose if we have a person who doesn't smoke, is
not obese, and doesn't exercise, the sum of the products of input nodes and weights will be
zero. In that case, the output will always be zero no matter how much we train the
algorithms. Therefore, in order to be able to make predictions, even if we do not have any
non-zero information about the person, we need a bias term. The bias term is necessary to
make a robust neural network.
• Mathematically, in step 1, we perform the following calculation:
X.W=x1w1+x2w2+x3w3+bX.W=x1w1+x2w2+x3w3+b
• The result from Step 1 can be a set of any values. However, in our output we have the
values in the form of 1 and 0.
• We want our output to be in the same format. To do so we need an activation function,
which squashes input values between 1 and 0.
• One such activation function is the sigmoid function.
• The sigmoid function returns 0.5 when the input is 0. It returns a value close to 1 if the
81
input is a large positive number. In case of negative input, the sigmoid function outputs a
value close to zero.
• Mathematically, the sigmoid function can be represented as:
θX.W=11+e−X.W
8. RETRIEVAL EVALUATION
Text retrieval in IR, where the user enters a text query and the system returns a ranked
list of search results.
The system’s goal is to rank the user’s preferred search results at the top.
This problem is a central one in the IR literature, with well-understood challenges and
solutions.
8.2 Evaluation in IR
82
the quality of those answers!
Let,
I : an example information request (topic)
R : the ideal answer set for the topic I
|R| : number of docs in the set R
A : the answer set generated by a ranking strategy we wish to evaluate
|A| : the number of docs in the set A.
83
The viewpoint using the sets R, A, and Ra, does not consider that documents presented
to the user are ordered (i.e., ranked).
User sees a ranked set of documents and examines them starting from the top.
Thus, precision and recall vary as the user proceeds with his examination of the set A.
Most appropriate then is to plot a curve of precision versus recall.
Consider a new retrieval algorithm that yields the following set of docs as answers to the query
q:
01. d123 06. d9 11. d38
02. d84 07. d511 12. d48
03. d56 08. d129 13. d250
04. d6 09. d187 14. d113
84
05. d8 10. d25 15. d3
9. EVALUATION MEASURES
Evaluation measures for an information retrieval system are used to assess how well the
search results satisfied the user's query intent.
Such metrics are often split into kinds: online metrics look at users' interactions with the
search system, while offline metrics measure relevance, in other words how likely each
result, or search engine results page (SERP) page as a whole, is to meet the information
needs of the user.
Several evaluation measures are used in order to assess the effectiveness of an IRS.
85
Results-based
per query. The objective is to measure how an IRS is capable to find all
Given _ a set of retrieved documents by a given system, precision P is the fraction of relevant
documents that have been retrieved, _+, over the total amount of returned documents, while the
recall R corresponds to the fraction of relevant documents that have been retrieved among all
relevant documents.
• Acc. The Accuracy is one of the metrics used to evaluate binary classifiers.
While in ranking tasks, the objective is to evaluate the ordering of the relevant elements in a list
of results, in classification tasks, the evaluation objective is to assess the systems’ ability to
correctly categorize a set of instances.
For example, as a binary classification application in text matching, we would like the system
to predict the correct label (ex: 0 or 1) that reflects an element’s relevance. Hence, given a
dataset having S positive elements and N negative elements. The accuracy (Acc) of a model M
could be
86
defined as in equation 1.4
where, TS and TN are respectively the number of elements that are correctly classified as
positive ones, and the number of elements that are correctly classified as negative ones.
Hence, for a total population (evaluation dataset), the closer to 1 is the model’s accuracy, the
better it is.
The recall, precision and Acc assume that the relevance of each document
could be judged in isolation, independently from other documents [66]
87
R∩ A : the intersection of the sets R and A.
10.2 Precision
Precision is the fraction of the documents retrieved that are relevant to the user's
information need.
10.3 Recall
Recall is the fraction of the documents that are relevant to the query that are successfully
retrieved.
88
In binary classification, recall is often called sensitivity. So it can be looked at as the
probability that a relevant document is retrieved by the query.
It is trivial to achieve recall of 100% by returning all documents in response to any
query. Therefore, recall alone is not enough but one needs to measure the number of
non-relevant documents also, for example by computing the precision.
• A collection of documents used for testing information retrieval models and algorithms.
• A reference collection usually includes a set of documents, a set of test queries, and a set
of documents known to be relevant to each query.
89
• Reference collections, which are based on the foundations established by the Cranfield
experiments, constitute the most evaluation method in IR.
• With small collections one can apply the Cranfield evaluation paradigm to provide
relevance assessments.
• With large collections, however, not all documents can be evaluated relatively to a given
information need.
• The alternative consider only the top k documents produced by various ranking
algorithms for a given information need.
• This is called the pooling method.
• The method work for reference collections of a few million documents such as the
TREC collections.
For instance, the users of search engines look first at the upper corner of the results page.
Thus, changing the layout is likely to affect the assessment made by the users and their
behavior.
Proper evaluation of the user interface requires going beyond the framework of the
Cranfield experiments.
90
12.2 User-Centered Evaluation
• User base evaluation is the most common evaluation system advocated by many
information scientists.
91
13. RELEVANCE FEEDBACK AND QUERY EVALUATION
92
For example, an initial query "find information surrounding the various conspiracy theories
about the assassination of John F. Kennedy" has both useful keywords and noise. The most
useful keywords are probably assassination.
Like many queries (in terms of retrieval) there is some meaningless information. Terms
such as various and information are probably not stop words (i.e., frequently used words
that are typically ignored by an information retrieval system such as a, an, and, the), but
they are more than likely not going to help retrieve relevant documents.
The idea is to use all terms in the initial query and ask the user if the top ranked documents
are relevant.
The hope is that the terms in the top ranked documents that are said to be relevant will be
"good" terms to use in a subsequent query.
Assume a highly ranked document contains the term Oswald.
It is reasonable to expect that adding the term Oswald to the initial query would improve
both precision and recall. Similarly, if a top ranked document that is deemed relevant by the
user contains many occurrences of the term assassination, the weight used in the initial
query for this term should be increased.
With the vector space model, the addition of new terms to the original query, the deletion of
terms from the query, and the modification of existing term weights has been done.
With the probabilistic model, relevance feedback initially was only able to re-weight
existing terms, and there was no accepted means of adding terms to the original query.
The exact means by which relevance feedback is implemented is fairly dependent on the
retrieval strategy being employed.
Relevance Feedback in the Vector Space Model: /(The Rocchio algorithm for relevance
feedback):
Rocchio's approach used the vector space model to rank documents.
The query is represented by a vector Q, each document is represented by a vectorDi, and
a measure of relevance between the query and the document vector iscomputed as SC(Q,
Di), where SC is the similarity coefficient.
The SC is computed as an inner product of the document andquery vector or the cosine
of the angle between the two vectors.
The basic assumption is that the user has issued a query Q and retrieved a set of
documents.
The user is then asked whether or not the documents are relevant.
After the user responds, the set R contains the nl relevant document vectors, and the set
S contains the n2 non-relevant document vectors.
93
Rocchio builds the newquery Q' from the old query Q using the equation given below:
The weights α,β and γ and, are referred to as Rocchio weights and are
frequentlymentioned in the annual proceedings of TREC. The optimal values were
experimentallyobtained, but it is considered common today to drop the use of
nonrelevantdocuments (assign zero to γ) and only use the relevant documents.This basic theme
was used by Ide in follow-up research to Rocchio where thefollowing equation was defined:
94
Only the top ranked non-relevant document is used, instead of the sum of allnon-
relevant documents. Ide refers to this as the Dec-Hi (decrease using highestranking non-
relevant document) approach. Also, a more simplistic weight is described in which the
normalization, based on the number of document vectors is removed, and α,β and γ are set to
one [Salton, 1971a].
This new equation is:
An interesting case occurs when the original query retrieves only non-relevant
documents.
Kelly addresses this case in [Salton, 1971b]. The approach suggests that an arbitrary
weight should be added to the most frequently occurring concept in the document
collection. This can be generalized to increase the component with the highest weight.
The hope is that the term was important,but it was drowned out by all of the surrounding
noise.
By increasing the weight, the term now rings true and yields some relevant documents.
Note that this approach is applied only in manual relevance feedback approaches.
It is not applicable to automatic feedback as the top n documents are assumed, by
definition, to be relevant.
*****************
95
• In an implicit relevance feedback cycle, the feedback information is derived
• implicitly by the system
• There are two basic approaches for compiling implicit feedback information:
• local analysis, which derives the feedback information from the documents in the result
set.
• top ranked global analysis, which derives the feedback information from external
sources such as a thesaurus.
96
Classic Relevance Feedback
• In a classic relevance feedback cycle, the user is presented with a list of the retrieved
documents .
• Then, the user examines them and marks those that are relevant In practice, only the top
10 (or 20) ranked documents need to be examined.
• The main idea consists of selecting important terms from the documents that have been
identified as relevant, and enhancing the importance of these terms in a new query
formulation.
*******************
97
UNIT III: MODELING AND RETRIEVAL
EVALUATION
Syllabus:
UNIT III TEXT CLASSIFICATION AND CLUSTERING 9
A Characterization of Text Classification – Unsupervised Algorithms: Clustering –
Naïve Text Classification – Supervised Algorithms – Decision Tree – k-NN
Classifier – SVM Classifier – Feature Selection or Dimensionality Reduction –
Evaluation metrics – Accuracy and Error – Organizing the classes – Indexing and
Searching – Inverted Indexes – Sequential Searching – Multi-dimensional Indexing.
98
1. A CHARACTERIZATION OF TEXTCLASSIFICATION
What is Classification?
• Classification is:
– the data mining process of
– finding a model (or function) that
– describes and distinguishes data classes or concepts,
– for the purpose of being able to use the model to predict
the class of objects whose class label is unknown.
• That is, predicts categorical class labels (discrete or
nominal).
• Classifies the data (constructs a model) based on the
training set.
• It predict group membership for data instances.
Classification and prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends. Such analysis can help
provide us with a better understanding of the data at large. Whereas classification predicts
categorical (discrete, unordered) labels, prediction models continuous valued functions.
99
Predictor where the model constructed predicts a continuous-valued function, or
ordered value, as opposed to a categorical label. This model is a predictor.
Classification and numeric prediction are the two major types of prediction problems.
DATA CLASSIFICATION :
In the first step, a classifier is built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classification algorithm builds
the classifier by analyzing or “learning from” a training set made up of database tuples and
their associated class labels.
10
0
Figure shows ,The data classification process: (a) Learning: Training data are analyzed
by a classification algorithm. Here, the class label attribute is loan decision, and the learned
model or classifier is represented in the form of classification rules. (b) Classification: Test
data are used to estimate the accuracy of the classification rules. If the accuracy is considered
acceptable, the rules can be applied to the classification of new data tuples.
The individual tuples making up the training set are referred to as training tuples and
are selected from the database under analysis.
supervised learning (i.e., the learning of the classifier is “supervised” in that it is told
to which class each training tuple belongs.)
It contrasts with unsupervised learning (or clustering), in which the class label of each
training tuple is not known, and the number or set of classes to be learned may not be known in
advance.
This first step of the classification process can also be viewed as the learning of a
mapping or function, y = f (X), that can predict the associated class label y of a given tuple X.
The model is used for classification. First, the predictive accuracy of the classifier is
estimated. If we were to use the training set to measure the accuracy of the classifier, this
estimate would likely be optimistic, because the classifier tends to overfit the data (i.e., during
learning it may incorporate some particular anomalies of the training data that are not present in
10
1
the general data set overall). Therefore, a test set is used, made up of test tuples and their
associated class labels. These tuples are randomly selected from the general data set.
The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier. The associated class label of each test tuple is compared
with the learned classifier’s class prediction for that tuple.
DATA PREDICTION :
However, for prediction, we lose the terminology of “class label attribute” because the
attribute for which values are being predicted is continuous-valued (ordered) rather than
categorical (discrete-valued and unordered). The attribute can be referred to simply as the
predicted attribute.
Note that prediction can also be viewed as a mapping or function, y= f (X), where X is the input
(e.g., a tuple describing a loan applicant), and the output y is a continuous or ordered value
(such as the predicted amount that the bank can safely loan the applicant); That is, we wish to
learn a mapping or function that models the relationship between X and y.
The following preprocessing steps may be applied to the data to help improve the
accuracy,efficiency, and scalability of the classification or prediction process.
The following preprocessing steps may be applied to the data to help improve the
accuracy, efficiency, and scalability of the classification or prediction process.
10
2
Data cleaning: This refers to the preprocessing of data in order to remove or reduce
noise (by applying smoothing techniques, for example) and the treatment of missing values
(e.g., by replacing a missing value with the most commonly occurring value for that attribute,
or with the most probable value based on statistics).
Relevance analysis: Many of the attributes in the data may be redundant. Correlation
analysis can be used to identify whether any two given attributes are statistically related.
Attribute subset selection can be used in these cases to find a reduced set of attributes such
that the resulting probability distribution of the data classes is as close as possible to the
original distribution obtained using all attributes.
Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization techniques, such as binning,
histogram analysis, and clustering.
Classification and prediction methods can be compared and evaluated according to the
following criteria:
10
3
information). Similarly, the accuracy of a predictor refers to how well a given predictor can
guess the value of the predicted attribute for new or previously unseen data.
Speed: This refers to the computational costs involved in generating and using the given
classifier or predictor.
Robustness: This is the ability of the classifier or predictor to make correct predictions
given noisy data or data with missing values.
Scalability: This refers to the ability to construct the classifier or predictor efficiently
given large amounts of data.
Interpretability: This refers to the level of understanding and insight that is provided by
the classifier or predictor.
Decision tree induction is the learning of decision trees from class-labeled training
tuples.A decision tree is a flowchart-like tree structure, where each internal node (nonleaf
node) denotes a test on an attribute, each branch represents an outcome of the test, and each
leaf
node (or terminal node) holds a class label. The top most node in a tree is the root node.
10
4
Fig. A decision tree for the concept buys computer
Given a tuple, X, for which the associated class label is unknown, the attribute values of
the tuple are tested against the decision tree. A path is traced from the root to a leaf node, which
holds the class prediction for that tuple. Decision trees can easily be converted to classification
rules.
The construction of decision tree classifiers does not require any domain knowledge or
parameter setting, and therefore is appropriate for exploratory knowledge discovery. Decision
trees can handle high dimensional data. Their representation of acquired knowledge in tree
form is intuitive and generally easy to assimilate by humans. The learning and classification
steps of decision tree induction are simple and fast. In general, decision tree classifiers have
good accuracy.
In later presented C4.5 (a successor of ID3), which became a benchmark to which newer
supervised learning algorithms are often compared.
10
5
Classification and Regression Trees (CART), which described the generation of binary
decision trees. ID3 and CART were invented independently of one another at around the same
time, yet follow a similar approach for learning decision trees from training tuples.
ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which decision
trees are constructed in a top-down recursive divide-and-conquer manner. Most algorithms for
decision tree induction also follow such a top-down approach, which starts with a training set
of tuples and their associated class labels. The training set is recursively partitioned into smaller
subsets as the tree is being built. A basic decision tree algorithm is summarized here.
Fig. Basic algorithm for inducing a decision tree from training tuples.
10
6
10
7
The algorithm is called with three parameters: D, attribute list, and Attribute selection
method.We refer to D as a data partition. Initially, it is the complete set of training tuples and
their associated class labels. The parameter attribute list is a list of attributes describing the
tuples. Attribute selection method specifies a heuristic procedure for selecting the attribute that
“best” discriminates the given tuples according to class. This procedure employs an attribute
selection measure, such as information gain or the gini index. Whether the tree is strictly binary
is generally driven by the attribute selection measure. Some attribute selection measures, such
as the gini index, enforce the resulting tree to be binary. Others, like information gain, do not,
therein allowing multiway splits (i.e., two or more branches to be grown from a node).
The tree starts as a single node, N, representing the training tuples in D (step 1).
If the tuples in D are all of the same class, then node N becomes a leaf and is labeled
with that class (steps 2 and 3). Note that steps 4 and 5 are terminating conditions. All of the
terminating conditions are explained at the end of the algorithm.
Otherwise, the algorithm calls Attribute selection method to determine the splitting
criterion. The splitting criterion tells us which attribute to test at node N by determining the
“best” way to separate or partition the tuples in D into individual classes (step 6). The splitting
criterion also tells us which branches to grow from node N with respect to the outcomes of the
chosen test. More specifically, the splitting criterion indicates the splitting attribute and may
also indicate either a split-point or a splitting subset. The splitting criterion is determined so
that, ideally, the resulting partitions at each branch are as “pure” as possible.
A partition is pure if all of the tuples in it belong to the same class. In other words, if we
were to split up the tuples in D according to the mutually exclusive outcomes of the splitting
criterion, we hope for the resulting partitions to be as pure as possible.
The node N is labeled with the splitting criterion, which serves as a test at the node (step
7). A branch is grown from node N for each of the outcomes of the splitting criterion. The
10
8
tuples in D are partitioned accordingly (steps 10 to 11). There are three possible scenarios, as
illustrated in Figure 6.4. Let A be the splitting attribute. A has v distinct values, fa1, a2, : : : ,
avg, based on the training data.
2. A is continuous-valued: In this case, the test at node N has two possible outcomes,
corresponding to the conditions A _ split point and A > split point, respectively, where split
point is the split-point returned by Attribute selection method as part of the splitting criterion.
(In practice, the split-point, a, is often taken as the midpoint of two known adjacent values of A
and therefore may not actually be a pre-existing value of A from the training data.) Two
branches are grown from N and labeled
according to the above outcomes (Figure 6.4(b)). The tuples are partitioned such thatD1 holds
the subset of class-labeled tuples inDforwhich A_split point,while D2 holds the rest.
selection measure or algorithm being used): The test at node N is of the form “A 2 SA?”. SA is
the splitting subset for A, returned by Attribute selection method as part of the splitting
criterion. It is a subset of the known values of A. If a given tuple has value aj of A and if aj 2
SA, then the test at node N is satisfied. Two branches are grown from N . By convention, the
left branch out of N is labeled yes so that D1 corresponds to the subset of class-labeled tuples in
Dthat satisfy the test. The right branch out of N is labeled no so that D2 corresponds to the
subset of class-labeled tuples from D that do not satisfy the test.
10
9
The algorithm uses the same process recursively to form a decision tree for the tuples
The recursive partitioning stops only when any one of the following terminating
conditions is true:
1. All of the tuples in partition D (represented at node N) belong to the same class
2. There are no remaining attributes on which the tuples may be further partitioned
(step 4). In this case, majority voting is employed (step 5). This involves converting node N
into a leaf and labeling it with the most common class in D. Alternatively, the class distribution
of the node tuples may be stored.
3. There are no tuples for a given branch, that is, a partition Dj is empty (step 12).
The above figure shows three possibilities for partitioning tuples based on the splitting
criterion, shown with examples. Let A be the splitting attribute. (a) If A is discrete-valued, then
one branch is grown for each known value of A. (b) If A is continuous-valued, then two
11
0
branches are grown, corresponding to A _ split point and A > split point. (c) If A is discrete-
valued and a binary tree must be produced, then the test is of the form A 2 SA, where SA is the
splitting subset for A.
In this case, a leaf is created with the majority class in D (step 13).
Attribute selection measures are used to select the attribute that best partitions the tuples
into distinct classes.
An attribute selection measure is a heuristic for selecting the splitting criterion that
“best” separates a given data partition, D, of class-labeled training tuples into individual
classes. If we were to split D into smaller partitions according to the outcomes of the splitting
criterion, ideally each partition would be pure (i.e., all of the tuples that fall into a given
partition would belong to the same class).
Attribute selection measures are also known as splitting rules because they determine
how the tuples at a given node are to be split.
The attribute having the best score for the measure6 is chosen as the splitting attribute
for the given tuples. If the splitting attribute is continuous-valued or if we are restricted to
binary trees then, respectively, either a split point or a splitting subset must also be determined
as part of the splitting criterion. The tree node created for partition D is labeled with the
splitting criterion, branches are grown for each outcome of the criterion, and the tuples are
partitioned accordingly.
information gain,
11
1
gain ratio, and
gini index.
Information gain
ID3 uses information gain as its attribute selection measure. This measure is based on
which studied the value or “information content” of messages.
Let node N represent or hold the tuples of partition D. The attribute with the highest
information gain is chosen as the splitting attribute for node N.
Now, suppose we were to partition the tuples in D on some attribute A having v distinct
values, {a1, a2, : : : , av}, as observed from the training data. If A is discrete-valued, these values
correspond directly to the v outcomes of a test on A. Attribute A can be used to split D into v
partitions or subsets, {D1, D2, : : : , Dv},where Dj contains those tuples in D that have outcome
aj of A. These partitions would correspond to the branches grown from node N.
11
2
The term |Dj | / |D| acts as the weight of the jth partition. InfoA(D) is the expected
information required to classify a tuple from D based on the partitioning by A. The smaller the
expected information (still) required, the greater the purity of the partitions.
(i.e., based on just the proportion of classes) and the newrequirement (i.e., obtained after
partitioning on A). That is,
Example : Induction of a decision tree using information gain. The following table presents a
training set,
The class label attribute, buys computer, has two distinct values (namely, fyes, nog);
therefore, there are two distinct classes (that is, m = 2). Let class C1 correspond to yes and class
C2 correspond to no. There are nine tuples of class yes and five tuples of class no. A (root)
node N is created for the tuples in D. To find the splitting criterion for these tuples, we must
compute the information gain of each attribute.
11
3
By using following equation to compute the expected information needed to classify a tuple in
D:
Where Total number of tuple is 14 , Class = “Yes” are 9 and Class = “no” are 5 , Therefore
Next, we need to compute the expected information requirement for each attribute. Let’s start
with the attribute age.We need to look at the distribution of yes and no tuples for each category
of age. For the age category youth, there are two yes tuples and three no tuples. For the
category middle aged, there are four yes tuples and zero no tuples. For the category senior,
there are three yes tuples and two no tuples. Using InfoA(D) equation, the expected information
needed to classify a tuple in D if the tuples are partitioned according to age is
11
4
That is
InfoA(D) = Age tuple (Youth) * ( - Youth with “Yes” - Youth with “no” ) +
Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and
Gain(credit rating) = 0.048 bits. Because age has the highest information gain among the
attributes, it is selected as the splitting attribute. Node N is labeled with age, and branches are
grown for each of the attribute’s values.
11
5
The attribute age has the highest information gain and therefore becomes the splitting
attribute at the root node of the decision tree. Branches are grown for each outcome of age. The
tuples are shown partitioned accordingly.
Suppose, instead, that we have an attribute A that is continuous-valued, rather than discrete-
valued.
For such a scenario, we must determine the “best” split-point for A, where the split-
point is a threshold on A. We first sort the values of A in increasing order. Typically, the
midpoint between each
The point with the minimum expected information requirement for A is selected as the
split point for A. D1 is the set of tuples in D satisfying A <= split point, and D2 is the set of
tuples in D satisfying A > =split point.
Gain ratio :
The information gain measure is biased toward tests with many outcomes. That is, it
prefers to select attributes having a large number of values.
11
6
Therefore, the information gained by partitioning on this attribute is maximal. Clearly,
such a partitioning is useless for classification.
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio,
which attempts to overcome this bias. It applies a kind of normalization to information gain
using a “split information” value defined analogously with Info(D) as
This value represents the potential information generated by splitting the training data
set, D, into v partitions, corresponding to the v outcomes of a test on attribute A.
It differs from information gain, which measures the information with respect to
classification that is acquired based on the same partitioning. The gain ratio is defined as
The attribute with the maximum gain ratio is selected as the splitting attribute.
Example, Computation of gain ratio for the attribute income. A test on income splits the
data of Table 6.1 into three partitions, namely low, medium, and high, containing four, six, and
four tuples, respectively. To compute the gain ratio of income, we first use Equation
11
7
we have Gain(income) = 0.029. Therefore, GainRatio(income) = 0.029/0.926 = 0.031.
Gini index :
The Gini index is used in CART. Using the notation described above, the Gini index
where pi is the probability that a tuple in D belongs to class Ci and is estimated by The sum is
computed over m classes.
The Gini index considers a binary split for each attribute. Let’s first consider the case
where A is a discrete-valued attribute having v distinct values, {a1, a2, : : : , av}, occurring in
D. To determine the best binary split on A, we examine all of the possible subsets that can be
formed using known values of A.
If A has v possible values, then there are 2v possible subsets. For example, if income has
three possible values, namely { low, medium, high}, then the possible subsets are {low,
medium, high},{low, medium},{low, high},{medium, high},{low}, {medium},{high}, and {}.
We exclude the power set, {low, medium, high}, and the empty set from consideration since,
11
8
conceptually, they do not represent a split. Therefore, there are 2v-2 possible ways to form two
partitions of the data, D, based on a binary split on A.
For each attribute, each of the possible binary splits is considered. For a discrete-valued
attribute, the subset that gives the minimum gini index for that attribute is selected as its
splitting subset.
The point giving the minimum Gini index for a given (continuous-valued) attribute is
taken as the split-point of that attribute. Recall that for a possible split-point of A, D1 is the set
of tuples in D satisfying A <= split point, and D2 is the set of tuples in D satisfying A > split
point.
11
9
Similarly, the Gini index values for splits on the remaining subsets are: 0.315 (for the
subsets {low, high} and {medium}) and 0.300 (for the subsets {medium, high} and {low}).
Therefore, the best binary split for attribute income is on {medium, high} (or {low}) because it
minimizes the gini index.
Many other attribute selection measures have been proposed. CHAID, a decision tree
algorithm that is popular in marketing, uses an attribute selection measure that is based on the
statistical c2 test for independence. Other measures include C-SEP (which performs better than
information gain and Gini index in certain cases) and G-statistic (an information theoretic
measure that is a close approximation to c2 distribution).
Other attribute selection measures consider multivariate splits (i.e., where the
partitioning of tuples is based on a combination of attributes, rather than on a single attribute).
The CART system, for example, can find multivariate splits based on a linear combination of
attributes. Multivariate splits are a form of attribute (or feature) construction, where new
attributes are created based on the existing ones.
Tree Pruning :
When a decision tree is built, many of the branches will reflect anomalies in the training
data due to noise or outliers. Tree pruning methods address this problem of overfitting the data.
12
0
Pruned trees tend to be smaller and less complex and, thus, easier to comprehend. They
are usually faster and better at correctly classifying independent test data (i.e., of previously
unseen tuples) than unpruned trees.
“How does tree pruning work?” There are two common approaches to tree pruning:
prepruning and
postpruning.
In the prepruning approach, a tree is “pruned” by halting its construction early (e.g., by
deciding not to further split or partition the subset of training tuples at a given node).
The second and more common approach is postpruning, which removes subtrees from a
“fully grown” tree. A subtree at a given node is pruned by removing its branches and replacing
it with a leaf. The leaf is labeled with the most frequent class among the subtree being replaced.
C4.5 uses a method called pessimistic pruning, which is similar to the cost complexity
method in that it also uses error rate estimates to make decisions regarding subtree pruning.
Pessimistic pruning, however, does not require the use of a prune set. Instead, it uses the
training set to estimate error rates.
12
1
Scalability and Decision Tree Induction :
More recent decision tree algorithms that address the scalability issue have been
proposed. Algorithms for the induction of decision trees from very large training sets include
SLIQ and SPRINT, both of which can handle categorical and continuous valued attributes.
SLIQ employs disk-resident attribute lists and a single memory-resident class list. The
attribute lists and class list generated by SLIQ for the tuple data of Table
Figure Attribute list and class list data structures used in SLIQ for the tuple data of
above table
12
2
Table attribute list data structure used in SPRINT for the tuple data of above table.
The use of data structures to hold aggregate information regarding the training data are
one approach to improving the scalability of decision tree induction.
While both SLIQ and SPRINT handle disk-resident data sets that are too large to fit into
memory, the scalability of SLIQis limited by the use of its memory-resident data structure.
12
3
Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities, such as the probability that a given tuple belongs to a particular class.
By training it means to train them on particular inputs so that later on we may test them
for unknown inputs (which they have never seen before) for which they may classify or
predict etc (in case of supervised learning) based on their learning.
This is what most of the Machine Learning techniques like Neural Networks, SVM,
Bayesian etc. are based upon.
So in a general Machine Learning project basically you have to divide your input set to
a Development Set (Training Set + Dev-Test Set) & a Test Set (or Evaluation set).
Remember your basic objective would be that your system learns and classifies new
inputs which they have never seen before in either Dev set or test set.
12
4
The test set typically has the same format as the training set.
However, it is very important that the test set be distinct from the training corpus: if we
simply reused the training set as the test set, then a model that simply memorized its
input, without learning how to generalize to new examples, would receive misleadingly
high scores.
In general, for an example, 70% can be training set cases. Also remember to partition
the original set into the training and test sets randomly.
To demonstrate the concept of Naïve Bayes Classification, consider the example given below:
The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is
particularly suited when the dimensionality of the inputs is high.
Despite its simplicity, Naive Bayes can often outperform more sophisticated
classification methods.
12
5
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a
new case (which hasn't been observed yet) is twice as likely to have membership
GREEN rather than RED.
In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities
are based on previous experience, in this case the percentage of GREEN and RED
objects, and often used to predict outcomes before they actually happen.
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior
probabilities for class membership are:
Having formulated our prior probability, we are now ready to classify a new object
(WHITE circle).
Since the objects are well clustered, it is reasonable to assume that the more GREEN (or
RED) objects in the vicinity of X, the more likely that the new cases belong to that
12
6
particular color.
To measure this likelihood, we draw a circle around X which encompasses a number (to
be chosen a priori) of points irrespective of their class labels.
Then we calculate the number of points in the circle belonging to each class label.
From this we calculate the likelihood:
From the illustration above, it is clear that Likelihood of X given GREEN is smaller
than Likelihood of X given RED, since the circle encompasses 1 GREEN object and
3 RED ones. Thus:
Although the prior probabilities indicate that X may belong to GREEN (given that there
are twice as many GREEN compared to RED) the likelihood indicates otherwise; that
the class membership of X is RED (given that there are more RED objects in the
vicinity of X than GREEN).
In the Bayesian analysis, the final classification is produced by combining both sources
of information, i.e., the prior and the likelihood, to form a posterior probability using the
so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).
12
7
Finally, we classify X as RED since its class membership achieves the largest posterior
probability.
Note. The above probabilities are not normalized. However, this does not affect the
classification outcome since their normalizing constants are the same.
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a
new case (which hasn't been observed yet) is twice as likely to have membership
GREEN rather than RED.
In the Bayesian analysis, this belief is known as the prior probability.
Prior probabilities are based on previous experience, in this case the percentage of
GREEN and RED objects, and often used to predict outcomes before they actually
happen.
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities
for class membership are:
12
8
Prior Probability for RED: 20 / 60
Having formulated our prior probability, we are now ready to classify a new object
(WHITE circle in the diagram below).
Since the objects are well clustered, it is reasonable to assume that the more GREEN (or
RED) objects in the vicinity of X, the more likely that the new cases belong to that
particular color.
To measure this likelihood, we draw a circle around X which encompasses a number (to
be chosen a priori) of points irrespective of their class labels.
Then we calculate the number of points in the circle belonging to each class label.
BAYESIAN CLASSIFICATION :
Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class conditional
independence. It is made to simplify the computations involved and, in this sense, is considered
“naïve.” Bayesian belief networks are graphical models, which unlike naïve Bayesian
classifiers, allow the representation of dependencies among subsets of attributes. Bayesian
belief networks can also be used for classification.
BAYES’ THEOREM :
12
9
P(X) is the prior probability of X.
P(H), P(X|H), and P(X) may be estimated from the given data, as we shall see below. Bayes’
theorem is useful in that it provides a way of calculating the posterior probability, P(H|X), from
P(H), P(X|H), and P(X). Bayes’ theorem is
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple
2. Suppose that there are m classes, C1, C2, …. , Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned on X.
That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
Thus we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem
13
0
3. As P(X) is constant for all classes, only P(X|Ci) P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) =…. = P(Cm), and we would therefore maximize P(X|Ci).
For each attribute, we look at whether the attribute is categorical or continuous-valued. For
instance, to compute P(X|Ci), we consider the following:
(a) If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in D having the
value xk for Ak, divided by |Ci,D|, the number of tuples of class Ci in D.
(b) If Ak is continuous-valued, then we need to do a bit more work, but the calculation is
pretty straightforward. A continuous-valued attribute is typically assumed to have a Gaussian
distribution with a mean μ and standard deviation s, defined by
13
1
5. In order to predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci.
The classifier predicts that the class label of tuple X is the class Ci if and only if
The naïve Bayesian classifier makes the assumption of class conditional independence,
that is, given the class label of a tuple, the values of the attributes are assumed to be
conditionally
Bayesian belief networks specify joint conditional probability distributions. They allow
class conditional independencies to be defined between subsets of variables. They provide a
graphical model of causal relationships, on which learning can be performed. Trained Bayesian
belief networks can be used for classification. Bayesian belief networks are also known as
belief networks, Bayesian networks, and probabilistic networks. For brevity, we will refer to
them as belief networks.
A belief network is defined by two components—a directed acyclic graph and a set of
conditional probabability tables.
13
2
predecessor of Z, and Z is a descendant of Y. Each variable is conditionally independent of its
non descendants in the graph, given its parents.
Figure 6.11 A simple Bayesian belief network: (a) A proposed causal model, represented by a
directed acyclic graph. (b) The conditional probability table for the values of the variable Lung
Cancer (LC) showing each possible combination of the values of its parent nodes, Family
History (FH) and Smoker (S).
A belief network has one conditional probability table (CPT) for each variable. The
CPT for a variable Y specifies the conditional distribution P(Y |Parents (Y)), where Parents(Y)
are the parents of Y.
The network topology (or “layout” of nodes and arcs) may be given in advance or
inferred from the data. The network variables may be observable or hidden in all or some of the
training tuples. The case of hidden data is also referred to as missing values or incomplete data.
13
3
If the network topology is known and the variables are observable, then training the
network is straightforward. It consists of computing the CPT entries, as is similarly done when
computing the probabilities involved in naive Bayesian classification. When the network
topology is given and some of the variables are hidden, there are various methods to choose
from for training the belief network.
A gradient descent strategy is used to search for the wi jk values that best model the
data, based on the assumption that each possible setting of wi jk is equally likely.
The gradient descent method performs greedy hill-climbing in that, at each iteration or
step along the way, the algorithm moves toward what appears to be the best solution at the
moment, without backtracking. The weights are updated at each iteration. Eventually, they
converge to a local optimum solution.
2. Take a small step in the direction of the gradient: The weights are updated by
is computed.
13
4
3. Renormalize the weights: Because the weights wi, jk are probability values, they must
must equal 1 for all i, k. Algorithms that follow this form of learning are called Adaptive
Probabilistic Networks.
The “IF”-part (or left-hand side)of a rule is known as the rule antecedent or precondition.
The “THEN”-part (or right-hand side) is the rule consequent. In the rule antecedent, the
condition consists of one or more attribute tests (such as age = youth, and student = yes) that
are logically ANDed. The rule’s consequent contains a class prediction (in this case, we are
predicting whether a customer will buy a computer). R1 can also be written as
13
5
If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given
tuple,we say that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that
the rule covers the tuple.
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class
labeled data set,D, let ncovers be the number of tuples covered by R; ncorrect be the number of
tuples correctly classified by R; and |D| be the number of tuples in D. We can define the
coverage and accuracy of R as
That is, a rule’s coverage is the percentage of tuples that are covered by the rule (i.e., whose
attribute values hold true for the rule’s antecedent). For a rule’s accuracy, we look at the tuples
that it covers and see what percentage of them the rule can correctly classify.
Let’s see how we can use rule-based classification to predict the class label of a given
tuple, X. If a rule is satisfied by X, the rule is said to be triggered. For example, suppose we
have
If R1 is the only rule satisfied, then the rule fires by returning the class prediction for X.
13
6
If more than one rule is triggered, we need a conflict resolution strategy to figure out
which rule gets to fire and assign its class prediction to X. There are many possible strategies ,
Rule ordering.
The size ordering scheme assigns the highest priority to the triggering rule that has the
“toughest” requirements, where toughness is measured by the rule antecedent size.That is, the
triggering rule with the most attribute tests is fired.
The rule ordering scheme prioritizes the rules beforehand. The ordering may be class
based or rule-based.
With class-based ordering, the classes are sorted in order of decreasing “importance,”
such as by decreasing order of prevalence.
With rule-based ordering, the rules are organized into one long priority list, according
to some measure of rule quality such as accuracy, coverage, or size (number of attribute tests
in the rule antecedent), or based on advice from domain experts.
To extract rules from a decision tree, one rule is created for each path from the root to a
leaf node. Each splitting criterion along a given path is logically ANDed to form the rule
antecedent (“IF” part). The leaf node holds the class prediction, forming the rule consequent
(“THEN” part).
13
7
A disjunction (logical OR) is implied between each of the extracted rules. Because the
rules are extracted directly from the tree, they are mutually exclusive and exhaustive. By
mutually exclusive, this means that we cannot have rule conflicts here because no two rules will
be triggered for the same tuple. (We have one rule per leaf, and any tuple can map to only one
leaf.) By exhaustive, there is one rule for each possible attribute-value combination, so that this
set of rules does not require a default rule. Therefore, the order of the rules does not matter—
they are unordered.
The training tuples and their associated class labels are used to estimate rule accuracy.
Other problems arise during rule pruning, however, as the rules will no longer be
mutually exclusive and exhaustive. For conflict resolution, C4.5 adopts a class-based ordering
scheme. It groups all rules for a single class together, and then determines a ranking of these
class rule sets. Within a rule set, the rules are not ordered. C4.5 orders the class rule sets so as
to minimize the number of false-positive errors (i.e., where a rule predicts a class, C, but the
13
8
actual class is not C). The class rule set with the least number of false positives is examined
first. Once pruning is complete, a final check is done to
IF-THEN rules can be extracted directly from the training data (i.e., without having to
generate a decision tree first) using a sequential covering algorithm.
Sequential covering algorithms are the most widely used approach to mining disjunctive
sets of classification rules, and form the topic of this subsection.
Here, rules are learned for one class at a time. Ideally, when learning a rule for a class,
Ci, we would like the rule to cover all (or many) of the training tuples of class C and none (or
few) of the tuples from other classes. In this way, the rules learned should be of high accuracy.
The rules need not necessarily be of high coverage. This is because we can have more than one
rule for a class, so that different rules may cover different tuples within the same class. The
process continues until the terminating condition is met, such as when there are no more
training tuples or the quality of a rule returned is below a user-specified threshold. The Learn
13
9
One Rule procedure finds the “best” rule for the current class, given the current set of training
tuples.
Typically, rules are grown in a general-to-specific manner, The classifying attribute is loan
decision, which indicates whether a loan is accepted (considered safe) or rejected (considered
risky). To learn a rule for the class “accept,” we start off with the most general rule possible,
that is, the condition of the rule antecedent is empty. The rule is:
Each time we add an attribute test to a rule, the resulting rule should cover more of the
“accept” tuples. During the next iteration, we again consider the possible attribute tests and end
up selecting credit rating = excellent. Our current rule grows to become
The process repeats, where at each step, we continue to greedily grow rules until the
14
0
4.3.4.Rule Quality Measures :
Learn One Rule needs a measure of rule quality. Every time it considers an attribute test,
it must check to see if appending such a test to the current rule’s condition will result in an
improved rule.
Choosing between two rules based on accuracy. Consider the two rules as illustrated in
Figure 6.14. Both are for the class loan decision = accept. We use “a” to represent the tuples of
class “accept” and “r” for the tuples of class “reject.” Rule R1 correctly classifies 38 of the 40
tuples it covers. Rule R2 covers only two tuples, which it correctly classifies. Their respective
accuracies are 95% and 100%. Thus, R2 has greater accuracy than R1, but it is not the better
rule because of its small coverage.
Figure shows Rules for the class loan decision = accept, showing accept (a) and reject (r)
tuples.
Another measure is based on information gain and was proposed in FOIL (First Order
Inductive Learner), a sequential covering algorithm that learns first-order logic rules.
Learning first-order rules is more complex because such rules contain variables, whereas
the rules we are concerned with in this section are propositional.
14
1
FOIL assesses the information gained by extending condition as
4.3.5.Rule Pruning :
The rules may perform well on the training data, but less well on subsequent data. To
compensate for this, we can prune the rules. A rule is pruned by removing a conjunct (attribute
test). We choose to prune a rule, R, if the pruned version of R has greater quality, as assessed on
an independent set of tuples. FOIL uses a simple yet effective method. Given a rule, R,
where pos and neg are the number of positive and negative tuples covered by R, respectively.
This value will increase with the accuracy of R on a pruning set. Therefore, if the FOIL Prune
value is higher for the pruned version of R, then we prune R.
14
2
Back propagation algorithm for classification. After applying Back propagation algorithm,
genetic algorithm is applied for weight adjustment. The developed model can then be applied to
classify the unknown tuples from the given database and this information may be used by
decision maker to make useful decision. If one can write down a flow chart or a formula that
accurately describes the problem, then stick with a traditional programming method. There are
many tasks of data mining that are not solved efficiently with simple mathematical formulas.
Large scale data mining applications involving complex decision making can access billions of
bytes of data.. Hence, the efficiency of such applications is paramount. Classification is a key
data mining technique.
Neural Network. Artificial Neural Network is often called as Neural Network (NN). To build
artificial neural network, artificial neurons, also called as nodes, are interconnected. The
architecture of NN is very important for performing a particular computation. Some neurons
are arranged to take inputs from outside environment. These neurons are not connected with
each other, so the arrangement of these neurons is in a layer, called as Input layer. All the
neurons of input layer are producing some output, which is the input to next layer. The
architecture of NN can be of single layer or multilayer. In a single layer Neural Network, only
one input layer and one output layer is there, while in multilayer neural network, there can be
one or more hidden layer.
An artificial neuron is an abstraction of biological neurons and the basic unit in an ANN.
The Artificial Neuron receives one or more inputs and sums them to produce an output. Usually
the sums of each node are weighted, and the sum is passed through a function known as an
activation or transfer function. The objective here is to develop a data classification algorithm
that will be used as a general-purpose classifier. To classify any database first, it is required to
train the model. The proposed training algorithm used here is a Hybrid BP-GA. After
successful training user can give unlabeled data to classify The synapses or connecting links:
that provide weights, wj, to the input values, xj for j = 1 ...m; An adder: that sums the weighted
input values to compute the input to the activation function
Where,
14
3
w0 is called the bias, is a numerical value associated with the neuron. It is convenient to think
of the bias as the weight for an input x0 whose value is always equal to one, so that;
An activation function g: that maps v to g(v) the output value of the neuron. This function is a
monotone function. The practical value of the logistic function arises from the fact
that it is almost linear in the range where g is between 0.1 and 0.9 but has a squashing effect on
The back propagation algorithm cycles through two distinct passes, a forward pass
followed by a backward pass through the layers of the network. The algorithm alternates
between these passes several times as it scans the training data.
• The algorithm starts with the first hidden layer using as input values the independent
variables of a case from the training data set.
• The neuron outputs are computed for all neurons in the first hidden layer by performing the
relevant sum and activation function evaluations.
• These outputs are the inputs for neurons in the second hidden layer. Again the relevant sum
14
4
and activation function calculations are performed to compute the outputs of second layer
neurons.
• This phase begins with the computation of error at each neuron in the output layer. A popular
error function is the squared difference between ok the output of node k and yk the target value
for that node.
• The target value is just 1 for the output node corresponding to the class of the exemplar and
zero for other output nodes.
• The new value of the weight wjk of the connection from node j to node k is given by:
wnewjk= woldjk+_oj_k. Here _ is an important tuning parameter that is chosen by trial and
error by repeated runs on the training data. Typical values for _ are in the range 0.1 to 0.9.
• The backward propagation of weight adjustments along these lines continues until we
• At this time we have a new set of weights on which we can make a new forward pass when
Initial weight range(r): It is the range usually between [-r, r], weights are initialized between
these range.
Number of hidden layers: Up to four hidden layers can be specified; see the overview section
for more detail on layers in a neural network (input, hidden and output). Let us specify the
14
5
number to be 1.
Number of Nodes in Hidden Layer: Specify the number of nodes in each hidden layer.
Selecting the number of hidden layers and the number of nodes is largely a matter of trial and
error.
Number of Epochs: An epoch is one sweep through all the records in the training set.
Increasing this number will likely improve the accuracy of the model, but at the cost of time,
and decreasing this number will likely decrease the accuracy, but take less time.
Step size (Learning rate) for gradient descent: This is the multiplying factor for the error
correction during back propagation; it is roughly equivalent to the learning rate for the neural
network. A low value produces slow but steady learning; a high value produces rapid but
erratic learning. Values for the step size typically range from 0.1 to 0.9.
Error tolerance: The error in a particular iteration is back propagated only if it is greater than
the error tolerance. Typically error tolerance is a small value in the range 0 to 1.
Hidden layer sigmoid: The output of every hidden node passes through a sigmoid function.
Standard sigmoid function is logistic; the range is between 0 and 1.
• A study of comparing Feed forward Network, Recurrent Neural Network and Time-delay
neural Network shows that highest correct classification rate is achieved by the fully connected
feed forward neural network.
• From table 3.1, it can be seen that results obtained from BPNN are better than those obtained
from MLC.
14
6
• From the obtained results in table 3.3, it can be seen that MLP is having highest classification
rate, with reasonable error and time taken to classify is also reasonably less
• From table 3.3, it can be seen that considering some of the performance parameters, BPNN is
better than other methods, GA, KNN and MLC.
CLASSIFICATION
§
Classification consists of examining the properties of a newly presented observation and
assigning it to a predefined class.
§
Assigning customers to predefined customer segments (good vs. bad)
§
Assigning keywords to articles
§
Classifying credit applicants as low, medium, or high risk
§
Classifying instructor rating as excellent, very good, good, fair, or poor
Classification means that based on the properties of existing data, we have made or groups i.e.
we
have made classification. The concept can be well understood by a very simple example of
student grouping. A student can be grouped either as good or bad depending on his previous
record. Similarly an employee can be grouped as excellent, good, fair etc based on his tra ck
record in the organization. So how students or employees were classified? Answer is using the
historical data. Yes history is the best predictor of the future. When an organization conducts
test
and interviews from candidate employees, their performanc e is compared with those of the
existing employees. The knowledge can be used to predict how good you can perform if
employed. So we are doing classification, here absolute classification i.e. either good or bad or
in
other words we are doing binary class ification. Either you are in this group or this. Each entity
is
assigned one of the groups or classes. An example where classification can prove to be
beneficial
is in customer segmentation. The businesses can classify their customers as either good or bad;
the knowledge thus can be utilized for executing targeted marketing plans. Another example is
of
a news site, where there are number of visitors and also many content developers. Now where
to
14
7
place a specific news item on the web site? What should be the hierarchical position of the
news
item, what should be the news chapter, category? Either it should be in the sports or weather
section and so on. What is the problem in doing all this? The problem is that it's not a matter of
placing a single news item. The site as already mentioned contains a number of content
developers and also many categories. If sorting is performed humanly, then it is time
consuming.
That is why classification techniques can scan and process the document to decide its category
or
class. How and what sort of processing will be discussed in the next lecture. It is not possible
and
there are flaws in assigning category to any news document just based on the keyword.
Frequent
occurrence of the word keyword cricket in a document doesn't necessary means that the
document be placed in the sports category. The document may be actually political in nature
PREDICTION
Same as classification or estimation except records are classified according to some predicted
future behavior or estimated value.
Using class ification or estimation on a training example with known predicted values and
historical data a model is built.
Then explain the known values, and use the model to predict future.
Example:
Predicting how much customers will spend during next 6 months.
Prediction here is not like a palmists approach that if this line then this. Prediction means that
what's the probability of an item/event/customer to go in a specific class. This means that
prediction tells that in which class this specific item would lie in future or to which class this
specific event can be assigned in any time in future, say after six years. How prediction actually
works? First of all a model is built using exiting data. The existing data set is divided into two
subsets, one is called the training set and the other is called test set. The training set is used to
form model and the associated rules. Once model built and rules defined, the test set is used for
grouping. It must be noted the test set groupings are already known but they are put in the
model
to test its accuracy. Accuracy, we will discuss in detail in following slides but is dependent on
many factors like the model, training data and test data selection and sizes and many more
things
So, the accuracy gives the confidence level, that the rules are accurate to that much level.
14
8
Prediction can be well understood by considering a simple example. Suppose a business wants
to
know about their customers their propensity to buy/spend/purchase. In other words, how much
the customer will spend in next 6 months? Similarly a mobile phone company can install a new
tower based on the knowledge spending habits of its customers in the surroundings. It is not the
case that companies install facilities or invest money because of their gut feelings. If you think
like this you are absolutely wrong. Why companies should bother about their customers?
Because
if they know their customers, their interests, their like and dislikes, their buying patterns then it
is
possible to run targeted marketing campaigns and thus increasing profit
CLUSTERING
Task of segmenting a heterogeneous population into a number of more homogenous sub-
groups or clusters.
Unlike classification, it does NOT depend on predefined classes.
It is up to you to determine what meaning, if any, to attached to resulting clusters.
It could be the first step to the market segmentation effort.
What else data mining can do? We can do clustering with DM. Clustering is the technique of
reshuffling, relocating exiting segments in given data which is mostly heterogeneous so that the
new segments have more homogeneous data items. This can be very easily understood by a
simple example. Suppose some items have been segmented on the basis of color in the given
data.
Suppose the items are fruits, then the green segment may contain all green fruits like apple,
grapes etc. thus a heterogeneous mixture of items. Clustering segregates such items and brings
all
apples in one segment or cluster although it may contain apples of different colors red, green,
yellow etc. thus a more homogeneous cluster than the previous cluster.
Clustering is a difficult task, why? In case of classification we already know the number of
classes, either good or bad or yes or no or any number of classes. We also have the knowledge
of
classes properties so its easy to segment data into known classes. However, in case of clustering
we don't know the number of clusters a priori. Once clusters are found in the data business
intelligence, domain knowledge is needed to analyze the found clusters. Clustering can be the
first step towards market segmentation i.e. we can use countermining to know the possible
clusters in the data. Once clusters found and analyzed classification can be applied thus gaining
more accuracy than any standalone technique. Thus clustering is at higher level than
classification
14
9
not only because of its complexity but also because it leads to classification.
Examples of Clustering Applications
§
Marketing: Discovering distinct groups in customer databases, such as customers who
make lot of long-distance calls and don't have a job. Who are they? Students. Marketers
use this knowledge to develop targeted marketing programs.
§
Insurance: Identifying groups of crop insurance policy holders with a high average claim
rate. Farmers crash crops, when it is "profitable".
§
Land use: Identification of areas of similar land use in a GIS database.
§
Seismic studies: Identifying probable areas for oil/gas exploration based on seismic data.
We discussed that what clustering is and how it works. Now to know the real spirit of it, lets
look
at some of the real world examples to show the blessings of clustering;
1. Knowing or discovering about your market segment: Suppose a telecom company whose
data when clustered revealed that there is a group or cluster of people or customers whose
251
long distance calls are greater in number. Is this a discovery that such a group exi sts? Nope
not really. The real discovery is analyzing the cluster, the real fun part. Why these people are
in a cluster? Is important to know. Analysis of the cluster reveals that all the people in the
group are unemployed! How come it is possible that unemployed people are making
expensive far distance calls? The excitement lead to further analysis which ultimately
revealed that the people in the cluster were mostly students, students like you living away
from home in universities , colleges and hostels. They are making calls back home. So this is
a real example of clustering. Now the same question what is the benefit of knowing al this?
15
0
The answer is customer is like an asset for any organization. To know the customer is crucial
for any organization/compa ny so as to satisfy the customer which is a key of any company's
success in terms of profit. The company can rum targeted sale promotion and marketing
effort to target customers i.e. students.
2. Insurance: Now lets have look at how clustering plays a role in insurance sector. Insurance
companies are interested in knowing the people having higher insurance claim. You may
astonish that clustering has successfully been used in a developed country to detect farmer
insurance abuses. Some of the malicious farmers used to crash their crops intentionally to
gain insurance money which presumably was higher than the amount of profit and effort from
their crops. The farmer was happy but the loss was to be bear by the insurance company. The
company successfully used clustering techniques to identify such farmers, and thus saving a
lot of money.
Clustering thus has a wider scope in real life applications. Other areas where clustering is being
used are for city planning, GIS (Land use management), seismic data for mining (real mining)
and the list goes on.
Ambiguity in Clustering
How many clusters?
o Two clusters
o Four clusters
o Six clusters
Figure-30.2: Ambiguity in Clustering
As we mentioned the spirit of clustering lies in its analysis. A common ambiguity in clustering
is
regarding the number of clusters, since the cluster are not known in advance. To understand the
problem, consider the example in Figure 30.2. The black dots represent individual data records
or
252
tuples and they are placed as a result of a clustering algorithm. Now can u tell how many
clusters
15
1
are there?
Yes two clusters, but look at your screens again and tell how many clusters now?
Yes four clusters now, you are absolutely right. Now look again and tell how ma ny clusters?
Yes
6 clusters as shown in the Figure 30.2. What all this shows? This shows that deciding upon the
number of clusters is a complex task depending on factors like level of detail, application
domain
etc. By level of detail I mean that either the black point represents a single record or an
aggregate.
The thing which is important is to know how many clusters solve our problem. Understanding
this solves the problem.
DESCRIPTION
Describe what is going on in a complicated database so as to increas e our understanding.
A good description of a behavior will suggest an explanation as well.
Another application of DM is description. To know what is happening in our databases is
beneficial. How? The OLAP cubes provide ample amount of information, which is otherwise
distributed in the haystack. We can move the cube in different angles to get to the information
of
interest. However, we might miss the angle which might have given use some useful
information.
Description is used to describe such things.
15
2
253
15
3
warehousing is to deal huge amounts of data. So scaling is very important, which is the ability
of the method to work efficiently even when the data size is huge.
Interpretability: It refers to the level of understanding and insight that is provided by the
method. As we discussed in clustering one of the complex and difficult tasks is the cluster
analysis. The techniques can be compared on the basis of their interpretational ability e.g. there
might be some methods which give additional functionalities to provide meaning to the
discovered information like color coding, plots and curve fittings etc
Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts, for the purpose of being able to use the model to predict the class of
objects whose class label is unknown. The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label is known).
“How is the derived model presented?” The derived model may be represented in various
forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae, or
neural networks (Figure 1.10).
A decision tree is a flow-chart-like tree structure, where each node denotes a test on an
attribute value, each branch represents an outcome of the test, and tree leaves represent classes
or class distributions. Decision trees can easily be converted to classification rules. A neural
network, when used for classification, is typically a collection of neuron-like processing units
with weighted connections between the units. There are many other methods for constructing
classification models, such as naïve Bayesian classification, support vector machines, and k-
nearest neighbor classification.
15
4
distinguish each class from the others, presenting an organized picture of the data set. Suppose
that the resulting classification is expressed in the form of a decision tree. The decision tree, for
instance, may identify price as being the single factor that best distinguishes the three classes.
The tree may reveal that, after price, other features that help further distinguish objects of each
class from another include brand and place made. Such a decision tree may help you
understand the impact of the given sales campaign and design a more effective campaign for
the future.
Associative Classification
Associative classification
Association rules are generated and analyzed for use in classification
Search for strong associations between frequent patterns (conjunctions of
attribute-value pairs) and class labels
Classification: Based on evaluating a set of rules in the form of
P1 ^ p2 … ^ pl “Aclass = C” (conf, sup)
Why effective?
It explores highly confident associations among multiple attributes and may
overcome some constraints introduced by decision-tree induction, which
considers only one attribute at a time
In many studies, associative classification has been found to be more accurate
than some traditional classification methods, such as C4.5
15
5
High efficiency, accuracy similar to CMAR
RCBT (Mining top-k covering rule groups for gene expression data, Cong et al.
SIGMOD’05)
Explore high-dimensional classification, using top-k rule groups
Achieve high classification accuracy and high run-time efficiency
**************************
Bayesian Classification
A statistical classifier: performs probabilistic prediction, i.e., predicts class membership
probabilities
15
6
Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured
Classification is to determine P(H|X), the probability that the hypothesis holds given the
observed data sample X
P(X|H) (posteriori probability), the probability of observing the sample X, given that the
hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
Bayesian Theorem
Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all the P(Ck|X)
for all the k classes
15
7
Practical difficulty: require initial knowledge of many probabilities, significant
computational cost
classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
Prediction
Typical applications
Credit approval
Target marketing
Medical diagnosis
Fraud detection
15
8
Estimate accuracy of the model
The known label of test sample is compared with the classified result from
the model
Accuracy rate is the percentage of test set samples that are correctly
classified by the model
If the accuracy is acceptable, use the model to classify data tuples whose class
labels are not known
15
9
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data
Accuracy
Speed
Interpretability
Other measures, e.g., goodness of rules, such as decision tree size or compactness of
classification rules
16
0
K-MEANS ALGORITHM
Clustering is the process of partitioning a group of data points into a small
number of clusters. For instance, the items in a supermarket are clustered in
categories (butter, cheese and milk are grouped in dairy products). Of course
this is a qualitative kind of partitioning.
argminc∑i=1k∑x∈cid(x,μi)=argminc∑i=1k∑x∈ci∥x−μi∥22
16
1
EXPECTATION MAXIMUM ALGORITHM
Given the statistical model which generates a set of observed data, a set of
unobserved latent data or missing values , and a vector of unknown parameters ,
along with a likelihood function , the maximum likelihood estimate (MLE) of
the unknown parameters is determined by the marginal likelihood of the observed
dataHowever, this quantity is often intractable (e.g. if is a sequence of events, so
that the number of values grows exponentially with the sequence length, making
the exact calculation of the sum extremely difficult).
The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively
applying these two steps:
a) Expectation step (E step): Calculate the expected value of the log likelihood
function, with respect to the conditional distribution of given under the current
estimate of the parameters :
b) Maximization step (M step): Find the parameter that maximizes this quantity:
16
2
The typical models to which EM is applied uses as a latent variable indicating
membership in one of a set of groups:
The observed data points may be discrete (taking values in a finite or countably infinite
set) or continuous (taking values in an uncountably infinite set). Associated with each
data point may be a vector of observations.
The missing values (aka latent variables) are discrete, drawn from a fixed number of
values, and with one latent variable per observed unit.
The parameters are continuous, and are of two kinds: Parameters that are associated with
all data points, and those associated with a specific value of a latent variable (i.e.,
associated with all data points which corresponding latent variable has that value).
The algorithm as just described monotonically approaches a local minimum of the cost
function.
16
3
*******************
SVM for complex (Non Linearly Separable) SVM works very well without any
modifications for linearly separable data. Linearly Separable Data is any data that
can be plotted in a graph and can be separated into classes using a straight line.
16
4
SVM CLASSIFIER
a vector space method for binary classification problems
documents represented in t-dimensional space
find a decision surface (hyperplane) that best separate
documents of two classes
new document classified by its position relative to
hyperplane.Simple 2D example: training documents linearly
separable
Delimiting Hyperplanes
parallel dashed lines that delimit region where to look for a solution
16
5
Lines that cross the delimiting hyperplanes.
candidates to be selected as the decision hyperplane
lines that are parallel to delimiting hyperplanes: best candidates
Support vectors: documents that belong to, and define, the delimiting hyperplanesOur
example in a 2-dimensional system of coordinates
16
6
FEATURE SELECTION OR DIMENSIONALITY REDUCTION
Feature selection and dimensionality reduction allow us to minimize the number of
features in a dataset by only keeping features that are important. In other words, we want to
retain features that contain the most useful information that is needed by our model to make
accurate predictions while discarding redundant features that contain little to no
information. There are several benefits in performing feature selection and dimensionality
reduction which include model
16
7
interpretability, minimizing overfitting as well as reducing the size of the training set and
consequently training time.
Dimensionality Reduction
The number of input variables or features for a dataset is referred to as its
dimensionality. Dimensionality reduction refers to techniques that reduce the number of
input variables in a dataset. More input features often make a predictive modeling task more
challenging to model, more generally referred to as the curse of dimensionality. High-
dimensionality statistics and dimensionality reduction techniques are often used for data
visualization. Nevertheless these techniques can be used in applied machine learning to
simplify a classification or regression dataset in order to better fit a predictive model.
16
8
Dimensionality Reduction
Dimensionality reduction refers to techniques for reducing the number of input variables in
training data.
When dealing with high dimensional data, it is often useful to reduce the
dimensionality by projecting the data to a lower dimensional subspace which
captures the “essence” of the data. This is called dimensionality reduction.
Feature selection is different from dimensionality reduction. Both methods seek to reduce
the number of attributes in the dataset, but a dimensionality reduction
16
9
method do so by creating new combinations of attributes, where as feature selection
methods include and exclude attributes present in the data without changing them.
Examples of dimensionality reduction methods include Principal ComponentAnalysis,
Singular Value Decomposition and Sammon’s Mapping.
Feature selection is itself useful, but it mostly acts as a filter, muting out
featuresthat aren’t useful in addition to your existing features.
Feature Selection
AlgorithmsFilter Methods
Filter feature selection methods apply a statistical measure to assign a scoring to each
feature. The features are ranked by the score and either selected to be kept or removed from
the dataset. The methods are often univariate and consider the feature independently, or
with regard to the dependent variable. Some examples of some filter methods include the Chi
squared test, information gain and correlation coefficient scores.
Wrapper Methods
Wrapper methods consider the selection of a set of features as a search problem, where
different combinations are prepared, evaluated and compared to other combinations. A
predictive model us used to evaluate a combination of features and assign a score based on
model accuracy. The search process may be methodical such as a best-first search, it may
stochastic such as a random hill-climbing algorithm, or it may use heuristics, like forward
and backward passes to add and remove features. An example if a wrapper method is the
recursive feature elimination algorithm.
Embedded Methods
Embedded methods learn which features best contribute to the accuracy of the model while
the model is being created. The most common type of embedded feature selection methods
are regularization methods. Regularization methods are also called penalization methods
that introduce additional constraints into the
17
0
optimization of a predictive algorithm (such as a regression algorithm) that bias the model
toward lower complexity (fewer coefficients). Examples of regularization algorithms are the
LASSO, Elastic Net and Ridge Regression.
EVALUATION METRICS
17
1
17
2
17
3
17
4
17
5
ORGANIZING THE CLASSES TAXONOMIES
17
6
17
7
INDEXING AND SEARCHING
17
8
INVERTED INDEXES
17
9
18
0
18
1
18
2
SEARCHING
18
3
SEQUENTIAL SEARCHING
18
4
18
5
MULTI-DIMENSIONAL INDEXING
18
6
18
7
18
8
UNIT IV WEB RETRIEVAL AND WEB CRAWLING
Syllabus:
18
9
The Web
World Wide Web, which is also known as a Web, is a collection of
websites or web pages stored in web servers and connected to local computers
through the internet. These websites contain text pages, digital images, audios,
videos, etc. Users can access the content of these sites from any part of the world
over the internet using their devices such as computers, laptops, cell phones, etc.
The WWW, along with internet, enables the retrieval and display of text and
media to your device.
The building blocks of the Web are web pages which are formatted in
HTML and connected by links called "hypertext" or hyperlinks and accessed by
HTTP. These links are electronic connections that link related pieces of
information so that users can access the desired information quickly. Hypertext
offers the advantage to select a word or phrase from text and thus to access other
pages that provide additional information related to that word or phrase.
19
0
SEARCH ENGINE ARCHITECTURES
Components Of A Search Engine
19
1
Web crawler
It is also known as spider or bots. It is a software component that traverses
the webto gather information.
Database
All the information on the web is stored in database. It consists of huge web
resources.
Search Interfaces
This component is an interface between user and the database. It helps the
user tosearch through the database.
19
2
Architecture
The search engine architecture comprises of the three basic layers listed below:
Content collection and refinement.
Search core
User and application interfaces
19
3
Text Transformation
It transforms document into index terms or features.
19
4
Index Creation
It takes index terms created by text transformations and create data
structures to support fast searching.
Query Process
Query process comprises of the following three tasks:
User interaction
Ranking
Evaluation
User interaction
It supports creation and refinement of user query and displays the results.
Ranking
It uses query and indexes to create ranked list of documents.
Evaluation
It monitors and measures the effectiveness and efficiency. It is done offline.
19
5
primarily as a tool to identify a subset of documents that are likely to be
relevant, so that at the time of retrieval, only those documents will be
matchedto the query.
This approach has been the most common for cluster-based retrieval.
19
6
The second approach to cluster-based retrieval is to use clusters as a form
of document smoothing.
Previous studies have suggested that by grouping documents into clusters,
differences between representations of individual documents are, in effect,
smoothed out.
19
7
exchangeable hardware components.
19
8
Figure 11.7 shows a generic search cluster architecture with its key components.
The front-end servers receive queries and process them right away if
the answer is already in the “answer cache” servers. Otherwise they route the
query to the search clusters through a hierarchical broker network. The
exact topology of this network can vary but basically, it should be designed to
balance traffic so as to reach the search clusters as fast as possible. Each
search cluster includes a load balancing server (LB in the figure) that routes
the query to all the servers in one replica of the search cluster. In this
figure, we show an index partitioned into n
clusters with m replicas. Although partitioning the index into a single cluster is
conceivable, it is not recommended as the cluster would turn out to be very
largeand consequently suffer from additional management and fault tolerance
problems.
19
9
Each search cluster also includes an index cache, which is depicted
at thetop, as a flat rectangle. The broker network merges the results
coming from the
20
0
search clusters and sends the merged results to the appropriate front-end server
that will use the right document servers to generate the full results pages,
including snippet and other search result page artifacts. This is an example of a
more general trend to consider a whole data center as a computer.
DISTRIBUTED ARCHITECTURES
There exist several variants of the crawler-indexer architecture and
we describe here the most important ones. Among them, the most significant
early example is Harvest.
Harvest
Harvest uses a distributed architecture to gather and distribute data,
which is more efficient than the standard Web crawler architecture. The main
drawback is that Harvest requires the coordination of several Web servers.
Interestingly, the Harvest distributed approach does not suffer from some of
the common problems of the crawler-indexer architecture, such as:
increased servers load caused by the reception of simultaneous requests
fromdifferent crawlers,
increased Web traffic, due to crawlers retrieving entire objects, while
mostcontent is not retained eventually, and
lack of coordination between engines, as information is
gathered independently by each crawler.
20
1
one or more gatherers or other brokers, updating incrementally their indexes.
Depending on the configuration of gatherers and brokers, different
improvements on server load and network traffic
20
2
can be achieved. For example, a gatherer can run on a Web server, generating
no external traffic for that server. Also, a gatherer can send information to
several brokers, avoiding work repetition. Brokers can also filter information
and send it to other brokers. This design allows the sharing of work and
information in a very flexible and generic manner. An example of the Harvest
architecture is shown in Figure 11.9.
20
3
SEARCH ENGINE RANKING
Ranking is the hardest and most important function search engines have
to execute. A first challenge is to devise an adequate evaluation process that
allows judging the efficacy of a ranking, in terms of its relevance to the
users. Without such evaluation process it is close to impossible to fine tune the
ranking function, which basically prevents achieving high quality results.
There are many possible evaluation techniques and measures. We cover this
topic in the context of the Web, paying particular attention to the exploitation
of user’s clicks.
20
4
Finally, the fourth issue lies in defining the ranking function and computing
it (which is different from evaluating its quality as mentioned above). While it
is fairly difficult to compare different search engines as they evolve and
operate on different Web corpora, leading search engines have to
20
5
constantly measure and compare themselves, each one using its own
measure, so asto remain competitive.
Ranking Signals
We distinguish among different types of signals used for ranking
improvements according to their origin, namely content, structure, or usage, as
follows. Content signals are related to the text itself, to the distributions of
words in the documents as has been traditionally studied in IR. The signal
in this case can vary from simple word counts to a full IR score such as
BM25. They can also be provided by the layout, that is, the HTML source,
ranging from simple format indicators (more weight given to titles/headings)
to sophisticated ones such as the proximity of certain tags in the page.
Structural signals are intrinsic to the linked structure of the Web. Some of
them are textual in nature, such as anchor text, which describe in very brief
form the content of the target Web page. In fact, anchor text is usually used
as surrogate text of the linked Web page. That implies that Web pages can
be found by searching the anchor texts associated with links that point to them,
even if they have not been crawled. Other signals pertain to the links
themselves, such as the number of in-links to or outlinks from a page.
The next set of signals comes from Web usage. The main one is the
implicit feedback of the user through clicks. In our case the main use of
clicks are the onesin the URLs of the results set.
LINK-BASED RANKING
Given that there might be thousands or even millions of pages available
for any given query, the problem of ranking those pages to generate a short list
is probably one of the key problems of Web IR; one that requires some kind of
relevance estimation. In this context, the number of hyperlinks that point to a
page provides a measure of its popularity and quality. Further, many links in
20
6
common among pages and pages referenced by a same page are often
indicative of page relations with potential value for ranking purposes. Next,
we present several
20
7
examples of ranking techniques that exploit links, but differ on whether
they arequery dependent or not.
Early Algorithms
TF-IDF
Boolean spread, vector spread, and most-cited
WebQuery
HITS
A better idea is due to Kleinberg and used in HITS (Hypertext Induced Topic
Search). This ranking scheme is query-dependent and considers the set of
pages S that point to or are pointed by pages in the answer. Pages that have
many links pointing to it in S are called authorities because they are
susceptible to contain authoritative and thus, relevant content. Pages that have
many outgoing links are called hubs and are susceptible to point to relevant
similar content. A positive two- way feedback exists: better authority pages
come from incoming edges from good hubs and better hub pages come from
outgoing edges to good authorities. Let H(b) and A(p) be the hub and
authority values of page p. These values are defined such that the following
equations are satisfied for all pages p:
where H(p) and A(p) for all pages are normalized (in the original paper, the
sum of the squares of each measure is set to one). These values can be
determined through an iterative algorithm, and they converge to the
principal eigenvector of the linkmatrix of S. In the case of the Web, to
avoid an explosion on the size of S, a maximal number of pages pointing
to the answer can be defined. This technique does not work with non-
20
8
existent, repeated, or automatically generated links. One solution is to weigh
each link based on the surrounding content. A second problem is that of
topic diffusion, because as a consequence of link weights, the result set
might include pages that are not directly related to the query (even if they
have got
20
9
high hub and authority values). A typical case of this phenomenon is when a
particular query is expanded to a more general topic that properly contains the
original answer. One solution to this problem is to associate a score with the
content of each page, like in traditional IR ranking, and combine this score
with the link weight. The link weight and the page score can be included in the
previous formula multiplying each term of the summation. Experiments show
that the recall and precision for the first ten results increase significantly. The
appearance order of the links on the Web page can also be used by dividing
the links into subgroups and using the HITS algorithm on those subgroups
instead of the original Web pages. In Table 11.2, we show the exponent of the
power law of the distribution for authority and hub values for different
countries of the globe adapted from.
21
0
p with regard to query Q can be computed as:
21
1
Further, R(p, Q)=0 if p does not satisfies Q. If we assume that all the
functions are normalized and a ∈ [0, 1], then R(p, Q) ∈ [0, 1]. Notice that this
linear function is convex in a. Also, while the first term depends on the query,
the second term does not. If a = 1, we have a pure textual ranking, which
was the typical case in the early search engines. If a = 0, we have a pure link-
based ranking that is also independent of the query. Thus, the order of the
pages is known in advance for pages that do contain q. We can tune the
value of a experimentally using labeled data as ground truth or click through
data. In fact, a might even be query dependent. For example, for navigational
queries a could be made smaller than for informational queries.
LEARNING TO RANK
A rather distinct approach for computing a Web ranking is to apply
machine learning techniques for learning to rank. For this, one can use their
favorite machine learning algorithm, fed with training data that contains
21
2
ranking information, to “learn” a ranking of the results, analogously to the
supervised
21
3
algorithms for text classification. The loss function to minimize in this case is
the number of mistakes done by the learned algorithm, which is similar to
counting the number of misclassified instances in traditional classification.
The evaluation of the learned ranking must be done with another data set
(which also includes ranking information) distinct from the one used for
training. There exist three types of ranking information for a query Q, that
can be used for training:
point wise: a set of relevant pages for Q.
Pair wise: a set of pairs of relevant pages indicating the ranking
relationbetween the two pages.
That is, the pair [p1>p2], implies that the page p1 is more relevant than p2.
List-wise: a set of ordered relevant pages: p1>p2··· >pm.
In any case, we can consider that any page included in the ranking
information is more relevant than a page without information, or we can
maintain those cases undefined. Also, the ranking information does not need
to be consistent (for example, in the pair wise case). The training data may
come from the so-called “editorial judgments” made by people or, better,
from click through data. Given that users’ clicks reflect preferences that
agree in most cases with relevance judgments done by human assessors, one
can consider using click through information to generate the training data.
Then, we can learn the ranking function from click-based preferences. That
is, if for query Q, p has more clicks than p2 ,then [p1_ p21].
One approach for learning to rank from clicks using the pair wise
approach is to use support vector machines (SVMs), to learn the ranking
function. In this case, preference relations are transformed into inequalities
among weighted term vectors representing the ranked documents. These
inequalities are then translated into an SVM optimization problem, whose
solution computes optimal weights for the document terms. This approach
proposes the combination of different retrieval functions with different
21
4
weights into a single ranking function.
21
5
The point wise approach solves the problem of ranking by means of
regression or classification on single documents, while the pairwise approach
transforms ranking into a problem of classification on document pairs. The
advantage of these two approaches is that they can make use of existing results
in regression and classification. However, ranking has intrinsic characteristics
that cannot be always solved by the latter techniques. The list wise approach
tackles the ranking problem directly, by adopting list wise loss functions, or
directly optimizes IR evaluation measures such as average precision.
However, this case is in general more complex. Some authors have proposed
to use a multi-variant function, also called relational ranking function, to
perform list wise ranking, instead of using a single-document based ranking
function.
QUALITY EVALUATION
To be able to evaluate quality, Web search engines typically use
human judgments that indicate which results are relevant for a given query, or
some approximation of a “ground truth” inferred from user’s clicks, or finally
a combination of both, as follows.
Precision at 5, 10, 20
One simple approach to evaluate the quality of Web search results is
to adapt the standard precision-recall metrics to the Web. For this, the
following observations are important:
on the Web
it is almost impossible to measure recall, as the number of relevant
pages for most typical queries is prohibitive and ultimately unknown.
Thus, standard precision-recall figures cannot be applied directly.
Most Web users inspect only the top 10 results and it is relatively
uncommon that a user inspects answers beyond the top 20 results. Thus,
21
6
evaluating the quality of Web results beyond position 20 in the ranking is not
indicated as does not reflect common user behavior. Since Web queries tend
to be short and vague,
21
7
human evaluation of results should be based on distinct relevance assessments
for each query-result pair. For instance, if three separate assessments are made
for each query-result pair, we can consider that the result is indeed relevant to
the query if at least two of the assessments suggest so. The compounding
effect of these observations is that
(a) precision of Web results should be measured only at the top
positions in the ranking, say P@5, P@10, and P@20 and
(b) each query-result pair should be subjected to 3-5 independent
relevant assessments.
21
8
Web Spam
The Web contains numerous profit-seeking ventures, so there is an
economic incentive from Web site owners to rank high in the result lists of
search engines.
21
9
All deceptive actions that try to increase the ranking of a page in search
engines are generally referred to as Web spam or spamdexing (a
portmanteau of “spamming” and “index”). The area of research that relates to
spam fighting is called Adversarial Information Retrieval, which has been
the object of several publications and workshops.
22
0
search engine user interaction
Web search engines target hundreds of millions of users, most of
which have very little technical background. As a consequence, the design of
the interface has been heavily influenced by a extreme simplicity rule, as
follows.
In this section, we describe typical user interaction models for the most
popular Web Search engines of today, their recent innovations, and the
challenges they face to abide by this extreme simplicity rule. But we revisit
them here in more depth, in the context of the Web search experience offered
by major players such as Ask.com, Bing, Google and Yahoo! Search. We do
not discuss here “vertical” search engines, i.e., search engines restricted to
a specific domains of knowledge such as Yelp or Netflix, or major search
engines verticals, such as Google Image Search or Yahoo! Answers.
22
1
so popular that many Web homepages now feature a rectangle search box,
visible in prominent area of the site, even if the supporting search
technology is provided by a third partner. To illustrate, Figure 11.10 displays
the search rectangle of the Ask, Bing, Google, and Yahoo! search engines.
The rectangle design has remained
22
2
pretty much stable for some engines such as Google, whose main homepage
has basically not changed in the last ten years. Others like Ask and Bing allow
a more fantasy oriented design with colorful skins and beautiful photos of
interesting places and objects (notice, for instance, the Golden Gate bridge
background of Ask in Figure 11.10).
Despite these trends, the search rectangle remains the center piece of
the action in all engines. While the display of a search rectangle at the
center of the page is the favored layout style, there are alternatives:
Some Web portals embed the search rectangle in a privileged area of the
homepage. Examples of this approach are provided by yahoo.com or aol.com.
• Many sites include an Advanced Search page, which provides the users
with a form composed of multiple search “rectangles” and options (rarely
used).
• The search toolbars provided by most search engines as a browser plug-in,
or built-in in browsers like Firefox, can be seen as a leaner version of the
central search rectangle. By being accessible at all times, they represent a
more convenient alternative to the homepage rectangle, yet their requirement
of a download for
installation prevents wider adoption. Notice that, to compensate this overhead,
many search deal
negotiate costly OEM
engines s
22
3
with PC
distributors/manufacturers to preinstall their toolbars.
• The “ultimate” rectangle, introduced by Google’s Chrome “omnibox”,
merges the functionality of the address bar with that of the search box. It
becomes then
22
4
responsibility of the browser to decide whether the text inputted by the user
aims at navigating to a given site or at conducting a search. Prior to the
introduction of the Omnibox, Firefox already provided functionality to
recognize that certain words cannot be part of a URL and thus, should be
treated as part of a query. In these cases, it would trigger Google’s “I feel
lucky” function to return the top search result. Interestingly enough, this
Firefox feature is customizable, allowing users to trigger search engines other
than Google or to obtain a full page of results.
22
5
These engines m ght differ on small details like the assistance”
“query
features, which might appear in the North, South or West region of the page,
the position of the navigational tools, which might or might not be
displayed on the West region, or the position of spelling correction
recommendations, which might
appear before of after the sponsored results in the North region. Search engines
constantly experiment with small variations of layout, and it might be the case
that drastically different layouts be adopted in the future, as this is a space that
calls for innovative features. To illustrate, Cuil introduced a radically different
layout that departs from the one dimensional ranking, but this is more the
exception than the rule. In contrast, search properties other than the main
engines, commonly adopt distinct, such as the image search in both Google
and Yahoo! as an example, or Google Ads search results, all of which display
results across several columns. In this section, we focus exclusively on the
22
6
organic part of search results. We will refer to them from now as “search
results”, note the distinction with paid/sponsored search results.
22
7
Major search engines use a very similar format to display individual
results composed basically of
(a) a title shown in blue and underlined,
(b) a short snippet consisting of two or three sentences extracted
from the result page, and
(c) a URL, that points to the page that contains the full text. In most
cases,titles can be extracted directly from the page.
When a page does not have a title, anchor texts pointing to it can be
used togenerate a title.
22
8
Universal search results: Most Web search engines offer, in addition
to core Web search, other properties, such as Images, Videos, Products, Maps,
which come with their own vertical search. While users can go directly to
these properties to conduct corpus-specific searches, the “universal” vision
states that users should not have to specify the target corpus. The engine
should guess their intent and automatically return results from the most
relevant sources when appropriate. The key technical challenge here is to
select these sources and to decide how many results from each sources to
display.
BROWSING
In this section, we cover browsing as an additional discovery paradigm,
with special attention to Web directories. Browsing is mostly
when users
usefulhave no idea of how to specify a query (which becomes
rarer in the
rarer and
context of the global Web), or when they want to explore a specific collection
and are not sure of its scope. Nowadays, browsing is no longer the discovery
paradigm of choice on the Web. Despite that, it can still be useful in specific
contexts such as
that of an Intranet or in vertical
22
9
domains, as we now the case of
discuss. In
browsing, users are willing to invest some time exploring the document space,
looking for interesting or even unexpected references. Both with browsing and
searching, the user is pursuing discovery goals. However, in search, the user’s goal
23
0
is somewhat crisper. In contrast, with browsing, the user’s needs are usually
broader. While this distinction is not valid in all cases, we will adopt it here
for the sake of simplicity. We first describe the three types of browsing
namely, flat, structure driven (with special attention to Web directories), and
hypertext driven. Following, we discuss attempts at combining searching and
browsing in a hybrid manner.
Flat Browsing
In flat browsing, the user explores a document space that follows a
flat organization. For instance, the documents might be represented as dots in
a two- dimensional plane or as elements in a single dimension list, which
might be ranked by alphabetical or by any other order. The user then glances
here and there looking for information within the visited documents. Note that
exploring search results is a form of flat browsing. Each single document can
also be explored in a flat manner via the browser, using navigation arrows
and the scroll bar.
One disadvantage is that in a given page or screen there may not be any
clear indication of the context the user is in. For example, while browsing
large documents, users might lose track of which part of the document they
are looking at. Flat browsing is obviously not available in the global Web due
to its scale and distribution, but is still the mechanism of choice when
exploring smaller sets. Furthermore, it can be used in combination with search
for exploring search results or attributes. In fact, flat browsing conducted after
an initial search allows identifying new keywords of interest. Such keywords
can then be added to the original query in an attempt to provide better
contextualization.
23
1
WEB CRAWLING
A Web Crawler is a software for downloading pages from the Web.
Alsoknown as Web Spider, Web Robot, or simply Bot.
A Web crawler, sometimes called a spider or spiderwort and
oftenshortened to crawler, is an Internet bot that
systematically browses the World
23
2
Wide Web, typically operated by search engines for the purpose of Web
indexing (web spidering).
Web search engines and some other websites use Web crawling or
spidering software to update their web content or indices of other sites' web
content. Web crawlers copy pages for processing by a search engine,
which indexes the downloaded pages so that users can search more efficiently.
Crawlers consume resources on visited systems and often visit sites without
approval. Issues of schedule, load, and "politeness" come into play when large
collections of pages are accessed. Mechanisms exist for public sites not wishing to
be crawled to make this known to the crawling agent. For example, including
a robots.txt file can request bots to index only parts of a website, or nothing at all.
23
3
APPLICATIONS OF A WEB CRAWLER
A Web Crawler can be used to
create an index covering broad topics (general Web search)
create an index covering specific topics (vertical Web search)
archive content (Web archival)
analyze Web sites for extracting aggregate statistics
(Webcharacterization)
keep copies or replicate Web sites (Web mirroring)
Web site analysis
23
4
Feed crawler: checks for updates in RSS/RDF files in Web sites.
Provides a more efficient strategy to avoid collecting more pages than necessary
A focused crawler receives as input the description of a topic, usually described by
a driving query
a set of example
documents The crawler can
operate in
batch mode, collecting pages about the topic periodically
on-demand, collecting pages driven by a user query
TAXONOMY
The crawlers assign different importance to issues such as freshness,
quality, and volume The crawlers can be classified according to these three
axes
A crawler would like to use all the available resources as much as possible
crawling servers,
bandwidth). However, crawlers also fulfill
Internet
should
23
5
politeness, That is, a crawler cannot overload a Web site with HTTP requests.
That implies that a crawler should wait a small delay between two requests to
the same Web site Later we will detail other aspects of politeness
23
6
ARCHIT CTURE AND IMPLEMENTATION
The crawler is composed of three main modules:
downloader,
storage, and
scheduler
Scheduler: maintains a queue of URLs to visit
Downloader: downloads the pages
Storage: makes the indexing of the pages, and provides the scheduler with
metadata on the pages retrieved.
23
7
In the short-term scheduler, enforcement of the politeness policy requires
maintaining several queues, one for each site, and a list of pages to download
in each queue.
SCHEDULING ALGORITHMS
A Web crawler needs to balance various objectives that contradict
each other It must download new pages and seek fresh copies of downloaded
pages It must use network bandwidth efficiently avoiding to download bad
pages However, the crawler cannot know which pages are good, without first
downloading them To further complicate matters there is a huge amount of
23
8
pages being added, changed and removed every day on the Web
23
9
Crawling the Web, in a certain way, resembles watching the sky in a clear
night:the star positions that we see reflects the state of the stars at different
times.
CRAWLING EVALUATION
The diagram below depicts an optimal Web crawling scenario for an
hypothetical batch of five pages. The x-axis is time and the y-axis is speed, so
24
0
the area of each page is its size (in bytes).
24
1
Let us consider now a more realistic setting in which:
The speed of download of every page is variable and the effective
bounded by
bandwidth to a Web site (a fraction of the bandwidth that the crawler would
like to use can be lost) Pages from the same site can not be downloaded
right away one after the other (politeness policy).
By the end of the batch of pages, it is very likely that only a few hosts are active
24
2
Once a large fraction of the pages have been downloaded it is reasonable to
stop the crawl, if only a few hosts remain at the end. Particularly, if the number
of hosts remaining is very small then the bandwidth cannot be used
completely.
24
3
UNIT – V : RECOMMENDER SYSTEM
24
4
• The suggestions relate to various decision-making processes,
• What items to buy?
• What music to listen to?
• What online news to read?
Purpose
Examples :
based on
It finds out
imputes or attributes
Benefices by
24
5
• Both the users and the services provided have benefited from these kinds of
systems.
attains – Quality
• The quality and decision-making process has also improved through these
kinds of systems.
24
6
• In the above image, a user has searched for a laptop with 1TB HDD, 8GB RAM,
and an i5 processor for 40,000₹.
• The system has recommended 3 most similar laptops to the user.
Thus,
Perfect matching may not be recommended here.
24
7
How do User and Item matching is done?
• In order to understand how the item is recommended and how the matching is done,
let us a look at the images below:
c) Hybrid
24
8
Other types of RS are:
For example:
24
9
• YouTube: Trending videos.
• It does not suffer from cold start problems which means on day 1 of the
business also it can recommend products on various different filters.
• There is no need for the user's historical data.
The result: Rising demand for an obscure book (not well-known book)
it is an example of an entirely new economic model for the media and entertainment
industries,
• What consumers want and how they want to get it in service after service,
from DVDs at Netflix to music videos on Yahoo! ?
Solution:
25
0
Birth of Recommender System
Rule 2: Cut the price of needed / not needed information in half. Now lower it.
b) Evaluating the results: evaluate the results of your information retrieval (number and
relevance of search results)
c) Locating publications: find out where and how the required publication, e.g. article,
can be acquired.
25
1
COMPONENTS OF INFORMATION RETRIEVAL PROCESS
First of all,
25
2
• we must distinguish between the role played by the RS on behalf of the
service provider from that of the user of the RS.
i.e., sell more hotel rooms, or to increase the number of tourists to the destination.
• There are various reasons as to why service providers may want to exploit this
technology:
• Increase the number of items sold.
• Sell more diverse items.
• Increase the user satisfaction.
• Increase user fidelity.
25
3
• Better understand what the user wants.
• In any case, as a general classification, data used by RSs refers to three kinds of
objects:
a) items,
b) users, and
c) transactions, i.e., relations between users and items.
a) Ratings
Ratings have been the most popular source of knowledge for RS to represent
users’s preferences from the early 1990s to more recent years.
25
4
The foundational RS algorithm collaborative filtering, tries to find like-minded
users by correlating the ratings that users have provided in a system.
The goal of the algorithm is predicting users’ ratings, under the assumption that
this is a good way to estimate the interest that a user will show for a previously
unseen item.
b) Implicit Feedback
This source of knowledge refers to actions that the user performs over items, but that
cannot be directly interpreted as explicit interest, i. e., the user explicitly stating her
preference or the relevance of an item.
c) Social Tags
Social Tagging systems (STS) allow users to attach free keywords, also known as
tags, to items that users share or items that are already available in the system.
last.fm (music).
Social Recommender Systems (SRSs) are recommender systems that target the social
media domain.
The main goals for these systems are to improve recommendation quality and solve
the social information overload problem.
These recommender systems provide people, web pages, items, or groups as
recommendations to users.
25
5
• The value of an item may be positive
• if the item is useful for the user, or negative if the item is not appropriate
and the user made a wrong decision when selecting it.
• We note that:
• when a user is acquiring an item she will always incur in a cost,
• which includes
• the cognitive cost of searching for the item and the real monetary cost
eventually paid for the item.
• For instance,
• the designer of a news RS must take into account the complexity of a news item,
i.e., its structure, the textual representation, and the time-dependent importance of
any news item.
• But, at the same time,
• the RS designer must understand that even if the user is not paying for
reading news,
• there is always a cognitive cost associated to searching and reading news
items.
• Items with low complexity and value are: news, Web pages, books, CDs,
movies.
• Items with larger complexity and value are:
a) digital cameras
b) mobile phones
c) PCs
• The most complex items that have been considered are:
25
6
a) insurance policies,
b) financial investments,
c) travels, jobs.
25
7
1. ITEMS
2. USERS
3. TRANSACTIONS
1) ITEMS
• For instance,
• the designer of a news RS must take into account the complexity
of a news item,
• i.e., its structure, the textual representation, and the time-
dependent importance of any news item.
• But, at the same time,
• the RS designer must understand that even if the user is not
paying for reading news,
• there is always a cognitive cost associated to searching and
reading news items.
25
8
• In other domains,
• e.g., cars, or financial investments,
• the true monetary cost of the items becomes an
important element to consider when selecting the
most appropriate recommendation approach.
25
9
• Items can be represented using various information and representation
approaches:
• e.g., in a minimalist way as a single id code, or in a richer form,
as a set of attributes,
• but even as a concept in an ontological representation of the
domain.
2) Users
• Users of a RS,
• may have very diverse goals and characteristics.
• In order to personalize the recommendations and the human-
computer interaction.
• RSs exploit a range of information about the users.
• This information can be
• structured in various ways and
• again the selection of what information to model depends on the
recommendation technique.
26
0
3) Transactions
• For instance,
• a transaction log may contain a reference to the item selected by
the user and a description of the context
• (e.g., the user goal/query) for that particular recommendation.
• If available,
• that transaction may also include
• an explicit feedback the user has provided,
• such as the rating for the selected item.
• In fact, ratings are the most popular form of transaction data that a RS collects.
• These ratings may be collected explicitly or implicitly.
• In the explicit collection of ratings,
• the user is asked to provide her opinion about an item on a rating
scale.
• Numerical ratings such as the 1-5 stars provided in the book recommender
associated with Amazon.com.
•
• Ordinal ratings, such as “strongly agree, agree, neutral, disagree, strongly
disagree”
26
1
• where the user is asked to select the term that best indicates her
opinion regarding an item (usually via questionnaire).
• Binary ratings that model choices in which the user is simply asked to decide
if a certain item is good or bad.
• Unary ratings can indicate that a user has observed or purchased an item, or
otherwise rated the item positively.
• In such cases, the absence of a rating indicates that we have no information
relating the user to the item (perhaps she purchased the item somewhere else).
RECOMMENDATION TECHNIQUES
In order to do this,
26
2
DIFFERENT TYPES OF RECOMMENDATION TECHNIQUES
a) Content-based
b) Collaborative filtering
c) Demographic
d) Knowledge-based
e) Community-based
a) Content-based:
• The system learns to recommend items that are similar to the ones that the user liked
in the past.
• The similarity of items is calculated based on the features associated with the
compared items.
For example, if a user has positively rated a movie that belongs to the comedy genre, then
the system can learn to recommend other movies from this genre.
• The system learns to recommend items that are similar to the ones that the user
liked in the past.
• The similarity of items is calculated based on the features associated with the
compared items.
• For example,
• if a user has positively rated a movie that belongs to the comedy genre,
then
• the system can learn to recommend other movies from this
genre.
26
3
b) Collaborative filtering
c) Demographic
• This type of system recommends items based on the demographic profile of the
user.
• The assumption is that different recommendations should be generated for
different demographic niches.
• Many Web sites adopt simple and effective personalization solutions based on
demographics.
• For example,
• users are dispatched to particular Web sites based on their
language or country.
• Or suggestions may be customized according to the age of the
user.
• While these approaches have been quite popular in the marketing
literature,
• there has been relatively little proper RS research into
demographic systems.
d) Knowledge-based
26
4
• How much the user needs (problem description) match the
recommendations (solutions of the problem).
• Here the similarity score can be directly interpreted as the utility of the
recommendation for the user.
e) Community-based
• This type of system recommends items based on the preferences of the users
friends.
• This technique follows the epigram
• “Tell me who your friends are, and I will tell you who you are”.
• Evidence suggests that people tend to rely more on recommendations from
their friends than on recommendation from similar but anonymous
individuals.
• This observation, combined with the growing popularity of open social
networks, is generating a rising interest in community-based systems or, as or
as they usually referred to, social recommender systems.
• This type of RSs models and acquires information about the social relations of
the users and the preferences of the user’s friends.
• The recommendation is based on ratings that were provided by the user’s
friends.
• In fact these RSs are following the rise of social-networks and enable a simple
and comprehensive acquisition of data related to the social relations of the
users.
• Hybrid recommender systems:
• These RSs are based on the combination of the above mentioned techniques.
• A hybrid system combining techniques A and B tries to use the advantages of A
to fix the disadvantages of B.
• For instance, CF methods suffer from new-item problems, i.e., they cannot
recommend items that have no ratings.
26
5
• This does not limit content-based approaches since the prediction for new
items is based on their description (features) that are typically easily available.
• Given two (or more) basic RSs techniques, several ways have been proposed
for combining them to create a new hybrid.
• The aspects that apply to the design stage include factors that might affect the
choice of the algorithm.
• The first factor to consider, the application’s domain, has a major effect on the
algorithmic approach that should be taken.
• Based on the specific application domains,
• we define more general classes of domains for the most common
recommender systems applications:
• Entertainment - recommendations for movies, music, and IPTV.
• Content - personalized newspapers, recommendation for
documents, recommendations of Web pages, e-learning
applications, and e-mail filters.
• E-commerce - recommendations for consumers of products to
buy such as books, cameras, PCs etc.
• Services - recommendations of travel services, recommendation
of experts for consultation, recommendation of houses to rent, or
matchmaking services.
• What is RS?
• Recommender systems have the effect of
• guiding users in a personalized way to interesting objects in a
large space of possible options.
• What C-BRS?
• Content-based recommendation systems try to
• recommend items similar to those a given user has liked in
the past.
• Content-Based Recommender works by the data that we take from the user,
26
6
• either explicitly (rating) or implicitly (clicking on a link).
• By the data, we create a user profile,
• which is then used to suggest to the user, as the user provides
more input or take more actions on the recommendation, the
engine becomes more accurate.
User Profile
Item Profile
• To build a profile for each item,
• which will represent the important characteristics of that item.
• Example: if we make a movie as an item then
• its actors, director, release year and genre are the most
significant features of the movie.
• We can also add its rating from the IMDB (Internet Movie Database) in the
Item Profile.
• A predefined DATASET for an ITEM is readily available.
26
7
Utility Matrix
• Utility Matrix signifies the user’s preference with certain items.
• In the data gathered from the user,
• we have to find some relation between the items which are liked
by the user and those which are disliked, for this purpose we use
the utility matrix.
• In it we assign a particular value to each user-item pair, this value is known as
the degree of preference.
• Then we draw a matrix of a user with the respective items to identify their
preference relationship.
• Method 1:
We can use the cosine distance between the vectors of the item and the user to
determine its preference to the user.
26
8
• Method 2:
We can use a classification approach in the recommendation systems:
• like we can use the Decision Tree for finding out whether a user
wants to watch a movie or not
• like at each level we can apply a certain condition to refine our
recommendation
26
9
• with
• the attributes of a content object (item),
in order to recommend to the user new interesting items.
• A content based recommender works with data that the user provides, either
explicitly (rating) or implicitly (clicking on a link).
• Based on that data, a user profile is generated, which is then used to make
suggestions to the user.
27
0
IMPLEMENTATION OF CONTENT–BASED RECOMMENDER SYSTEMS
• Content-based recommendation systems try to
• recommend items similar to those a given user has liked in the
past.
• Systems designed according to the collaborative recommendation paradigm
identify:
• users whose preferences are similar to those of the given user and
recommend items they have liked.
• In Content-Based Recommender,
• we must build a profile for each item, which will represent the important
characteristics of that item.
• For example, if we make a movie as an item then its actors, director, release
year and genre are the most significant features of the movie.
27
1
• . For instance,
• it could be used to filter search results by
• deciding whether a user is interested in a specific Web page or
not and,
• in the negative case, preventing it from being displayed.
27
2
1) CONTENT ANALYZER
2) PROFILE LEARNER
3) FILTERING COMPONENT
a) A Content Analyzer, that give us a classification of the items, using some sort of
representation (more of this later on this post).
b) A Profile Learner, that makes a profile that represents each user’s preferences.
c) A Filtering Component, that takes all the inputs and generates the list of
recommendations for each user.
1) CONTENT ANALYZER
When information has no structure (e.g. text), some kind of pre-processing step is needed
to extract structured relevant information.
27
3
2) PROFILE LEARNER
3) FILTERING COMPONENT
• This module exploits the user profile to
• suggest relevant items by matching the profile representation
against that of items to be recommended.
• The result is a binary or continuous relevance judgment (computed using some
similarity metrics [42]), the latter case resulting in a ranked list of potentially
interesting items.
• In the above mentioned example,
• the matching is realized by computing the cosine similarity
between the prototype vector and the item vectors.
27
4
• The first step of the recommendation process is the one performed by the
CONTENT ANALYZER, that usually borrows
• techniques from Information Retrieval system.
• Item descriptions coming from Information Source are processed by the
CONTENT ANALYZER, that extracts features (keywords, n-grams, concepts,
. . . ) from
• unstructured text to produce a structured item representation,
stored in the repository Represented Items.
• In order to construct and update the profile of the active user (user for which
recommendations must be provided)
• her reactions to items are collected in some way and recorded in
the repository Feedback.
• These reactions, called annotations or feedback, together with the
related item descriptions, are exploited during the process of learning a
model useful to predict the actual relevance of newly presented items.
1) Explicit feedback :
2) Implicit feedback: It does not require any active user involvement, in the sense that
feedback is derived from monitoring and analyzing user’s activities.
27
5
• Alternatively, symbolic ratings are mapped to a numeric scale, such as in
Syskill & Webert, where users have the possibility of rating a Web page as hot,
lukewarm, or cold;
• text comments – Comments about a single item are collected and presented to
the users as a means of facilitating the decision-making process.
• For instance, customer’s feedback at Amazon.com or eBay.com might help
users in deciding whether an item has been appreciated by the community.
• Textual comments are helpful, but they can overload the active user because
she must read and interpret each comment to decide if it is positive or negative,
and to what degree.
text comments – Comments about a single item are collected and presented to the users
as a means of facilitating the decision-making process.
27
6
ADVANTAGES AND DRAWBACKS OF CONTENT-BASED FILTERING
AN INTRODUCTION TO CONTENT BASED FILTERING
• Content-based filtering is a type of recommender system that attempts to
guess what a user may like based on that user's activity.
• Content-based filtering makes recommendations by
• using keywords and attributes assigned to objects in a database
• e.g., items in an online marketplace and
• matching them to a user profile.
27
7
Advantages of Content-based Filtering
a) INDEPENDENCE
27
8
b) TRANSPARENCY
c) NEW ITEM
27
9
• suitable suggestions if the analyzed content does not contain
enough information to discriminate items the user likes from
items the user does not like.
• Some representations capture only certain aspects of the content,
• but there are many others that would influence a user’s
experience.
• For instance,
• often there is not enough information in the word frequency to
• Model the user interests in jokes or poems,
• while techniques for affective computing would be most
appropriate.
• Again, for Web pages, feature extraction techniques from text completely
ignore aesthetic qualities and additional multimedia information.
• To sum up,
• both automatic and manually assignment of features to items
could not be sufficient to
• define distinguishing aspects of items that turn out to be
necessary for the elicitation of user interests.
b) OVER-SPECIALIZATION
• To give an example:
• when a user has only rated movies directed by Stanley Kubrick,
she will be recommended just that kind of movies.
• A “perfect” content-based technique would rarely find anything
novel, limiting the range of applications for which it would be
useful.
c) NEW USER
28
0
• Enough ratings have to be collected before
• a content-based recommender system can really understand
• user preferences and provide accurate
recommendations.
• Therefore, when few ratings are available,
• as for a new user, the system will not be able to provide reliable
recommendations.
*************************************
28
1