Chapter 2
Chapter 2
Chapter 2
Representation
Chapter 2
1
Sub topics
Overview
Manual Vs Automatic indexing
Indexing Languages and Methods
Index Term selection process
Lexical analysis (tokenization)
Use of stopwords list/stopword remover
Conflation
2
Overview
What do we need to know to effectively
handle text representation?
3
Content Analysis
Automated Transformation of raw text into a
form that represents some aspect (s) of its
meaning
Including, but not limited to:
Term selection (index term)
Automated Thesaurus Generation
Categorization/Clustering
Summarization
4
Cont…
Before a computerized information retrieval system can
actually operate to retrieve some information,
that information must have already been stored
inside the computer.
Originally it was usually have been in the form of
documents.
The computer, however, is not likely to have stored the
complete text of each document in the natural
language in which it was written for retrieval propose.
It will have, instead, a document representative, which
may have been produced from the documents either
manually or automatically
5
knowledge organization tools
Index (document containing representatives of
information items)
6
Cont…
Subject heading lists (are knowledge organization
tools which are designed to provide a set of controlled
(managed) terms to represent the subject content of
items in a collection)
7
What is Indexing?
8
Indexing
Is the art of organizing information.
Is an association of descriptors (keywords, concepts,
metadata) to documents in view of future retrieval
Is a process of constructing document surrogates by
assigning identifiers to text items
is the notion of storing data in a particular way in order
to locate and retrieve the data as efficiently as possible
It is the process of analyzing the information content in
the language of the indexing system, which is used in
IRS.
Give access point to a collection that are expected to
be most useful to the users of the Information
9
Indexing: From text to index
Indexing
Text Index
(Actions)
10
Indexes
Search systems rarely search document collections
directly.
Instead an index is built of the documents in the
collection and the user searches the index.
Document
User collection
Create
index
Search
index
Documents can be
digital (e.g., web
Index
pages) or physical
11
(e.g., books)
Why indexing?
Cannot use the document for search as it is
Need some representation of content
Indexing can be used in
Finding documents by topic (how?)
Relate documents to each other (how ?)
Determine relevance between documents and
information needs (how?)
12
Enumerate some basic
characteristics of an Index
term (describe what an index
term should look like)
13
Important features of a good
document representative
Discriminating power: the representative keyword
(index term) should be able to identify a document
uniquely and reduce ambiguity
Ex. ISBN numbs for a book
Descriptiveness: it has to describe all the
information in a document as complete as possible
being correct
Similarity Identification: should be able to group
similar documents together.
Ex. Book classification number
Conciseness: should be simple and clear. This will
reduce process time and storage space
Ex. Author, title 14
Relationship of features of a good
document representative
Good balancing is required.
Higher discrimination power may lower the capability
of identifying similarities among doc.
Good descriptiveness may defeat the conciseness of
the representative
What is good for the computer may not always be
good for the user.
Remark: A good representation should seek a
balance of the four, and take consideration of
both the computer and the user.
15
Assumption While organizing
knowledge using index terms
Indexing can be
Manual
Automatic 16
Manual vs Automatic
Indexing
What is the basic features of these
indexing approaches?
17
Manual indexing
Indexers decide which keywords to assign to
documents based on controlled (managed)
vocabulary
A controlled (managed) vocabulary is a finite set
of index terms from which all index terms must
be selected
Human indexers assign index terms to documents
Indexers will be given
Input sheets
Manuals and instruction
Printed thesaurus
Is usually done in the library environment
18
Limitations of manual indexing
Slow and expensive (significant cost)
Is based on intellectual judgment and semantic
interpretation (concepts, themes)
Prior knowledge is needed on the subject, Terms
that will be used by the user, Indexing vocabularies,
and Collection characteristics
High probability of inconsistency among indexers
Low consistency (Maintaining consistency is difficult)
19
Automatic indexing
The aim of automatic indexing is to build indexes and
retrieve information without human intervention
When the information that is being searched is text,
methods of automatic indexing can be very effective.
Much of the fundamental research in automatic
indexing was carried out by Gerald Salton, (Professor of
CS at Cornell, and his graduate students.)
Using computers for indexing, i.e. automatic indexing
The system extracts “typical significant” terms
The human may contribute by setting the parameters
or thresholds, or by choosing components or
algorithms
20
Cont…
Consists of two (major) processes, namely
Selectingand assigning terms or concepts
capable of representing document content
21
Need for automatic indexing
Information overload
Enormous amount of information is being
generated from day to day activities
Explosion of machine-readable text
Massive information available in electronic
format and on Internet.
Cost effectiveness
Human indexing is expensive and labor
intensive
22
Advantages of Automatic Indexing
Reduced processing time (Fast)
At most few seconds vs. at least a few minutes
Reduced cost (inexpensive)
Once initial hardware cost is amortized, operational cost is
cheaper than wages for human indexers
Indexing entries are generated at a lower cost than manual
indexing
Easy to maintain
Improved consistency (Consistent): No inconsistency or high
consistency: Algorithms select index terms much more consistently
than humans.
Better retrieval (achieved)
Mechanical execution of algorithms, with no intelligent
interpretation (aboutness/ relevance
23
Indexing language
It refers to a system of naming subjects and the
complete set of terms used to describe subject matter
with in a particular database
25
Title based indexing
Indexing words are selected from the title.
Example: considering the following titles representing their
documents
Manual system analysis
Introduction to system analysis
A structured approach to system analysis
System analysis
The significant words in each title are the same and might be
used as the basis for retrieval.
27
Subject indexing (basically automatic)
28
Pre-coordinate systems:
Concepts or terms are combined to form compound classes
or descriptions by the index at the indexing stage
Each entry (term/code) represents full contents of the
document concerned
The indexer plays an important role in coordinating index
terms
An individual card in the catalogue represents individual
documents
Example:
Dewey Decimal Classification (read more on this)
Library Of Congress Classification
29
Post- coordinate systems
Concepts or terms are combined to form compound class or
descriptions by the searcher at the searching stage
Key words are prepared separately for representing a
subject of a document
The key words are coordinated at search time
Example:
Computer Based search System
30
Effectiveness of an indexing system
Measured in terms of Exhaustiveness and
Specificity
Exhaustiveness
The degree to which the subject matter of a given
document has been reflected through the index terms.
Exhaustive indexing system supposed to represent the
content of the input document fully.
Thus it needs selection of as many keywords as possible.
Specificity
The degree to which the subject matter of a given
document is represented by specific terms
31
Con…
What is the relation of these concepts with IRS
performance measures recall and precision?
An exhaustive system tends to retrieve more
documents, that is a high recall with a lower precision
Term specificity ensures high precision
32
Indexing -Term Selection
Process
How do IR systems select index terms?
33
Content/subject analysis and
representation- using term selection
is the analysis of the contents embedded in a document or
identification of subject matter in document texts through
terms.
34
Origin-Term Selection (Hans Peter
Luhn (1896-1964))
35
Cont…
According to Luhn, the distribution pattern of a
word could give significant information about the
property of being content bearing.
He stated that high frequency words tend be common, non
important
because they don’t discriminate sufficiently between
document
He also recognized that one or two occurrences of a word in a
relatively long text could not be taken significant in defining
the subject matter.
And this is because they are unlikely to be specified in
search statements (queries).
36
Cont…
Motivation
In addition to unequal capacity of terms in a
document
Using set of all words in a collection to index its
documents generates too much noise for
retrieval task
According to Luhn the most important words for
document representation (indexing) are those,
which occur with intermediate frequencies.
This is supported by Zipf’s law.
37
Zipf’s law in IR
In relation to Luhn’s idea, Zipf .
another personality in IR,
provides us a law which states that the product of
the frequency of use of words and the rank order
is approximately constant.
Accordingly
42
Cont…
43
Problems with Luhn’s selection
mechanism
44
Methods that Build on Zipf's Law
45
Basic procedures
46
Lexical Analysis
(Tokenization)
1st step in term selection
47
Lexical Analysis of the text (developing
lexical analyzer)
The major objective of this phase is identification of the
words in the text.
It is the process of converting an input stream of
characters (the text of the document) into a stream of
words (the candidate words to be adopted as index
terms) or tokens.
It begins when we have the text in electronic format.
A word or token is defined as a string of characters
separated by white space and/or punctuation
Results of the process:
Candidate index terms that can be further processed
48
Develop a flowchart
representing a
lexical analyzer?
49
Basic Techniques/issues/steps of
lexical analysis
1. Recognition of spaces as word separations
Multiple spaces should be reduced to one space
2. Making small (minor) transformation like (consideration of
digits, hyphens, punctuation, cases of the letters) -
normalization
Converting the case of letters to either lower or upper
(but some consideration for exceptions)
Converting abbreviations and acronyms to their original
format using a machine readable dictionary
Avoid numbers from the index term list, it is because,
without surrounding context, they are inherently vague
50
Example
Consider a user interested in document about the number of
death due to car accidents between the years 1910 and 1989
Such a request is specified as the set of index terms (death,
car, accidents, years, 1910, 1989)
The presence of the numbers 1910 or 1989 in the query
could lead to the retrieval of a variety of document, which
refers to either of the two numbers
Exception
Combination of digits /numbers with text. Example,
“ 510B.c”
In such case the numbers are clearly important index
terms and should not be removed
Solution for treating digits in text
Remove all word containing sequences of digits unless
specified other wise 51
Cont…
Breaking up hyphenated words might be useful due
to inconsistency of usage but there are exceptions,
where words may include hyphens as an integral part.
Ex. RJ-45, B-29
The most suitable solution / procedure here is to
adopt a general rule and specify the exception on
case by case basis
Removal of punctuation marks: the standard is to
remove but it will be more good if the program
incorporate codes to consider exceptions
52
Stopword Removal
2nd step in term selection
53
Removal of Stop words (Creating
Stopword list)
Motivation:
words of a text do not have equal value for indexing
purposes.
Words, which are too frequent among the documents in
the collection, are not good discriminators
A word, which occurs in 80% of the documents in a given
collection and/or frequently occur in a single document, is
useless for the purpose of retrieval.
The reason behind this is that, such words are not good
discriminator.
Such words are frequently referred to as stopwords
54
Stopword list
Also called negative dictionary . Why?
is a machine-readable list of words (Stopwords) that can not
be chosen as an index term.
They are words with a little or no meaning (function words).
Possible candidates for a list of Stopwords are
Articles
Prepositions
Conjunctions
Some Verbs, adverbs and adjectives –(can be treated as
stop words), words or phrases indicating what somebody
or something does, what state sb or sth is in, words that
answer questions with how, when, where and words that
names a quality, define or limit a noun.
55
Cont…
Stoplist vary in size (e.g. most stoplists in English contains
from about 50 to 400 words)
A sample of stoplist to English
About , becoming, can, did, eight
After, been, caption, do, else
Above, before, could, does, ever
All, below
56
Develop an algorithm (series of steps)
using flowchart to represent the
process of stopword remover?
57
Techniques for building a stoplist/removal of
function words
There are different techniques to build a stoplist /or remove
function words
1. Select words that serve for grammatical purpose
and do not refer to objects or concepts
Ex. the, and, of (an inverse strategy that selects words
as index terms when they belong to a specific class
called noun). That means listing all in a form of negative
dictionary.
Often the creation of stoplist is a process that occurs
before the actual indexing of individual texts.
Thus a potential index term is checked against the
stoplist and eliminated as candidate index term found
there 58
Cont…
2. Include words that most frequently occur.
The assumption is, these most frequent words are non-content
bearing (it is based on the assumption that the frequency of
occurrence of a function words is much higher than that of a
content word).
A threshold value (limit) is set to determine the number of words
to be included in the stoplist. e.g.
400 most frequently occurring words
Include words having up to 5 occurrence in a reasonably
long document
3. Because function words tend to be small in size,
occasionally all short word that contain less than a
threshold value number of characters are removed form
the text
Remark: But it has high risk of loosing important59
short words,
and the solution for this is using an anti-stopword-list
Advantages and dis advantage of
eliminating stopwords
Advantages
It reduces document description and focuses
attention on terms that convey more information.
Provides compression (manageable size of index)
of the indexing structure. Ex.
The size of indexing structure may be reduced
by 40% or more solely with the elimination of
stopwords.
Disadvantages
May reduce recall (proportion of relevant
documents to actually retrieved)
60
Cont…
Stoplists can be either generic or domain
specific
Generic stoplist: by considering the most frequent
words in a wide range of subjects
Domain specific: By observing the frequency of
words in the documents collection that is to be
indexed
61
Review question
Relate key issues, entities and two major subsystems in IR?
Mention at least 4 challenges in IR?
Define Indexing ?
What is Subject indexing and how is it different from others?
What are key issues one need to consider in developing lexical
analyzer?
What is the contribution of H.P Lhun in IR? How about Gerard
Salton?
What are key issues one need to consider in Removing stop
words?
What are the 4 major features of an index term?
62
Class Exercise
63
Thank you
64