Chapter 2

Content/subject Analysis and
Representation
Chapter 2
1
Sub topics
 Overview
 Manual Vs Automatic indexing
 Indexing Languages and Methods
 Index Term selection process
 Lexical analysis (tokenization)
 Use of stopwords list/stopword remover
 Conflation
2
Overview
What do we need to know to effectively
handle text representation?
3
Content Analysis
 Automated Transformation of raw text into a
form that represents some aspect (s) of its
meaning
 Including, but not limited to:
 Term selection (index term)
 Automated Thesaurus Generation
 Categorization/Clustering
 Summarization
4
Cont…
 Before a computerized information retrieval system can
actually operate to retrieve some information,
 that information must have already been stored
inside the computer.
 Originally it was usually have been in the form of
documents.
 The computer, however, is not likely to have stored the
complete text of each document in the natural
language in which it was written for retrieval propose.
 It will have, instead, a document representative, which
may have been produced from the documents either
manually or automatically
5
knowledge organization tools
 Index (document containing representatives of
information items)
 Classification schemes (are tools that are

designed to provide a hierarchical arrangement
of numeric or alphabetic notation to represent
broad (wide) tropics.
 divide a particular subject into successive classes
and sub-classes, with chosen characteristics as the
basis for each stage.
 Possible examples are DDC, LCC)
6
Cont…
 Subject heading lists (are knowledge organization
tools which are designed to provide a set of controlled
(managed) terms to represent the subject content of
items in a collection)
 Authority files (are list of terms that are used to

control (mange) variant forms of personal,
geographical, or organizational names and details)
 Thesauri ((is a document that shows the relationship

between terms) More on this later)
7
What is Indexing?
8
Indexing
 Is the art of organizing information.
 Is an association of descriptors (keywords, concepts,
metadata) to documents in view of future retrieval
 Is a process of constructing document surrogates by
assigning identifiers to text items
 is the notion of storing data in a particular way in order
to locate and retrieve the data as efficiently as possible
 It is the process of analyzing the information content in
the language of the indexing system, which is used in
IRS.
 Give access point to a collection that are expected to
be most useful to the users of the Information
9
Indexing: From text to index
• Whether it is used by human being or by a machine, its essences

is a list of index entries.
• Index term is is a word or phrase in a document whose
semantics gives an indication of the document’s theme.
• It might be single word or multi-word phrases. And are
mainly nouns (because nouns have meanings by themselves)
• Each index entry leads to an indexed item somewhere out side
the index; for instance, to a record in a database, to a folder in
a file drawer or to an item in a container.
Indexing
Text Index
(Actions)
10
Indexes
Search systems rarely search document collections
directly.
Instead an index is built of the documents in the
collection and the user searches the index.
Document
User collection
Create
index
Search
index
Documents can be
digital (e.g., web
Index
pages) or physical
11
(e.g., books)
Why indexing?
 Cannot use the document for search as it is
 Need some representation of content
 Indexing can be used in
 Finding documents by topic (how?)
 Relate documents to each other (how ?)
 Determine relevance between documents and
information needs (how?)
12
Enumerate some basic
characteristics of an Index
term (describe what an index
term should look like)
13
Important features of a good
document representative
 Discriminating power: the representative keyword
(index term) should be able to identify a document
uniquely and reduce ambiguity
Ex. ISBN numbs for a book
 Descriptiveness: it has to describe all the
information in a document as complete as possible
being correct
 Similarity Identification: should be able to group
similar documents together.
Ex. Book classification number
 Conciseness: should be simple and clear. This will
reduce process time and storage space
Ex. Author, title 14
Relationship of features of a good
document representative
 Good balancing is required.
 Higher discrimination power may lower the capability
of identifying similarities among doc.
 Good descriptiveness may defeat the conciseness of
the representative
 What is good for the computer may not always be
good for the user.
 Remark: A good representation should seek a
balance of the four, and take consideration of
both the computer and the user.
15
Assumption While organizing
knowledge using index terms
 The index terms selected are assumed to reflect the

content of the text
 Index terms can be extracted from the title, abstract or
full text of the document
 The majority of existing automatic indexing methods
select NL index terms from the document text.
 Why?
 Indexing can be
 Manual
 Automatic 16
Manual vs Automatic
Indexing
What is the basic features of these
indexing approaches?
17
Manual indexing
 Indexers decide which keywords to assign to
documents based on controlled (managed)
vocabulary
 A controlled (managed) vocabulary is a finite set
of index terms from which all index terms must
be selected
 Human indexers assign index terms to documents
 Indexers will be given
 Input sheets
 Manuals and instruction
 Printed thesaurus
 Is usually done in the library environment
18
Limitations of manual indexing
 Slow and expensive (significant cost)
 Is based on intellectual judgment and semantic
interpretation (concepts, themes)
 Prior knowledge is needed on the subject, Terms
that will be used by the user, Indexing vocabularies,
and Collection characteristics
 High probability of inconsistency among indexers
 Low consistency (Maintaining consistency is difficult)
19
Automatic indexing
 The aim of automatic indexing is to build indexes and
retrieve information without human intervention
 When the information that is being searched is text,
methods of automatic indexing can be very effective.
 Much of the fundamental research in automatic
indexing was carried out by Gerald Salton, (Professor of
CS at Cornell, and his graduate students.)
 Using computers for indexing, i.e. automatic indexing
 The system extracts “typical significant” terms
 The human may contribute by setting the parameters
or thresholds, or by choosing components or
algorithms
20
Cont…
 Consists of two (major) processes, namely
 Selectingand assigning terms or concepts
capable of representing document content
 Assigning a weight or value to each term

reflecting its presumed importance for the
purpose of content identification
 Importantwords are assigned higher weights and Less
important words are assigned lower weights
21
Need for automatic indexing
 Information overload
 Enormous amount of information is being
generated from day to day activities
 Explosion of machine-readable text
 Massive information available in electronic
format and on Internet.
 Cost effectiveness
 Human indexing is expensive and labor
intensive
22
Advantages of Automatic Indexing
 Reduced processing time (Fast)
 At most few seconds vs. at least a few minutes
 Reduced cost (inexpensive)
 Once initial hardware cost is amortized, operational cost is
cheaper than wages for human indexers
 Indexing entries are generated at a lower cost than manual
indexing
 Easy to maintain
 Improved consistency (Consistent): No inconsistency or high
consistency: Algorithms select index terms much more consistently
than humans.
 Better retrieval (achieved)
 Mechanical execution of algorithms, with no intelligent
interpretation (aboutness/ relevance
23
Indexing language
 It refers to a system of naming subjects and the
complete set of terms used to describe subject matter
with in a particular database
 Like any other language it consists of two parts

 Vocabulary: two types of vocabulary
 Controlled (managed): a type of indexing
language in which the terminology is
controlled (managed)
 Natural language: it is said “natural language
“ if terms appearing in the text are used
 Syntax: the way keywords are written
24
Three basic categories of indexing
approach:
Based on the different approach available:

 Title based indexing
 Citation indexing
 Subject indexing (basically automatic)
25
Title based indexing
 Indexing words are selected from the title.
 Example: considering the following titles representing their
documents
 Manual system analysis
 Introduction to system analysis
 A structured approach to system analysis
 System analysis
 The significant words in each title are the same and might be
used as the basis for retrieval.
 Various methods of using title keywords exist for indexing (read

more on this)
 Catchword title indexing
 Keyword in context (KWIC) indexing
 Keyword out of context (KWOC) indexing
26
Citation indexing
 It is an indexing system, which considers a link between a

document and each item in its bibliography and vice versa.
 The idea is thus bringing (having) together of all the
documents which have included a given item in their list of
references.
 Is domain specific
 Example
 Science citation index
 Arts and humanities citation index
27
Subject indexing (basically automatic)
 It is just to index (organize) by identifying the

subject matter of the document, standardizing it
and selecting an indexing system.
 Thus it involves two principal steps;
 Conceptual analysis
 Translating in to a particular set of index terms.
 Subject indexing system can be classified in either
 Pre-coordinate systems
 Post- coordinate systems
28
Pre-coordinate systems:
 Concepts or terms are combined to form compound classes
or descriptions by the index at the indexing stage
 Each entry (term/code) represents full contents of the
document concerned
 The indexer plays an important role in coordinating index
terms
 An individual card in the catalogue represents individual
documents
 Example:
 Dewey Decimal Classification (read more on this)
 Library Of Congress Classification
29
Post- coordinate systems
 Concepts or terms are combined to form compound class or
descriptions by the searcher at the searching stage
 Key words are prepared separately for representing a
subject of a document
 The key words are coordinated at search time
 Example:
 Computer Based search System
30
Effectiveness of an indexing system
 Measured in terms of Exhaustiveness and
Specificity
 Exhaustiveness
 The degree to which the subject matter of a given
document has been reflected through the index terms.
 Exhaustive indexing system supposed to represent the
content of the input document fully.
 Thus it needs selection of as many keywords as possible.
 Specificity
 The degree to which the subject matter of a given
document is represented by specific terms
31
Con…
 What is the relation of these concepts with IRS
performance measures recall and precision?
 An exhaustive system tends to retrieve more
documents, that is a high recall with a lower precision
 Term specificity ensures high precision
32
Indexing -Term Selection
Process
How do IR systems select index terms?
33
Content/subject analysis and
representation- using term selection
 is the analysis of the contents embedded in a document or
identification of subject matter in document texts through
terms.
 It can also be seen as:

 Analyzing and representation of the content of a
document through keywords (an automated translation
of raw text into a form that represent some aspects of
its meaning)
 Deciding the “aboutness” of a document
34
Origin-Term Selection (Hans Peter
Luhn (1896-1964))
 It was Luhn(1957) who first suggested that certain word

could be automatically extracted from texts to
represent their content.
 His concept is still used by search engines that operate
on the Internet.
 The aim of term selection is representing textual
documents by a set of keywords called index terms or
simply terms.
 However
 Not all words in a text are good index terms
 Words that are good index terms do not contribute
equally in defining the content of a text
35
Cont…
 According to Luhn, the distribution pattern of a
word could give significant information about the
property of being content bearing.
 He stated that high frequency words tend be common, non
important
 because they don’t discriminate sufficiently between
document
 He also recognized that one or two occurrences of a word in a
relatively long text could not be taken significant in defining
the subject matter.
 And this is because they are unlikely to be specified in
search statements (queries).
36
Cont…
 Motivation
 In addition to unequal capacity of terms in a
document
 Using set of all words in a collection to index its
documents generates too much noise for
retrieval task
 According to Luhn the most important words for
document representation (indexing) are those,
which occur with intermediate frequencies.
 This is supported by Zipf’s law.
37
Zipf’s law in IR
 In relation to Luhn’s idea, Zipf .
 another personality in IR,
 provides us a law which states that the product of
the frequency of use of words and the rank order
is approximately constant.
 Zipf’s Law is a curious observation about the

frequency of words:
 Frequency of words will tell term importance
 word rank multiplied by word frequency is roughly
constant for many text files.
38
Cont…
 Word frequency is the number of times a given word
appears in the text file.
 When words are ranked according to frequency, the
most frequent word is given rank 1, the next-most-
frequent word is given rank 2, etc
 rank*frequency~ constant
 Example:
 The product of frequency and rank is shown for some
very common and rare words from Tom Sawyer (The
Adventures of Tom Sawyer was Mark Twain's first novel).
 It is seen that the product is roughly constant except few
places. Zipf’s law predicts that the plot should be a
straight line with slope -1. 39
40
Remark:
For a large
body of
text of
“well
written
English”,
the
resulting
curve is a
straight
line
Accordingly
 if f be the frequency of occurrence of word in a text and

 r be their rank order, that is the order of their frequency of occurrence ,
 then a plot relating f and r yields a curve similar to the hyperbolic curve. He
verified his law on American news papers. 41
Cont…
 Luhn’s concept and Zipf’s law
 There is a relationship between Zipfian curve and
Luhn’s concept of where the significant words are
 Words with low significance are at both tails of the
distribution
 Therefore, Luhn suggested using the words in the
middle of the frequency range
 These findings are the bases of a number of term
importance indications (weighting)
42
Cont…
 In line with this,

 Luhn used Zipf’s idea, as a null hypothesis to enable
him to specify two cut–offs, an upper and lower.
 he assumed that the resolving power of significant
words, by which he meant the ability of words to
discriminate content, reached a peak at a rank order
position half way between the two cut-offs and from
the peak fall off in ether direction reducing to almost
zero at the cut-off points.
 Thus it is stated that, there is an inverse relation
between the frequency of a word f and its rank r.
43
Problems with Luhn’s selection
mechanism
 Finding a value/ threshold for elimination of high and

low frequency words
 The risk of loss of retrieval performance

 The removal of high frequency words may reduce
recall
 The removal of low frequency words may bring
losses in precision
44
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words (upper cut-

off). Used by almost all systems.
• Significant words: Take words in between the most
frequent (upper cut-off) and least frequent words
(lower cut-off).
• Term weighting: Give differing weights to terms based
on their frequency, with most frequent words weighed
less. Used by almost all ranking methods.
45
Basic procedures
 Identification of individual words of the text

(Lexical analysis/ tokenization)
 Removal of function words and highly frequent
terms in the subject domain that are insufficiently
specific to represent content using stoplist (use of
stoplist)
 The reduction of the remaining words to their stem
form (stemming, use of conflation procedure)
 The optional formation of phrases as index terms.
 The optional replacement of words, words stems or
phrases by their the thesaurus class terms
 The computation of the weight of each remaining
word stem or word
46
Lexical Analysis
(Tokenization)
1st step in term selection
47
Lexical Analysis of the text (developing
lexical analyzer)
 The major objective of this phase is identification of the
words in the text.
 It is the process of converting an input stream of
characters (the text of the document) into a stream of
words (the candidate words to be adopted as index
terms) or tokens.
 It begins when we have the text in electronic format.
 A word or token is defined as a string of characters
separated by white space and/or punctuation
 Results of the process:
 Candidate index terms that can be further processed
48
Develop a flowchart
representing a
lexical analyzer?
49
Basic Techniques/issues/steps of
lexical analysis
1. Recognition of spaces as word separations
 Multiple spaces should be reduced to one space
2. Making small (minor) transformation like (consideration of
digits, hyphens, punctuation, cases of the letters) -
normalization
 Converting the case of letters to either lower or upper
(but some consideration for exceptions)
 Converting abbreviations and acronyms to their original
format using a machine readable dictionary
 Avoid numbers from the index term list, it is because,
without surrounding context, they are inherently vague
50
Example
 Consider a user interested in document about the number of
death due to car accidents between the years 1910 and 1989
 Such a request is specified as the set of index terms (death,
car, accidents, years, 1910, 1989)
 The presence of the numbers 1910 or 1989 in the query
could lead to the retrieval of a variety of document, which
refers to either of the two numbers
 Exception
 Combination of digits /numbers with text. Example,
“ 510B.c”
 In such case the numbers are clearly important index
terms and should not be removed
 Solution for treating digits in text
 Remove all word containing sequences of digits unless
specified other wise 51
Cont…
 Breaking up hyphenated words might be useful due
to inconsistency of usage but there are exceptions,
where words may include hyphens as an integral part.
 Ex. RJ-45, B-29
 The most suitable solution / procedure here is to
adopt a general rule and specify the exception on
case by case basis
 Removal of punctuation marks: the standard is to
remove but it will be more good if the program
incorporate codes to consider exceptions
52
Stopword Removal
2nd step in term selection
53
Removal of Stop words (Creating
Stopword list)
 Motivation:
 words of a text do not have equal value for indexing
purposes.
 Words, which are too frequent among the documents in
the collection, are not good discriminators
 A word, which occurs in 80% of the documents in a given
collection and/or frequently occur in a single document, is
useless for the purpose of retrieval.
 The reason behind this is that, such words are not good
discriminator.
 Such words are frequently referred to as stopwords
54
Stopword list
 Also called negative dictionary . Why?
 is a machine-readable list of words (Stopwords) that can not
be chosen as an index term.
 They are words with a little or no meaning (function words).
 Possible candidates for a list of Stopwords are
 Articles
 Prepositions
 Conjunctions
 Some Verbs, adverbs and adjectives –(can be treated as
stop words), words or phrases indicating what somebody
or something does, what state sb or sth is in, words that
answer questions with how, when, where and words that
names a quality, define or limit a noun.
55
Cont…
 Stoplist vary in size (e.g. most stoplists in English contains
from about 50 to 400 words)
 A sample of stoplist to English
 About , becoming, can, did, eight
 After, been, caption, do, else
 Above, before, could, does, ever
 All, below
56
Develop an algorithm (series of steps)
using flowchart to represent the
process of stopword remover?
57
Techniques for building a stoplist/removal of
function words
 There are different techniques to build a stoplist /or remove
function words
1. Select words that serve for grammatical purpose
and do not refer to objects or concepts
 Ex. the, and, of (an inverse strategy that selects words
as index terms when they belong to a specific class
called noun). That means listing all in a form of negative
dictionary.
 Often the creation of stoplist is a process that occurs
before the actual indexing of individual texts.
 Thus a potential index term is checked against the
stoplist and eliminated as candidate index term found
there 58
Cont…
2. Include words that most frequently occur.
 The assumption is, these most frequent words are non-content
bearing (it is based on the assumption that the frequency of
occurrence of a function words is much higher than that of a
content word).
 A threshold value (limit) is set to determine the number of words
to be included in the stoplist. e.g.
 400 most frequently occurring words
 Include words having up to 5 occurrence in a reasonably
long document
3. Because function words tend to be small in size,
occasionally all short word that contain less than a
threshold value number of characters are removed form
the text
 Remark: But it has high risk of loosing important59
short words,
and the solution for this is using an anti-stopword-list
Advantages and dis advantage of
eliminating stopwords
 Advantages
 It reduces document description and focuses
attention on terms that convey more information.
 Provides compression (manageable size of index)
of the indexing structure. Ex.
 The size of indexing structure may be reduced
by 40% or more solely with the elimination of
stopwords.
 Disadvantages
 May reduce recall (proportion of relevant
documents to actually retrieved)
60
Cont…
 Stoplists can be either generic or domain
specific
 Generic stoplist: by considering the most frequent
words in a wide range of subjects
 Domain specific: By observing the frequency of
words in the documents collection that is to be
indexed
61
Review question
 Relate key issues, entities and two major subsystems in IR?
 Mention at least 4 challenges in IR?
 Define Indexing ?
 What is Subject indexing and how is it different from others?
 What are key issues one need to consider in developing lexical
analyzer?
 What is the contribution of H.P Lhun in IR? How about Gerard
Salton?
 What are key issues one need to consider in Removing stop
words?
 What are the 4 major features of an index term?
62
Class Exercise
 Merge the two flowcharts you created before to

represent the process of lexical analysis and
stopword remover.
 Put all your assumptions where ever required.
63
Thank you
64

Chapter 2

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Chapter 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2

Uploaded by

Copyright:

Available Formats

Content/subject Analysis and

 Classification schemes (are tools that are

 Authority files (are list of terms that are used to

 Thesauri ((is a document that shows the relationship

• Whether it is used by human being or by a machine, its essences

 The index terms selected are assumed to reflect the

 Assigning a weight or value to each term

 Like any other language it consists of two parts

Based on the different approach available:

 Various methods of using title keywords exist for indexing (read

 It is an indexing system, which considers a link between a

 It is just to index (organize) by identifying the

 It can also be seen as:

 It was Luhn(1957) who first suggested that certain word

 Zipf’s Law is a curious observation about the

 if f be the frequency of occurrence of word in a text and

 In line with this,

 Finding a value/ threshold for elimination of high and

 The risk of loss of retrieval performance

• Stop lists: Ignore the most frequent words (upper cut-

 Identification of individual words of the text

 Merge the two flowcharts you created before to

You might also like