Chapter One - Information Storage & Reterival
Chapter One - Information Storage & Reterival
Chapter One - Information Storage & Reterival
By.Molalign Tilahun 1
Chapter One
Introduction to ISR
By.Molalign Tilahun 2
1. What is Information Storage?
By.Molalign Tilahun 3
What is Information Storage? (Cont. …)
By.Molalign Tilahun 4
2. What is Information Retrieval?
By.Molalign Tilahun 5
IR Processes
Information retrieval is the process of matching the
query against the indexed information objects
An index is an optimized data structure that is built
on top of the information objects
allowing faster access for the search process
The indexer:
tokenizes the text (tokenization)
removes words with little semantic value
(stop-words)
unifies word families (stemming)
The same is done for the query as well
By.Molalign Tilahun 6
IR Processes (Cont.…..)
The IR system responds by matching information objects,
which are relevant to a query
Information retrieval focuses on finding relevant
information rather than simple pattern matching
Relevance
is a subjective notion(Concept)
depends on the task being solved and its context
can change with time (e.g. new info became
available)
can change with location (e.g. the most important
answer is the closest one)
can change with the device (e.g. The best answer is a
short doc that is easier to download and visualize)
By.Molalign Tilahun 7
IR Processes (Cont.…..)
A retrieval strategy (model) is an algorithm
and related structures that takes a query and a
set of documents and assigns a similarity
measure between the query and each
document
similarity represents relevance to the user
query
Documents are ranked on the basis of their
similarity to the query
This process can be repeated and the query
can be modified By.Molalign Tilahun 8
IR Processes (Cont.…..)
In general, the IR Process
doc
Representation Representation
Retrieved documents
Evaluation
By.Molalign Tilahun 9
IR Processes (Cont.…..)
Text Collections and IR
Large collections of documents from various sources: news
articles, research papers, books, digital libraries, Web pages,
etc.
Storage of text:
Textual documents
Searchable as text
words are represented as ASCII/Unicode
Image Documents:
Scanned image of text document, which is not
searchable as text: Texts (characters, words, etc.) are
represented as patterns of pixels
By.Molalign Tilahun 10
IR Processes (Cont.…..)
Retrieval from Document Images: Two options:-
Recognition-based retrieval:
Optical Character Recognition (OCR) is required
to convert document images to ASCII (may be
error prone) and then
apply text IR systems on the recognized
documents
Recognition-free retrieval:
retrieval from document images without explicit
recognition
Search relevant documents directly from image
collections
Directly searching by its name, title or place etc.
By.Molalign Tilahun 11
IR as a Discipline
IR deals with the representation, storage,
organization of, and access to information
items such as documents, webpages, online
catalogs, structured and semi-structured
records, multimedia objects.
It can involve range of contents and media
The goals of IR were indexing text and
searching for useful documents in a collection.
Much IR research focuses more specifically on
text retrieval
By.Molalign Tilahun 12
IR as a Discipline (Cont.…..)
The area has grown beyond its early goals
Nowadays research in IR includes:
• Modeling, • language,
• web search, • cross-language retrieval,
• text classification, • audio (speech and music) retrieval,
• system architecture, • image retrieval,
• user interface, • video retrieval,
• data filtering, • question answering, etc.
By.Molalign Tilahun 13
IR as a Discipline (Cont.…..)
IR can be studied from two distinct but complementary point of view
A computer-centered: consists of
Building up efficient indexes
Processing user queries with high performance
Developing ranking algorithms to improve results
A human-centered:
Studying the behavior of the user
Understanding user’s need
Determining how understanding user’s need affects the
organization and operation of retrieval system
By.Molalign Tilahun 14
IR as a Tool
IR is a tool that finds and
selects from a collection of
items a subset that serves the
user’s purpose
By.Molalign Tilahun 15
Examples of IR systems
Text-based (Lexis-Nexis, Google, FAST): Search by keywords.
Limited search using queries in natural language.
Multimedia (QBIC, WebSeek, SaFe): Search by visual appearance
(shapes, colors,… ).
Question answering systems (AskJeeves, Answerbus): Search in
(restricted) natural language
Digital and virtual libraries
Other:
Cross language vs. multilingual information retrieval,
Music retrieval
Medical search engines
Molalign
By.Molalign Tilahun
Tilahun 16
IR serve as Bridge
An Information Retrieval System serves as a bridge between
the world of authors and the world of readers/users,
That is, writers present a set of ideas in a document using a
set of concepts. Then Users seek the IR system for relevant
documents that satisfy their information need.
Black box
By.Molalign Tilahun 17
IR System Architecture
By.Molalign Tilahun 18
Indexing, retrieval and ranking
By.Molalign Tilahun 19
IR System Architecture
Document collection
• Document representation
Text analysis/Operations
Indexing – executed offline
Query parsing and expansion
spelling correction, normalization, stop word removal, etc.
Retrieval and ranking – IR models
Evaluation of the quality of the answer
Relevance feedback – to improve ranking
The clicks on the documents
Formatting – consists of retrieving the title of the documents and
generating snippets(brief extract) for them
By.Molalign Tilahun 20
Issues in IR
Document/Text representation
what makes a “good” representation?
how is a representation generated from text?
what are the retrievable objects and how are they organized?
Information need representation
what is an appropriate query language?
how can interactive query formulation and refinement be
supported?
Comparing representations
what is a “good” similarity measure & retrieval model?
how is uncertainty represented?
Evaluating effectiveness of retrieval
what are good metrics?
what constitutes a good experimental test bed?
By.Molalign Tilahun 21
Information Vs Data Retrieval
Data retrieval : the task of determining which documents
of a collection contain the keywords in the user query
Data retrieval system
Relational database
Deals with data that has a well defined structure and
semantics
A single mistaken object among a thousand retrieved
objects means total failure
Data retrieval does not solve the problem of retrieving
information about a subject or topic
By.Molalign Tilahun 22
Data Vs. Information Retrieval
Features Data Retrieval Information Retrieval
By.Molalign Tilahun 23
Information Retrieval Research areas
Much of IR research focuses more specifically on text retrieval But there
are many other interesting areas:
Audio retrieval, which deals with searching for speech or music file
Cross-language retrieval, which uses a query in one language (say
English) and finds documents in other languages (say Amharic and
Russian).
Question-answering IR systems, which retrieve answers from a body
of text. For example, the question Who won the 1997 World Series?
finds a 1997 headline World Series: Marlins are champions.
Image retrieval, which finds images on a given topic or images that
contain a given shape or color.
Video retrieval, which searches for video file that the user looking for.
By.Molalign Tilahun 24
End of Chapter - One
Thanks
By.Molalign Tilahun 25