Information Retrieval & Machine Learning: Supporting Technologies For Web Mining Research & Practice
Information Retrieval & Machine Learning: Supporting Technologies For Web Mining Research & Practice
Information Retrieval & Machine Learning: Supporting Technologies For Web Mining Research & Practice
Abstract
With the enormous increase in recent years in the volume of information available on-line, and the
consequent need for better techniques to access this information, there has been a strong resurgence of
interest in Web Mining research. This paper expounds how research in Machine learning and
Information Retrieval will help develop applications that can more effectively and efficiently utilize the
Web of knowledge. We will provide a review of the Web mining systems from the perspectives of Machine
Learning and Information Retrieval and how they have been applied in developing standards &
improving effectiveness.
Keywords
Web Information Retrieval; Machine Learning Paradigms
Introduction
World Wide Web is a huge, widely distributed, global source for information services, hyper-link information,
access and usage information and web-site contents & organizations. With the transformation of the Web into a
ubiquitous tool for e-activities such as e-commerce, e-learning, e-government, e-science, its use has pervaded to
the realms of day-to-day work, information retrieval and business management and it is imperative to provide users
with tools for efficient and effective resource and knowledge discovery. By all measures, the Web is enormous and
growing at a staggering rate, which has made it increasingly intricate and crucial for both people and programs to
have quick and accurate access to Web information and services. Search engines have assumed a central role in the
World Wide Webs infrastructure as its scale and impact have escalated. Although the web search engine assists
resource discovery, it is far from satisfying for its poor performance. The slow retrieval speed, poor quality of
retrieved results, handling a huge quantity of information, addressing subjective & time-varying search needs,
finding fresh information and dealing with poor quality queries are commonly cited glitches.
From IR to Web IR
Information discovery on the Web is a challenging task with a great potential that is yet to be realized.
Diversity and complexity of Web information, as well as its richness, call for approaches that reach beyond
conventional IR (Kuhlen, 1991; Kobayashi, Takeda, 2000; Baeza-Yates, 1999). Research problems range
from understanding the users information need better to managing huge amounts of information to
providing superior ranking methods exploiting the structure and characteristics of the Web. The information
explosion of the WWW-era information age has made the field of Information Retrieval (IR) more critical
than ever and fostered the development of the field of Web Information Retrieval (Web IR), which is
concerned with addressing the technological challenges facing Information Retrieval (IR) in the setting of
WWW (Nunes, 2006).
For many years, information retrieval research focused mainly on the problem of ad-hoc document
retrieval, a topical search task that assumes all queries are meant to express a broad request for information
on a topic identified by the query. This task is exemplified by the early TREC conferences, where the adhoc document retrieval track was prominent. In recent years, particularly since the popular advent of the
World Wide Web and e-commerce, IR researchers have begun to expand their efforts to understand the
nature of the information need that users express in their queries. The unprecedented growth of available
data coupled with the vast number of available online activities has introduced a new wrinkle to the
problem of search: it is now important to attempt to determine not only what the user is looking for, but
also the task they are trying to accomplish and the method by which they would prefer to accomplish it. In
addition, all users are not created equal; different users may use different terms to describe similar
information needs; the concept of what is relevant to a user has only become more and more unclear as
the web has matured and more diverse data have become available. Because of this, it is of key interest to
search services to discover sets of identifying features that an information retrieval system can use to
associate a specific user query with a broader information need. All of these concerns fall into the general
area of query understanding. The central idea is that there is more information present in a user query than
simply the topic of focus, and that harnessing this information can lead to the development of more
effective and efficient information retrieval systems. Thus, Information Retrieval (IR) is Automatic
retrieval of all relevant documents while at the same time retrieving as few of the non-relevant as
possible. It has the primary goals of indexing text and searching for useful documents in a collection. [1, 2,
5, 6, 7].
The ultimate challenge of Web IR research is to provide improved systems that retrieve the most relevant
information available on the web to better satisfy a users information need. In an Information Retrieval
scenario, the most common evaluation is retrieval effectiveness and the effect of indexing exhaustivity
and term specificity on retrieval effectiveness can be explained by two widely accepted measures Precision
& Recall.
Precision : The proportion of retrieved and relevant documents to all the documents retrieved
|Ra| / |A|
Recall : The fraction of the relevant documents (R) that is successfully retrieved
|Ra| / |R|
A perfect Precision score of 1.0 means that every result retrieved by a search was relevant (but says
nothing about whether all relevant documents were retrieved) & A perfect Recall score of 1.0 means that all
relevant documents were retrieved by the search (but says nothing about how many irrelevant documents
were also retrieved) (Singhal, 2001)
Figure 1. Defining IR Metrics
Existing search engines focus mainly on basic term-based techniques for general search, and do not attempt
query understanding. Traditionally, a term in a given document is considered to be significant if it occurs
multiple times within that document. This observation is commonly referred to as Term Frequency (tf)
(Luhn, 1957). His study was based on the fact that the authors of documents typically emphasize a topic or
concept by repeatedly using the same words. Since then, most information retrieval approaches
(Kobayashi, Takeda, 2000; Baeza-Yates, 1999) have adopted tf (or variations of it) as a benchmark for
indicating term significance or relevance within a given document. In particular, it is normally combined
with inverse document frequency (idf) to form the tf-idf measure (Salton, Yang, 1973). Even with the
emergence of Web Information Retrieval, tf still continues to be a standard measure of term significance
within a document. There are several examples content-based web information retrieval systems (Anh,
Moffat, 2003; Craswell, Hawking, Upstill, McLean, Willkinson & Wu, 2003; Robertson, Walker, 1999; Yu,
Cai,Wen & Ma, 2003) that assess term significance using tf. But as is the case for many potentially relevant
documents, tf is not always the best or most useful indicator of term significance or relevancy. Quite often,
there are relevant documents that contain only a single or a few occurrences of a particular term.
Consequently, through tf these terms will rarely be considered significant, and thus never contribute
impressively to the rank score of the potentially relevant document they appear within. This is especially
the case when infrequently occurring terms appear in large documents containing hundreds or even
thousands of terms.
In the query-centric approaches to retrieval (Plachouris, Cacheda, Ounis & Rijsbergen, 2003; Plachouris,
Ounis, 2002) queries can be classified to aid in the choice of retrieval strategy. Kang et al. (Kang, Kin,
2003) classify queries as either pertaining to topic relevance, homepage finding or service task and use this
classification as a basis of dynamically combining multiple evidences in different ways to improve
retrieval. Plachouris et al. (Plachouris, Ounis, 2002) use WordNet in a concept-based probabilistic approach
to information retrieval where queries are biased according to their calculated scope. In their work, scope is
an indication of generality or specificity of a query and is used as a factor of uncertainty in Dempster
Shafers theory of evidence.
Context-based retrieval approaches aim to provide a more complete retrieval process by incorporating
contextual information into the retrieval process. The use of context in information retrieval is not a new
idea. Jing et al. (Jing, Tzoukermann, 1999) use context as a basis of measuring the semantic distances
between words. Billhardt et al. (Billhardt, Borrajo & Maojo, 2002) propose a context-based vector space
model for information retrieval. The WEBSOM (Honkela, Kaski, Lagus & Kohonen, 1997) system is an
example of another way in which context has been used for information retrieval. It uses a two level
Kohonens self-organizing map approach to group words and documents of contextual similarity. Context
in WEBSOM is limited to the terms that occur direct either sides of the term in question. IntelliZap
(Finkelstein, Gabrilovich, Matias, Rivlin, Solan, Wolfman & Ruppin, 2001) is a context-based web search
engine that requires the user to select a key word in the context of some text. The approach makes effective
use of the contextual information in the immediate vicinity of the keywords selected, so that retrieval
precision can be improved. Inquirus (Glover, Lawrence, Gordon, Birmingham & Giles, 2001; Lawrence,
Giles, 1998) is another web search engine that uses contextual information to improve search results. A user
must specify some contextual information, considered as preferences, pertaining to the query. This context
(preferences) provides a high-level description of the users information need and ultimately control the
search strategy used by the system. Kleinberg (1999) illustrates how hyperlink information in web pages
can be used for web search when using a set of retrieved documents. An approach that also uses the
characteristics of link information from a set of retrieved documents for topic distillation is presented by
Amitay et al. (Amitay, Carmel, Darlow, Lempel, & Soffer, 2002). PageRank is hyperlink-based retrieval
algorithm that calculates document scores by considering the entire hyperlink connected graph represented
by all the links in the entire document collection. (Brin, Page, 1998) The model described in (Z John,
Verma, 2006), uses traditional query expansion to determine context of query. Another closely related work
(Jonathan,), implicitly deduce context using three different algorithms. Finally (Pickens, Farlance, 2006),
offers Term context model as a new tool for accessing term presence in a document. The paper (Bhatia,
Khalid K, 2007; Bhatia, Khalid K, 2008) offers Web Retrieval system architecture with a novel re-ranking
algorithm to effectively refine the ranking of web search results for locating the most useful documents.
networks.
(Rumelhart
1986b)
et
al.,
The Self-Organizing
Maps have been
widely
used
in
unsupervised
learning, clustering,
and
pattern
recognition
(Kohonen,1995)
The
Hopfield
Networks have been
used
mostly
in
search
and
optimization
applications
(Hopfield, 1982)
Information extraction
Relevance feedback
Information filtering
Text classification and Text clustering
Information Extraction
Information Extraction (IE) refers to the techniques designed to identify useful information from
text documents automatically. It has the goal of transforming a collection of documents,
usually with the help of an IR system, into information that is more readily digested and
analyzed.
Table 2. Difference between IR and IE
Information Retrieval (IR)
Aims to select relevant documents, i.e, it finds documents.
Views text as a bag of unordered words.
Thus IE works at a finer granularity level than IR does on documents. Applications of IE include:
It can make Information Retrieval more precise; Summarization of documents in well defined
subject areas; Automatic generation of databases from text. IE includes the following subtasks:
Named entity recognition (NER)
It refers to recognizing relevant entities in text. It is the automatic identification from text
documents of the names of entities of interest.
Relation extraction
Linking recognized entities having particular relevant relations.
The Machine learning-based entity extraction systems rely on algorithms rather than human-created
rules to extract knowledge or identify patterns from texts.
Neural networks
Decision tree (Baluja et al., 1999),
Hidden Markov Model (Miller et al., 1998),
Entropy maximization (Borthwick et al., 1998).
Relevance Feedback
Relevance Feedback helps users conduct searches iteratively and reformulate search queries based
on evaluation of previously retrieved documents. Using relevance feedback, a model can learn the
common characteristics of a set of relevant documents in order to estimate the probability of
relevance for the remaining documents (Fuhr & Buckley, 1991). Various Machine Learning
algorithms, such as genetic algorithms, ID3, and simulated annealing, have been used in relevance
feedback applications (Kraft et al., 1995; 1997; Chen et al., 1998).
Information filtering
Information filtering techniques try to learn about users interests based on their evaluations and
actions, and then to use this information to analyze new documents. Many personalization and
collaborative systems have been implemented as software agents to help users in information
systems (Maes, 1994).
Web Mining
Web mining refers to the use of data mining techniques to automatically retrieve, extract and
evaluate (generalize/analyze) information for knowledge discovery from web documents and services. It
is about making implicit or "hidden" knowledge explicit. The digital revolution and the phenomenal
growth of the Web have lead to the generation and storage of huge amounts of data, prompting the
need for intelligent analysis methodologies to discover useful knowledge from it. Due to the
heterogeneous, semi-structured, distributed, time-varying and multi-dimensional facets of web data,
automated discovery of targeted knowledge is a challenging task. It calls for novel methods that draw
from a wide range of patent areas of Data Mining, Machine Learning, Information Retrieval, Natural
Language Processing, Multimedia, and Statistics. In this article, we will provide a review of the field from
the perspectives of Machine Learning and Information Retrieval and how they have been applied in Web
mining systems. Machine Learning is the basis for most data mining and text mining techniques &
Information Retrieval research has largely influenced the research directions of Web mining applications.
profiles on the Web server still belong to the category of traditional data mining. Secondly, the
Web is a directed-graph consists of document nodes and hyperlinks. Therefore, the pattern
identified can be possibly about the content of documents or about the structure of the Web.
Moreover, the Web documents are semi-structured or non-structured with little machine-readable
semantic while the source of data mining is confined to the structural data in database. As a
result, some traditional data mining methods are not applicable to Web mining.
EXAMPLE: Web server access logs, proxy server logs, browser logs, user
profiles, registration data, user sessions or transactions, cookies, user queries,
bookmark data, mouse clicks and scrolls, any other interaction data.
Web structure mining can be further divided into external structure (hyperlink between
web page) mining, internal structure (of a web page) mining and URL mining.
Web Content Mining
DEFINITION: Web Content Mining refers to extracting use information and
knowledge from content in Web pages. It is the discovery of useful information
from Web contents, including text, images, audio, video, etc. It can be divided
into text mining (including text file, HTML document, etc.) multimedia mining.
Firstly, most Web documents are in HTML format and contain many markup tags,
mainly used for formatting.
Web Content Mining. Web content mining is mainly based on research in information
retrieval and text mining, such as information extraction, text classification and clustering, and
information visualization. However, it also includes some new applications, such as Web resource
discovery. Some important Web content mining techniques and applications are reviewed in
following subsections:
E.g., the Itsy Bitsy Spider searches the Web using a best-first search and a
genetic algorithm approach (Chen et al.,1998).
Web Visualization
Web Visualization tools have been used to help users maintain a "big picture" of
the retrieval results from search engines, web sites, a subset of the Web, or even
the whole Web. The most well known example of using the tree-metaphor for
Web browsing is the hyperbolic tree developed by Xerox PARC (Lamping &
Rao, 1996). In these visualization systems, Machine Learning techniques are
often used to determine how Web pages should be placed in the 2-D or 3-D
space. One example is the SOM algorithm described earlier
(Chen et
al., 1996).
Web Structure Mining. Web link structure has been widely used to infer important web
pages information. Web structure mining has been largely influenced by research in:
We believe that research in Machine learning and Information Retrieval help develop Web Mining
applications that can more effectively and efficiently utilize the Web of knowledge.
References
Amitay, E., Carmel, D., Darlow, A., Lempel, R., Soffer, A., Topic distillation with knowledge agents, Proceedings of
the 11th Text Retrieval Conference (TREC-11), Gaithersburg, Maryland, USA, 2002.
Anh, V., Moffat, A.: Robust and web retrieval document-centric integral impacts, Proceedings of the 12th Text
Retrieval Conference (TREC-12), Gaithersburg, USA, pp. 726731, 2003.
Baeza-Yates R., Ribeiro-Neto B., Modern Information Retrieval. Addison Wesley, New York, 1999.
Baluja, S., Mittal, V., & Sukthankar, R, Applying machine learning for high performance named-entity
extraction. Proceedings of the Conference of the Pacific Association for Computational Linguistics, 1999, 365378, 1999.
Belew, R. K., Adaptive information retrieval: Using a connectionist representation to retrieve and learn about
documents. Proceedings of the 12th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, 11-20, 1989.
Berners-Lee, T., Hendler, J., & Lassila, 0, The Semantic Web. Scientific American, 284(5), 35-43, 2001.
Bhatia MPS, Khalid K Akshi, Contextual Proximity Based Term-Weighting for improved Web Information
Retrieval, Proceedings of KSEM 2007, Lecture notes of AI-4798, Springer, Pages 267-278, 2007.
Bhatia MPS, Khalid K Akshi, The Context-driven Generation of Web Search, (To be published), Proceedings of
CISTM 2008, Journal of Information Science and Technology, 2008.
Billhardt, H., Borrajo, D., Maojo, V., A context vector model for information retrieval, Journal of American Society
on Information Science Technology 53(3):236249, 2002.
Bin W. & Zhijing L., Web Mining Research, Proceedings of 5th International Conference on Computational
Intelligence and Multimedia Applications (ICCIMA03), 2003.
Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R, NYU: Description of the MENE named entity system
as used in MUC-7.Proceedings of the Seventh Message Understanding Conference (MUC- 7), 1998.
Brin, S., Page, L., The anatomy of a large-scale hyper-textual web search engine, Proceedings of the 7th WWW
Conference, pp. 107117, Brisbane, Australia, 1998.
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., & Wiener, J., Graph
structure in the Web. Proceedings of the 9th International World Wide Web Conference, 2000.
Carbonell, J. G., Michalski, R. S., & Mitchell, T. M. (1983). An overview of machine learning. In R. S.
Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (pp.
3-23). Pa10 Alto,CA Tioga.
Chakarbarti S., Mining the Web: Discovering knowledge from hypertext data, Morgan Kaufmann
Publisher, San Francisco, CA, 2003.
Chang, C. H., & Lui, S. C. IEPAD: Information extraction based on pattern discovery. Proceedings of the 10th
World Wide Web Conference, 2001.
Chau, M., Zeng, D., & Chen, H., Personalized spiders for Web search and analysis. Proceedings of the 1st ACMIEEE Joint Conference on Digital Libraries, 79-87, 2001.
Chen, H., Knowledge management systems: A text mining perspective. Tucson, AZ: University of Arizona.,
http://ai.bpa.arizona.edu, 2001.
Chen, H., Chau, M., & Zeng, D., CI spider: A tool for competitive intelligence on the Web. Decision Support
Systems, 34( l), 1-17, 2002.
Chen, H. M., & Cooper, M. D. Using clustering techniques to detect usage patterns in a Web-based information
system. Journal of the American Society for Information Science and Technology, 52, 888-904, 2001.
Chen, H., & Ng, T., An algorithmic approach to concept exploration in a large knowledge network (automatic
thesaurus consultation): Symbolic brand and bound search vs. connectionist Hopfield net activation. Journal of
the American Society for Information Science, 46, 348-369, 1995.
Chen, H., Schuffels, C., & Orwig, R, Internet categorization and search: A machine learning approach. Journal of
Visual Communication and Image Representation, 7(1), 88-102, 1996.
Chen, H., Shankaranarayanan, G., Iyer, A., & She, L., A machine learning approach to inductive query by
examples: An experiment using relevance feedback, ID3, genetic algorithms, and simulated annealing. Journal
of the American Society for Information Science, 49, 693-705,1998.
Cheong, F. C, Internet agents: Spiders, wanderers, brokers, and bots. Indianapolis, IN: New Riders Publishing,
1996.
Cohen, P. R., & Feigenbaum, E. A. (1982). The handbook of artificial intelligence (Vol. 3). Reading, MA:
Addison-Wesley.
Craswell, N., Hawking, D., Upstill, T., McLean, A., Wilkinson, R., Wu, M., TREC 12 Web and interactive tracks at
CSIRO, Proceedings of the 12th Text Retrieval Conference (TREC-12), Gaithersburg, USA, pp. 193203, 2003.
Duda, R., & Hart, P. Pattern classification and scene analysis. New York: Wiley, 1973
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman G., Ruppin, E., Placing search in context:
the concept revisited, Proceedings of the 10th International World Wide Web Conference, pp. 406414, 2001.
Fisher, D. H., Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139-172,
1987.
Fuhr, N., & Buckley, C. A Probabilistic Learning Approach for Document Indexing. ACM Transactions on
Information Systems, 9, 223-248,1991.
Glover, E., Lawrence, S., Gordon, M., Birmingham, W., Lee Giles, C., Web search your way. Communication
ACM, 44(12):97102 ,2001.
Goldberg, D. E, Genetic algorithms in search, optimization, and machine learning. Reading, MA: AddisonWesley, 1989.
Henzinger Monika, The Past, Present, and Future of Web Search Engines, Proceedings of 31st International
Colloquium, ICALP 2004, Finland, July 12-16, 2004Kobayashi M. & Takeda K., Information Retrieval on the
Web, ACM Computing Surveys, Vol. 32, No.2, June 2000.
Honkela, T., Kaski, S., Lagus, K., Kohonen, T., WEBSOM self-organizing maps of document collections,
Proceedings of WSOM_97 (Workshop on Self-Organizing Maps), Espoo, Finland, pp. 310315, 1997.
Hopfield, J. J. , Neural network and physical systems with collective computational abilities. Proceedings of the
National Academy of Science, 79(4), 2554-2558, 1982.
Jing, H., Tzoukermann, E., Information retrieval based on context distance and morphology, Proceedings of the
22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.
9096, 1999.
Jonathan Siddharth, Context Driven Ranking for the Web, http://infolab.stanford.edu/~jonsid/.
Kahle, B, Preserving the Internet. Scientific American, 276(6), 82-83, 1997.
Kang, I., Kim, G.: Query type classification for web document retrieval, Proceedings of the 26th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, pp.
6471, 2003.
Kleinberg, J., Authoritative sources in a hyperlinked environment, Journal of the ACM 46(5):604632, 1999.
Kobayashi M & Takeda K., Information Retrieval on the Web., ACM Computing Surveys, Vol. 32, No.2, June
2000.
Kohonen, T, Self-organizing maps. Berlin, Germany: Springer-Verlag, 1995.
Kononenko, I. Inductive and Bayesian learning in medical diagnosis. Applied Artificial Intelligence, 7, 317-337, 1993.
Kosala R. & Blockeel H., Web Mining Research: A survey, SIGKDD Explorations, Vol. 2, Issue 1, July 2000, pp. 115.
Kraft, D. H., Petry, F. E., Buckles, B. P., & Sadasivan, T. Applying genetic algorithms to information retrieval
systems via relevance feedback. In P. Bosc & J. Kacprzyk (Eds.), Fuzziness in database management systems
(pp. 330-344). Heidelberg, Germany: Physica-Verlag, 1995.
Kraft, D. H., Petry, F. E., Buckles, B. P., & Sadasivan, T. Genetic algorithms for query optimization in
information retrieval: Relevance feedback. In E. Sanchez, T. Shibata, & L. A. Zadeh (Eds.), Genetic algorithms and
fuzzy logic systems (pp. 155-173). Singapore: World Scientific, 1997.
Kuhlen, R., Information and Pragmatic Value-adding: Language Games and Information Science, Computers and
the Humanities 25, pages 93101, 1991.
Kwok, K. LA, neural network for probabilistic information retrieval. Proceedings of the 12th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval, 21-30, 1989
Lamping, J., & Rao, R, Visualizing large trees using the hyperbolic browser. Proceedings of the ACM CHI '96
Conference on Human Factors in Computing Systems, 388-389, 1996.
Langley, F'., Iba, W., & Thompson, K , An analysis of Bayesian classifiers. Proceedings of the 10th
National Conference on Artificial Intelligence, 223-228, 1992.
Lawrence, S., Giles, C., Context & page analysis for improved web search. IEEE Internet Computing 2(4), pp 38
46, 1999.
Lin, X., Soergel, D., & Marchionini, G. A self-organizing semantic map for information retrieval. Proceedings of
the 14th Annual International ACM SIGIR Conference on Research and Development in Information ,Retrieval,
262-269, 1991.
Lippmann, R. P, An introduction to computing with neural networks. IEEE Acoustics Speech and Signal
Processing Magazine, 4, 4-22, 1987
Luhn, H., A statistical approach to mechanized encoding and searching of literary information, IBM Journal of
Research. Development, 1(4):309317, 1957.
Maes, P., Agents that reduce work and information overload. Communications of the ACM, 37(7), 3140, 1994.
Marchionini, G., Co-evolution of user and organizational interfaces: A longitudinal case study of WWW
dissemination of national statistics. Journal of the American Society for Information Science and Technology, 53,
1192-1209, 2002.
Michalewicz, Z., Genetic algorithms + data structures = evolution programs. Berlin, Germany: Springer-Verlag,
1992.
Miller, S., Crystal, M., Fox, H., Ramshaw, L., Schwartz, R., Stone, R., Weischedel, R., & the Annotation Group
BBN: Description of the SIFT system as used for MUC-7. Proceedings of the Seventh Message Understanding
Conference (MUC-7), 1998.
Nunes Srgio, State of the Art in Web Information Retrieval, Technical Report, FEUP, 2006.
Pickens J.& Farlance A.M., Term Context Models for Information Retrieval, ACM CIKM06, 2006.
Pinkerton, B. Finding what people want: Experiences with the Web Crawler. Proceedings of the 2nd
International World Wide Web Conference, 1994.
Plachouris, V., Cacheda, F., Ounis, Iadh, van Rijsbergen, C., University of Glasgow at the Web Track: Dynamic
Application of Hyperlink Analysis using the Query Scope, Proceedings of the 12th Text Retrieval Conference
(TREC-12), Gaithersburg, USA, pp. 636642, 2003.
Plachouris, V., Ounis, I., Query-biased combination of evidence on the web, Workshop on Mathematical/Formal
Methods in Information Retrieval, ACM SIGIR Conference, pp. 105121, 2002.
Quinlan, J. R. , Learning efficient classification procedures and their application to chess end games. In R.
S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach
(pp. 463-482). Palo Alto, CA: Tioga, 1983.
Quinlan, J. R., C4.5: Programs for machine learning. Los Altos, CA: Morgan Kaufmann, 1993.
Robertson, S.: On term selection for query expansion, Journal of Documentation, Volume 46, Issue 4
359 364, 1991.
Pages:
Robertson, S., Walker, S.: Okapi/ Keenbow at TREC-8, Proceedings of the 8th Text Retrieval Conference (TREC-8),
Gaithersburg, USA, pp. 151161, 1999.
Rumelhart, D. E., Hinton, G. E., & McClelland, J. L, A general framework for parallel distributed processing. In
D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing (pp. 4576). Cambridge, MA: The MIT Press, 1986.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. Learning internal representations by error propagation. In D.
E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing (pp. 31S362).
Cambridge, MA: MIT Press, 1986.
Salton, G., Yang, C.: On the specification of term values in automatic indexing. J. Doc. 29(4): 351372
1973.
Simon, H. A. (1983). Why Should Machine Learn? In R. S. Michalski, J. Carbonell, & T. M. Mitchell (Eds.),
Machine learning: An artificial intelligence approach (pp. 25-38). Palo Alto, CA Tioga Press.
Srivastava J., Desikan P. & Kumar V. Web Mining- Accomplishments and Future directions, Proceedings of.
National Science Foundation Workshop on Next Generation Data Mining (NGDM02), Baltimore, Maryland,
2002.
Tian Chin, Web Search Improvement Based on Proximity and density of multiple keywords, IEEE Proceedings of
the 22nd International conference on Data engineering Workshops (ICDEW06)
Verma Brijesh, John Z , A Novel Context Matching Based Technique for Web Document Retrieval, World Wide
Web, Volume 9, Number 4, December 2006 , pp. 485-503(19), Springer, 2006.
Voorhees, E., Using WordNet for text retrieval, WordNet: An Electronic Lexical Database, MIT Press, pp. 285303,
1998.
Wen J.R., Probabilistic Model for Contextual Retrieval, ACM SIGIR04, 2004.
Yu, S., Cai, D., Wen, J., Ma, W., Improving pseudo-relevance feedback in web information retrieval using web page
segmentation, Proceedings of the 12th International Word Wide Web Conference , 2003.
Yu, S., Cai, D., Wen, J., Ma, W., Improving pseudo-relevance feedback in web information retrieval using web page
segmentation, Proceedings of the 12th International Word Wide Web Conference, 2003.
Zadeh, L. A, Fuzzy sets. Information and Control, 8,338-353, 1965.