Article

Length normalization in XML retrieval

Authors:

Börkur SigurbjörnssonAuthors Info & Claims

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 80 - 87

https://doi.org/10.1145/1008992.1009009

Published: 25 July 2004 Publication History

Get Access

Abstract

XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a potentially retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document collections, prompting us to revisit the issue of document length normalization. We perform a comparative analysis of arbitrary elements versus relevant elements, and show the importance of length as a parameter for XML retrieval. Within the language modeling framework, we investigate a range of techniques that deal with length either directly or indirectly. We observe a length bias introduced by the amount of smoothing, and show the importance of extreme length priors for XML retrieval. We also show that simply removing shorter elements from the index (by introducing a cut-off value) does not create an appropriate document length normalization. Even after increasing the minimal size of XML elements occurring in the index, the importance of an extreme length bias remains.

References

[1]

M. Abolhassani, N. Fuhr, and S. Malik. HyREX at INEX 2003. In Fuhr et al. {10}, pages 27--32.

Google Scholar

[2]

G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20:357--389, 2002.

Digital Library

Google Scholar

[3]

A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proceedings of the 22nd ACM SIGIR Conference, pages 222--229, 1999.

Digital Library

Google Scholar

[4]

C. Buckley, A. Singhal, and M. Mitra. New retrieval approaches using SMART: TREC 4. In The Fourth Text REtrieval Conference (TREC-4), pages 25--48.

Google Scholar

[5]

D. Carmel, Y. Maarek, M. Mandelbrod, Y. Mass, and A. Soffer. Searching XML documents via XML fragments. In Proceedings of the 26th ACM SIGIR Conference, pages 151--158, 2003.

Digital Library

Google Scholar

[6]

B. Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7:1--26, 1979.

Crossref

Google Scholar

[7]

B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, New York, 1993.

Crossref

Google Scholar

[8]

N. Fuhr, N. Gövert, G. Kazai, and M. Lalmas, editors. Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM, 2003.

Google Scholar

[9]

N. Fuhr, M. Lalmas, and S. Malik, editors. PreProceedings of the Second Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2003), 2003.

Google Scholar

[10]

N. Fuhr, M. Lalmas, and S. Malik, editors. INEX 2003 Workshop Proceedings, 2004.

Google Scholar

[11]

N. Gövert, M. Abolhassani, N. Fuhr, and K. Grossjohan. Content-based XML retrieval with HyRex. In Fuhr et al. {8}, pages 26--32.

Google Scholar

[12]

N. Gövert and G. Kazai. Overview of the INitiative for the Evaluation of XML retrieval (INEX) 2002. In Fuhr et al. {8}, pages 1--17.

Google Scholar

[13]

W. R. Greiff and W. T. Morgan. Contributions o language modeling to the theory and practice of information retrieval. In W. B. Croft and J. Lafferty, editors, Language Modeling for Information Retrieval, pages 73--93. Kluwer Academic Publishers, 2003.

Crossref

Google Scholar

[14]

D. Hiemstra. Using Language Models for Information Retrieval. PhD thesis, University of Twente, 2001.

Google Scholar

[15]

D. Hiemstra. A Database Approach to Content-based XML Retrieval. In Fuhr et al. {8}, pages 111--118.

Google Scholar

[16]

D. Hiemstra and W. Kraaij. Twenty-One at TREC-7: Ad-hoc and cross-language track. In The Seventh Text REtrieval Conference (TREC-7), pages 227--238, 1999.

Google Scholar

[17]

INEX. Initiative for the evaluation of XML retrieval, 2004. http://www.is.informatik.uni-duisburg.de/projects/inex03/.

Google Scholar

[18]

J. Kamps, M. Marx, M. de Rijke, and B. Sigurbjörnsson. XML Retrieval: What to Retrieve? In Proceedings of the 26th ACM SIGIR Conference, pages 409--410, 2003.

Digital Library

Google Scholar

[19]

W. Kraaij, R. Pohlmann, and D. Hiemstra. Twenty-One at TREC-8: using language technology for information retrieval. In The Eighth Text REtrieval Conference (TREC-8), pages 285--300, 2000.

Google Scholar

[20]

W. Kraaij and T. Westerveld. Twenty-UT at TREC-9: How different are web documents? In The Ninth Text REtrieval Conference (TREC-9), pages 665--672, 2001.

Google Scholar

[21]

W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In Proceedings of the 25th ACM SIGIR Conference, pages 27--34, 2002.

Digital Library

Google Scholar

[22]

J. Lafferty and C. Zhai. Probabilistic relevance models based on document and query generation. In W. B. Croft and J. Lafferty, editors, Language Modeling for Information Retrieval, pages 1--10. Kluwer Academic Publishers, 2003.

Crossref

Google Scholar

[23]

J. List and A. P. de Vries. CWI at INEX 2002. In Fuhr et al. {8}, pages 133--140.

Google Scholar

[24]

J. A. List, V. Mihajlovic, A. P. de Vries, G. Ramírez, and D. Hiemstra. The TIJAH XML-IR system at INEX 2003. In Fuhr et al. {10}, pages 102--109.

Google Scholar

[25]

Y. Mass and M. Mandelbrod. Retrieving the most relevant XML components. In Fuhr et al. {10}, pages 53--58.

Google Scholar

[26]

Y. Mass, M. Mandelbrod, E. Amitay, D. Carmel, Y. Maarek, and A. Soffer. JuruXML - an XML retrieval system at INEX'02. In Fuhr et al. {8}, pages 73--80.

Google Scholar

[27]

D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden Markov model information retrieval system. In Proceedings of the 22nd ACM SIGIR Conference, pages 214--221, 1999.

Digital Library

Google Scholar

[28]

P. Ogilvie and J. Callan. Language models and structured document retrieval. In Fuhr et al. {8}, pages 33--44.

Google Scholar

[29]

P. Ogilvie and J. Callan. Using language models for at text queries in XML retrieval. In Fuhr et al. {10}, pages 12--18.

Google Scholar

[30]

G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill computer science series. McGraw-Hill, New York, 1983.

Digital Library

Google Scholar

[31]

J. Savoy. Statistical inference in retrieval effectiveness evaluation. Information Processing and Management, 33:495--512, 1997.

Digital Library

Google Scholar

[32]

A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document length normalization. Information Processing & Management, 32:619--633, 1996.

Digital Library

Google Scholar

[33]

I. Soboroff and D. Harman. Overview of the TREC 2003 Novelty Track. In The Twelfth Text REtrieval Conference (TREC-12), 2004.

Google Scholar

[34]

E. M. Voorhees. Overview of the TREC 2003 Question Answering Track. In The Twelfth Text REtrieval Conference (TREC-12), 2004.

Google Scholar

[35]

J. Wilbur. Non-parametric significance tests of retrieval performance comparisons. Journal of Information Science, 20:270--284, 1994.

Digital Library

Google Scholar

[36]

R. Wilkinson. Effective retrieval of structured documents. In Proceedings of the 17th ACM SIGIR Conference, pages 311--317, 1994.

Digital Library

Google Scholar

[37]

C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th ACM SIGIR Conference, pages 334--342, 2001.

Digital Library

Google Scholar

Cited By

View all

Kamps JLalmas M(2018)Structure WeightEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_377(3821-3822)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_377
Kamps JLalmas M(2016)Structure WeightEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_377-2(1-2)Online publication date: 8-Dec-2016
https://doi.org/10.1007/978-1-4899-7993-3_377-2
Kamps J(2016)Indexing Units of Structured Text RetrievalEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_202-2(1-5)Online publication date: 21-Dec-2016
https://doi.org/10.1007/978-1-4899-7993-3_202-2
Show More Cited By

Index Terms

Length normalization in XML retrieval
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
  2. Document management and text processing
    1. Document preparation
2. Information systems

Recommendations

The Importance of Length Normalization for XML Retrieval
Abstract
XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a retrievable unit. The distribution of XML element lengths is unlike what we ...
An analysis on document length retrieval trends in language modeling smoothing
Abstract
Document length is widely recognized as an important factor for adjusting retrieval systems. Many models tend to favor the retrieval of either short or long documents and, thus, a length-based correction needs to be applied for avoiding any length ...
Using small XML elements to support relevance
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Small XML elements are often estimated relevant by the retrieval model but they are not desirable retrieval units. This paper presents a generic model that exploits the information obtained from small elements. We identify relationships between small ...

Reviews

Reviewer: Xiaoya Tang

The retrievable units in Extensible Markup Language (XML) documents are individual elements, and the length distribution of such elements is different from that of standard documents. Therefore, document length normalization in XML retrieval needs to take a different approach than that in standard document retrieval. In this paper, the authors investigate the issue of document length normalization in the XML retrieval context, by analyzing the length distributions of XML elements, and carrying out an experiment investigating length normalization techniques in XML retrieval. The element length analysis indicates that, although the distribution of arbitrary elements is skewed toward short elements, the distribution of relevant elements is fairly even, except in the case of the shortest elements. In addition, the length distribution of prior probability of relevant elements is heavily skewed toward long elements. The experiment evaluates the effects of smoothing, length priors, and index cut-offs on retrieval performance. The results indicate that length priors improve retrieval performance significantly. While removing shorter elements from the index does improve performance, this improvement is far less than that obtained by the use of length priors. The results also indicate that the smoothing parameter is dependent on the length prior. The primary contribution of this paper is the reconsideration of the concept of document length normalization in a new context, that of XML retrieval. This paper also provides possible techniques that could be used for XML element length normalization. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

July 2004

624 pages

ISBN:1581138814

DOI:10.1145/1008992

General Chair:
Mark Sanderson
University of Sheffield (UK)
,
Program Chairs:
Kalervo Järvelin
University of Tampere (Finland)
,
James Allan
University of Massachusetts (USA)
,
Peter Bruza
Distributed Systems Technology Centre (Australia)

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR04

Sponsor:

SIGIR04: The 27th ACM/SIGIR International Symposium on Information Retrieval 2004

July 25 - 29, 2004

Sheffield, United Kingdom

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
1,027
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Kamps JLalmas M(2018)Structure WeightEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_377(3821-3822)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_377
Kamps JLalmas M(2016)Structure WeightEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_377-2(1-2)Online publication date: 8-Dec-2016
https://doi.org/10.1007/978-1-4899-7993-3_377-2
Kamps J(2016)Indexing Units of Structured Text RetrievalEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_202-2(1-5)Online publication date: 21-Dec-2016
https://doi.org/10.1007/978-1-4899-7993-3_202-2
Berchiche-Fellag SMezghiche M(2014)Searching XML Element Using Terms Propagation MethodFoundations of Intelligent Systems10.1007/978-3-319-08326-1_40(395-404)Online publication date: 2014
https://doi.org/10.1007/978-3-319-08326-1_40
Roelleke T(2013)Information Retrieval Models: Foundations and RelationshipsSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00494ED1V01Y201304ICR0275:3(1-163)Online publication date: 26-Jul-2013
https://doi.org/10.2200/S00494ED1V01Y201304ICR027
Keyaki AMiyazaki JHatano KYamamoto GTaketomi TKato H(2013)Path Expression-based Smoothing of Query Likelihood Model for XML Element RetrievalProceedings of the 2013 Second IIAI International Conference on Advanced Applied Informatics10.1109/IIAI-AAI.2013.18(296-300)Online publication date: 31-Aug-2013
https://dl.acm.org/doi/10.1109/IIAI-AAI.2013.18
Lalmas M(2012)XML Information RetrievalUnderstanding Information Retrieval Systems10.1201/b11499-29(345-362)Online publication date: 26-Jan-2012
https://doi.org/10.1201/b11499-29
Weerkamp WBalog Kde Rijke M(2012)Exploiting External Collections for Query ExpansionACM Transactions on the Web10.1145/2382616.23826216:4(1-29)Online publication date: 1-Nov-2012
https://dl.acm.org/doi/10.1145/2382616.2382621
Chaa MNouali OBal K(2012)Learning to rank in XML information retrieval: Which feature improve the best?Seventh International Conference on Digital Information Management (ICDIM 2012)10.1109/ICDIM.2012.6360123(336-340)Online publication date: Aug-2012
https://doi.org/10.1109/ICDIM.2012.6360123
SzláVik ZTombros ALalmas M(2012)Summarisation of the logical structure of XML documentsInformation Processing and Management: an International Journal10.1016/j.ipm.2011.11.00248:5(956-968)Online publication date: 1-Sep-2012
https://dl.acm.org/doi/10.1016/j.ipm.2011.11.002
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

The Importance of Length Normalization for XML Retrieval

An analysis on document length retrieval trends in language modeling smoothing

Using small XML elements to support relevance

Reviews

Access critical reviews of Computing literature here