Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1008992.1009009acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Length normalization in XML retrieval

Published: 25 July 2004 Publication History

Abstract

XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a potentially retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document collections, prompting us to revisit the issue of document length normalization. We perform a comparative analysis of arbitrary elements versus relevant elements, and show the importance of length as a parameter for XML retrieval. Within the language modeling framework, we investigate a range of techniques that deal with length either directly or indirectly. We observe a length bias introduced by the amount of smoothing, and show the importance of extreme length priors for XML retrieval. We also show that simply removing shorter elements from the index (by introducing a cut-off value) does not create an appropriate document length normalization. Even after increasing the minimal size of XML elements occurring in the index, the importance of an extreme length bias remains.

References

[1]
M. Abolhassani, N. Fuhr, and S. Malik. HyREX at INEX 2003. In Fuhr et al. {10}, pages 27--32.
[2]
G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20:357--389, 2002.
[3]
A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proceedings of the 22nd ACM SIGIR Conference, pages 222--229, 1999.
[4]
C. Buckley, A. Singhal, and M. Mitra. New retrieval approaches using SMART: TREC 4. In The Fourth Text REtrieval Conference (TREC-4), pages 25--48.
[5]
D. Carmel, Y. Maarek, M. Mandelbrod, Y. Mass, and A. Soffer. Searching XML documents via XML fragments. In Proceedings of the 26th ACM SIGIR Conference, pages 151--158, 2003.
[6]
B. Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7:1--26, 1979.
[7]
B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, New York, 1993.
[8]
N. Fuhr, N. Gövert, G. Kazai, and M. Lalmas, editors. Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM, 2003.
[9]
N. Fuhr, M. Lalmas, and S. Malik, editors. PreProceedings of the Second Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2003), 2003.
[10]
N. Fuhr, M. Lalmas, and S. Malik, editors. INEX 2003 Workshop Proceedings, 2004.
[11]
N. Gövert, M. Abolhassani, N. Fuhr, and K. Grossjohan. Content-based XML retrieval with HyRex. In Fuhr et al. {8}, pages 26--32.
[12]
N. Gövert and G. Kazai. Overview of the INitiative for the Evaluation of XML retrieval (INEX) 2002. In Fuhr et al. {8}, pages 1--17.
[13]
W. R. Greiff and W. T. Morgan. Contributions o language modeling to the theory and practice of information retrieval. In W. B. Croft and J. Lafferty, editors, Language Modeling for Information Retrieval, pages 73--93. Kluwer Academic Publishers, 2003.
[14]
D. Hiemstra. Using Language Models for Information Retrieval. PhD thesis, University of Twente, 2001.
[15]
D. Hiemstra. A Database Approach to Content-based XML Retrieval. In Fuhr et al. {8}, pages 111--118.
[16]
D. Hiemstra and W. Kraaij. Twenty-One at TREC-7: Ad-hoc and cross-language track. In The Seventh Text REtrieval Conference (TREC-7), pages 227--238, 1999.
[17]
INEX. Initiative for the evaluation of XML retrieval, 2004. http://www.is.informatik.uni-duisburg.de/projects/inex03/.
[18]
J. Kamps, M. Marx, M. de Rijke, and B. Sigurbjörnsson. XML Retrieval: What to Retrieve? In Proceedings of the 26th ACM SIGIR Conference, pages 409--410, 2003.
[19]
W. Kraaij, R. Pohlmann, and D. Hiemstra. Twenty-One at TREC-8: using language technology for information retrieval. In The Eighth Text REtrieval Conference (TREC-8), pages 285--300, 2000.
[20]
W. Kraaij and T. Westerveld. Twenty-UT at TREC-9: How different are web documents? In The Ninth Text REtrieval Conference (TREC-9), pages 665--672, 2001.
[21]
W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In Proceedings of the 25th ACM SIGIR Conference, pages 27--34, 2002.
[22]
J. Lafferty and C. Zhai. Probabilistic relevance models based on document and query generation. In W. B. Croft and J. Lafferty, editors, Language Modeling for Information Retrieval, pages 1--10. Kluwer Academic Publishers, 2003.
[23]
J. List and A. P. de Vries. CWI at INEX 2002. In Fuhr et al. {8}, pages 133--140.
[24]
J. A. List, V. Mihajlovic, A. P. de Vries, G. Ramírez, and D. Hiemstra. The TIJAH XML-IR system at INEX 2003. In Fuhr et al. {10}, pages 102--109.
[25]
Y. Mass and M. Mandelbrod. Retrieving the most relevant XML components. In Fuhr et al. {10}, pages 53--58.
[26]
Y. Mass, M. Mandelbrod, E. Amitay, D. Carmel, Y. Maarek, and A. Soffer. JuruXML - an XML retrieval system at INEX'02. In Fuhr et al. {8}, pages 73--80.
[27]
D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden Markov model information retrieval system. In Proceedings of the 22nd ACM SIGIR Conference, pages 214--221, 1999.
[28]
P. Ogilvie and J. Callan. Language models and structured document retrieval. In Fuhr et al. {8}, pages 33--44.
[29]
P. Ogilvie and J. Callan. Using language models for at text queries in XML retrieval. In Fuhr et al. {10}, pages 12--18.
[30]
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill computer science series. McGraw-Hill, New York, 1983.
[31]
J. Savoy. Statistical inference in retrieval effectiveness evaluation. Information Processing and Management, 33:495--512, 1997.
[32]
A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document length normalization. Information Processing & Management, 32:619--633, 1996.
[33]
I. Soboroff and D. Harman. Overview of the TREC 2003 Novelty Track. In The Twelfth Text REtrieval Conference (TREC-12), 2004.
[34]
E. M. Voorhees. Overview of the TREC 2003 Question Answering Track. In The Twelfth Text REtrieval Conference (TREC-12), 2004.
[35]
J. Wilbur. Non-parametric significance tests of retrieval performance comparisons. Journal of Information Science, 20:270--284, 1994.
[36]
R. Wilkinson. Effective retrieval of structured documents. In Proceedings of the 17th ACM SIGIR Conference, pages 311--317, 1994.
[37]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th ACM SIGIR Conference, pages 334--342, 2001.

Cited By

View all

Recommendations

Reviews

Xiaoya Tang

The retrievable units in Extensible Markup Language (XML) documents are individual elements, and the length distribution of such elements is different from that of standard documents. Therefore, document length normalization in XML retrieval needs to take a different approach than that in standard document retrieval. In this paper, the authors investigate the issue of document length normalization in the XML retrieval context, by analyzing the length distributions of XML elements, and carrying out an experiment investigating length normalization techniques in XML retrieval. The element length analysis indicates that, although the distribution of arbitrary elements is skewed toward short elements, the distribution of relevant elements is fairly even, except in the case of the shortest elements. In addition, the length distribution of prior probability of relevant elements is heavily skewed toward long elements. The experiment evaluates the effects of smoothing, length priors, and index cut-offs on retrieval performance. The results indicate that length priors improve retrieval performance significantly. While removing shorter elements from the index does improve performance, this improvement is far less than that obtained by the use of length priors. The results also indicate that the smoothing parameter is dependent on the length prior. The primary contribution of this paper is the reconsideration of the concept of document length normalization in a new context, that of XML retrieval. This paper also provides possible techniques that could be used for XML element length normalization. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. XML retrieval
  2. language models
  3. length normalization
  4. smoothing

Qualifiers

  • Article

Conference

SIGIR04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Structure WeightEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_377(3821-3822)Online publication date: 7-Dec-2018
  • (2016)Structure WeightEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_377-2(1-2)Online publication date: 8-Dec-2016
  • (2016)Indexing Units of Structured Text RetrievalEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_202-2(1-5)Online publication date: 21-Dec-2016
  • (2014)Searching XML Element Using Terms Propagation MethodFoundations of Intelligent Systems10.1007/978-3-319-08326-1_40(395-404)Online publication date: 2014
  • (2013)Information Retrieval Models: Foundations and RelationshipsSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00494ED1V01Y201304ICR0275:3(1-163)Online publication date: 26-Jul-2013
  • (2013)Path Expression-based Smoothing of Query Likelihood Model for XML Element RetrievalProceedings of the 2013 Second IIAI International Conference on Advanced Applied Informatics10.1109/IIAI-AAI.2013.18(296-300)Online publication date: 31-Aug-2013
  • (2012)XML Information RetrievalUnderstanding Information Retrieval Systems10.1201/b11499-29(345-362)Online publication date: 26-Jan-2012
  • (2012)Exploiting External Collections for Query ExpansionACM Transactions on the Web10.1145/2382616.23826216:4(1-29)Online publication date: 1-Nov-2012
  • (2012)Learning to rank in XML information retrieval: Which feature improve the best?Seventh International Conference on Digital Information Management (ICDIM 2012)10.1109/ICDIM.2012.6360123(336-340)Online publication date: Aug-2012
  • (2012)Summarisation of the logical structure of XML documentsInformation Processing and Management: an International Journal10.1016/j.ipm.2011.11.00248:5(956-968)Online publication date: 1-Sep-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media