research-article

Feature Extraction for Large-Scale Text Collections

Authors:

Luke Gallagher,

Antonio Mallia,

J. Shane Culpepper,

B. Barla CambazogluAuthors Info & Claims

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Pages 3015 - 3022

https://doi.org/10.1145/3340531.3412773

Published: 19 October 2020 Publication History

Abstract

Feature engineering is a fundamental but poorly documented component in Learning-to-Rank (LTR) search engines. Such features are commonly used to construct learning models for web and product search engines, recommender systems, and question-answering tasks. In each of these domains, there is a growing interest in the creation of open-access test collections that promote reproducible research. However, there are still few open-source software packages capable of extracting high-quality machine learning features from large text collections. Instead, most feature-based LTR research relies on "canned" test collections, which often do not expose critical details about the underlying collection or implementation details of the extracted features. Both of these are crucial to collection creation and deployment of a search engine into production. So in this regard, the experiments are rarely reproducible with new features or collections, or helpful for companies wishing to deploy LTR systems.

In this paper, we introduce Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt can easily be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems. To demonstrate the software's utility, we build and document a reproducible feature extraction pipeline and show how to recreate several common LTR experiments using the ClueWeb09B collection. Researchers and practitioners can benefit from Fxt to extend their machine learning pipelines for various text-based retrieval tasks, and learn how some static document features and query-specific features are implemented.

Supplementary Material

MP4 File (3340531.3412773.mp4)

Presentation for the resource track paper Feature Extraction for Large-Scale Text Collections

Download
8.18 MB

References

[1]

N. Asadi and J. Lin. 2013. Document Vector Representations for Feature Extraction in Multi-Stage Document Ranking. Inf. Retr., Vol. 16, 6 (2013), 747--768.

Digital Library

[2]

J. Aslam and V. Pavlu. 2007. A Practical Sampling Strategy for Efficient Retrieval Evaluation. Technical Report. Northeastern University.

[3]

P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv preprint arXiv:1611.09268 (2016).

[4]

M. Bendersky, W. B. Croft, and Y. Diao. 2011. Quality-Biased Ranking of Web Documents. In Proc. WSDM. 95--104.

[5]

D. Carmel and E. Yom-Tov. 2010. Estimating the Query Difficulty for Information Retrieval. Morgan & Claypool.

[6]

B. Carterette, J. Allan, and R. Sitaraman. 2006. Minimal Test Collections for Retrieval Evaluation. In Proc. SIGIR. 268--275.

[7]

B. Carterette, V. Pavlu, H. Fang, and E. Kanoulas. 2009. Million Query Track 2009 Overview. In Proc. TREC.

[8]

M. Catena, O. Frieder, C. I. Muntean, F. M. Nardini, R. Perego, and N. Tonellotto. 2019. Enhanced News Retrieval: Passages Lead the Way!. In Proc. SIGIR. 1269--1272.

[9]

O. Chapelle and Y. Chang. 2011. Yahoo! Learning to Rank Challenge Overview. J. Mach. Learn. Res., Vol. 14 (2011), 1--24.

Digital Library

[10]

D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017a. Reading Wikipedia to Answer Open-Domain Questions. In Proc. ACL. 1870--1879.

[11]

R-C. Chen, L. Gallagher, R. Blanco, and J. S. Culpepper. 2017b. Efficient Cost-Aware Cascade Ranking in Multi-Stage Retrieval. In Proc. SIGIR. 445--454.

[12]

J. S. Culpepper, C. L. A. Clarke, and J. Lin. 2016. Dynamic Cutoff Prediction in Multi-Stage Retrieval Systems. In Proc. ADCS. 17--24.

[13]

M. Dehghani, H. Zamani, A. Severyn, J. Kamps, and W. B. Croft. 2017. Neural Ranking Models with Weak Supervision. In Proc. SIGIR. 65--74.

[14]

M. Dunn, L. Sagun, M. Higgins, V. Ugur Guney, V. Cirik, and K. Cho. 2017. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv preprint arXiv:1704.05179 (2017).

[15]

Y. Fan, J. Guo, Y. Lan, J. Xu, C. Zhai, and X. Cheng. 2018. Modeling Diverse Relevance Patterns in Ad-hoc Retrieval. In Proc. SIGIR. 375--384.

[16]

J. Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.

[17]

L. Gallagher, R-C. Chen, R. Blanco, and J. S. Culpepper. 2019. Joint Optimization of Cascade Ranking Models. (2019), 15--23.

[18]

L. Gallagher, J. Mackenzie, and J. S. Culpepper. 2018. Revisiting Spam Filtering in Web Search. In Proc. ADCS.

[19]

J. Gao, M. Galley, and L. Li. 2018. Neural Approaches to Conversational AI. arXiv preprint arXiv:1809.08267 (2018).

[20]

B. He and I. Ounis. 2005. Term Frequency Normalisation Tuning for BM25 and DFR Models. In Advances in Information Retrieval. 200--214.

[21]

Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience, Vol. 45, 1 (2015), 1--29.

Digital Library

[22]

Daniel Lemire, Nathan Kurz, and Christoph Rupp. 2018. Stream VByte: Faster byte-oriented integer compression. Inform. Process. Lett., Vol. 130 (2018), 1--6.

[23]

J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, and S. Vigna. 2016. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In Proc. ECIR. 408--420.

[24]

J. Lin, J. Mackenzie, C. Kamphuis, C. Macdonald, A. Mallia, M. Siedlaczek, A. Trotman, and A. de Vries. 2020. Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. In Proc. SIGIR. To Appear.

Digital Library

[25]

T-Y. Liu. 2009. Learning to Rank for Information Retrieval. Found. Trends in Inf. Ret., Vol. 3, 3 (2009), 225--331.

Digital Library

[26]

X. Lu, A. Moffat, and J. S. Culpepper. 2015. On the Cost of Extracting Proximity Features for Term-Dependency Models. In Proc. CIKM. 293--302.

[27]

X. Lu, A. Moffat, and J. S. Culpepper. 2016. The Effect of Pooling and Evaluation Depth on IR Metrics. Inf. Retr., Vol. 19, 4 (2016), 416--445.

Digital Library

[28]

C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, F. Silvestri, and S. Trani. 2016. Post-learning optimization of tree ensembles for efficient ranking. In Proc. SIGIR. 949--952.

[29]

C. Macdonald, R. L. Santos, and I. Ounis. 2013a. The Whens and Hows of Learning to Rank for Web Search. Inf. Retr., Vol. 16, 5 (2013), 584--628.

Digital Library

[30]

C. Macdonald, R. L. Santos, I. Ounis, and B. He. 2013b. About Learning Models with Multiple Query-Dependent Features. ACM Trans. Information Systems, Vol. 31, 3 (2013), 11:1--11:39.

Digital Library

[31]

J. McAuley and A. Yang. 2016. Addressing Complex and Subjective Product-Related Queries with Customer Reviews. In Proc. WWW. 625--635.

[32]

D. Metzler and W. B. Croft. 2005. A Markov Random Field Model for Term Dependencies. In Proc. SIGIR. 472--479.

[33]

Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. arXiv preprint arXiv:1306:2597 (2013).

[34]

F. Radlinski and N. Craswell. 2017. A Theoretical Framework for Conversational Search. In Proc. CHIIR. 117--126.

[35]

F. Raiber and O. Kurland. 2014. Query-performance Prediction: Setting the Expectations Straight. In Proc. SIGIR. 13--22.

[36]

S. E. Robertson and H. Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends in Inf. Ret., Vol. 3 (2009), 333--389.

Digital Library

[37]

E. Sheetrit, A. Shtok, and O. Kurland. 2019. A Passage-Based Approach to Learning to Rank Documents. Inf. Retr., Vol. 23, 2 (2019), 159--186.

Digital Library

[38]

S. Sonnenburg, M. L. Braun, C. S. Ong, S. Bengio, L. Bottou, G. Holmes, Y. LeCun, K-R. Müller, F. Pereira, C. E. Rasmussen, G. Rätsch, B. Schölkopf, A. Smola, P. Vincent, J. Weston, and R. Williamson. 2007. The Need for Open Source Software in Machine Learning. J. Mach. Learn. Res., Vol. 8 (2007), 2443--2466.

Digital Library

[39]

P. Thomas, M. Czerwinski, D. McDuff, N. Craswell, and G. Mark. 2018. Style and Alignment in Information-Seeking Conversation. In Proc. CHIIR. 42--51.

[40]

L. Wang, J. Lin, and D. Metzler. 2011. A Cascade Ranking Model for Efficient Ranked Retrieval. In Proc. SIGIR. 105--114.

[41]

P. Yang, H. Fang, and J. Lin. 2018a. Anserini: Reproducible Ranking Baselines using Lucene. J. Data and Information Quality, Vol. 10, 4 (2018).

Digital Library

[42]

Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. 2018b. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proc. EMNLP. 2369--2380.

Index Terms

Feature Extraction for Large-Scale Text Collections
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Test collections
    2. Retrieval models and ranking
      1. Learning to rank

Recommendations

Linear feature extraction for ranking
Abstract
We address the feature extraction problem for document ranking in information retrieval. We then propose LifeRank, a Linear feature extraction algorithm for Ranking. In LifeRank, we regard each document collection for ranking as a matrix, referred ...
LETOR: A benchmark collection for research on learning to rank for information retrieval
Abstract
LETOR is a benchmark collection for the research on learning to rank for information retrieval, released by Microsoft Research Asia. In this paper, we describe the details of the LETOR collection and show how it can be used in different kinds of ...
Kernel-based feature extraction under maximum margin criterion

In this paper, we study the problem of feature extraction for pattern classification applications. RELIEF is considered as one of the best-performed algorithms for assessing the quality of features for pattern classification. Its extension, local ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

October 2020

3619 pages

ISBN:9781450368599

DOI:10.1145/3340531

General Chairs:
Mathieu d'Aquin
DSI, Insight, NUI Galway, Ireland
,
Stefan Dietze
GESIS, Cologne, Germany, Heinrich-Heine-University Düsseldorf, Germany, L3S Research Center, Germany
,
Program Chairs:
Claudia Hauff
TU Delft, The Netherlands
,
Edward Curry
DSI, Insight, NUI Galway, Ireland
,
Philippe Cudre Mauroux
eXascale, University of Fribourg, Switzerland

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

ARC Discovery Grant
Amazon Research Award

Conference

CIKM '20

Sponsor:

CIKM '20: The 29th ACM International Conference on Information and Knowledge Management

October 19 - 23, 2020

Virtual Event, Ireland

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
184
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 11 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents