Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3340531.3412773acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Feature Extraction for Large-Scale Text Collections

Published: 19 October 2020 Publication History

Abstract

Feature engineering is a fundamental but poorly documented component in Learning-to-Rank (LTR) search engines. Such features are commonly used to construct learning models for web and product search engines, recommender systems, and question-answering tasks. In each of these domains, there is a growing interest in the creation of open-access test collections that promote reproducible research. However, there are still few open-source software packages capable of extracting high-quality machine learning features from large text collections. Instead, most feature-based LTR research relies on "canned" test collections, which often do not expose critical details about the underlying collection or implementation details of the extracted features. Both of these are crucial to collection creation and deployment of a search engine into production. So in this regard, the experiments are rarely reproducible with new features or collections, or helpful for companies wishing to deploy LTR systems.
In this paper, we introduce Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt can easily be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems. To demonstrate the software's utility, we build and document a reproducible feature extraction pipeline and show how to recreate several common LTR experiments using the ClueWeb09B collection. Researchers and practitioners can benefit from Fxt to extend their machine learning pipelines for various text-based retrieval tasks, and learn how some static document features and query-specific features are implemented.

Supplementary Material

MP4 File (3340531.3412773.mp4)
Presentation for the resource track paper Feature Extraction for Large-Scale Text Collections

References

[1]
N. Asadi and J. Lin. 2013. Document Vector Representations for Feature Extraction in Multi-Stage Document Ranking. Inf. Retr., Vol. 16, 6 (2013), 747--768.
[2]
J. Aslam and V. Pavlu. 2007. A Practical Sampling Strategy for Efficient Retrieval Evaluation. Technical Report. Northeastern University.
[3]
P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv preprint arXiv:1611.09268 (2016).
[4]
M. Bendersky, W. B. Croft, and Y. Diao. 2011. Quality-Biased Ranking of Web Documents. In Proc. WSDM. 95--104.
[5]
D. Carmel and E. Yom-Tov. 2010. Estimating the Query Difficulty for Information Retrieval. Morgan & Claypool.
[6]
B. Carterette, J. Allan, and R. Sitaraman. 2006. Minimal Test Collections for Retrieval Evaluation. In Proc. SIGIR. 268--275.
[7]
B. Carterette, V. Pavlu, H. Fang, and E. Kanoulas. 2009. Million Query Track 2009 Overview. In Proc. TREC.
[8]
M. Catena, O. Frieder, C. I. Muntean, F. M. Nardini, R. Perego, and N. Tonellotto. 2019. Enhanced News Retrieval: Passages Lead the Way!. In Proc. SIGIR. 1269--1272.
[9]
O. Chapelle and Y. Chang. 2011. Yahoo! Learning to Rank Challenge Overview. J. Mach. Learn. Res., Vol. 14 (2011), 1--24.
[10]
D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017a. Reading Wikipedia to Answer Open-Domain Questions. In Proc. ACL. 1870--1879.
[11]
R-C. Chen, L. Gallagher, R. Blanco, and J. S. Culpepper. 2017b. Efficient Cost-Aware Cascade Ranking in Multi-Stage Retrieval. In Proc. SIGIR. 445--454.
[12]
J. S. Culpepper, C. L. A. Clarke, and J. Lin. 2016. Dynamic Cutoff Prediction in Multi-Stage Retrieval Systems. In Proc. ADCS. 17--24.
[13]
M. Dehghani, H. Zamani, A. Severyn, J. Kamps, and W. B. Croft. 2017. Neural Ranking Models with Weak Supervision. In Proc. SIGIR. 65--74.
[14]
M. Dunn, L. Sagun, M. Higgins, V. Ugur Guney, V. Cirik, and K. Cho. 2017. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv preprint arXiv:1704.05179 (2017).
[15]
Y. Fan, J. Guo, Y. Lan, J. Xu, C. Zhai, and X. Cheng. 2018. Modeling Diverse Relevance Patterns in Ad-hoc Retrieval. In Proc. SIGIR. 375--384.
[16]
J. Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.
[17]
L. Gallagher, R-C. Chen, R. Blanco, and J. S. Culpepper. 2019. Joint Optimization of Cascade Ranking Models. (2019), 15--23.
[18]
L. Gallagher, J. Mackenzie, and J. S. Culpepper. 2018. Revisiting Spam Filtering in Web Search. In Proc. ADCS.
[19]
J. Gao, M. Galley, and L. Li. 2018. Neural Approaches to Conversational AI. arXiv preprint arXiv:1809.08267 (2018).
[20]
B. He and I. Ounis. 2005. Term Frequency Normalisation Tuning for BM25 and DFR Models. In Advances in Information Retrieval. 200--214.
[21]
Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience, Vol. 45, 1 (2015), 1--29.
[22]
Daniel Lemire, Nathan Kurz, and Christoph Rupp. 2018. Stream VByte: Faster byte-oriented integer compression. Inform. Process. Lett., Vol. 130 (2018), 1--6.
[23]
J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, and S. Vigna. 2016. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In Proc. ECIR. 408--420.
[24]
J. Lin, J. Mackenzie, C. Kamphuis, C. Macdonald, A. Mallia, M. Siedlaczek, A. Trotman, and A. de Vries. 2020. Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. In Proc. SIGIR. To Appear.
[25]
T-Y. Liu. 2009. Learning to Rank for Information Retrieval. Found. Trends in Inf. Ret., Vol. 3, 3 (2009), 225--331.
[26]
X. Lu, A. Moffat, and J. S. Culpepper. 2015. On the Cost of Extracting Proximity Features for Term-Dependency Models. In Proc. CIKM. 293--302.
[27]
X. Lu, A. Moffat, and J. S. Culpepper. 2016. The Effect of Pooling and Evaluation Depth on IR Metrics. Inf. Retr., Vol. 19, 4 (2016), 416--445.
[28]
C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, F. Silvestri, and S. Trani. 2016. Post-learning optimization of tree ensembles for efficient ranking. In Proc. SIGIR. 949--952.
[29]
C. Macdonald, R. L. Santos, and I. Ounis. 2013a. The Whens and Hows of Learning to Rank for Web Search. Inf. Retr., Vol. 16, 5 (2013), 584--628.
[30]
C. Macdonald, R. L. Santos, I. Ounis, and B. He. 2013b. About Learning Models with Multiple Query-Dependent Features. ACM Trans. Information Systems, Vol. 31, 3 (2013), 11:1--11:39.
[31]
J. McAuley and A. Yang. 2016. Addressing Complex and Subjective Product-Related Queries with Customer Reviews. In Proc. WWW. 625--635.
[32]
D. Metzler and W. B. Croft. 2005. A Markov Random Field Model for Term Dependencies. In Proc. SIGIR. 472--479.
[33]
Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. arXiv preprint arXiv:1306:2597 (2013).
[34]
F. Radlinski and N. Craswell. 2017. A Theoretical Framework for Conversational Search. In Proc. CHIIR. 117--126.
[35]
F. Raiber and O. Kurland. 2014. Query-performance Prediction: Setting the Expectations Straight. In Proc. SIGIR. 13--22.
[36]
S. E. Robertson and H. Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends in Inf. Ret., Vol. 3 (2009), 333--389.
[37]
E. Sheetrit, A. Shtok, and O. Kurland. 2019. A Passage-Based Approach to Learning to Rank Documents. Inf. Retr., Vol. 23, 2 (2019), 159--186.
[38]
S. Sonnenburg, M. L. Braun, C. S. Ong, S. Bengio, L. Bottou, G. Holmes, Y. LeCun, K-R. Müller, F. Pereira, C. E. Rasmussen, G. Rätsch, B. Schölkopf, A. Smola, P. Vincent, J. Weston, and R. Williamson. 2007. The Need for Open Source Software in Machine Learning. J. Mach. Learn. Res., Vol. 8 (2007), 2443--2466.
[39]
P. Thomas, M. Czerwinski, D. McDuff, N. Craswell, and G. Mark. 2018. Style and Alignment in Information-Seeking Conversation. In Proc. CHIIR. 42--51.
[40]
L. Wang, J. Lin, and D. Metzler. 2011. A Cascade Ranking Model for Efficient Ranked Retrieval. In Proc. SIGIR. 105--114.
[41]
P. Yang, H. Fang, and J. Lin. 2018a. Anserini: Reproducible Ranking Baselines using Lucene. J. Data and Information Quality, Vol. 10, 4 (2018).
[42]
Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. 2018b. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proc. EMNLP. 2369--2380.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
October 2020
3619 pages
ISBN:9781450368599
DOI:10.1145/3340531
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clueweb
  2. feature extraction
  3. feature importance
  4. feature index
  5. feature repository
  6. lambdamart
  7. learning to rank
  8. ltr

Qualifiers

  • Research-article

Funding Sources

  • ARC Discovery Grant
  • Amazon Research Award

Conference

CIKM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 184
    Total Downloads
  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)2
Reflects downloads up to 11 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media