Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1645953.1645979acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Supervised semantic indexing

Published: 02 November 2009 Publication History

Abstract

In this article we propose Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH). We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.

References

[1]
R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern information retrieval. Addison-Wesley Harlow, England, 1999.
[2]
D. M. Blei and J. D. McAuli e. Supervised topic models. In In Advances in Neural Information Processing Systems (NIPS), pages 121--128, 2007.
[3]
D. M. Blei, A. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, 2003.
[4]
R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In EACL, pages 9--16, 2006.
[5]
C. Burges, R. Ragno, and Q.V. Le. Learning to Rank with Nonsmooth Cost Functions. In Advances in Neural Information Processing Systems: Proceedings of the 2006 Conference, pages 193--200. MIT Press, 2007.
[6]
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML 2005, pages 89--96, New York, NY, USA, 2005. ACM Press.
[7]
Z. Cao, T. Qin, T.Y. Liu, M.F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pages 129--136. ACM Press New York, NY, USA, 2007.
[8]
S. Chernov, T. Iofciu, W. Nejdl, and X. Zhou. Extracting semantic relationships between wikipedia categories. In 1st International Workshop: AISemWiki2006 -- From Wiki to Semantics (SemWiki 2006), co-located with the ESWC2006 in Budva, 2006.
[9]
M. Collins and N. Du y. New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 263--270. Association for Computational Linguistics Morristown, NJ, USA, 2001.
[10]
S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 708--716, Prague, June 2007. Association for Computational Linguistics.
[11]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.
[12]
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In International Joint Conference on Artificial Intelligence, pages 1606--1611, 2007.
[13]
P. Gehler, A. Holub, and M. Welling. The rate adapting poisson (rap) model for information retrieval and object recognition. In Proceedings of the 23rd International Conference on Machine Learning, pages 337--344. 2006.
[14]
A. Globerson and S. Roweis. Visualizing pairwise similarity via semidefinite programming. In AISTATS. 2007.
[15]
S. Goel, J. Langford, and A. Strehl. Predictive indexing for fast search. In Advances in Neural Information Processing Systems 21, pages 505--512.2008.
[16]
D. Grangier and S. Bengio. Inferring document similarity from hyperlinks. In CIKM '05, pages 359--360, New York, NY, USA, 2005. ACM.
[17]
D. Grangier and S. Bengio. A discriminative kernel--based approach to rank images from text queries. IEEE Trans. PAMI., 30(8):1371--1384, 2008.
[18]
R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. MIT Press, ambridge, MA, 2000.
[19]
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR 1999, pages 50--57. ACM Press, 1999.
[20]
J. Hu, L. Fang, Y. Cao, H. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging wikipedia semantics. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 179--186, New York, NY, USA, 2008. ACM.
[21]
P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman. Online metric learning and fast similarity search. In Advances in Neural Information Processing Systems (NIPS), pages 761--768. 2008.
[22]
T. Joachims. Optimizing search engines using clickthrough data. In ACM SIGKDD, pages 133--142, 2002.
[23]
M. Keller and S. Bengio. A Neural Network for Text Representation. In International Conference on Artificial Neural Networks, ICANN, pages 667--672, 2005. IDIAP-RR 05-12.
[24]
T.Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to rank for information retrieval. In Proceedings of SIGIR 2007 Workshop on Learning to Rank for Information Retrieval, 2007.
[25]
D. N. Milne, I. H. Witten, and D. M. Nichols. A knowledge-based search engine powered by wikipedia. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 445--454, New York, NY, USA, 2007. ACM.
[26]
Z. Minier, Z. Bodo, and L. Csato. Wikipedia-based kernels for text categorization. In In 9th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pages 157--164, 2007.
[27]
M. Ruiz-casado, E. Alfonseca, and P. Castells. Automatic extraction of semantic relationships for wordnet by means of pattern learning from wikipedia. In In NLDB, pages 67--79. Springer Verlag, 2005.
[28]
R. Salakhutdinov and G. Hinton. Semantic Hashing. International Journal of Approximate Reasoning, 50(7):969--978, 2009.
[29]
Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, A. Strehl, and V. Vishwanathan. Hash kernels. In Twelfth International Conference on Artificial Intelligence and Statistics, 2009.
[30]
F. Smadja, K. R. McKeown, and V. Hatzivassiloglou. Translating collocations for bilingual lexicons: a statistical approach. Comput. Linguist., 22(1):1--38, 1996.
[31]
J. Sun, Z. Chen, H. Zeng, Y. Lu, C. Shi, and W. Ma. Supervised latent semantic indexing for document categorization. In ICDM 2004, pages 535--538, Washington, DC, USA, 2004. IEEE Computer Society.
[32]
K. Weinberger and L. Saul. Fast solvers and efficient implementations for distance metric learning. In International Conference on Machine Learning, pages 1160--1167. 2008.
[33]
Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. In SIGIR, pages 271--278, 2007.
[34]
L. Zighelnic and O. Kurland. Query-drift prevention for robust query expansion. In SIGIR 2008, pages 825--826, New York, NY, USA, 2008. ACM.

Cited By

View all
  • (2024)Topic mining for theses and job ads in ICT sector: can higher education institutes respond to job market demands?Frontiers in Education10.3389/feduc.2024.13227749Online publication date: 12-Mar-2024
  • (2024)Enhancing Customer Support in Banking: Leveraging AI for Efficient Ticket ClassificationProcedia Computer Science10.1016/j.procs.2024.09.235246(128-137)Online publication date: 2024
  • (2023)SemSup-XCProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618419(228-247)Online publication date: 23-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
November 2009
2162 pages
ISBN:9781605585123
DOI:10.1145/1645953
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content matching
  2. learning to rank
  3. semantic indexing

Qualifiers

  • Research-article

Conference

CIKM '09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)4
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Topic mining for theses and job ads in ICT sector: can higher education institutes respond to job market demands?Frontiers in Education10.3389/feduc.2024.13227749Online publication date: 12-Mar-2024
  • (2024)Enhancing Customer Support in Banking: Leveraging AI for Efficient Ticket ClassificationProcedia Computer Science10.1016/j.procs.2024.09.235246(128-137)Online publication date: 2024
  • (2023)SemSup-XCProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618419(228-247)Online publication date: 23-Jul-2023
  • (2023)An Experimental Study on Pretraining Transformers from Scratch for IRAdvances in Information Retrieval10.1007/978-3-031-28244-7_32(504-520)Online publication date: 2-Apr-2023
  • (2023)Axiomatic Analysis of Pre‐Processing Methodologies Using Machine Learning in Text MiningConvergence of Cloud with AI for Big Data Analytics10.1002/9781119905233.ch11(229-256)Online publication date: 10-Feb-2023
  • (2021)Coffee With a Hint of Data: Towards Using Data-Driven Approaches in Personalised Long-Term InteractionsFrontiers in Robotics and AI10.3389/frobt.2021.6768148Online publication date: 28-Sep-2021
  • (2020)Comprehensive Contemplation of Probabilistic Aspects in Intelligent AnalyticsInternational Journal of Service Science, Management, Engineering, and Technology10.4018/IJSSMET.202001010811:1(116-141)Online publication date: 1-Jan-2020
  • (2019)Learning Click-Based Deep Structure-Preserving Embeddings with Visual AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/332899415:3(1-19)Online publication date: 8-Aug-2019
  • (2019)Memory-Augmented Dialogue Management for Task-Oriented Dialogue SystemsACM Transactions on Information Systems10.1145/331761237:3(1-30)Online publication date: 8-Jul-2019
  • (2019)A Review on Dimensionality Reduction for Multi-label ClassificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.2940014(1-1)Online publication date: 2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media