research-article

Semantic hashing using tags and topic modeling

Authors:

Luo SiAuthors Info & Claims

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pages 213 - 222

https://doi.org/10.1145/2484028.2484037

Published: 28 July 2013 Publication History

Abstract

It is an important research problem to design efficient and effective solutions for large scale similarity search. One popular strategy is to represent data examples as compact binary codes through semantic hashing, which has produced promising results with fast search speed and low storage cost. Many existing semantic hashing methods generate binary codes for documents by modeling document relationships based on similarity in a keyword feature space. Two major limitations in existing methods are: (1) Tag information is often associated with documents in many real world applications, but has not been fully exploited yet; (2) The similarity in keyword feature space does not fully reflect semantic relationships that go beyond keyword matching.

This paper proposes a novel hashing approach, Semantic Hashing using Tags and Topic Modeling (SHTTM), to incorporate both the tag information and the similarity information from probabilistic topic modeling. In particular, a unified framework is designed for ensuring hashing codes to be consistent with tag information by a formal latent factor model and preserving the document topic/semantic similarity that goes beyond keyword matching. An iterative coordinate descent procedure is proposed for learning the optimal hashing codes. An extensive set of empirical studies on four different datasets has been conducted to demonstrate the advantages of the proposed SHTTM approach against several other state-of-the-art semantic hashing techniques. Furthermore, experimental results indicate that the modeling of tag information and utilizing topic modeling are beneficial for improving the effectiveness of hashing separately, while the combination of these two techniques in the unified framework obtains even better results.

References

[1]

A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In WSDM, pages 123--132, 2012.

Digital Library

[2]

A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, pages 459--468, 2006.

Digital Library

[3]

D. Blei and J. Lafferty. Topic models. Text Mining: Theory and Applications, 2009.

[4]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.

Digital Library

[5]

T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to algorithms. The MIT press, 2001.

Digital Library

[6]

J. S. Culpepper and A. Moffat. Efficient set intersection for inverted indexing. ACM Trans. Inf. Syst., 29(1):1, 2010.

Digital Library

[7]

M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational Geometry, pages 253--262, 2004.

Digital Library

[8]

S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.

[9]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.

Digital Library

[10]

G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527--1554, 2006.

Digital Library

[11]

G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, 2006.

[12]

T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999.

Digital Library

[13]

T. Hofmann. Collaborative filtering via gaussian probabilistic latent semantic analysis. In SIGIR, pages 259--266, 2003.

Digital Library

[14]

Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In ICDM, pages 263--272, 2008.

Digital Library

[15]

W. Kong and W.-J. Li. Isotropic hashing. In NIPS, pages 1655--1663. 2012.

Digital Library

[16]

B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In ICCV, pages 2130--2137, 2009.

[17]

K. Lang. Newsweeder: Learning to filter netnews. In ICML, pages 331--339, 1995.

Digital Library

[18]

D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004.

Digital Library

[19]

R.-S. Lin, D. A. Ross, and J. Yagnik. Spec hashing: Similarity preserving algorithm for entropy-based coding. In CVPR, pages 848--854, 2010.

[20]

W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR, pages 2074--2081, 2012.

Digital Library

[21]

W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In ICML, pages 1--8, 2011.

Digital Library

[22]

R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969--978, 2009.

Digital Library

[23]

G. Salton. Developments in automatic text retrieval. Science, 253(5023):974--980, August 1991.

[24]

G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513--523, 1988.

Digital Library

[25]

L. Si and R. Jin. Flexible mixture model for collaborative filtering. In ICML, pages 704--711, 2003.

Digital Library

[26]

F. Silvestri and R. Venturini. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In CIKM, pages 1219--1228, 2010.

Digital Library

[27]

B. Stein. Principles of hash-based text retrieval. In SIGIR, pages 527--534, 2007.

Digital Library

[28]

V. Vapnik. The nature of statistical learning theory. Springer Verlag, 2000.

Digital Library

[29]

C. Wang and D. M. Blei. Collaborative topic modeling for recommending scientific articles. In KDD, pages 448--456, 2011.

Digital Library

[30]

J. Wang, S. Kumar, and S.-F. Chang. Sequential projection learning for hashing with compact codes. In ICML, pages 1127--1134, 2010.

Digital Library

[31]

J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large-scale search. IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2393--2406, 2012.

Digital Library

[32]

Q. Wang, L. Si, and D. Zhang. A discriminative data-dependent mixture-model approach for multiple instance learning in image classification. In ECCV (4), pages 660--673, 2012.

Digital Library

[33]

R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, pages 194--205, 1998.

Digital Library

[34]

Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753--1760, 2008.

Digital Library

[35]

H. Xia, P. Wu, S. C. H. Hoi, and R. Jin. Boosting multi-kernel locality-sensitive hashing for scalable image retrieval. In SIGIR, pages 55--64, 2012.

Digital Library

[36]

X. Yi and J. Allan. Evaluating topic models for information retrieval. In CIKM, pages 1431--1432, 2008.

Digital Library

[37]

L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, 2004.

Digital Library

[38]

D. Zhang, F. Wang, and L. Si. Composite hashing with multiple information sources. In SIGIR, pages 225--234, 2011.

Digital Library

[39]

D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. In SIGIR, pages 18--25, 2010.

Digital Library

[40]

J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006.

Digital Library

Cited By

Gharavi EVeisi H(2023)Using RST-based deep neural networks to improve text representationSignal and Data Processing10.61186/jsdp.20.1.18120:1(181-197)Online publication date: 1-Jun-2023
https://doi.org/10.61186/jsdp.20.1.181
He LHuang ZChen ELiu QTong SWang HLian DWang S(2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1145/3570725
Zhang MMa YLuo GLi SQian ZZhang X(2023)Identifying the Style of Chatting2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC58517.2023.10317307(1085-1092)Online publication date: 31-Oct-2023
https://doi.org/10.1109/APSIPAASC58517.2023.10317307
Show More Cited By

Index Terms

Semantic hashing using tags and topic modeling
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

Extractive text summarization using clustering-based topic modeling
Abstract
Text summarization is the process of converting the input document into a short form, provided that it preserves the overall meaning associated with it. Primarily, text summarization is achieved in two ways, i.e., abstractive and extractive. ...
Text summarization using topic-based vector space model and semantic measure
Abstract
The primary shortcoming associated with extractive text summarization is redundancy, where more than one sentence representing a similar type of information are incorporated in summary. In the last two decades, a lot of extractive text ...
Weighted hashing for fast large scale similarity search
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Similarity search, or finding approximate nearest neighbors, is an important technique for many applications. Many recent research demonstrate that hashing methods can achieve promising results for large scale similarity search due to its computational ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

July 2013

1188 pages

ISBN:9781450320344

DOI:10.1145/2484028

General Chairs:
Gareth J.F. Jones
Dublin City University, Ireland
,
Páraic Sheridan
Dublin City University, Ireland
,
Program Chairs:
Diane Kelly
University of North Carolina, Chapel Hill, USA
,
Maarten de Rijke
University of Amsterdam, The Netherlands
,
Tetsuya Sakai
Microsoft Research Asia, China

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 July 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '13

Sponsor:

SIGIR

SIGIR '13: The 36th International ACM SIGIR conference on research and development in Information Retrieval

July 28 - August 1, 2013

Dublin, Ireland

Acceptance Rates

SIGIR '13 Paper Acceptance Rate 73 of 366 submissions, 20%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
1,224
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gharavi EVeisi H(2023)Using RST-based deep neural networks to improve text representationSignal and Data Processing10.61186/jsdp.20.1.18120:1(181-197)Online publication date: 1-Jun-2023
https://doi.org/10.61186/jsdp.20.1.181
He LHuang ZChen ELiu QTong SWang HLian DWang S(2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1145/3570725
Zhang MMa YLuo GLi SQian ZZhang X(2023)Identifying the Style of Chatting2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC58517.2023.10317307(1085-1092)Online publication date: 31-Oct-2023
https://doi.org/10.1109/APSIPAASC58517.2023.10317307
Mukta MIslam SShatabda SAli MZaman A(2022)Predicting Academic Performance: Analysis of Students’ Mental Health Condition from Social Media InteractionsBehavioral Sciences10.3390/bs1204008712:4(87)Online publication date: 23-Mar-2022
https://doi.org/10.3390/bs12040087
Guo JMao XWei WHuang H(2022)Intra-category Aware Hierarchical Supervised Document HashingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3161807(1-1)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3161807
Xuan RShim JLee S(2021)Deep Semantic Hashing Using Pairwise LabelsIEEE Access10.1109/ACCESS.2021.30921509(91934-91949)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3092150
Doan KReddy C(2020)Efficient Implicit Unsupervised Text Hashing using Adversarial AutoencoderProceedings of The Web Conference 202010.1145/3366423.3380150(684-694)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366423.3380150
Zhang YZhu H(2020)Discrete Wasserstein Autoencoders for Document RetrievalICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP40776.2020.9053129(8159-8163)Online publication date: May-2020
https://doi.org/10.1109/ICASSP40776.2020.9053129
Jelodar HWang YXiao GRabbani MZhao RAyobi SHu PMasood I(2020)Recommendation system based on semantic scholar mining and topic modeling on conference publicationsSoft Computing10.1007/s00500-020-05397-3Online publication date: 3-Nov-2020
https://doi.org/10.1007/s00500-020-05397-3
Gharavi ESilwal RGerber MVeisi H(2019)Siamese Discourse Structure Recursive Neural Network for Semantic Representation2019 IEEE 13th International Conference on Semantic Computing (ICSC)10.1109/ICOSC.2019.8665662(330-335)Online publication date: Jan-2019
https://doi.org/10.1109/ICOSC.2019.8665662
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents