Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2484028.2484037acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Semantic hashing using tags and topic modeling

Published: 28 July 2013 Publication History

Abstract

It is an important research problem to design efficient and effective solutions for large scale similarity search. One popular strategy is to represent data examples as compact binary codes through semantic hashing, which has produced promising results with fast search speed and low storage cost. Many existing semantic hashing methods generate binary codes for documents by modeling document relationships based on similarity in a keyword feature space. Two major limitations in existing methods are: (1) Tag information is often associated with documents in many real world applications, but has not been fully exploited yet; (2) The similarity in keyword feature space does not fully reflect semantic relationships that go beyond keyword matching.
This paper proposes a novel hashing approach, Semantic Hashing using Tags and Topic Modeling (SHTTM), to incorporate both the tag information and the similarity information from probabilistic topic modeling. In particular, a unified framework is designed for ensuring hashing codes to be consistent with tag information by a formal latent factor model and preserving the document topic/semantic similarity that goes beyond keyword matching. An iterative coordinate descent procedure is proposed for learning the optimal hashing codes. An extensive set of empirical studies on four different datasets has been conducted to demonstrate the advantages of the proposed SHTTM approach against several other state-of-the-art semantic hashing techniques. Furthermore, experimental results indicate that the modeling of tag information and utilizing topic modeling are beneficial for improving the effectiveness of hashing separately, while the combination of these two techniques in the unified framework obtains even better results.

References

[1]
A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In WSDM, pages 123--132, 2012.
[2]
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, pages 459--468, 2006.
[3]
D. Blei and J. Lafferty. Topic models. Text Mining: Theory and Applications, 2009.
[4]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.
[5]
T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to algorithms. The MIT press, 2001.
[6]
J. S. Culpepper and A. Moffat. Efficient set intersection for inverted indexing. ACM Trans. Inf. Syst., 29(1):1, 2010.
[7]
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational Geometry, pages 253--262, 2004.
[8]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.
[9]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.
[10]
G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527--1554, 2006.
[11]
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, 2006.
[12]
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999.
[13]
T. Hofmann. Collaborative filtering via gaussian probabilistic latent semantic analysis. In SIGIR, pages 259--266, 2003.
[14]
Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In ICDM, pages 263--272, 2008.
[15]
W. Kong and W.-J. Li. Isotropic hashing. In NIPS, pages 1655--1663. 2012.
[16]
B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In ICCV, pages 2130--2137, 2009.
[17]
K. Lang. Newsweeder: Learning to filter netnews. In ICML, pages 331--339, 1995.
[18]
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004.
[19]
R.-S. Lin, D. A. Ross, and J. Yagnik. Spec hashing: Similarity preserving algorithm for entropy-based coding. In CVPR, pages 848--854, 2010.
[20]
W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR, pages 2074--2081, 2012.
[21]
W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In ICML, pages 1--8, 2011.
[22]
R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969--978, 2009.
[23]
G. Salton. Developments in automatic text retrieval. Science, 253(5023):974--980, August 1991.
[24]
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513--523, 1988.
[25]
L. Si and R. Jin. Flexible mixture model for collaborative filtering. In ICML, pages 704--711, 2003.
[26]
F. Silvestri and R. Venturini. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In CIKM, pages 1219--1228, 2010.
[27]
B. Stein. Principles of hash-based text retrieval. In SIGIR, pages 527--534, 2007.
[28]
V. Vapnik. The nature of statistical learning theory. Springer Verlag, 2000.
[29]
C. Wang and D. M. Blei. Collaborative topic modeling for recommending scientific articles. In KDD, pages 448--456, 2011.
[30]
J. Wang, S. Kumar, and S.-F. Chang. Sequential projection learning for hashing with compact codes. In ICML, pages 1127--1134, 2010.
[31]
J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large-scale search. IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2393--2406, 2012.
[32]
Q. Wang, L. Si, and D. Zhang. A discriminative data-dependent mixture-model approach for multiple instance learning in image classification. In ECCV (4), pages 660--673, 2012.
[33]
R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, pages 194--205, 1998.
[34]
Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753--1760, 2008.
[35]
H. Xia, P. Wu, S. C. H. Hoi, and R. Jin. Boosting multi-kernel locality-sensitive hashing for scalable image retrieval. In SIGIR, pages 55--64, 2012.
[36]
X. Yi and J. Allan. Evaluating topic models for information retrieval. In CIKM, pages 1431--1432, 2008.
[37]
L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, 2004.
[38]
D. Zhang, F. Wang, and L. Si. Composite hashing with multiple information sources. In SIGIR, pages 225--234, 2011.
[39]
D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. In SIGIR, pages 18--25, 2010.
[40]
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006.

Cited By

View all
  • (2023)Using RST-based deep neural networks to improve text representationSignal and Data Processing10.61186/jsdp.20.1.18120:1(181-197)Online publication date: 1-Jun-2023
  • (2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 22-Mar-2023
  • (2023)Identifying the Style of Chatting2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC58517.2023.10317307(1085-1092)Online publication date: 31-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
July 2013
1188 pages
ISBN:9781450320344
DOI:10.1145/2484028
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 July 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. hashing
  2. tags
  3. topic modeling

Qualifiers

  • Research-article

Conference

SIGIR '13
Sponsor:

Acceptance Rates

SIGIR '13 Paper Acceptance Rate 73 of 366 submissions, 20%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Using RST-based deep neural networks to improve text representationSignal and Data Processing10.61186/jsdp.20.1.18120:1(181-197)Online publication date: 1-Jun-2023
  • (2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 22-Mar-2023
  • (2023)Identifying the Style of Chatting2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC58517.2023.10317307(1085-1092)Online publication date: 31-Oct-2023
  • (2022)Predicting Academic Performance: Analysis of Students’ Mental Health Condition from Social Media InteractionsBehavioral Sciences10.3390/bs1204008712:4(87)Online publication date: 23-Mar-2022
  • (2022)Intra-category Aware Hierarchical Supervised Document HashingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3161807(1-1)Online publication date: 2022
  • (2021)Deep Semantic Hashing Using Pairwise LabelsIEEE Access10.1109/ACCESS.2021.30921509(91934-91949)Online publication date: 2021
  • (2020)Efficient Implicit Unsupervised Text Hashing using Adversarial AutoencoderProceedings of The Web Conference 202010.1145/3366423.3380150(684-694)Online publication date: 20-Apr-2020
  • (2020)Discrete Wasserstein Autoencoders for Document RetrievalICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP40776.2020.9053129(8159-8163)Online publication date: May-2020
  • (2020)Recommendation system based on semantic scholar mining and topic modeling on conference publicationsSoft Computing10.1007/s00500-020-05397-3Online publication date: 3-Nov-2020
  • (2019)Siamese Discourse Structure Recursive Neural Network for Semantic Representation2019 IEEE 13th International Conference on Semantic Computing (ICSC)10.1109/ICOSC.2019.8665662(330-335)Online publication date: Jan-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media