Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2034691.2034722acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

Document visual similarity measure for document search

Published: 19 September 2011 Publication History

Abstract

Managing large document databases has become an important task. Being able to automatically compare document layouts and classify and search documents with respect to their visual appearance proves to be desirable in many applications. We propose a new algorithm that approximates a metric function between documents based on their visual similarity. The comparison is based only on the visual appearance of the document without taking into consideration its text content. We measure the similarity of single page documents with respect to distance functions between three document components: background, text, and saliency. Each document component is represented as a Gaussian mixture distribution; and distances between the components of different documents are calculated as an approximation of the Hellinger distance between corresponding distributions. Since the Hellinger distance obeys the triangle inequality, it proves to be favorable in the task of nearest neighbor search in a document database. Thus, the computation required to find similar documents in a document database can be significantly reduced.

References

[1]
Hu, J., Kashi, R., Wilfong, G. 2000. Comparison and classification of documents based on layout similarity. Information Retreival 2, 227--243.
[2]
Bitlis, B., Feng, X., Harris, J.L., Bouman, C.A., Pollak, I., Harper, M.P., Allebach, J.P. 2004. A hierarchical document description and comparison method, In Proc. IS&T Archiving Conference, San Antonio, TX, 2004.
[3]
Rissanen, J. 1983. A universal prior for integers and estimation by minimum description length. The Annals of Statistics 11 (2), 417--431.
[4]
Ahmadullin, I., Fan, J., Damera-Venkata, N., Lim, S.H., Lin, Q., Liu, J., Liu, S., O'Brien-Strain, E., Allebach, J.P. 2011. Document similarity measures and document browsing. In Proc. of SPIE-IS&T Electronic Imaging Conference, San Francisco, CA, 7879.
[5]
Barros, J., French, J., Martin, W., Kelly, P., Cannon, M. 1996. Using the triangle inequality to reduce the number of comparisons required for similarity-based retrieval. In Proc. of SPIE/IS&T Conf. on Storage and Retrieval for Image and Video Databases IV, San Jose, CA, February 1996, 2670, 392--403.
[6]
Shapiro, M. 1977. The choice of reference points in best-match file searching. Communications of the ACM, 20(5), 339--343.
[7]
Julier, S. J., and Uhlmann, J. K. 2004. Unscented filtering and nonlinear estimation. In Proceedings of the IEEE 92(3), 401--422.
[8]
Goldberger, J., Greenspan, H., Gordon, S. 2003. An efficient similarity measure based on approximations of KL-divergence between two Gaussian mixtures. In Proc. of International Conference on Computer Vision, Nice, France, October 2003, 1, 487--493.
[9]
Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R. 2004. Neighborhood Component Analysis. In Proceedings of Neural Information Processing Systems (NIPS), Vancouver, Canada, December 2004, 513--520.
[10]
Cheng, H., Bouman, C.A. 2001. Multiscale Bayesian segmentation using a trainable context model. In IEEE Trans. on Image Processing 10 (4), 511--525.
[11]
Achanta, R., Hemami, S., Estrada, F., Susstrunk, S. 2009. Frequency tuned salient region detection. In Proc. of IEEE International Conference on Computer Vision and Pattern Recognition, Miami, FL, June 2009, 1597--1604.
[12]
Purdue Computer Science Department Annual Report 2003-2004. http://www.cs.purdue.edu/about_us/annual_reports/2003-2004
[13]
Purdue Computer Science Department Annual Report 2004-2005. http://www.cs.purdue.edu/about_us/annual_reports/2004-2005
[14]
Purdue Computer Science Department Annual Report 2006-2007. http://www.cs.purdue.edu/about_us/annual_reports/2006-2007
[15]
Purdue College of Liberal Arts Magazine, Fall 2004. http://www.cla.purdue.edu/news/magazine/documents/2004Fall.pdf
[16]
Purdue American Studies Newsletters 2003-2009. http://www.cla.purdue.edu/idis/americanstudies/news_events/newsletters.html

Cited By

View all
  • (2018)Automatic Rights Management for PhotocopiersProceedings of the ACM Symposium on Document Engineering 201810.1145/3209280.3209531(1-10)Online publication date: 28-Aug-2018
  • (2014)Semisupervised Wrapper Choice and Generation for Print-Oriented DocumentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.25426:1(208-220)Online publication date: 1-Jan-2014

Index Terms

  1. Document visual similarity measure for document search

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      DocEng '11: Proceedings of the 11th ACM symposium on Document engineering
      September 2011
      296 pages
      ISBN:9781450308632
      DOI:10.1145/2034691
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 September 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tag

      1. document retrieval

      Qualifiers

      • Research-article

      Conference

      DocEng '11
      Sponsor:
      DocEng '11: ACM Symposium on Document Engineering
      September 19 - 22, 2011
      California, Mountain View, USA

      Acceptance Rates

      Overall Acceptance Rate 194 of 564 submissions, 34%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 21 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2018)Automatic Rights Management for PhotocopiersProceedings of the ACM Symposium on Document Engineering 201810.1145/3209280.3209531(1-10)Online publication date: 28-Aug-2018
      • (2014)Semisupervised Wrapper Choice and Generation for Print-Oriented DocumentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.25426:1(208-220)Online publication date: 1-Jan-2014

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media