Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2063384.2063463acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

A distributed look-up architecture for text mining applications using MapReduce

Published: 12 November 2011 Publication History

Abstract

We study text analysis algorithms that use global optimization methods to compute local characteristics that are consistent with properties of the entire corpus rather than computed locally based on exogenous parameters. In the iterative implementations that we consider, each step both reads and updates a database of parameter values. Motivated by a need for rapid analysis of large corpora, we have developed methods for efficient access to such databases on parallel computers. These methods combine Bloom filters, in-memory caches, and an HBase cluster to reduce communication costs greatly relative to simpler approaches that either fully distribute or fully replicate the database. We also describe how this method can be incorporated into the MapReduce programming model, and illustrate its use within phrase segmentation programs. Our design can achieve considerable run time, latency and storage space improvements relative to other methods. In one phrase segmentation application, we improve performance by a factor of six relative to an HBase-based implementation.

References

[1]
H. Ahonen-Myka and A. Doucet. Data mining meets collocations discovery. In Inquiries into Words, Constraints and Contexts, Festschrift for Kimmo Koskenniemi, pages 194--203. CSLI Studies in Computational Linguistics. CSLI Publications, Center for the Study of Language and Information, University of Stanford, 2005.
[2]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13:422--426, July 1970.
[3]
T. Brants, A. C. Popat, P. Xu, F. J. Och, J. Dean, and G. Inc. Large language models in machine translation. In EMNLP, pages 858--867, 2007.
[4]
P. F. Brown, V. J. Pietra, S. A. D. Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19:263--311, 1993.
[5]
T. S. Bruce Croft, Donald Metzler. Search Engines: Information Retrieval in Practice. Addison Wesley, 2009.
[6]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI '06, pages 15--15, Berkeley, CA, USA, 2006. USENIX Association.
[7]
C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages 281--288. MIT Press, 2006.
[8]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107--113, January 2008.
[9]
C. Dyer, A. Cordova, A. Mont, and J. Lin. Fast, easy, and cheap: Construction of statistical machine translation models with mapreduce. Proceedings of the Third Workshop on Statistical Machine Translation, (June):199--207, 2008.
[10]
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 810--818, New York, NY, USA, 2010.
[11]
J. Ekanayake, S. Pallickara, and G. Fox. Mapreduce for data intensive scientific analyses. In eScience '08: The Fourth IEEE International Conference on eScience, 2008, pages 277--284, dec. 2008.
[12]
T. Elsayed, J. Lin, and D. W. Oard. Pairwise document similarity in large collections with mapreduce. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, HLT-Short '08, pages 265--268, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.
[13]
D. FA. Searching Medline via . Clin Lab Sci, 2008.
[14]
S. Goldwater, T. L. Griffiths, and M. Johnson. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 673--680, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics.
[15]
U. Kang, C. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In Data Mining, 2009. ICDM '09. Ninth IEEE International Conference on, pages 229--238, dec. 2009.
[16]
J. Lin, S. Konda, and S. Mahindrakar. Low-latency, high-throughput access to static global resources within the hadoop framework, 2009.
[17]
M. E. J. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, 46:323--351, December 2005.
[18]
L. E. Notter. Medline--newest service in the medical information network. Nursing Research, 21, 1972.
[19]
Online. Apache hadoop. http://hadoop.apache.org/.
[20]
Online. Apache hbase. http://hadoop.apache.org/hbase/.
[21]
Online. Memcached: A distributed memory objcect caching system. http://memcached.org/.
[22]
S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach (2nd Edition). Prentice Hall, 2002.
[23]
C. Sabatti and K. Lange. Genomewide motif identification using a dictionary model. Proceedings of the IEEE, 90(11):1803--1810, nov 2002.
[24]
The MPI Forum. MPI: a message passing interface. In Proceedings of the 1993 ACM/IEEE conference on Supercomputing, Supercomputing '93, pages 878--883, New York, NY, USA, 1993. ACM.
[25]
P. D. Turney. Learning algorithms for keyphrase extraction. INFORMATION RETRIEVAL, 2:303--336, 2000.
[26]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, HotCloud'10, pages 10--10, 2010.

Cited By

View all
  • (2019)Knowledge management overview of feature selection problem in high-dimensional financial data: cooperative co-evolution and MapReduce perspectivesProblems and Perspectives in Management10.21511/ppm.17(4).2019.2817:4(340-359)Online publication date: 26-Dec-2019
  • (2019)n-gram Cache Performance in Statistical Extraction of Relevant Terms in Large CorporaComputational Science – ICCS 201910.1007/978-3-030-22741-8_6(75-88)Online publication date: 8-Jun-2019
  • (2017)Towards data analysis for weather cloud computingKnowledge-Based Systems10.1016/j.knosys.2017.03.003127:C(29-45)Online publication date: 1-Jul-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
November 2011
866 pages
ISBN:9781450307710
DOI:10.1145/2063384
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MapReduce
  2. distributed storage
  3. text mining

Qualifiers

  • Research-article

Funding Sources

Conference

SC '11
Sponsor:

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Knowledge management overview of feature selection problem in high-dimensional financial data: cooperative co-evolution and MapReduce perspectivesProblems and Perspectives in Management10.21511/ppm.17(4).2019.2817:4(340-359)Online publication date: 26-Dec-2019
  • (2019)n-gram Cache Performance in Statistical Extraction of Relevant Terms in Large CorporaComputational Science – ICCS 201910.1007/978-3-030-22741-8_6(75-88)Online publication date: 8-Jun-2019
  • (2017)Towards data analysis for weather cloud computingKnowledge-Based Systems10.1016/j.knosys.2017.03.003127:C(29-45)Online publication date: 1-Jul-2017
  • (2013)A Scalable Distributed Framework for Efficient Analytics on Ordered DatasetsProceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing10.1109/UCC.2013.35(131-138)Online publication date: 9-Dec-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media