research-article

A distributed look-up architecture for text mining applications using MapReduce

Authors:

Atilla Soner Balkir,

Andrey RzhetskyAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 59, Pages 1 - 11

https://doi.org/10.1145/2063384.2063463

Published: 12 November 2011 Publication History

Abstract

We study text analysis algorithms that use global optimization methods to compute local characteristics that are consistent with properties of the entire corpus rather than computed locally based on exogenous parameters. In the iterative implementations that we consider, each step both reads and updates a database of parameter values. Motivated by a need for rapid analysis of large corpora, we have developed methods for efficient access to such databases on parallel computers. These methods combine Bloom filters, in-memory caches, and an HBase cluster to reduce communication costs greatly relative to simpler approaches that either fully distribute or fully replicate the database. We also describe how this method can be incorporated into the MapReduce programming model, and illustrate its use within phrase segmentation programs. Our design can achieve considerable run time, latency and storage space improvements relative to other methods. In one phrase segmentation application, we improve performance by a factor of six relative to an HBase-based implementation.

References

[1]

H. Ahonen-Myka and A. Doucet. Data mining meets collocations discovery. In Inquiries into Words, Constraints and Contexts, Festschrift for Kimmo Koskenniemi, pages 194--203. CSLI Studies in Computational Linguistics. CSLI Publications, Center for the Study of Language and Information, University of Stanford, 2005.

[2]

B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13:422--426, July 1970.

Digital Library

[3]

T. Brants, A. C. Popat, P. Xu, F. J. Och, J. Dean, and G. Inc. Large language models in machine translation. In EMNLP, pages 858--867, 2007.

[4]

P. F. Brown, V. J. Pietra, S. A. D. Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19:263--311, 1993.

Digital Library

[5]

T. S. Bruce Croft, Donald Metzler. Search Engines: Information Retrieval in Practice. Addison Wesley, 2009.

Digital Library

[6]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI '06, pages 15--15, Berkeley, CA, USA, 2006. USENIX Association.

Digital Library

[7]

C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages 281--288. MIT Press, 2006.

Digital Library

[8]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107--113, January 2008.

Digital Library

[9]

C. Dyer, A. Cordova, A. Mont, and J. Lin. Fast, easy, and cheap: Construction of statistical machine translation models with mapreduce. Proceedings of the Third Workshop on Statistical Machine Translation, (June):199--207, 2008.

Digital Library

[10]

J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 810--818, New York, NY, USA, 2010.

Digital Library

[11]

J. Ekanayake, S. Pallickara, and G. Fox. Mapreduce for data intensive scientific analyses. In eScience '08: The Fourth IEEE International Conference on eScience, 2008, pages 277--284, dec. 2008.

Digital Library

[12]

T. Elsayed, J. Lin, and D. W. Oard. Pairwise document similarity in large collections with mapreduce. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, HLT-Short '08, pages 265--268, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.

Digital Library

[13]

D. FA. Searching Medline via . Clin Lab Sci, 2008.

[14]

S. Goldwater, T. L. Griffiths, and M. Johnson. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 673--680, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics.

Digital Library

[15]

U. Kang, C. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In Data Mining, 2009. ICDM '09. Ninth IEEE International Conference on, pages 229--238, dec. 2009.

Digital Library

[16]

J. Lin, S. Konda, and S. Mahindrakar. Low-latency, high-throughput access to static global resources within the hadoop framework, 2009.

[17]

M. E. J. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, 46:323--351, December 2005.

[18]

L. E. Notter. Medline--newest service in the medical information network. Nursing Research, 21, 1972.

[19]

Online. Apache hadoop. http://hadoop.apache.org/.

[20]

Online. Apache hbase. http://hadoop.apache.org/hbase/.

[21]

Online. Memcached: A distributed memory objcect caching system. http://memcached.org/.

[22]

S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach (2nd Edition). Prentice Hall, 2002.

Digital Library

[23]

C. Sabatti and K. Lange. Genomewide motif identification using a dictionary model. Proceedings of the IEEE, 90(11):1803--1810, nov 2002.

[24]

The MPI Forum. MPI: a message passing interface. In Proceedings of the 1993 ACM/IEEE conference on Supercomputing, Supercomputing '93, pages 878--883, New York, NY, USA, 1993. ACM.

Digital Library

[25]

P. D. Turney. Learning algorithms for keyphrase extraction. INFORMATION RETRIEVAL, 2:303--336, 2000.

Digital Library

[26]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, HotCloud'10, pages 10--10, 2010.

Digital Library

Cited By

N M Bazlur Rashid AChoudhury T(2019)Knowledge management overview of feature selection problem in high-dimensional financial data: cooperative co-evolution and MapReduce perspectivesProblems and Perspectives in Management10.21511/ppm.17(4).2019.2817:4(340-359)Online publication date: 26-Dec-2019
https://doi.org/10.21511/ppm.17(4).2019.28
Goncalves CSilva JCunha J(2019)n-gram Cache Performance in Statistical Extraction of Relevant Terms in Large CorporaComputational Science – ICCS 201910.1007/978-3-030-22741-8_6(75-88)Online publication date: 8-Jun-2019
https://doi.org/10.1007/978-3-030-22741-8_6
Chang V(2017)Towards data analysis for weather cloud computingKnowledge-Based Systems10.1016/j.knosys.2017.03.003127:C(29-45)Online publication date: 1-Jul-2017
https://dl.acm.org/doi/10.1016/j.knosys.2017.03.003
Show More Cited By

Index Terms

A distributed look-up architecture for text mining applications using MapReduce
1. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
        Organizing principles for web applications

Recommendations

A distributed look-up architecture for text mining applications using mapreduce
HPDC '11: Proceedings of the 20th international symposium on High performance distributed computing

We study text analysis algorithms that use global optimization methods to compute local characteristics that are consistent with properties of the entire corpus rather than computed locally based on exogenous parameters. In the iterative implementations ...
Data cloud for distributed data mining via pipelined mapreduce
ADMI'11: Proceedings of the 7th international conference on Agents and Data Mining Interaction

Distributed data mining (DDM) which often utilizes autonomous agents is a process to extract globally interesting associations, classifiers, clusters, and other patterns from distributed data. As datasets double in size every year, moving the data ...
Implementation of Distributed Searching and Sorting using Hadoop MapReduce
ICTCS '14: Proceedings of the 2014 International Conference on Information and Communication Technology for Competitive Strategies

This paper focuses on implementation of MapReduce programming model on Hadoop cluster for parallel processing of huge amount of data efficiently. There is deluge of data everywhere and we need to process these data efficiently to take decisions and to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Institutes of Health

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
372
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

N M Bazlur Rashid AChoudhury T(2019)Knowledge management overview of feature selection problem in high-dimensional financial data: cooperative co-evolution and MapReduce perspectivesProblems and Perspectives in Management10.21511/ppm.17(4).2019.2817:4(340-359)Online publication date: 26-Dec-2019
https://doi.org/10.21511/ppm.17(4).2019.28
Goncalves CSilva JCunha J(2019)n-gram Cache Performance in Statistical Extraction of Relevant Terms in Large CorporaComputational Science – ICCS 201910.1007/978-3-030-22741-8_6(75-88)Online publication date: 8-Jun-2019
https://doi.org/10.1007/978-3-030-22741-8_6
Chang V(2017)Towards data analysis for weather cloud computingKnowledge-Based Systems10.1016/j.knosys.2017.03.003127:C(29-45)Online publication date: 1-Jul-2017
https://dl.acm.org/doi/10.1016/j.knosys.2017.03.003
Yin JLiao YBaldi MGao LNucci A(2013)A Scalable Distributed Framework for Efficient Analytics on Ordered DatasetsProceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing10.1109/UCC.2013.35(131-138)Online publication date: 9-Dec-2013
https://dl.acm.org/doi/10.1109/UCC.2013.35

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten