Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Skip header Section
Data-Intensive Text Processing with MapReduceApril 2010
Publisher:
  • Morgan and Claypool Publishers
ISBN:978-1-60845-342-9
Published:30 April 2010
Pages:
178
Skip Bibliometrics Section
Bibliometrics
Skip Abstract Section
Abstract

Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader "think in MapReduce", but also discusses limitations of the programming model as well. Table of Contents: Introduction / MapReduce Basics / MapReduce Algorithm Design / Inverted Indexing for Text Retrieval / Graph Algorithms / EM Algorithms for Text Processing / Closing Remarks

Cited By

  1. ACM
    Gao W, Ye Z, Sun P, Wen Y and Zhang T Chronus Proceedings of the ACM Symposium on Cloud Computing, (609-623)
  2. Fukuda M, Gordon C, Mert U and Sell M (2020). An Agent-Based Computational Framework for Distributed Data Analysis, Computer, 53:3, (16-25), Online publication date: 1-Mar-2020.
  3. Wagholikar K, Fischer C, Goodson A, Herrick C, Rees M, Toscano E, Macrae C, Scirica B, Desai A and Murphy S (2018). Extraction of Ejection Fraction from Echocardiography Notes for Constructing a Cohort of Patients having Heart Failure with reduced Ejection Fraction (HFrEF), Journal of Medical Systems, 42:11, (1-12), Online publication date: 1-Nov-2018.
  4. Chang V (2018). Data analytics and visualization for inspecting cancers and genes, Multimedia Tools and Applications, 77:14, (17693-17707), Online publication date: 1-Jul-2018.
  5. ACM
    Bendahmane A, Bennasar H and Essaaidi M An Efficient Approach to Improve Security for MapReduce Computation in Cloud System Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications, (1-6)
  6. Singh A, Singh S and Yousef M (2018). A conceptual framework for designing a big data course, Journal of Computing Sciences in Colleges, 33:5, (192-198), Online publication date: 1-May-2018.
  7. Mackey A and Cuevas I (2018). Automatic text summarization within big data frameworks, Journal of Computing Sciences in Colleges, 33:5, (26-32), Online publication date: 1-May-2018.
  8. Kang M and Lee J (2017). An experimental analysis of limitations of MapReduce for iterative algorithms on Spark, Cluster Computing, 20:4, (3593-3604), Online publication date: 1-Dec-2017.
  9. Zhao D (2017). Performance comparison between Hadoop and HAMR under laboratory environment, Procedia Computer Science, 111:C, (223-229), Online publication date: 1-Sep-2017.
  10. Khezr S and Navimipour N (2017). MapReduce and Its Applications, Challenges, and Architecture, Journal of Grid Computing, 15:3, (295-321), Online publication date: 1-Sep-2017.
  11. Singh H and Bawa S (2017). A MapReduce-based scalable discovery and indexing of structured big data, Future Generation Computer Systems, 73:C, (32-43), Online publication date: 1-Aug-2017.
  12. Wu H and Wang C (2017). Generalization of Large-Scale Data Processing in One MapReduce Job for Coarse-Grained Parallelism, International Journal of Parallel Programming, 45:4, (797-826), Online publication date: 1-Aug-2017.
  13. Sutrisnowati R, Yahya B, Bae H, Pulshashi I and Nur Adi T (2017). Scalable indexing algorithm for multi-dimensional time-gap analysis with distributed computing, Procedia Computer Science, 124:C, (224-231), Online publication date: 1-Jan-2017.
  14. Liu Z, Zhang Q, Ahmed R, Boutaba R, Liu Y and Gong Z (2016). Dynamic Resource Allocation for MapReduce with Partitioning Skew, IEEE Transactions on Computers, 65:11, (3304-3317), Online publication date: 1-Nov-2016.
  15. Yadav D, Yadav A and Prasad R (2016). Efficient Textual Web Retrieval using Wavelet Tree, International Journal of Information Retrieval Research, 6:4, (16-29), Online publication date: 1-Oct-2016.
  16. Liu Z, Zhang Q, Boutaba R, Liu Y and Wang B (2016). OPTIMA, Journal of Network and Systems Management, 24:4, (859-883), Online publication date: 1-Oct-2016.
  17. Fegaras L A Query Processing Framework for Array-Based Computations Proceedings, Part I, 27th International Conference on Database and Expert Systems Applications - Volume 9827, (240-254)
  18. ACM
    Turcksin B, Kronbichler M and Bangerth W (2016). WorkStream -- A Design Pattern for Multicore-Enabled Finite Element Computations, ACM Transactions on Mathematical Software, 43:1, (1-29), Online publication date: 29-Aug-2016.
  19. Wang J (2016). Extracting significant pattern histories from timestamped texts using MapReduce, The Journal of Supercomputing, 72:8, (3236-3260), Online publication date: 1-Aug-2016.
  20. ACM
    WU Y, Zhu X, Li L, Fan W, Jin R and Zhang X (2016). Mining Dual Networks, ACM Transactions on Knowledge Discovery from Data, 10:4, (1-37), Online publication date: 27-Jul-2016.
  21. (2016). Security and privacy aspects in MapReduce on clouds, Computer Science Review, 20:C, (1-28), Online publication date: 1-May-2016.
  22. Marszałkowski J, Drozdowski M and Marszałkowski J (2016). Time and Energy Performance of Parallel Systems with Hierarchical Memory, Journal of Grid Computing, 14:1, (153-170), Online publication date: 1-Mar-2016.
  23. ACM
    Ramamurthy B A Practical and Sustainable Model for Learning and Teaching Data Science Proceedings of the 47th ACM Technical Symposium on Computing Science Education, (169-174)
  24. ACM
    Kosugi N, Onizuka M, Kazui H and Ikeda M Ninchisho Chienowa-net Proceedings of the 17th International Conference on Information Integration and Web-based Applications & Services, (1-5)
  25. ACM
    Leung C Big Data Mining Applications and Services Proceedings of the 2015 International Conference on Big Data Applications and Services, (1-8)
  26. ACM
    Shang H, Zhao X, Kiran U and Kitsuregawa M Towards Scale-out Capability on Social Graphs Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, (253-262)
  27. ACM
    Koch J, Staudt C, Vogel M and Meyerhenke H Complex Network Analysis on Distributed Systems Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, (1169-1176)
  28. ACM
    Lucier B, Oren J and Singer Y Influence at Scale Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (735-744)
  29. Karim N, Latif K, Anwar Z, Khan S and Hayat A (2015). Storage schema and ontology-independent SPARQL to HiveQL translation, The Journal of Supercomputing, 71:7, (2694-2719), Online publication date: 1-Jul-2015.
  30. Bhuiyan M and Al Hasan M (2015). An Iterative MapReduce Based Frequent Subgraph Mining Algorithm, IEEE Transactions on Knowledge and Data Engineering, 27:3, (608-620), Online publication date: 1-Mar-2015.
  31. Klauck H, Nanongkai D, Pandurangan G and Robinson P Distributed computation of large-scale graph problems Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, (391-410)
  32. Kolias V, Anagnostopoulos I and Kayafas E A Covering Classification Rule Induction Approach for Big Datasets Proceedings of the 2014 IEEE/ACM International Symposium on Big Data Computing, (45-53)
  33. ACM
    Liu X, Tian Y, He Q, Lee W and McPherson J Distributed Graph Summarization Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, (799-808)
  34. ACM
    Liu X, Thomsen C and Pedersen T CloudETL Proceedings of the 18th International Database Engineering & Applications Symposium, (195-206)
  35. ACM
    Radenski A Big data, high-performance computing, and MapReduce Proceedings of the 15th International Conference on Computer Systems and Technologies, (13-24)
  36. ACM
    Gurajada S, Seufert S, Miliaraki I and Theobald M TriAD Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, (289-300)
  37. ACM
    Okcan A and Riedewald M Anti-combining for MapReduce Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, (839-850)
  38. Fustes D, Cantorna D, Dafonte C, Arcay B, Iglesias A and Manteiga M (2014). A cloud-integrated web platform for marine monitoring using GIS and remote sensing. Application to oil spill detection through SAR images, Future Generation Computer Systems, 34, (155-160), Online publication date: 1-May-2014.
  39. ACM
    Lin J, Gholami M and Rao J Infrastructure for supporting exploration and discovery in web archives Proceedings of the 23rd International Conference on World Wide Web, (851-856)
  40. Onizuka M, Kato H, Hidaka S, Nakano K and Hu Z (2013). Optimization for iterative queries on MapReduce, Proceedings of the VLDB Endowment, 7:4, (241-252), Online publication date: 1-Dec-2013.
  41. Tauer G and Nagi R (2013). A map-reduce lagrangian heuristic for multidimensional assignment problems with decomposable costs, Parallel Computing, 39:11, (653-668), Online publication date: 1-Nov-2013.
  42. Woo J (2013). Market Basket Analysis algorithms with MapReduce, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3:6, (445-452), Online publication date: 1-Nov-2013.
  43. ACM
    Cho B, Rahman M, Chajed T, Gupta I, Abad C, Roberts N and Lin P Natjam Proceedings of the 4th annual Symposium on Cloud Computing, (1-17)
  44. Wang Y and Yu H An ultralow-power memory-based big-data computing platform by nonvolatile domain-wall nanowire devices Proceedings of the 2013 International Symposium on Low Power Electronics and Design, (329-334)
  45. Galitsky B, Ilvovsky D, Kuznetsov S and Strok F Finding Maximal Common Sub-parse Thickets for Multi-sentence Search Revised Selected Papers of the Third International Workshop on Graph Structures for Knowledge Representation and Reasoning - Volume 8323, (39-57)
  46. Luo Y, de Lange Y, Fletcher G, De Bra P, Hidders J and Wu Y Bisimulation reduction of big graphs on mapreduce Proceedings of the 29th British National conference on Big Data, (189-203)
  47. ACM
    Gupta P, Goel A, Lin J, Sharma A, Wang D and Zadeh R WTF Proceedings of the 22nd international conference on World Wide Web, (505-514)
  48. Yu X and Hong B Bi-hadoop Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, (245-252)
  49. Cosulschi M, Cuzzocrea A and De Virgilio R Implementing BFS-based traversals of RDF graphs over MapReduce efficiently Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, (569-574)
  50. ACM
    Lin J and Ryaboy D (2013). Scaling big data mining infrastructure, ACM SIGKDD Explorations Newsletter, 14:2, (6-19), Online publication date: 30-Apr-2013.
  51. ACM
    Chen F and Hsu M A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics Proceedings of the 16th International Conference on Extending Database Technology, (613-624)
  52. ACM
    Berberich K and Bedathur S Computing n-gram statistics in MapReduce Proceedings of the 16th International Conference on Extending Database Technology, (101-112)
  53. ACM
    Rubin P (2013). Review of Information Retrieval By Buettcher, Clarke, Cormack, ACM SIGACT News, 44:1, (29-33), Online publication date: 6-Mar-2013.
  54. Fustes D, Cantorna D, Dafonte C, Iglesias A, Manteiga M and Arcay B Cloud integrated web platform for marine monitoring using GIS and remote sensing Proceedings of the 6th international conference on Ubiquitous Computing and Ambient Intelligence, (446-453)
  55. Akritidis L and Bozanis P Computing scientometrics in large-scale academic search engines with mapreduce Proceedings of the 13th international conference on Web Information Systems Engineering, (609-623)
  56. ACM
    Zhong M and Liu M A distributed index for efficient parallel top-k keyword search on massive graphs Proceedings of the twelfth international workshop on Web information and data management, (27-32)
  57. ACM
    Zhao L and Ichise R Graph-based ontology analysis in the linked open data Proceedings of the 8th International Conference on Semantic Systems, (56-63)
  58. Reguieg H, Toumani F, Motahari-Nezhad H and Benatallah B Using mapreduce to scale events correlation discovery for business processes mining Proceedings of the 10th international conference on Business Process Management, (279-284)
  59. ACM
    Yin Z, Cao L, Gu Q and Han J (2012). Latent Community Topic Analysis, ACM Transactions on Intelligent Systems and Technology, 3:4, (1-21), Online publication date: 1-Sep-2012.
  60. Quick L, Wilkinson P and Hardcastle D Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), (457-463)
  61. Khopkar S, Nagi R and Nikolaev A An Efficient Map-Reduce Algorithm for the Incremental Computation of All-Pairs Shortest Paths in Social Networks Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), (1144-1148)
  62. ACM
    Agarwal R, Caesar M, Godfrey P and Zhao B Shortest paths in less than a millisecond Proceedings of the 2012 ACM workshop on Workshop on online social networks, (37-42)
  63. ACM
    Ryu H, Lease M and Woodward N Finding and exploring memes in social media Proceedings of the 23rd ACM conference on Hypertext and social media, (295-304)
  64. ACM
    Pietracaprina A, Pucci G, Riondato M, Silvestri F and Upfal E Space-round tradeoffs for MapReduce computations Proceedings of the 26th ACM international conference on Supercomputing, (235-244)
  65. Marchal S and Engel T Large scale DNS analysis Proceedings of the 6th IFIP WG 6.6 international autonomous infrastructure, management, and security conference on Dependable Networks and Services, (151-154)
  66. Ture F and Lin J Why not grab a free lunch? Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (626-630)
  67. ACM
    Lin J and Kolcz A Large-scale machine learning at twitter Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, (793-804)
  68. Gesmundo A and Tomeh N HadoopPerceptron Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, (97-101)
  69. ACM
    Zhai K, Boyd-Graber J, Asadi N and Alkhouja M Mr. LDA Proceedings of the 21st international conference on World Wide Web, (879-888)
  70. ACM
    Bahmani B and Goel A Partitioned multi-indexing Proceedings of the 21st international conference on World Wide Web, (399-408)
  71. Radenski A Distributed simulated annealing with mapreduce Proceedings of the 2012t European conference on Applications of Evolutionary Computation, (466-476)
  72. Metwally A and Faloutsos C (2012). V-SMART-join, Proceedings of the VLDB Endowment, 5:8, (704-715), Online publication date: 1-Apr-2012.
  73. ACM
    Fegaras L, Li C and Gupta U An optimization framework for map-reduce queries Proceedings of the 15th International Conference on Extending Database Technology, (26-37)
  74. Emoto K, Fischer S and Hu Z Generate, test, and aggregate Proceedings of the 21st European conference on Programming Languages and Systems, (254-273)
  75. Kolb L, Thor A and Rahm E (2012). Multi-pass sorted neighborhood blocking with MapReduce, Computer Science - Research and Development, 27:1, (45-63), Online publication date: 1-Feb-2012.
  76. Škrabálek J, Kunc P and Pitner T Inner architecture of a social networking system Proceedings of the 38th international conference on Current Trends in Theory and Practice of Computer Science, (530-541)
  77. ACM
    Lee K, Lee Y, Choi H, Chung Y and Moon B (2012). Parallel data processing with MapReduce, ACM SIGMOD Record, 40:4, (11-20), Online publication date: 11-Jan-2012.
  78. Ono K, Hirai Y, Tanabe Y, Noda N and Hagiya M Using Coq in specification and program extraction of hadoop mapreduce applications Proceedings of the 9th international conference on Software engineering and formal methods, (350-365)
  79. ACM
    Sudhakaran V and Chue Hong N Evaluating the suitability of mapreduce for surface temperature analysis codes Proceedings of the second international workshop on Data intensive computing in the clouds, (3-12)
  80. ACM
    Gau R, Hsieh T, Tsai S and Cheng C An implementation framework of mapreduce email social network analysis Proceedings of the 6th ACM workshop on Wireless multimedia networking and computing, (67-70)
  81. ACM
    Herodotou H, Dong F and Babu S No one (cluster) size fits all Proceedings of the 2nd ACM Symposium on Cloud Computing, (1-14)
  82. ACM
    Lu Q, Conrad J, Al-Kofahi K and Keenan W Legal document clustering with built-in topic segmentation Proceedings of the 20th ACM international conference on Information and knowledge management, (383-392)
  83. Li Y and Schuurmans D MapReduce for parallel reinforcement learning Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning, (309-320)
  84. ACM
    Ene A, Im S and Moseley B Fast clustering using MapReduce Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, (681-689)
  85. Herodotou H, Dong F and Babu S (2020). MapReduce programming and cost-based optimization?, Proceedings of the VLDB Endowment, 4:12, (1446-1449), Online publication date: 1-Aug-2011.
  86. Herodotou H and Babu S (2020). Profiling, what-if analysis, and cost-based optimization of MapReduce programs, Proceedings of the VLDB Endowment, 4:11, (1111-1122), Online publication date: 1-Aug-2011.
  87. Chen K, Xu H, Tian F and Guo S CloudVista Proceedings of the 23rd international conference on Scientific and statistical database management, (332-350)
  88. Wittek P and Darányi S Introducing scalable quantum approaches in language representation Proceedings of the 5th international conference on Quantum interaction, (2-12)
  89. Tan M, Zhou W, Zheng L and Wang S A large scale distributed syntactic, semantic and lexical language model for machine translation Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, (201-210)
  90. ACM
    Schätzle A, Przyjaciel-Zablocki M and Lausen G PigSPARQL Proceedings of the International Workshop on Semantic Web Information Management, (1-8)
  91. ACM
    Bahmani B, Chakrabarti K and Xin D Fast personalized PageRank on MapReduce Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, (973-984)
  92. ACM
    Moseley B, Dasgupta A, Kumar R and Sarlós T On scheduling in map-reduce and flow-shops Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, (289-298)
  93. ACM
    Lattanzi S, Moseley B, Suri S and Vassilvitskii S Filtering Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, (85-94)
  94. Przyjaciel-Zablocki M, Schätzle A, Hornung T and Lausen G RDFPath Proceedings of the 8th international conference on The Semantic Web, (50-64)
  95. De Francisci Morales G, Gionis A and Sozio M (2011). Social content matching in MapReduce, Proceedings of the VLDB Endowment, 4:7, (460-469), Online publication date: 1-Apr-2011.
  96. ACM
    Suri S and Vassilvitskii S Counting triangles and the curse of the last reducer Proceedings of the 20th international conference on World wide web, (607-614)
  97. ACM
    White B, Yeh T, Lin J and Davis L Web-scale computer vision using MapReduce for multimedia data mining Proceedings of the Tenth International Workshop on Multimedia Data Mining, (1-10)
  98. ACM
    Lin J and Schatz M Design patterns for efficient graph algorithms in MapReduce Proceedings of the Eighth Workshop on Mining and Learning with Graphs, (78-85)
Contributors
  • University of Waterloo
  • Carnegie Mellon University

Recommendations