Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1559845.1559893acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

E = MC3: managing uncertain enterprise data in a cluster-computing environment

Published: 29 June 2009 Publication History
  • Get Citation Alerts
  • Abstract

    Modern enterprises must manage uncertain data for purposes of risk assessment and decisionmaking under uncertainty. The Monte Carlo approach embodied in the MCDB system of Jampani et al. is well suited for such a task. MCDB can support industrial strength business-intelligence queries over uncertain warehouse data. Moreover, MCDB's extensible approach to specifying uncertainty can also capture complex stochastic prediction models, allowing sophisticated ``what-if'' analyses within the DBMS. The MCDB computations can be highly CPU intensive, but offer the potential for massive parallelization. To realize this potential, we provide a new system, called MC3 (Monte Carlo Computation on a Cluster), that extends the MCDB approach to the map-reduce processing framework. MC3 can exploit the robustness and scalability of map-reduce, and can handle data stored in non-relational formats. We show how MCDB query plans over ``tuple bundles'' can be translated to sequences of map-reduce operations over nested data, and describe different parallelization schemes. We also provide and analyze several novel distributed algorithms for adding pseudorandom number seeds to tuple bundles. These algorithms ensure statistical correctness of the Monte-Carlo computations while minimizing the seed length. Our experiments show that MC3 can scale well for a variety of workloads.

    References

    [1]
    P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio:A system for data, uncertainty, and lineage. In VLDB, 2006.
    [2]
    L. Antova, C. Koch, andD. Olteanu. MayBMS: Managing incomplete information with probabilistic world-set decompositions. In ICDE, pages 1479--1480, 2007.
    [3]
    J. Boulos, N. N. Dalvi, B. Mandhani, S. Mathur, C. Ré, and D. Suciu. MYSTIQ:a system for finding more answers by using probabilities. In ACM SIGMOD, pages 891--893, 2005.
    [4]
    F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable:a distributed storage system for structured data. In OSDI, pages 15--15, 2006.
    [5]
    H. chih Yang, A. Dasdan, R. -L. Hsiao, and D. S. Parker. Map-reduce-merge:simpli fied relational data processing on large clusters. In SIGMOD, pages 1029--1040, 2007.
    [6]
    P. D. Coddington. Random number generators for parallel computers. The NHSE Review, 2, 1996.
    [7]
    B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts:Yahoo!'s hosted data serving platform. Proc. VLDB, pages 1277--1288, 2008.
    [8]
    J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
    [9]
    L. Devroye. Non-Uniform Random Variate Generation. Springer, 1986.
    [10]
    D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.
    [11]
    P. W. Glynn and S. Asmussen. Stochastic Simulation: Algorithms and Analysis. Springer, 2007.
    [12]
    Hadoop. http://hadoop.apache.org/core/.
    [13]
    H. Haramoto, M. Matsumoto, and P. L'Ecuyer. A fast jump ahead algorithm for linear recurrences in a polynomial space. In SETA, pages 290--298, 2008.
    [14]
    H. Haramoto, M. Matsumoto, T. Nishimura, F. Panneton, and P. L'Ecuyer. Efficient jump ahead for F 2 linear random number generators. INFORMS J. Computing, 20(3):385--390, 2008.
    [15]
    S. G. Henderson and B. L. Nelson, editors. Simulation. North-Holland, 2006.
    [16]
    R. Jampani, F. Xu, M. Wu, L. L. Perez, C. M. Jermaine, and P. J. Haas. MCDB:a Monte Carlo approach to managing uncertain data. In ACM SIGMOD, pages 687--700, 2008.
    [17]
    JAQL. http://code.google.com/p/jaql/.
    [18]
    JSON. http://www.json.org.
    [19]
    B. Kimelfeld and Y. Sagiv. Modeling and querying probabilistic XML data. SIGMOD Record, 37(4):69--77, 2008.
    [20]
    P. L'Ecuyer. Random numbers for simulation. Comm. ACM, 33(10):85--97, 1990.
    [21]
    P. L'Ecuyer. Good parameters and implementations for combined multiple recursive random number generators. Oper. Res., 47(1):159--164, 1999.
    [22]
    P. L'Ecuyer and T. H. Andres. A random number generator based on the combination of four LCGs. Math. Comput. Simul., 44(1):99--107, 1997.
    [23]
    M. Mascagni. Some methods of parallel pseudorandom number generation. In R. Schreiber, M. Heath, and A. Ranade, editors, Algorithms for Parallel Processing, pages 277--288. Springer, 1997.
    [24]
    F. Panneton, P. L'Ecuyer, and M. Matsumoto. Improved long-period generators based on linear recurrences modulo 2. ACM Trans. Math. Software, 32(1):1--16, 2006.
    [25]
    S. K. Park and K. W. Miller. Random number generators:Good ones are hard to find. Comm. ACM, 31(10):1192--1201, 1988.
    [26]
    C. Re and D. Suciu. Managing probabilistic data with MystiQ:The can-do, the could-do, and the can't-do. In SUM, pages 5--18, 2008.
    [27]
    SimpleDB. http://aws.amazon.com.
    [28]
    S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. Hambrusch, and R. Shah. Orion 2. 0:native support for uncertain data. In ACM SIGMOD, pages 1239--1242, 2006.
    [29]
    SQLServer Data Services. http://www.microsoft.com/sql/dataservices/default.mspx.
    [30]
    A. Srinivasan, D. M. Ceperley, and M. Mascagni. Random number generators for parallel applications. In Monte Carlo Methods in Chemical Physics, pages 13--36. Wiley, 1997.
    [31]
    C. J. K. Tan. The PLFG parallel pseudo-random number generator. Future Generation Computer Systems, 18:693--698, 2002.
    [32]
    D. Z. Wang, E. Michelakis, M. N. Garofalakis, and J. M. Hellerstein. BayesStore:managing large, uncertain data repositories with probabilistic graphical models. Proc. VLDB, pages 340--351, 2008.

    Cited By

    View all

    Index Terms

    1. E = MC3: managing uncertain enterprise data in a cluster-computing environment

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
      June 2009
      1168 pages
      ISBN:9781605585512
      DOI:10.1145/1559845
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 June 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. jaql
      2. json
      3. map-reduce
      4. monte carlo
      5. uncertain data

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '09
      Sponsor:
      SIGMOD/PODS '09: International Conference on Management of Data
      June 29 - July 2, 2009
      Rhode Island, Providence, USA

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)7
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 09 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Probabilistic Databases10.1007/978-3-031-01879-4Online publication date: 2-Mar-2022
      • (2014)Towards automated personalized data storage2014 IEEE 30th International Conference on Data Engineering Workshops10.1109/ICDEW.2014.6818341(278-283)Online publication date: Mar-2014
      • (2012)SplashProceedings of the 2nd ACM SIGHIT International Health Informatics Symposium10.1145/2110363.2110424(543-552)Online publication date: 28-Jan-2012
      • (2012)Splash: Simulation optimization in complex systems of systems2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton)10.1109/Allerton.2012.6483248(414-425)Online publication date: Oct-2012
      • (2011)Database foundations for scalable RDF processingProceedings of the 7th international conference on Reasoning web: semantic technologies for the web of data10.5555/2033313.2033317(202-249)Online publication date: 23-Aug-2011
      • (2011)Probabilistic DatabasesSynthesis Lectures on Data Management10.2200/S00362ED1V01Y201105DTM0163:2(1-180)Online publication date: 31-May-2011
      • (2011)Data is dead... without what-if modelsProceedings of the VLDB Endowment10.14778/3402755.34028024:12(1486-1489)Online publication date: 1-Aug-2011
      • (2011)MyriadProceedings of the 1st Workshop on Architectures and Systems for Big Data10.1145/2377978.2377983(30-33)Online publication date: 10-Oct-2011
      • (2011)The monte carlo database systemACM Transactions on Database Systems10.1145/2000824.200082836:3(1-41)Online publication date: 26-Aug-2011
      • (2011)JigsawProceedings of the 2011 ACM SIGMOD International Conference on Management of data10.1145/1989323.1989410(829-840)Online publication date: 12-Jun-2011
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media