Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1376616.1376726acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Pig latin: a not-so-foreign language for data processing

Published: 09 June 2008 Publication History
  • Get Citation Alerts
  • Abstract

    There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse.
    We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.

    References

    [1]
    G. E. Blelloch. Programming parallel algorithms. Communications of the ACM, 39(3):85--97, March 1996.
    [2]
    F. Chang et al. Bigtable: A distributed storage system for structured data. In Proc. OSDI, pages 205--218. USENIX Association, 2006.
    [3]
    S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. In Proc. ACM SIGMOD, 1999.
    [4]
    J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. OSDI, 2004.
    [5]
    G. DeCandia et al. Dynamo: Amazon's highly available key-value store. In Proc. SOSP, 2007.
    [6]
    Dryad LINQ. http://research.microsoft.com/research/sv/DryadLINQ/, 2007.
    [7]
    R. Elmasri and S. Navathe. Fundamentals of Database Systems. Benjamin/Cummings, 1989.
    [8]
    J. Gray et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub totals. Data Min. Knowl. Discov., 1(1):29--53, 1997.
    [9]
    C. S. Group. Community Systems Research at Yahoo! SIGMOD Record, 36(3):47--54, September 2007.
    [10]
    Hadoop. http://lucene.apache.org/hadoop/, 2007.
    [11]
    R. Hull. A survey of theoretical research on typed complex database objects. In XP7.52 Workshop on Database Theory, 1986.
    [12]
    M. Isard et al. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), pages 59--72, Lisbon, Portugal, March 21-23 2007.
    [13]
    R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming Journal, 13(4), 2005.
    [14]
    H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: Simplified relational data processing on large clusters. In Proc. ACM SIGMOD, 2007.

    Cited By

    View all
    • (2024)GreatFree as a Generic Distributed Programming Language and the Foundation of the Cloud-Side Operating SystemInternational Journal of Advanced Network, Monitoring and Controls10.2478/ijanmc-2023-00788:4(66-81)Online publication date: 16-Mar-2024
    • (2024)Cloud-Based Analysis of Large-Scale Hyperspectral Imagery for Oil Spill DetectionIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2023.334402217(2461-2474)Online publication date: 2024
    • (2024)Bridging Between Active Objects: Multitier Programming for Distributed, Concurrent SystemsActive Object Languages: Current Research Trends10.1007/978-3-031-51060-1_4(92-122)Online publication date: 29-Jan-2024
    • Show More Cited By

    Index Terms

    1. Pig latin: a not-so-foreign language for data processing

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
      June 2008
      1396 pages
      ISBN:9781605581026
      DOI:10.1145/1376616
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 June 2008

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. dataflow language
      2. pig latin

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '08
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)154
      • Downloads (Last 6 weeks)15

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)GreatFree as a Generic Distributed Programming Language and the Foundation of the Cloud-Side Operating SystemInternational Journal of Advanced Network, Monitoring and Controls10.2478/ijanmc-2023-00788:4(66-81)Online publication date: 16-Mar-2024
      • (2024)Cloud-Based Analysis of Large-Scale Hyperspectral Imagery for Oil Spill DetectionIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2023.334402217(2461-2474)Online publication date: 2024
      • (2024)Bridging Between Active Objects: Multitier Programming for Distributed, Concurrent SystemsActive Object Languages: Current Research Trends10.1007/978-3-031-51060-1_4(92-122)Online publication date: 29-Jan-2024
      • (2023)EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop EcosystemInformation10.3390/info1402009314:2(93)Online publication date: 3-Feb-2023
      • (2023)Survey of Distributed Computing Frameworks for Supporting Big Data AnalysisBig Data Mining and Analytics10.26599/BDMA.2022.90200146:2(154-169)Online publication date: Jun-2023
      • (2023)A Distributed Framework for Predictive Analytics Using Big Data and MapReduce Parallel ProgrammingMathematical Problems in Engineering10.1155/2023/60488912023(1-10)Online publication date: 1-Feb-2023
      • (2023)Udon: Efficient Debugging of User-Defined Functions in Big Data Systems with Line-by-Line ControlProceedings of the ACM on Management of Data10.1145/36267121:4(1-26)Online publication date: 12-Dec-2023
      • (2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
      • (2023)Efficient Variant Calling on Human Genome Sequences Using a GPU-Enabled Commodity ClusterProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615268(3843-3848)Online publication date: 21-Oct-2023
      • (2023)A Shifting Filter Framework for Dynamic Set QueriesIEEE/ACM Transactions on Networking10.1109/TNET.2023.324762831:5(2329-2344)Online publication date: Oct-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media