Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/ICDE.2011.5767933guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

Published: 11 April 2011 Publication History
  • Get Citation Alerts
  • Abstract

    MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is a critical factor that can affect the warehouse performance in a fundamental way. Based on our observations and analysis of Facebook production systems, we have characterized four requirements for the data placement structure: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns. We have examined three commonly accepted data placement structures in conventional databases, namely row-stores, column-stores, and hybrid-stores in the context of large data analysis using MapReduce. We show that they are not very suitable for big data processing in distributed systems. In this paper, we present a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system. With intensive experiments, we show the effectiveness of RCFile in satisfying the four requirements. RCFile has been chosen in Facebook data warehouse system as the default option. It has also been adopted by Hive and Pig, the two most widely used data analysis systems developed in Facebook and Yahoo!

    Cited By

    View all
    • (2023)An Empirical Evaluation of Columnar Storage FormatsProceedings of the VLDB Endowment10.14778/3626292.362629817:2(148-161)Online publication date: 1-Oct-2023
    • (2023)Big Data Analytics In Indonesia : Literature StudyProceedings of the 2023 5th International Conference on Management Science and Industrial Engineering10.1145/3603955.3603960(23-28)Online publication date: 27-Apr-2023
    • (2023)Apache IoTDB: A Time Series Database for IoT ApplicationsProceedings of the ACM on Management of Data10.1145/35897751:2(1-27)Online publication date: 20-Jun-2023
    • Show More Cited By

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    ICDE '11: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
    April 2011
    1457 pages
    ISBN:9781424489596

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 11 April 2011

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)An Empirical Evaluation of Columnar Storage FormatsProceedings of the VLDB Endowment10.14778/3626292.362629817:2(148-161)Online publication date: 1-Oct-2023
    • (2023)Big Data Analytics In Indonesia : Literature StudyProceedings of the 2023 5th International Conference on Management Science and Industrial Engineering10.1145/3603955.3603960(23-28)Online publication date: 27-Apr-2023
    • (2023)Apache IoTDB: A Time Series Database for IoT ApplicationsProceedings of the ACM on Management of Data10.1145/35897751:2(1-27)Online publication date: 20-Jun-2023
    • (2021)The art of balanceProceedings of the VLDB Endowment10.14778/3476311.347637814:12(2999-3013)Online publication date: 28-Oct-2021
    • (2021)Adaptive Compression for Fast Scans on String ColumnsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452798(554-562)Online publication date: 9-Jun-2021
    • (2018)Hengam a MapReduce-Based Distributed Data Warehouse for Big DataInternational Journal of Artificial Life Research10.4018/IJALR.20180101028:1(16-35)Online publication date: 1-Jan-2018
    • (2017)Understanding Human-Machine NetworksACM Computing Surveys10.1145/303986850:1(1-35)Online publication date: 4-Apr-2017
    • (2017)Wide Table Layout Optimization based on Column Ordering and DuplicationProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3035930(299-314)Online publication date: 9-May-2017
    • (2016)Skipping-oriented partitioning for columnar layoutsProceedings of the VLDB Endowment10.14778/3025111.302512310:4(421-432)Online publication date: 1-Nov-2016
    • (2016)Operational analytics data management systemsProceedings of the VLDB Endowment10.14778/3007263.30073199:13(1601-1604)Online publication date: 1-Sep-2016
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media