Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1871437.1871606acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

Efficiently querying archived data using Hadoop

Published: 26 October 2010 Publication History

Abstract

The need to analyze structured data for various business intelligence applications such as customer churn analysis, social network analysis, telecom network monitoring etc., is well known. However, the potential size to which such data will scale in future will make solutions that revolve around data warehouses hard to scale. As data sizes grow the movement of data from the warehouse to archives becomes more frequent. Current file based archive models make the archived data unusable for any type of insight extraction. In this paper, we present an active archival solution for data warehouses that makes use of Hadoop distributed file system (HDFS) to store the data in an always available and cost-effective manner. We investigate various structured data storage schemes within HDFS and empirical evaluations show that a combination of Universal scheme model and column store is best suited for the active archival solution.

References

[1]
Apache Foundation. Hadoop. http://hadoop.apache.org/core/.
[2]
Hive- Hadoop wiki. http://wiki.apache.org/hadoop/Hive
[3]
JSON. http://www.json.org
[4]
Jaql Project hosting. http://code.google.com/p/jaql/
[5]
M. Poess and R.O. Nambiar and D. Walrath. Why You Should Run TPC-DS: A Workload Analysis. In Proceedings of VLDB, 2007.
[6]
M. Vardi. The Universal-Relation Data Model for Logical Independence. IEEE Software, Vol.5, No.2, 1988.
[7]
M. Stonebraker et al. C-STORE: A Column-oriented DBMS. In Proceedings of VLDB, 2005.
[8]
http://en.wikipedia.org/wiki/Communications_in_India

Cited By

View all
  • (2021)HFBT: An Efficient Hierarchical Fault-tolerant Method for Cloud Storage System2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00019(27-36)Online publication date: Sep-2021
  • (2017)aHDFS: An Erasure-Coded Data Archival System for Hadoop ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.270668628:11(3060-3073)Online publication date: 1-Nov-2017
  • (2017)Load rebalancing for Hadoop Distributed File System using distributed hash table2017 International Conference on Intelligent Sustainable Systems (ICISS)10.1109/ISS1.2017.8389317(939-943)Online publication date: Dec-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data archival
  2. data warehouse
  3. hadoop
  4. query
  5. response time

Qualifiers

  • Poster

Conference

CIKM '10

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)4
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)HFBT: An Efficient Hierarchical Fault-tolerant Method for Cloud Storage System2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00019(27-36)Online publication date: Sep-2021
  • (2017)aHDFS: An Erasure-Coded Data Archival System for Hadoop ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.270668628:11(3060-3073)Online publication date: 1-Nov-2017
  • (2017)Load rebalancing for Hadoop Distributed File System using distributed hash table2017 International Conference on Intelligent Sustainable Systems (ICISS)10.1109/ISS1.2017.8389317(939-943)Online publication date: Dec-2017
  • (2014)A novel approach to implement a shop bot on distributed web crawler2014 IEEE International Advance Computing Conference (IACC)10.1109/IAdCC.2014.6779439(882-886)Online publication date: Feb-2014
  • (2014)Join processing with threshold-based filtering in MapReduceThe Journal of Supercomputing10.1007/s11227-014-1179-969:2(793-813)Online publication date: 1-Aug-2014
  • (2012)HyDBProceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems10.1109/HPCC.2012.84(580-587)Online publication date: 25-Jun-2012

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media