Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2807591.2807622acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

AnalyzeThis: an analysis workflow-aware storage system

Published: 15 November 2015 Publication History
  • Get Citation Alerts
  • Abstract

    The need for novel data analysis is urgent in the face of a data deluge from modern applications. Traditional approaches to data analysis incur significant data movement costs, moving data back and forth between the storage system and the processor. Emerging Active Flash devices enable processing on the flash, where the data already resides. An array of such Active Flash devices allows us to revisit how analysis workflows interact with storage systems. By seamlessly blending together the flash storage and data analysis, we create an analysis workflow-aware storage system, AnalyzeThis. Our guiding principle is that analysis-awareness be deeply ingrained in each and every layer of the storage, elevating data analyses as first-class citizens, and transforming AnalyzeThis into a potent analytics-aware appliance. We implement the AnalyzeThis storage system atop an emulation platform of the Active Flash array. Our results indicate that AnalyzeThis is viable, expediting workflow execution and minimizing data movement.

    References

    [1]
    S. Al-Kiswany, E. Vairavanathan, L. B. Costa, H. Yang, and M. Ripeanu. The Case for Cross-Layer Optimizations in Storage: A Workflow-Optimized Storage System. arXiv preprint arXiv:1301.6195, 2013.
    [2]
    I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, and S. Mock. Kepler: An Extensible System for Design and Execution of Scientific Workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, SSDBM '04, pages 423--, Washington, DC, USA, 2004. IEEE Computer Society.
    [3]
    D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A Programming Model and Execution Framework for Web-scale Analytical Processing. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10, pages 119--130, New York, NY, USA, 2010. ACM.
    [4]
    J. Bent, D. Thain, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and M. Livny. Explicit Control a Batch-aware Distributed File System. In Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation - Volume 1, NSDI'04, pages 27--27, Berkeley, CA, USA, 2004. USENIX Association.
    [5]
    S. Boboila, Y. Kim, S. Vazhkudai, P. Desnoyers, and G. Shipman. Active Flash: Out-of-core Data Analytics on Flash Storage. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on, 2012.
    [6]
    V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A Flexible and Extensible Foundation for Data-intensive Computing. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, ICDE '11, pages 1151--1162, Washington, DC, USA, 2011. IEEE Computer Society.
    [7]
    P. Carns, K. Harms, W. Allcock, C. Bacon, S. Lang, R. Latham, and R. Ross. Understanding and Improving Computational Science Storage Access Through Continuous Characterization. Trans. Storage, 7(3):8:1--8:26, Oct. 2011.
    [8]
    S. Cho, C. Park, H. Oh, S. Kim, Y. Yi, and G. R. Ganger. Active Disk Meets Flash: A Case for Intelligent SSDs. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 91--102, New York, NY, USA, 2013. ACM.
    [9]
    J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association.
    [10]
    HDF5 - A New Generation of HDF. http://hdf.ncsa.uiuc.edu/HDF5/doc/.
    [11]
    The Opportunities and Challenges of Exascale Computing. http://science.energy.gov/~/media/ascr/ascac/pdf/reports/exascale_subcommittee_report.pdf.
    [12]
    exofs {LWN.net}. http://lwn.net/Articles/318564/.
    [13]
    Filesystem in Userspace. http://fuse.sourceforge.net/.
    [14]
    Grep - Hadoop Wiki. http://wiki.apache.org/hadoop/Grep.
    [15]
    R. L. Henderson. Job scheduling under the portable batch system. In Job scheduling strategies for parallel processing, pages 279--294. Springer, 1995.
    [16]
    DOE Exascale Initiative Technical RoadMap. http://extremecomputing.labworks.org/hardware/collaboration/EI-RoadMapV21-SanDiego.pdf, 2009.
    [17]
    Y. Kang, Y.-s. Kee, E. L. Miller, and C. Park. Enabling cost-effective data processing with smart SSD. In Mass Storage Systems and Technologies (MSST), 2013 IEEE 29th Symposium on, 2013.
    [18]
    K. Keeton, D. A. Patterson, and J. M. Hellerstein. A Case for Intelligent Disks (IDISKs). ACM SIGMOD Record, 27(3):42--52, 1998.
    [19]
    Y. Kim, R. Gunasekaran, G. Shipman, D. Dillow, Z. Zhang, and B. Settlemyer. Workload Characterization of a Leadership Class Storage Cluster. In Petascale Data Storage Workshop (PDSW), 2010 5th, pages 1--5, Nov 2010.
    [20]
    Computational Science Requirements for Leadership Computing. https://www.olcf.ornl.gov/wp-content/uploads/2010/03/ORNL_TM-2007_44.pdf, 2007.
    [21]
    J. Li, W. Liao, A. Choudhary, R. Ross, R. Thakur, W. Gropp, R. Latham, A. Siegel, B. Gallagher, and M. Zingale. Parallel netCDF: A High-Performance Scientific I/O Interface. In Proceedings of SC2003: High Performance Networking and Computing, 2003.
    [22]
    libconfig. http://www.hyperrealm.com/libconfig/, 2013.
    [23]
    J. F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, and C. Jin. Flexible IO and Integration for Scientific Codes Through the Adaptable IO System (ADIOS). In Proceedings of the 6th international workshop on Challenges of large applications in distributed environments, 2008.
    [24]
    The New Sky | LSST. http://www.lsst.org/lsst/.
    [25]
    N. Mi, A. Riska, E. Smirni, and E. Riedel. Enhancing Data Availability in Disk Drives through Background Activities. In Dependable Systems and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference on, pages 492--501, June 2008.
    [26]
    N. Mi, A. Riska, Q. Zhang, E. Smirni, and E. Riedel. Efficient Management of Idleness in Storage Systems. Trans. Storage, 5(2):4:1--4:25, June 2009.
    [27]
    Montage - An Astronomical Image Mosaic Engine. http://montage.ipac.caltech.edu/docs/m101tutorial.html.
    [28]
    L. Moreau, B. Ludäscher, I. Altintas, R. S. Barga, S. Bowers, S. Callahan, G. Chin, B. Clifford, S. Cohen, S. Cohen-Boulakia, et al. Special Issue: The First Provenance Challenge. Concurrency and computation: practice and experience, 20(5):409--418, 2008.
    [29]
    K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. Provenance-Aware Storage Systems. In Proceedings of the annual conference on USENIX '06 Annual Technical Conference, 2006.
    [30]
    NetCDF Documentation. http://www.unidata.ucar.edu/packages/netcdf/docs.html.
    [31]
    Nexus. http://trac.nexusformat.org/code/wiki.
    [32]
    OCZ RevoDrive 3 X2 (EOL) PCI Express (PCIe) SSD. http://ocz.com/consumer/revodrive-3-x2-pcie-ssd.
    [33]
    Open-OSD project. http://www.open-osd.org, 2013.
    [34]
    HDF5 Tutorial: Parallel HDF5 Topics. http://hdf.ncsa.uiuc.edu/HDF5/doc/Tutor/parallel.html.
    [35]
    J. Piernas, J. Nieplocha, and E. J. Felix. Evaluation of Active Storage Strategies for the Lustre Parallel File System. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, 2007.
    [36]
    L. Qin and D. Feng. Active Storage Framework for Object-based Storage Device. In Proceedings of the 20th International Conference on Advanced Information Networking and Applications, 2006.
    [37]
    E. Riedel, G. Gibson, and C. Faloutsos. Active Storage for Large Scale Data Mining and Multimedia Applications. In Proceedings of 24th Conference on Very Large Databases, 1998.
    [38]
    M. T. Runde, W. G. Stevens, P. A. Wortman, and J. A. Chandy. An Active Storage Framework for Object Storage Devices. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on, 2012.
    [39]
    Samsung SSD. http://www.samsung.com/uk/consumer/memory-cards-hdd-odd/ssd/830.
    [40]
    C. Sar and P. Cao. Lineage File System. http://crypto.stanford.edu/cao/lineage.html, 2005.
    [41]
    F. B. Schmuck and R. L. Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. In FAST, 2002.
    [42]
    P. Schwan. Lustre: Building a File System for 1000-Node Clusters. In Proceedings of the 2003 Linux Symposium, 2003.
    [43]
    SDSS-III DR12. http://www.sdss.org.
    [44]
    M. Singh and B. Leonhardi. Introduction to the IBM Netezza warehouse appliance. In Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research, 2011.
    [45]
    HDF 4.1r3 User's Guide. http://hdf.ncsa.uiuc.edu/UG41r3_html/.
    [46]
    Spallation Neutron Source | ORNL Neutron Sciences. http://neutrons.ornl.gov/facilities/SNS/.
    [47]
    S. W. Son, S. Lang, P. Carns, R. Ross, R. Thakur, B. Ozisikyilmaz, P. Kumar, W.-K. Liao, and A. Choudhary. Enabling Active Storage on Parallel I/O Software Stacks. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 2010.
    [48]
    SQLite. https://sqlite.org/.
    [49]
    DAGMan: A Directed Acyclic Graph Manager. http://research.cs.wisc.edu/htcondor/dagman/dagman.html.
    [50]
    Introducing Titan. https://www.olcf.ornl.gov/titan/.
    [51]
    D. Tiwari, S. Boboila, S. S. Vazhkudai, Y. Kim, X. Ma, P. J. Desnoyers, and Y. Solihin. Active Flash: Towards Energy-Efficient, In-Situ Data Analytics on Extreme-Scale Machines. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST'13), 2013.
    [52]
    Top 500 Supercomputer Sites. http://www.top500.org/.
    [53]
    G. Velamparampil. Data Management Techniques to Handle Large Data Arrays in HDF. Master's thesis, Department of Computer Science, University of Illinois, Jan. 1997.
    [54]
    Y. Wang, T.-H. Ahn, Z. Li, and C. Pan. Sipros/ProRata: A Versatile Informatics System for Quantitative Community Proteomics. Bioinformatics, 29(16), 2013.
    [55]
    R. O. Weber. Information Technology - SCSI Object-Based Storage Device Commands (OSD). Technical Council Proposal Document, 10:201--225, 2004.
    [56]
    Y. Xie, K.-K. Muniswamy-Reddy, D. Feng, D. D. E. Long, Y. Kang, Z. Niu, and Z. Tan. Design and evaluation of Oasis: An Active Storage Framework Based on T10 OSD Standard. In Proceedings of the 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies, 2011.
    [57]
    Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI, volume 8, pages 1--14, 2008.

    Cited By

    View all
    • (2023)A State-Aware Method for Flows With Fairness on NVMe SSDs With Load BalanceIEEE Transactions on Cloud Computing10.1109/TCC.2023.3253864(1-16)Online publication date: 2023
    • (2022)A State-aware Method for Flows with Fairness on NVMe SSDs with Load Balance2022 IEEE 15th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD55607.2022.00017(11-18)Online publication date: Jul-2022
    • (2021)Programming Abstractions for Managing Workflows on Tiered Storage SystemsACM Transactions on Storage10.1145/345711917:4(1-21)Online publication date: 25-Oct-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2015
    985 pages
    ISBN:9781450337236
    DOI:10.1145/2807591
    • General Chair:
    • Jackie Kern,
    • Program Chair:
    • Jeffrey S. Vetter
    © 2015 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 November 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data analytics
    2. visualization & storage

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC15
    Sponsor:

    Acceptance Rates

    SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)68
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A State-Aware Method for Flows With Fairness on NVMe SSDs With Load BalanceIEEE Transactions on Cloud Computing10.1109/TCC.2023.3253864(1-16)Online publication date: 2023
    • (2022)A State-aware Method for Flows with Fairness on NVMe SSDs with Load Balance2022 IEEE 15th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD55607.2022.00017(11-18)Online publication date: Jul-2022
    • (2021)Programming Abstractions for Managing Workflows on Tiered Storage SystemsACM Transactions on Storage10.1145/345711917:4(1-21)Online publication date: 25-Oct-2021
    • (2021)ProSPECT: Proactive Storage Using Provenance for Efficient Compute and TieringTransactions of the Indian National Academy of Engineering10.1007/s41403-021-00261-87:1(219-234)Online publication date: 5-Sep-2021
    • (2020)Streaming Data Reorganization at Scale with DeltaFS Indexed Massive DirectoriesACM Transactions on Storage10.1145/341558116:4(1-31)Online publication date: 24-Sep-2020
    • (2020)CoRECACM Transactions on Parallel Computing10.1145/33914487:2(1-29)Online publication date: 18-May-2020
    • (2019)Checkpointing Strategies for Shared High-Performance Computing PlatformsInternational Journal of Networking and Computing10.15803/ijnc.9.1_289:1(28-52)Online publication date: 2019
    • (2019)Revisiting I/O behavior in large-scale storage systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356183(1-13)Online publication date: 17-Nov-2019
    • (2019)An Analysis Workflow-Aware Storage System for Multi-Core Active Flash ArraysIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.286547130:2(271-285)Online publication date: 1-Feb-2019
    • (2018)Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00127(803-812)Online publication date: May-2018
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media