research-article

Vineyard: Optimizing Data Sharing in Data-Intensive Analytics

Authors:

Wenyuan Yu,

Tao He,

Lei Wang,

Ke Meng,

Ye Cao,

Diwen Zhu,

Sanhong Li,

Jingren ZhouAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 2

Article No.: 200, Pages 1 - 27

https://doi.org/10.1145/3589780

Published: 20 June 2023 Publication History

Get Access

Abstract

Modern data analytics and AI jobs become increasingly complex and involve multiple tasks performed on specialized systems. Sharing of intermediate data between different systems is often a significant bottleneck in such jobs. When the intermediate data is large, it is mostly exchanged through files in standard formats (e.g., CSV and ORC), causing high I/O and (de)serialization overheads. To solve these problems, we develop Vineyard, a high-performance, extensible, and cloud-native object store, trying to provide an intuitive experience for users to share data across systems in complex real-life workflows. Since different systems usually work on data structures (e.g., dataframes, graphs, hashmaps) with similar interfaces, and their computation logic is often loosely-coupled with how such interfaces are implemented over specific memory layouts, it enables Vineyard to conduct data sharing efficiently at a high level via memory mapping and method sharing. Vineyard provides an IDL named VCDL to facilitate users to register their own intermediate data types into Vineyard such that objects of the registered types can then be efficiently shared across systems in a polyglot workflow. As a cloud-native system, Vineyard is designed to work closely with Kubernetes, as well as achieve fault-tolerance and high performance in production environments. Evaluations on real-life datasets and data analytics jobs show that the above optimizations of Vineyard can significantly improve the end-to-end performance of data analytics jobs, by reducing their data-sharing time up to 68.4x.

Supplemental Material

MP4 File

Vineyard is an in-memory object manager developed by Alibaba Group's DAMO Academy that aims to improve data sharing in data-intensive analytics. Vineyard examined real-life scenarios and shows how big data has surpassed the capacity of single compute systems, leading to a need for specialized systems and federated computing platforms. The intermediate data between these systems becomes a bottleneck when shared with external file systems. Vineyard proposes optimizing intermediate sharing times and efficiently bridging different systems for decoupled intermediate data exchange. By using objects as metadata and a set of blobs, Vineyard enables efficient, zero-copy sharing for complex objects like graphs. Vineyard not only improves performance but also enables cross-engine and cross-language integration effort through its object composability. In practice, Vineyard has been integrated into various data-intensive systems and can accelerate end-to-end execution time up to 9x.

Download
169.07 MB

References

[1]

2019. Google Analytics Customer Revenue Prediction. https://www.kaggle.com/c/ga-customer-revenue-prediction.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Big Data Analytics

Big Data Analytics with R and Hadoop

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data

Comments

Information

Published In

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations