Qd-tree: Learning Data Layouts for Big Data Analytics

Yang, Zongheng; Chandramouli, Badrish; Wang, Chi; Gehrke, Johannes; Li, Yinan; Minhas, Umar Farooq; Larson, Per-Åke; Kossmann, Donald; Acharya, Rajeev

doi:10.1145/3318464.3389770

Computer Science > Databases

arXiv:2004.10898 (cs)

[Submitted on 22 Apr 2020]

Title:Qd-tree: Learning Data Layouts for Big Data Analytics

Authors:Zongheng Yang, Badrish Chandramouli, Chi Wang, Johannes Gehrke, Yinan Li, Umar Farooq Minhas, Per-Åke Larson, Donald Kossmann, Rajeev Acharya

View PDF

Abstract:Corporations today collect data at an unprecedented and accelerating scale, making the need to run queries on large datasets increasingly important. Technologies such as columnar block-based data organization and compression have become standard practice in most commercial database systems. However, the problem of best assigning records to data blocks on storage is still open. For example, today's systems usually partition data by arrival time into row groups, or range/hash partition the data based on selected fields. For a given workload, however, such techniques are unable to optimize for the important metric of the number of blocks accessed by a query. This metric directly relates to the I/O cost, and therefore performance, of most analytical queries. Further, they are unable to exploit additional available storage to drive this metric down further.
In this paper, we propose a new framework called a query-data routing tree, or qd-tree, to address this problem, and propose two algorithms for their construction based on greedy and deep reinforcement learning techniques. Experiments over benchmark and real workloads show that a qd-tree can provide physical speedups of more than an order of magnitude compared to current blocking schemes, and can reach within 2X of the lower bound for data skipping based on selectivity, while providing complete semantic descriptions of created blocks.

Comments:	ACM SIGMOD 2020
Subjects:	Databases (cs.DB); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
Cite as:	arXiv:2004.10898 [cs.DB]
	(or arXiv:2004.10898v1 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2004.10898
Related DOI:	https://doi.org/10.1145/3318464.3389770

Submission history

From: Badrish Chandramouli [view email]
[v1] Wed, 22 Apr 2020 23:42:59 UTC (2,375 KB)

Computer Science > Databases

Title:Qd-tree: Learning Data Layouts for Big Data Analytics

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:Qd-tree: Learning Data Layouts for Big Data Analytics

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators