Search | arXiv e-print repository

doi 10.1145/3620678.3624666

tf.data service: A Case for Disaggregating ML Input Data Processing

Authors: Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiri Simsa, Chandramohan A. Thekkath

Abstract: Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the hos… ▽ More Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy. △ Less

Submitted 2 January, 2024; v1 submitted 26 October, 2022; originally announced October 2022.

arXiv:2104.12615 [pdf, other]

doi 10.14778/3489496.3489498

Evaluating Query Languages and Systems for High-Energy Physics Data [Extended Version]

Authors: Dan Graur, Ingo Müller, Mason Proffitt, Ghislain Fourny, Gordon T. Watts, Gustavo Alonso

Abstract: In the domain of high-energy physics (HEP), query languages in general and SQL in particular have found limited acceptance. This is surprising since HEP data analysis matches the SQL model well: the data is fully structured and queried using mostly standard operators. To gain insights on why this is the case, we perform a comprehensive analysis of six diverse, general-purpose data processing platf… ▽ More In the domain of high-energy physics (HEP), query languages in general and SQL in particular have found limited acceptance. This is surprising since HEP data analysis matches the SQL model well: the data is fully structured and queried using mostly standard operators. To gain insights on why this is the case, we perform a comprehensive analysis of six diverse, general-purpose data processing platforms using an HEP benchmark. The result of the evaluation is an interesting and rather complex picture of existing solutions: Their query languages vary greatly in how natural and concise HEP query patterns can be expressed. Furthermore, most of them are also between one and two orders of magnitude slower than the domain-specific system used by particle physicists today. These observations suggest that, while database systems and their query languages are in principle viable tools for HEP, significant work remains to make them relevant to HEP researchers. △ Less

Submitted 30 October, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

Comments: This is the extended version of a full paper to appear in PVLDB 15.2 (VLDB 2022)

arXiv:1906.10496 [pdf]

The Coming Age of Pervasive Data Processing

Authors: Jan S. Rellermeyer, Sobhan Omranian Khorasani, Dan Graur, Apourva Parthasarathy

Abstract: Emerging Big Data analytics and machine learning applications require a significant amount of computational power. While there exists a plethora of large-scale data processing frameworks which thrive in handling the various complexities of data-intensive workloads, the ever-increasing demand of applications have made us reconsider the traditional ways of scaling (e.g., scale-out) and seek new oppo… ▽ More Emerging Big Data analytics and machine learning applications require a significant amount of computational power. While there exists a plethora of large-scale data processing frameworks which thrive in handling the various complexities of data-intensive workloads, the ever-increasing demand of applications have made us reconsider the traditional ways of scaling (e.g., scale-out) and seek new opportunities for improving the performance. In order to prepare for an era where data collection and processing occur on a wide range of devices, from powerful HPC machines to small embedded devices, it is crucial to investigate and eliminate the potential sources of inefficiency in the current state of the art platforms. In this paper, we address the current and upcoming challenges of pervasive data processing and present directions for designing the next generation of large-scale data processing systems. △ Less

Submitted 21 June, 2019; originally announced June 2019.

Comments: ISPDC 2019

arXiv:1709.01429 [pdf]

Down with ncRNA! Long live fRNA and jRNA!

Authors: Dan Graur

Abstract: Noncoding RNA (ncRNA) and long noncoding RNA (lncRNA) are scientifically invalid terms because they define molecular entities according to properties they do not possess and functions they do not perform. Here, I suggest retiring these two terms. Instead, I suggest using an evolutionary classification of genomic function, in which every RNA molecule is classified as either "functional" or "junk' a… ▽ More Noncoding RNA (ncRNA) and long noncoding RNA (lncRNA) are scientifically invalid terms because they define molecular entities according to properties they do not possess and functions they do not perform. Here, I suggest retiring these two terms. Instead, I suggest using an evolutionary classification of genomic function, in which every RNA molecule is classified as either "functional" or "junk' according to its selected effect function. Dealing with RNA molecules whose functional status is unknown require us to phrase Popperian nomenclatures that spell out the conditions for their own refutation. Thus, in the absence of falsifying evidence, RNA molecules of unknown function must be considered junk RNA (jRNA). △ Less

Submitted 1 September, 2017; originally announced September 2017.

arXiv:1601.06047 [pdf]

Rubbish DNA: The functionless fraction of the human genome

Authors: Dan Graur

Abstract: Because genomes are products of natural processes rather than intelligent design, all genomes contain functional and nonfunctional parts. The fraction of the genome that has no biological function is called rubbish DNA. Rubbish DNA consists of junk DNA, i.e., the fraction of the genome on which selection does not operate, and garbage DNA, i.e., sequences that lower the fitness of the organism, but… ▽ More Because genomes are products of natural processes rather than intelligent design, all genomes contain functional and nonfunctional parts. The fraction of the genome that has no biological function is called rubbish DNA. Rubbish DNA consists of junk DNA, i.e., the fraction of the genome on which selection does not operate, and garbage DNA, i.e., sequences that lower the fitness of the organism, but exist in the genome because purifying selection is neither omnipotent nor instantaneous. In this chapter, I (1) review the concepts of genomic function and functionlessness from an evolutionary perspective, (2) present a precise nomenclature of genomic function, (3) discuss the evidence for the existence of vast quantities of junk DNA within the human genome, (4) discuss the mutational mechanisms responsible for generating junk DNA, (5) spell out the necessary evolutionary conditions for maintaining junk DNA, (6) outline various methodologies for estimating the functional fraction within the genome, and (7) present a recent estimate for the functional fraction of our genome. △ Less

Submitted 22 January, 2016; originally announced January 2016.

Comments: 87 pages, 1 Figure, 1 Table

arXiv:1503.04120 [pdf]

A scale-free method for testing the proportionality of branch lengths between two phylogenetic trees

Authors: Yichen Zheng, William Ott, Chinmaya Gupta, Dan Graur

Abstract: We introduce a scale-free method for testing the proportionality of branch lengths between two phylogenetic trees that have the same topology and contain the same set of taxa. This method scales both trees to a total length of 1 and sums up the differences for each branch. Compared to previous methods, ours yields a fully symmetrical score that measures proportionality without being affected by sc… ▽ More We introduce a scale-free method for testing the proportionality of branch lengths between two phylogenetic trees that have the same topology and contain the same set of taxa. This method scales both trees to a total length of 1 and sums up the differences for each branch. Compared to previous methods, ours yields a fully symmetrical score that measures proportionality without being affected by scale. We call this score the normalized tree distance (NTD). Based on real data, we demonstrate that NTD scores are distributed unimodally, in a manner similar to a lognormal distribution. The NTD score can be used to, for example, detect co-evolutionary processes and measure the accuracy of branch length estimates. △ Less

Submitted 13 March, 2015; originally announced March 2015.

Comments: 13 pages of main text, 2 tables and 4 figures

arXiv:1410.3972 [pdf]

An extended reply to Mendez et al.: The 'extremely ancient' chromosome that still isn't

Authors: Eran Elhaik, Tatiana V. Tatarinova, Anatole A. Klyosov, Dan Graur

Abstract: Earlier this year, we published a scathing critique of a paper by Mendez et al. (2013) in which the claim was made that a Y chromosome was 237,000-581,000 years old. Elhaik et al. (2014) also attacked a popular article in Scientific American by the senior author of Mendez et al. (2013), whose title was "Sex with other human species might have been the secret of Homo sapiens's [sic] success" (Hamme… ▽ More Earlier this year, we published a scathing critique of a paper by Mendez et al. (2013) in which the claim was made that a Y chromosome was 237,000-581,000 years old. Elhaik et al. (2014) also attacked a popular article in Scientific American by the senior author of Mendez et al. (2013), whose title was "Sex with other human species might have been the secret of Homo sapiens's [sic] success" (Hammer 2013). Five of the 11 authors of Mendez et al. (2013) have now written a "rebuttal," and we were allowed to reply. Unfortunately, our reply was censored for being "too sarcastic and inflamed." References were removed, meanings were castrated, and a dedication in the Acknowledgments was deleted. Now, that the so-called rebuttal by 45% of the authors of Mendez et al. (2013) has been published together with our vasectomized reply, we decided to make public our entire reply to the so called "rebuttal." In fact, we go one step further, and publish a version of the reply that has not even been self-censored. △ Less

Submitted 20 October, 2014; v1 submitted 15 October, 2014; originally announced October 2014.

Showing 1–7 of 7 results for author: Graur, D