-
Imaginary Machines: A Serverless Model for Cloud Applications
Authors:
Michael Wawrzoniak,
Rodrigo Bruno,
Ana Klimovic,
Gustavo Alonso
Abstract:
Serverless Function-as-a-Service (FaaS) platforms provide applications with resources that are highly elastic, quick to instantiate, accounted at fine granularity, and without the need for explicit runtime resource orchestration. This combination of the core properties underpins the success and popularity of the serverless FaaS paradigm. However, these benefits are not available to most cloud appl…
▽ More
Serverless Function-as-a-Service (FaaS) platforms provide applications with resources that are highly elastic, quick to instantiate, accounted at fine granularity, and without the need for explicit runtime resource orchestration. This combination of the core properties underpins the success and popularity of the serverless FaaS paradigm. However, these benefits are not available to most cloud applications because they are designed for networked virtual machines/containers environments. Since such cloud applications cannot take advantage of the highly elastic resources of serverless and require run-time orchestration systems to operate, they suffer from lower resource utilization, additional management complexity, and costs relative to their FaaS serverless counterparts.
We propose Imaginary Machines, a new serverless model for cloud applications. This model (1.) exposes the highly elastic resources of serverless platforms as the traditional network-of-hosts model that cloud applications expect, and (2.) it eliminates the need for explicit run-time orchestration by transparently managing application resources based on signals generated during cloud application executions. With the Imaginary Machines model, unmodified cloud applications become serverless applications. While still based on the network-of-host model, they benefit from the highly elastic resources and do not require runtime orchestration, just like their specialized serverless FaaS counterparts, promising increased resource utilization while reducing management costs.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Boxer: FaaSt Ephemeral Elasticity for Off-the-Shelf Cloud Applications
Authors:
Michael Wawrzoniak,
Rodrigo Bruno,
Ana Klimovic,
Gustavo Alonso
Abstract:
Elasticity is a key property of cloud computing. However, elasticity is offered today at the granularity of virtual machines, which take tens of seconds to start. This is insufficient to react to load spikes and sudden failures in latency sensitive applications, leading users to resort to expensive overprovisioning. Function-as-a-Service (FaaS) provides significantly higher elasticity than VMs, bu…
▽ More
Elasticity is a key property of cloud computing. However, elasticity is offered today at the granularity of virtual machines, which take tens of seconds to start. This is insufficient to react to load spikes and sudden failures in latency sensitive applications, leading users to resort to expensive overprovisioning. Function-as-a-Service (FaaS) provides significantly higher elasticity than VMs, but comes coupled with an event-triggered programming model and a constrained execution environment that makes them unsuitable for off-the-shelf applications. Previous work tries to overcome these obstacles but often requires re-architecting the applications. In this paper, we show how off-the-shelf applications can transparently benefit from ephemeral elasticity with FaaS. We built Boxer, an interposition layer spanning VMs and AWS Lambda, that intercepts application execution and emulates the network-of-hosts environment that applications expect when deployed in a conventional VM/container environment. The ephemeral elasticity of Boxer enables significant performance and cost savings for off-the-shelf applications with, e.g., recovery times over 5x faster than EC2 instances and absorbing load spikes comparable to overprovisioned EC2 VM instances.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Accelerating Graph-based Vector Search via Delayed-Synchronization Traversal
Authors:
Wenqi Jiang,
Hang Hu,
Torsten Hoefler,
Gustavo Alonso
Abstract:
Vector search systems are indispensable in large language model (LLM) serving, search engines, and recommender systems, where minimizing online search latency is essential. Among various algorithms, graph-based vector search (GVS) is particularly popular due to its high search performance and quality. To efficiently serve low-latency GVS, we propose a hardware-algorithm co-design solution includin…
▽ More
Vector search systems are indispensable in large language model (LLM) serving, search engines, and recommender systems, where minimizing online search latency is essential. Among various algorithms, graph-based vector search (GVS) is particularly popular due to its high search performance and quality. To efficiently serve low-latency GVS, we propose a hardware-algorithm co-design solution including Falcon, a GVS accelerator, and Delayed-Synchronization Traversal (DST), an accelerator-optimized graph traversal algorithm. Falcon implements high-performance GVS operators and reduces memory accesses with an on-chip Bloom filter to track search states. DST improves search performance and quality by relaxing the graph traversal order to maximize accelerator utilization. Evaluation across various graphs and datasets shows that our Falcon prototype on FPGAs, coupled with DST, achieves up to 4.3$\times$ and 19.5$\times$ speedups in latency and up to 8.0$\times$ and 26.9$\times$ improvements in energy efficiency over CPU and GPU-based GVS systems. The remarkable efficiency of Falcon and DST demonstrates their potential to become the standard solutions for future GVS acceleration.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
TablePuppet: A Generic Framework for Relational Federated Learning
Authors:
Lijie Xu,
Chulin Xie,
Yiran Guo,
Gustavo Alonso,
Bo Li,
Guoliang Li,
Wei Wang,
Wentao Wu,
Ce Zhang
Abstract:
Current federated learning (FL) approaches view decentralized training data as a single table, divided among participants either horizontally (by rows) or vertically (by columns). However, these approaches are inadequate for handling distributed relational tables across databases. This scenario requires intricate SQL operations like joins and unions to obtain the training data, which is either cos…
▽ More
Current federated learning (FL) approaches view decentralized training data as a single table, divided among participants either horizontally (by rows) or vertically (by columns). However, these approaches are inadequate for handling distributed relational tables across databases. This scenario requires intricate SQL operations like joins and unions to obtain the training data, which is either costly or restricted by privacy concerns. This raises the question: can we directly run FL on distributed relational tables?
In this paper, we formalize this problem as relational federated learning (RFL). We propose TablePuppet, a generic framework for RFL that decomposes the learning process into two steps: (1) learning over join (LoJ) followed by (2) learning over union (LoU). In a nutshell, LoJ pushes learning down onto the vertical tables being joined, and LoU further pushes learning down onto the horizontal partitions of each vertical table. TablePuppet incorporates computation/communication optimizations to deal with the duplicate tuples introduced by joins, as well as differential privacy (DP) to protect against both feature and label leakages. We demonstrate the efficiency of TablePuppet in combination with two widely-used ML training algorithms, stochastic gradient descent (SGD) and alternating direction method of multipliers (ADMM), and compare their computation/communication complexity. We evaluate the SGD/ADMM algorithms developed atop TablePuppet by training diverse ML models. Our experimental results show that TablePuppet achieves model accuracy comparable to the centralized baselines running directly atop the SQL results. Moreover, ADMM takes less communication time than SGD to converge to similar model accuracy.
△ Less
Submitted 23 March, 2024;
originally announced March 2024.
-
General infinitesimal variations of Hodge structure of ample curves in surfaces
Authors:
Víctor González Alonso,
Sara Torelli
Abstract:
Given a smooth projective complex curve inside a smooth projective surface, one can ask how its Hodge structure varies when the curve moves inside the surface. In this paper we develop a general theory to study the infinitesimal version of this question in the case of ample curves. We can then apply the machinery to show that the infinitesimal variation of Hodge structure of a general deformation…
▽ More
Given a smooth projective complex curve inside a smooth projective surface, one can ask how its Hodge structure varies when the curve moves inside the surface. In this paper we develop a general theory to study the infinitesimal version of this question in the case of ample curves. We can then apply the machinery to show that the infinitesimal variation of Hodge structure of a general deformation of an ample curve in $\mathbb{P}^1\times\mathbb{P}^1$ is an isomorphism.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
EUSO-SPB1 Mission and Science
Authors:
JEM-EUSO Collaboration,
:,
G. Abdellaoui,
S. Abe,
J. H. Adams. Jr.,
D. Allard,
G. Alonso,
L. Anchordoqui,
A. Anzalone,
E. Arnone,
K. Asano,
R. Attallah,
H. Attoui,
M. Ave Pernas,
R. Bachmann,
S. Bacholle,
M. Bagheri,
M. Bakiri,
J. Baláz,
D. Barghini,
S. Bartocci,
M. Battisti,
J. Bayer,
B. Beldjilali,
T. Belenguer
, et al. (271 additional authors not shown)
Abstract:
The Extreme Universe Space Observatory on a Super Pressure Balloon 1 (EUSO-SPB1) was launched in 2017 April from Wanaka, New Zealand. The plan of this mission of opportunity on a NASA super pressure balloon test flight was to circle the southern hemisphere. The primary scientific goal was to make the first observations of ultra-high-energy cosmic-ray extensive air showers (EASs) by looking down on…
▽ More
The Extreme Universe Space Observatory on a Super Pressure Balloon 1 (EUSO-SPB1) was launched in 2017 April from Wanaka, New Zealand. The plan of this mission of opportunity on a NASA super pressure balloon test flight was to circle the southern hemisphere. The primary scientific goal was to make the first observations of ultra-high-energy cosmic-ray extensive air showers (EASs) by looking down on the atmosphere with an ultraviolet (UV) fluorescence telescope from suborbital altitude (33~km). After 12~days and 4~hours aloft, the flight was terminated prematurely in the Pacific Ocean. Before the flight, the instrument was tested extensively in the West Desert of Utah, USA, with UV point sources and lasers. The test results indicated that the instrument had sensitivity to EASs of approximately 3 EeV. Simulations of the telescope system, telescope on time, and realized flight trajectory predicted an observation of about 1 event assuming clear sky conditions. The effects of high clouds were estimated to reduce this value by approximately a factor of 2. A manual search and a machine-learning-based search did not find any EAS signals in these data. Here we review the EUSO-SPB1 instrument and flight and the EAS search.
△ Less
Submitted 12 January, 2024;
originally announced January 2024.
-
CXL and the Return of Scale-Up Database Engines
Authors:
Alberto Lerner,
Gustavo Alonso
Abstract:
The growing trend towards specialization has led to a proliferation of accelerators and alternative processing devices. When embedded in conventional computer architectures, the PCIe link connecting the CPU to these devices becomes a bottleneck. Several proposals for alternative designs have been put forward, with these efforts having now converged into the Compute Express Link (CXL) specification…
▽ More
The growing trend towards specialization has led to a proliferation of accelerators and alternative processing devices. When embedded in conventional computer architectures, the PCIe link connecting the CPU to these devices becomes a bottleneck. Several proposals for alternative designs have been put forward, with these efforts having now converged into the Compute Express Link (CXL) specification. CXL is an interconnect protocol on top of PCIe with a more modern and powerful interface. While still on version 1.0 in terms of commercial availability, the potential of CXL to radically change the underlying architecture has already attracted considerable attention. This attention has been focused mainly on the possibility of using CXL to build a shared memory system among the machines in a rack. We argue, however, that such benefits are just the beginning of more significant changes that will have a major impact on database engines and data processing systems. In a nutshell, while the cloud favored scale-out approaches, CXL brings back scale-up architectures. In the paper we describe how CXL enables such architectures, and the research challenges associated with the emerging scale-up, heterogeneous hardware platforms.
△ Less
Submitted 2 January, 2024;
originally announced January 2024.
-
ACCL+: an FPGA-Based Collective Engine for Distributed Applications
Authors:
Zhenhao He,
Dario Korolija,
Yu Zhu,
Benjamin Ramhorst,
Tristan Laan,
Lucian Petrica,
Michaela Blott,
Gustavo Alonso
Abstract:
FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-sour…
▽ More
FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source versatile FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and highly competitive performance for CPU applications. We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
An inductive bias from quantum mechanics: learning order effects with non-commuting measurements
Authors:
Kaitlin Gili,
Guillermo Alonso,
Maria Schuld
Abstract:
There are two major approaches to building good machine learning algorithms: feeding lots of data into large models, or picking a model class with an ''inductive bias'' that suits the structure of the data. When taking the second approach as a starting point to design quantum algorithms for machine learning, it is important to understand how mathematical structures in quantum mechanics can lead to…
▽ More
There are two major approaches to building good machine learning algorithms: feeding lots of data into large models, or picking a model class with an ''inductive bias'' that suits the structure of the data. When taking the second approach as a starting point to design quantum algorithms for machine learning, it is important to understand how mathematical structures in quantum mechanics can lead to useful inductive biases in quantum models. In this work, we bring a collection of theoretical evidence from the Quantum Cognition literature to the field of Quantum Machine Learning to investigate how non-commutativity of quantum observables can help to learn data with ''order effects'', such as the changes in human answering patterns when swapping the order of questions in a survey. We design a multi-task learning setting in which a generative quantum model consisting of sequential learnable measurements can be adapted to a given task -- or question order -- by changing the order of observables, and we provide artificial datasets inspired by human psychology to carry out our investigation. Our first experimental simulations show that in some cases the quantum model learns more non-commutativity as the amount of order effect present in the data is increased, and that the quantum model can learn to generate better samples for unseen question orders when trained on others - both signs that the model architecture suits the task.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Efficiently Processing Large Relational Joins on GPUs
Authors:
Bowen Wu,
Dimitrios Koutsoukos,
Gustavo Alonso
Abstract:
With the growing interest in Machine Learning (ML), Graphic Processing Units (GPUs) have become key elements of any computing infrastructure. Their widespread deployment in data centers and the cloud raises the question of how to use them beyond ML use cases, with growing interest in employing them in a database context. In this paper, we explore and analyze the implementation of relational joins…
▽ More
With the growing interest in Machine Learning (ML), Graphic Processing Units (GPUs) have become key elements of any computing infrastructure. Their widespread deployment in data centers and the cloud raises the question of how to use them beyond ML use cases, with growing interest in employing them in a database context. In this paper, we explore and analyze the implementation of relational joins on GPUs from an end-to-end perspective, meaning that we take result materialization into account. We conduct a comprehensive performance study of state-of-the-art GPU-based join algorithms over diverse synthetic workloads and TPC-H/TPC-DS benchmarks. Without being restricted to the conventional setting where each input relation has only one key and one non-key with all attributes being 4-bytes long, we investigate the effect of various factors (e.g., input sizes, number of non-key columns, skewness, data types, match ratios, and number of joins) on the end-to-end throughput. Furthermore, we propose a technique called "Gather-from-Transformed-Relations" (GFTR) to reduce the long-ignored yet high materialization cost in GPU-based joins. The experimental evaluation shows significant performance improvements from GFTR, with throughput gains of up to 2.3 times over previous work. The insights gained from the performance study not only advance the understanding of GPU-based joins but also introduce a structured approach to selecting the most efficient GPU join algorithm based on the input relation characteristics.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
Post-LS3 Experimental Options in ECN3
Authors:
C. Ahdida,
G. Arduini,
K. Balazs,
H. Bartosik,
J. Bernhard,
A. Boyarsky,
J. Brod,
M. Brugger,
M. Calviani,
A. Ceccucci,
A. Crivellin,
G. D'Ambrosio,
G. De Lellis,
B. Döbrich,
M. Fraser,
R. Franqueira Ximenes,
A. Golutvin,
M. Gonzalez Alonso,
E. Goudzovski,
J. -L. Grenard,
J. Heeck,
J. Jaeckel,
R. Jacobsson,
Y. Kadi,
F. Kahlhoefer
, et al. (25 additional authors not shown)
Abstract:
The Experimental Cavern North 3 (ECN3) is an underground experimental cavern on the CERN Prévessin site. ECN3 currently hosts the NA62 experiment, with a physics programme devoted to rare kaon decays and searches of hidden particles approved until Long Shutdown 3 (LS3). Several options are proposed on the longer term in order to make best use of the worldwide unique potential of the high-intensity…
▽ More
The Experimental Cavern North 3 (ECN3) is an underground experimental cavern on the CERN Prévessin site. ECN3 currently hosts the NA62 experiment, with a physics programme devoted to rare kaon decays and searches of hidden particles approved until Long Shutdown 3 (LS3). Several options are proposed on the longer term in order to make best use of the worldwide unique potential of the high-intensity/high-energy proton beam extracted from the Super Proton Synchrotron (SPS) in ECN3. The current status of their study by the CERN Physics Beyond Colliders (PBC) Study Group is presented, including considerations on beam requirements and upgrades, detector R&D and construction, schedules and cost, as well as physics potential within the CERN and worldwide landscape.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models
Authors:
Wenqi Jiang,
Marco Zeller,
Roger Waleffe,
Torsten Hoefler,
Gustavo Alonso
Abstract:
A Retrieval-Augmented Language Model (RALM) augments a generative language model by retrieving context-specific knowledge from an external database. This strategy facilitates impressive text generation quality even with smaller models, thus reducing orders of magnitude of computational demands. However, RALMs introduce unique system design challenges due to (a) the diverse workload characteristics…
▽ More
A Retrieval-Augmented Language Model (RALM) augments a generative language model by retrieving context-specific knowledge from an external database. This strategy facilitates impressive text generation quality even with smaller models, thus reducing orders of magnitude of computational demands. However, RALMs introduce unique system design challenges due to (a) the diverse workload characteristics between LM inference and retrieval and (b) the various system requirements and bottlenecks for different RALM configurations such as model sizes, database sizes, and retrieval frequencies. We propose Chameleon, a heterogeneous accelerator system that integrates both LM and retrieval accelerators in a disaggregated architecture. The heterogeneity ensures efficient acceleration of both LM inference and retrieval, while the accelerator disaggregation enables the system to independently scale both types of accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements retrieval accelerators on FPGAs and assigns LM inference to GPUs, with a CPU server orchestrating these accelerators over the network. Compared to CPU-based and CPU-GPU vector search systems, Chameleon achieves up to 23.72x speedup and 26.2x energy efficiency. Evaluated on various RALMs, Chameleon exhibits up to 2.16x reduction in latency and 3.18x speedup in throughput compared to the hybrid CPU-GPU architecture. These promising results pave the way for bringing accelerator heterogeneity and disaggregation into future RALM systems.
△ Less
Submitted 29 November, 2023; v1 submitted 15 October, 2023;
originally announced October 2023.
-
SwiftSpatial: Spatial Joins on Modern Hardware
Authors:
Wenqi Jiang,
Martin Parvanov,
Gustavo Alonso
Abstract:
Spatial joins are among the most time-consuming queries in spatial data management systems. In this paper, we propose SwiftSpatial, a specialized accelerator architecture tailored for spatial joins. SwiftSpatial contains multiple high-performance join units with innovative hybrid parallelism, several efficient memory management units, and an integrated on-chip join scheduler. We prototype SwiftSpa…
▽ More
Spatial joins are among the most time-consuming queries in spatial data management systems. In this paper, we propose SwiftSpatial, a specialized accelerator architecture tailored for spatial joins. SwiftSpatial contains multiple high-performance join units with innovative hybrid parallelism, several efficient memory management units, and an integrated on-chip join scheduler. We prototype SwiftSpatial on an FPGA and incorporate the R-tree synchronous traversal algorithm as the control flow. Benchmarked against various CPU and GPU-based spatial data processing systems, SwiftSpatial demonstrates a latency reduction of up to 5.36x relative to the best-performing baseline, while requiring 6.16x less power. The remarkable performance and energy efficiency of SwiftSpatial lay a solid foundation for its future integration into spatial data management systems, both in data centers and at the edge.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Co-design Hardware and Algorithm for Vector Search
Authors:
Wenqi Jiang,
Shigang Li,
Yu Zhu,
Johannes de Fine Licht,
Zhenhao He,
Runbin Shi,
Cedric Renggli,
Shuai Zhang,
Theodoros Rekatsinas,
Torsten Hoefler,
Gustavo Alonso
Abstract:
Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for vector search systems surge, accelerated hardware of…
▽ More
Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for vector search systems surge, accelerated hardware offers a promising solution in the post-Moore's Law era. We introduce \textit{FANNS}, an end-to-end and scalable vector search framework on FPGAs. Given a user-provided recall requirement on a dataset and a hardware resource budget, \textit{FANNS} automatically co-designs hardware and algorithm, subsequently generating the corresponding accelerator. The framework also supports scale-out by incorporating a hardware TCP/IP stack in the accelerator. \textit{FANNS} attains up to 23.0$\times$ and 37.2$\times$ speedup compared to FPGA and CPU baselines, respectively, and demonstrates superior scalability to GPUs, achieving 5.5$\times$ and 7.6$\times$ speedup in median and 95\textsuperscript{th} percentile (P95) latency within an eight-accelerator configuration. The remarkable performance of \textit{FANNS} lays a robust groundwork for future FPGA integration in data centers and AI supercomputers.
△ Less
Submitted 6 July, 2023; v1 submitted 19 June, 2023;
originally announced June 2023.
-
Data Processing with FPGAs on Modern Architectures
Authors:
Wenqi Jiang,
Dario Korolija,
Gustavo Alonso
Abstract:
Trends in hardware, the prevalence of the cloud, and the rise of highly demanding applications have ushered an era of specialization that quickly changes how data is processed at scale. These changes are likely to continue and accelerate in the next years as new technologies are adopted and deployed: smart NICs, smart storage, smart memory, disaggregated storage, disaggregated memory, specialized…
▽ More
Trends in hardware, the prevalence of the cloud, and the rise of highly demanding applications have ushered an era of specialization that quickly changes how data is processed at scale. These changes are likely to continue and accelerate in the next years as new technologies are adopted and deployed: smart NICs, smart storage, smart memory, disaggregated storage, disaggregated memory, specialized accelerators (GPUS, TPUs, FPGAs), and a wealth of ASICs specifically created to deal with computationally expensive tasks (e.g., cryptography or compression). In this tutorial, we focus on data processing on FPGAs, a technology that has received less attention than, e.g., TPUs or GPUs but that is, however, increasingly being deployed in the cloud for data processing tasks due to the architectural flexibility of FPGAs, along with their ability to process data at line rate, something not possible with other types of processors or accelerators.
In the tutorial, we will cover what FPGAs are, their characteristics, their advantages and disadvantages, as well as examples from deployments in the industry and how they are used in various data processing tasks. We will introduce FPGA programming with high-level languages and describe hardware and software resources available to researchers. The tutorial includes case studies borrowed from research done in collaboration with companies that illustrate the potential of FPGAs in data processing and how software and hardware are evolving to take advantage of the possibilities offered by FPGAs. The use cases include: (1) approximated nearest neighbor search, which is relevant to databases and machine learning, (2) remote disaggregated memory, showing how the cloud architecture is evolving and demonstrating the potential for operator offloading and line rate data processing, and (3) recommendation system as an application with tight latency constraints.
△ Less
Submitted 24 June, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
Resource Allocation in Serverless Query Processing
Authors:
Simon Kassing,
Ingo Müller,
Gustavo Alonso
Abstract:
Data lakes hold a growing amount of cold data that is infrequently accessed, yet require interactive response times. Serverless functions are seen as a way to address this use case since they offer an appealing alternative to maintaining (and paying for) a fixed infrastructure. Recent research has analyzed the potential of serverless for data processing. In this paper, we expand on such work by lo…
▽ More
Data lakes hold a growing amount of cold data that is infrequently accessed, yet require interactive response times. Serverless functions are seen as a way to address this use case since they offer an appealing alternative to maintaining (and paying for) a fixed infrastructure. Recent research has analyzed the potential of serverless for data processing. In this paper, we expand on such work by looking into the question of serverless resource allocation to data processing tasks (number and size of the functions). We formulate a general model to roughly estimate completion time and financial cost, which we apply to augment an existing serverless data processing system with an advisory tool that automatically identifies configurations striking a good balance -- which we define as being close to the "knee" of their Pareto frontier. The model takes into account key aspects of serverless: start-up, computation, network transfers, and overhead as a function of the input sizes and intermediate result exchanges. Using (micro)benchmarks and parts of TPC-H, we show that this advisor is capable of pinpointing configurations desirable to the user. Moreover, we identify and discuss several aspects of data processing on serverless affecting efficiency. By using an automated tool to configure the resources, the barrier to using serverless for data processing is lowered and the narrow window where it is cost effective can be expanded by using a more optimal allocation instead of having to over-provision the design.
△ Less
Submitted 19 August, 2022;
originally announced August 2022.
-
ECI: a Customizable Cache Coherency Stack for Hybrid FPGA-CPU Architectures
Authors:
Abishek Ramdas,
Michael Giardino,
Runbin Shi,
Adam Turowski,
David Cock,
Gustavo Alonso,
Timothy Roscoe
Abstract:
Unlike other accelerators, FPGAs are capable of supporting cache coherency, thereby turning them into a more powerful architectural option than just a peripheral accelerator. However, most existing deployments of FPGAs are either non-cache coherent or support only an asymmetric design where cache coherency is controlled from the CPU. Taking advantage of a recently released two socket CPU-FPGA arch…
▽ More
Unlike other accelerators, FPGAs are capable of supporting cache coherency, thereby turning them into a more powerful architectural option than just a peripheral accelerator. However, most existing deployments of FPGAs are either non-cache coherent or support only an asymmetric design where cache coherency is controlled from the CPU. Taking advantage of a recently released two socket CPU-FPGA architecture, in this paper we describe ECI, a flexible implementation of cache coherency on the FPGA capable of supporting both symmetric and asymmetric protocols. ECI is open and customizable, given applications the opportunity to fully interact with the cache coherency protocol, thereby opening up many interesting system design and research opportunities not available in existing designs. Through extensive microbenchmarks we show that ECI exhibits highly competitive performance and discuss in detail one use-case illustrating the benefits of having an open cache coherency stack on the FPGA.
△ Less
Submitted 15 August, 2022;
originally announced August 2022.
-
Short-lived Datacenter
Authors:
Michael Wawrzoniak,
Ingo Müller,
Rodrigo Bruno,
Ana Klimovic,
Gustavo Alonso
Abstract:
Serverless platforms have attracted attention due to their promise of elasticity, low cost, and fast deployment. Instead of using a fixed virtual machine (VM) infrastructure, which can incur considerable costs to operate and run, serverless platforms support short computations, triggered on demand, with cost proportional to fine-grain function execution time. However, serverless platforms offer a…
▽ More
Serverless platforms have attracted attention due to their promise of elasticity, low cost, and fast deployment. Instead of using a fixed virtual machine (VM) infrastructure, which can incur considerable costs to operate and run, serverless platforms support short computations, triggered on demand, with cost proportional to fine-grain function execution time. However, serverless platforms offer a restricted execution environment. For example, functions have limited execution times, limited resources, and no support for networking between functions. In this paper, we explore what it takes to treat serverless platforms as short-lived, general purpose data-centers which can execute unmodified existing applications. As a first step in this quest, we have developed Boxer, a system providing an execution environment on top of existing functions-as-a-service platforms that allows users to seamlessly migrate conventional VM-based cloud services to serverless platforms. Boxer allows generic applications to benefit from the fine-grain elasticity of serverless platforms without having to modify applications to adopt a restrictive event-triggered programming model or orchestrate auxiliary systems for data communication. We implement Boxer on top of AWS Lambda and extend it to transparently provide standard network interfaces. We describe its implementation and demonstrate how it can be used to run off-the-shelf cloud applications with a degree of fine-grained elasticity not available on traditional VM-based platforms.
△ Less
Submitted 14 February, 2022;
originally announced February 2022.
-
JEM-EUSO Collaboration contributions to the 37th International Cosmic Ray Conference
Authors:
G. Abdellaoui,
S. Abe,
J. H. Adams Jr.,
D. Allard,
G. Alonso,
L. Anchordoqui,
A. Anzalone,
E. Arnone,
K. Asano,
R. Attallah,
H. Attoui,
M. Ave Pernas,
M. Bagheri,
J. Baláz,
M. Bakiri,
D. Barghini,
S. Bartocci,
M. Battisti,
J. Bayer,
B. Beldjilali,
T. Belenguer,
N. Belkhalfa,
R. Bellotti,
A. A. Belov,
K. Benmessai
, et al. (267 additional authors not shown)
Abstract:
Compilation of papers presented by the JEM-EUSO Collaboration at the 37th International Cosmic Ray Conference (ICRC), held on July 12-23, 2021 (online) in Berlin, Germany.
Compilation of papers presented by the JEM-EUSO Collaboration at the 37th International Cosmic Ray Conference (ICRC), held on July 12-23, 2021 (online) in Berlin, Germany.
△ Less
Submitted 28 January, 2022;
originally announced January 2022.
-
Modelling Active Non-Markovian Oscillations
Authors:
Gennaro Tucci,
Édgar Roldán,
Andrea Gambassi,
Roman Belousov,
Florian Berger,
Rodrigo Gogui Alonso,
A. James Hudspeth
Abstract:
Modelling noisy oscillations of active systems is one of the current challenges in physics and biology. Because the physical mechanisms of such processes are often difficult to identify, we propose a linear stochastic model driven by a non-Markovian bistable noise that is capable of generating self-sustained periodic oscillation. We derive analytical predictions for most relevant dynamical and the…
▽ More
Modelling noisy oscillations of active systems is one of the current challenges in physics and biology. Because the physical mechanisms of such processes are often difficult to identify, we propose a linear stochastic model driven by a non-Markovian bistable noise that is capable of generating self-sustained periodic oscillation. We derive analytical predictions for most relevant dynamical and thermodynamic properties of the model. This minimal model turns out to describe accurately bistable-like oscillatory motion of hair bundles in bullfrog sacculus, extracted from experimental data. Based on and in agreement with these data, we estimate the power required to sustain such active oscillations to be of the order of one hundred $k_B T$ per oscillation cycle.
△ Less
Submitted 28 January, 2022;
originally announced January 2022.
-
Diagnosing quantum chaos with out-of-time-ordered-correlator quasiprobability in the kicked-top model
Authors:
José Raúl González Alonso,
Nathan Shammah,
Shahnawaz Ahmed,
Franco Nori,
Justin Dressel
Abstract:
While classical chaos has been successfully characterized with consistent theories and intuitive techniques, such as with the use of Lyapunov exponents, quantum chaos is still poorly understood, as well as its relation with multi-partite entanglement and information scrambling. We consider a benchmark system, the kicked top model, which displays chaotic behaviour in the classical version, and proc…
▽ More
While classical chaos has been successfully characterized with consistent theories and intuitive techniques, such as with the use of Lyapunov exponents, quantum chaos is still poorly understood, as well as its relation with multi-partite entanglement and information scrambling. We consider a benchmark system, the kicked top model, which displays chaotic behaviour in the classical version, and proceed to characterize the quantum case with a thorough diagnosis of the growth of chaos and entanglement in time. As a novel tool for the characterization of quantum chaos, we introduce for this scope the quasi-probability distribution behind the out-of-time-ordered correlator (OTOC). We calculate the cumulative nonclassicality of this distribution, which has already been shown to outperform the simple use of OTOC as a probe to distinguish between integrable and nonintegrable Hamiltonians. To provide a thorough comparative analysis, we contrast the behavior of the nonclassicality with entanglement measures, such as the tripartite mutual information of the Hamiltonian as well as the entanglement entropy. We find that systems whose initial states would lie in the "sea of chaos" in the classical kicked-top model, exhibit, as they evolve in time, characteristics associated with chaotic behavior and entanglement production in closed quantum systems. We corroborate this indication by capturing it with this novel OTOC-based measure.
△ Less
Submitted 20 January, 2022;
originally announced January 2022.
-
RumbleML: program the lakehouse with JSONiq
Authors:
Ghislain Fourny,
David Dao,
Can Berker Cikis,
Ce Zhang,
Gustavo Alonso
Abstract:
Lakehouse systems have reached in the past few years unprecedented size and heterogeneity and have been embraced by many industry players. However, they are often difficult to use as they lack the declarative language and optimization possibilities of relational engines. This paper introduces RumbleML, a high-level, declarative library integrated into the RumbleDB engine and with the JSONiq langua…
▽ More
Lakehouse systems have reached in the past few years unprecedented size and heterogeneity and have been embraced by many industry players. However, they are often difficult to use as they lack the declarative language and optimization possibilities of relational engines. This paper introduces RumbleML, a high-level, declarative library integrated into the RumbleDB engine and with the JSONiq language. RumbleML allows using a single platform for data cleaning, data preparation, training, and inference, as well as management of models and results. It does it using a purely declarative language (JSONiq) for all these tasks and without any performance loss over existing platforms (e.g. Spark). The key insights of the design of RumbleML are that training sets, evaluation sets, and test sets can be represented as homogeneous sequences of flat objects; that models can be seamlessly embodied in function items mapping input test sets into prediction-augmented result sets; and that estimators can be seamlessly embodied in function items mapping input training sets to models. We argue that this makes JSONiq a viable and seamless programming language for data lakehouses across all their features, whether database-related or machine-learning-related. While lakehouses bring Machine Learning and Data Wrangling on the same platform, RumbleML also brings them to the same language, JSONiq. In the paper, we present the first prototype and compare its performance to Spark showing the benefit of a huge functionality and productivity gain for cleaning up, normalizing, validating data, feeding it into Machine Learning pipelines, and analyzing the output, all within the same system and language and at scale.
△ Less
Submitted 23 December, 2021;
originally announced December 2021.
-
How to use Persistent Memory in your Database
Authors:
Dimitrios Koutsoukos,
Raghav Bhartia,
Ana Klimovic,
Gustavo Alonso
Abstract:
Persistent or Non Volatile Memory (PMEM or NVM) has recently become commercially available under several configurations with different purposes and goals. Despite the attention to the topic, we are not aware of a comprehensive empirical analysis of existing relational database engines under different PMEM configurations. Such a study is important to understand the performance implications of the v…
▽ More
Persistent or Non Volatile Memory (PMEM or NVM) has recently become commercially available under several configurations with different purposes and goals. Despite the attention to the topic, we are not aware of a comprehensive empirical analysis of existing relational database engines under different PMEM configurations. Such a study is important to understand the performance implications of the various hardware configurations and how different DB engines can benefit from them. To this end, we analyze three different engines (PostgreSQL, MySQL, and SQLServer) under common workloads (TPC-C and TPC-H) with all possible PMEM configurations supported by Intel's Optane NVM devices (PMEM as persistent memory in AppDirect mode and PMEM as volatile memory in Memory mode). Our results paint a complex picture and are not always intuitive due to the many factors involved. Based on our findings, we provide insights on how the different engines behave with PMEM and which configurations and queries perform best. Our results show that using PMEM as persistent storage usually speeds up query execution, but with some caveats as the I/O path is not fully optimized. Additionally, using PMEM in Memory mode does not offer any performance advantage despite the larger volatile memory capacity. Through the extensive coverage of engines and parameters, we provide an important starting point for exploiting PMEM in databases and tuning relational engines to take advantage of this new technology.
△ Less
Submitted 1 December, 2021;
originally announced December 2021.
-
From Research to Proof-of-Concept: Analysis of a Deployment of FPGAs on a Commercial Search Engine
Authors:
Fabio Maschi,
Gustavo Alonso,
Anthony Hock-Koon,
Nicolas Bondoux,
Teddy Roy,
Mourad Boudia,
Matteo Casalino
Abstract:
FPGAs are quickly becoming available in the cloud as a one more heterogeneous processing element complementing CPUs and GPUs. There are many reports in the literature showing the potential for FPGAs to accelerate a wide variety of algorithms, which combined with their growing availability, would seem to also indicate a widespread use in many applications. Unfortunately, there is not much published…
▽ More
FPGAs are quickly becoming available in the cloud as a one more heterogeneous processing element complementing CPUs and GPUs. There are many reports in the literature showing the potential for FPGAs to accelerate a wide variety of algorithms, which combined with their growing availability, would seem to also indicate a widespread use in many applications. Unfortunately, there is not much published research exploring what it takes to integrate an FPGA into an existing application in a cost-effective way and keeping the algorithmic performance advantages. Building on recent results exploring how to employ FPGAs to improve the search engines used in the travel industry, this paper analyses the end-to-end performance of the search engine when using FPGAs, as well as the necessary changes to the software and the cost of such deployments. The results provide important insights on current FPGA deployments and what needs to be done to make FPGAs more widely used. For instance, the large potential performance gains provided by an FPGA are greatly diminished in practice if the application cannot submit request in the most optimal way, something that is not always possible and might require significant changes to the application. Similarly, some existing cloud deployments turn out to use a very imbalanced architecture: a powerful FPGA connected to a not so powerful CPU. The result is that the CPU cannot generate enough load for the FPGA, which potentially eliminates all performance gains and might even result in a more expensive system. In this paper, we report on an extensive study and development effort to incorporate FPGAs into a search engine and analyse the issues encountered and their practical impact. We expect that these results will inform the development and deployment of FPGAs in the future by providing important insights on the end-to-end integration of FPGAs within existing systems.
△ Less
Submitted 20 August, 2021;
originally announced August 2021.
-
Farview: Disaggregated Memory with Operator Off-loading for Database Engines
Authors:
Dario Korolija,
Dimitrios Koutsoukos,
Kimberly Keeton,
Konstantin Taranov,
Dejan Milojičić,
Gustavo Alonso
Abstract:
Cloud deployments disaggregate storage from compute, providing more flexibility to both the storage and compute layers. In this paper, we explore disaggregation by taking it one step further and applying it to memory (DRAM). Disaggregated memory uses network attached DRAM as a way to decouple memory from CPU. In the context of databases, such a design offers significant advantages in terms of maki…
▽ More
Cloud deployments disaggregate storage from compute, providing more flexibility to both the storage and compute layers. In this paper, we explore disaggregation by taking it one step further and applying it to memory (DRAM). Disaggregated memory uses network attached DRAM as a way to decouple memory from CPU. In the context of databases, such a design offers significant advantages in terms of making a larger memory capacity available as a central pool to a collection of smaller processing nodes. To explore these possibilities, we have implemented Farview, a disaggregated memory solution for databases, operating as a remote buffer cache with operator offloading capabilities. Farview is implemented as an FPGA-based smart NIC making DRAM available as a disaggregated, network attached memory module capable of performing data processing at line rate over data streams to/from disaggregated memory. Farview supports query offloading using operators such as selection, projection, aggregation, regular expression matching and encryption. In this paper we focus on analytical queries and demonstrate the viability of the idea through an extensive experimental evaluation of Farview under different workloads. Farview is competitive with a local buffer cache solution for all the workloads and outperforms it in a number of cases, proving that a smart disaggregated memory can be a viable alternative for databases deployed in cloud environments.
△ Less
Submitted 13 June, 2021;
originally announced June 2021.
-
Towards Demystifying Serverless Machine Learning Training
Authors:
Jiawei Jiang,
Shaoduo Gan,
Yue Liu,
Fanlin Wang,
Gustavo Alonso,
Ana Klimovic,
Ankit Singla,
Wentao Wu,
Ce Zhang
Abstract:
The appeal of serverless (FaaS) has triggered a growing interest on how to use it in data-intensive applications such as ETL, query processing, or machine learning (ML). Several systems exist for training large-scale ML models on top of serverless infrastructures (e.g., AWS Lambda) but with inconclusive results in terms of their performance and relative advantage over "serverful" infrastructures (…
▽ More
The appeal of serverless (FaaS) has triggered a growing interest on how to use it in data-intensive applications such as ETL, query processing, or machine learning (ML). Several systems exist for training large-scale ML models on top of serverless infrastructures (e.g., AWS Lambda) but with inconclusive results in terms of their performance and relative advantage over "serverful" infrastructures (IaaS). In this paper we present a systematic, comparative study of distributed ML training over FaaS and IaaS. We present a design space covering design choices such as optimization algorithms and synchronization protocols, and implement a platform, LambdaML, that enables a fair comparison between FaaS and IaaS. We present experimental results using LambdaML, and further develop an analytic model to capture cost/performance tradeoffs that must be considered when opting for a serverless infrastructure. Our results indicate that ML training pays off in serverless only for models with efficient (i.e., reduced) communication and that quickly converge. In general, FaaS can be much faster but it is never significantly cheaper than IaaS.
△ Less
Submitted 17 May, 2021;
originally announced May 2021.
-
Evaluating Query Languages and Systems for High-Energy Physics Data [Extended Version]
Authors:
Dan Graur,
Ingo Müller,
Mason Proffitt,
Ghislain Fourny,
Gordon T. Watts,
Gustavo Alonso
Abstract:
In the domain of high-energy physics (HEP), query languages in general and SQL in particular have found limited acceptance. This is surprising since HEP data analysis matches the SQL model well: the data is fully structured and queried using mostly standard operators. To gain insights on why this is the case, we perform a comprehensive analysis of six diverse, general-purpose data processing platf…
▽ More
In the domain of high-energy physics (HEP), query languages in general and SQL in particular have found limited acceptance. This is surprising since HEP data analysis matches the SQL model well: the data is fully structured and queried using mostly standard operators. To gain insights on why this is the case, we perform a comprehensive analysis of six diverse, general-purpose data processing platforms using an HEP benchmark. The result of the evaluation is an interesting and rather complex picture of existing solutions: Their query languages vary greatly in how natural and concise HEP query patterns can be expressed. Furthermore, most of them are also between one and two orders of magnitude slower than the domain-specific system used by particle physicists today. These observations suggest that, while database systems and their query languages are in principle viable tools for HEP, significant work remains to make them relevant to HEP researchers.
△ Less
Submitted 30 October, 2021; v1 submitted 26 April, 2021;
originally announced April 2021.
-
Ariel: Enabling planetary science across light-years
Authors:
Giovanna Tinetti,
Paul Eccleston,
Carole Haswell,
Pierre-Olivier Lagage,
Jérémy Leconte,
Theresa Lüftinger,
Giusi Micela,
Michel Min,
Göran Pilbratt,
Ludovic Puig,
Mark Swain,
Leonardo Testi,
Diego Turrini,
Bart Vandenbussche,
Maria Rosa Zapatero Osorio,
Anna Aret,
Jean-Philippe Beaulieu,
Lars Buchhave,
Martin Ferus,
Matt Griffin,
Manuel Guedel,
Paul Hartogh,
Pedro Machado,
Giuseppe Malaguti,
Enric Pallé
, et al. (293 additional authors not shown)
Abstract:
Ariel, the Atmospheric Remote-sensing Infrared Exoplanet Large-survey, was adopted as the fourth medium-class mission in ESA's Cosmic Vision programme to be launched in 2029. During its 4-year mission, Ariel will study what exoplanets are made of, how they formed and how they evolve, by surveying a diverse sample of about 1000 extrasolar planets, simultaneously in visible and infrared wavelengths.…
▽ More
Ariel, the Atmospheric Remote-sensing Infrared Exoplanet Large-survey, was adopted as the fourth medium-class mission in ESA's Cosmic Vision programme to be launched in 2029. During its 4-year mission, Ariel will study what exoplanets are made of, how they formed and how they evolve, by surveying a diverse sample of about 1000 extrasolar planets, simultaneously in visible and infrared wavelengths. It is the first mission dedicated to measuring the chemical composition and thermal structures of hundreds of transiting exoplanets, enabling planetary science far beyond the boundaries of the Solar System. The payload consists of an off-axis Cassegrain telescope (primary mirror 1100 mm x 730 mm ellipse) and two separate instruments (FGS and AIRS) covering simultaneously 0.5-7.8 micron spectral range. The satellite is best placed into an L2 orbit to maximise the thermal stability and the field of regard. The payload module is passively cooled via a series of V-Groove radiators; the detectors for the AIRS are the only items that require active cooling via an active Ne JT cooler. The Ariel payload is developed by a consortium of more than 50 institutes from 16 ESA countries, which include the UK, France, Italy, Belgium, Poland, Spain, Austria, Denmark, Ireland, Portugal, Czech Republic, Hungary, the Netherlands, Sweden, Norway, Estonia, and a NASA contribution.
△ Less
Submitted 10 April, 2021;
originally announced April 2021.
-
MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions
Authors:
Wenqi Jiang,
Zhenhao He,
Shuai Zhang,
Thomas B. Preußer,
Kai Zeng,
Liang Feng,
Jiansong Zhang,
Tongxuan Liu,
Yong Li,
Jingren Zhou,
Ce Zhang,
Gustavo Alonso
Abstract:
Deep neural networks are widely used in personalized recommendation systems. Unlike regular DNN inference workloads, recommendation inference is memory-bound due to the many random memory accesses needed to lookup the embedding tables. The inference is also heavily constrained in terms of latency because producing a recommendation for a user must be done in about tens of milliseconds. In this pape…
▽ More
Deep neural networks are widely used in personalized recommendation systems. Unlike regular DNN inference workloads, recommendation inference is memory-bound due to the many random memory accesses needed to lookup the embedding tables. The inference is also heavily constrained in terms of latency because producing a recommendation for a user must be done in about tens of milliseconds. In this paper, we propose MicroRec, a high-performance inference engine for recommendation systems. MicroRec accelerates recommendation inference by (1) redesigning the data structures involved in the embeddings to reduce the number of lookups needed and (2) taking advantage of the availability of High-Bandwidth Memory (HBM) in FPGA accelerators to tackle the latency by enabling parallel lookups. We have implemented the resulting design on an FPGA board including the embedding lookup step as well as the complete inference process. Compared to the optimized CPU baseline (16 vCPU, AVX2-enabled), MicroRec achieves 13.8~14.7x speedup on embedding lookup alone and 2.5$~5.4x speedup for the entire recommendation inference in terms of throughput. As for latency, CPU-based engines needs milliseconds for inferring a recommendation while MicroRec only takes microseconds, a significant advantage in real-time recommendation systems.
△ Less
Submitted 19 February, 2021; v1 submitted 12 October, 2020;
originally announced October 2020.
-
Signed Distance Fields Dynamic Diffuse Global Illumination
Authors:
Jinkai Hu,
Milo Yip,
G. Elias Alonso,
Shihao Gu,
Xiangjun Tang,
Xiaogang Jin
Abstract:
Global Illumination (GI) is of utmost importance in the field of photo-realistic rendering. However, its computation has always been very complex, especially diffuse GI. State of the art real-time GI methods have limitations of different nature, such as light leaking, performance issues, special hardware requirements, noise corruption, bounce number limitations, among others. To overcome these lim…
▽ More
Global Illumination (GI) is of utmost importance in the field of photo-realistic rendering. However, its computation has always been very complex, especially diffuse GI. State of the art real-time GI methods have limitations of different nature, such as light leaking, performance issues, special hardware requirements, noise corruption, bounce number limitations, among others. To overcome these limitations, we propose a novel approach of computing dynamic diffuse GI with a signed distance fields approximation of the scene and discretizing the space domain of the irradiance function. With this approach, we are able to estimate real-time diffuse GI for dynamic lighting and geometry, without any precomputations and supporting multi-bounce GI, providing good quality lighting and high performance at the same time. Our algorithm is also able to achieve better scalability, and manage both large open scenes and indoor high-detailed scenes without being corrupted by noise.
△ Less
Submitted 28 July, 2020;
originally announced July 2020.
-
HyperLogLog Sketch Acceleration on FPGA
Authors:
Amit Kulkarni,
Monica Chiosa,
Thomas B. Preußer,
Kaan Kara,
David Sidler,
Gustavo Alonso
Abstract:
Data sketches are a set of widely used approximated data summarizing techniques. Their fundamental property is sub-linear memory complexity on the input cardinality, an important aspect when processing streams or data sets with a vast base domain (URLs, IP addresses, user IDs, etc.). Among the many data sketches available, HyperLogLog has become the reference for cardinality counting (how many dis…
▽ More
Data sketches are a set of widely used approximated data summarizing techniques. Their fundamental property is sub-linear memory complexity on the input cardinality, an important aspect when processing streams or data sets with a vast base domain (URLs, IP addresses, user IDs, etc.). Among the many data sketches available, HyperLogLog has become the reference for cardinality counting (how many distinct data items there are in a data set). Although it does not count every data item (to reduce memory consumption), it provides probabilistic guarantees on the result, and it is, thus, often used to analyze data streams. In this paper, we explore how to implement HyperLogLog on an FPGA to benefit from the parallelism available and the ability to process data streams coming from high-speed networks. Our multi-pipelined high-cardinality HyperLogLog implementation delivers 1.8x higher throughput than an optimized HyperLogLog running on a dual-socket Intel Xeon E5-2630 v3 system with a total of 16 cores and 32 hyper-threads.
△ Less
Submitted 20 October, 2020; v1 submitted 24 May, 2020;
originally announced May 2020.
-
Benchmarking High Bandwidth Memory on FPGAs
Authors:
Zeke Wang,
Hongjing Huang,
Jie Zhang,
Gustavo Alonso
Abstract:
FPGAs are starting to be enhanced with High Bandwidth Memory (HBM) as a way to reduce the memory bandwidth bottleneck encountered in some applications and to give the FPGA more capacity to deal with application state. However, the performance characteristics of HBM are still not well specified, especially in the context of FPGAs. In this paper, we bridge the gap between nominal specifications and…
▽ More
FPGAs are starting to be enhanced with High Bandwidth Memory (HBM) as a way to reduce the memory bandwidth bottleneck encountered in some applications and to give the FPGA more capacity to deal with application state. However, the performance characteristics of HBM are still not well specified, especially in the context of FPGAs. In this paper, we bridge the gap between nominal specifications and actual performance by benchmarkingHBM on a state-of-the-art FPGA, i.e., a Xilinx Alveo U280 featuring a two-stack HBM subsystem. To this end, we propose Shuhai, a benchmarking tool that allows us to demystify all the underlying details of HBM on an FPGA. FPGA-based benchmarking should also provide a more accurate picture of HBM than doing so on CPUs/GPUs, since CPUs/GPUs are noisier systems due to their complex control logic and cache hierarchy. Since the memory itself is complex, leveraging custom hardware logic to benchmark inside an FPGA provides more details as well as accurate and deterministic measurements. We observe that 1) HBM is able to provide up to 425GB/s memory bandwidth, and 2) how HBM is used has a significant impact on performance, which in turn demonstrates the importance of unveiling the performance characteristics of HBM so as to select the best approach. As a yardstick, we also applyShuhaito DDR4to show the differences between HBM and DDR4.Shuhai can be easily generalized to other FPGA boards or other generations of memory, e.g., HBM3, and DDR3. We will makeShuhaiopen-source, benefiting the community
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
Using DSP Slices as Content-Addressable Update Queues
Authors:
Thomas B. Preußer,
Monica Chiosa,
Alexander Weiss,
Gustavo Alonso
Abstract:
Content-Addressable Memory (CAM) is a powerful abstraction for building memory caches, routing tables and hazard detection logic. Without a native CAM structure available on FPGA devices, their functionality must be emulated using the structural primitives at hand. Such an emulation causes significant overhead in the consumption of the underlying resources, typically general-purpose fabric and on-…
▽ More
Content-Addressable Memory (CAM) is a powerful abstraction for building memory caches, routing tables and hazard detection logic. Without a native CAM structure available on FPGA devices, their functionality must be emulated using the structural primitives at hand. Such an emulation causes significant overhead in the consumption of the underlying resources, typically general-purpose fabric and on-chip block RAM (BRAM). This often motivates mitigating trade-offs, such as the reduction of the associativity of memory caches. This paper describes a technique to implement the hazard resolution in a memory update queue that hides the off-chip memory readout latency of read-modify-write cycles while guaranteeing the delivery of the full memory bandwidth. The innovative use of DSP slices allows them to assume and combine the functions of (a) the tag and data storage, (b) the tag matching, and (c) the data update in this key-value storage scenario. The proposed approach provides designers with extra flexibility by adding this resource type as another option to implement CAM.
△ Less
Submitted 23 April, 2020;
originally announced April 2020.
-
Modularis: Modular Relational Analytics over Heterogeneous Distributed Platforms
Authors:
Dimitrios Koutsoukos,
Ingo Müller,
Renato Marroquín,
Ana Klimovic,
Gustavo Alonso
Abstract:
The enormous quantity of data produced every day together with advances in data analytics has led to a proliferation of data management and analysis systems. Typically, these systems are built around highly specialized monolithic operators optimized for the underlying hardware. While effective in the short term, such an approach makes the operators cumbersome to port and adapt, which is increasing…
▽ More
The enormous quantity of data produced every day together with advances in data analytics has led to a proliferation of data management and analysis systems. Typically, these systems are built around highly specialized monolithic operators optimized for the underlying hardware. While effective in the short term, such an approach makes the operators cumbersome to port and adapt, which is increasingly required due to the speed at which algorithms and hardware evolve. To address this limitation, we present Modularis, an execution layer for data analytics based on sub-operators, i.e.,composable building blocks resembling traditional database operators but at a finer granularity. To demonstrate the advantages of our approach, we use Modularis to build a distributed query processing system supporting relational queries running on an RDMA cluster, a serverless cloud platform, and a smart storage engine. Modularis requires minimal code changes to execute queries across these three diverse hardware platforms, showing that the sub-operator approach reduces the amount and complexity of the code. In fact, changes in the platform affect only sub-operators that depend on the underlying hardware. We show the end-to-end performance of Modularis by comparing it with a framework for SQL processing (Presto), a commercial cluster database (SingleStore), as well as Query-as-a-Service systems (Athena, BigQuery). Modularis outperforms all these systems, proving that the design and architectural advantages of a modular design can be achieved without degrading performance. We also compare Modularis with a hand-optimized implementation of a join for RDMA clusters. We show that Modularis has the advantage of being easily extensible to a wider range of join variants and group by queries, all of which are not supported in the hand-tuned join.
△ Less
Submitted 29 September, 2021; v1 submitted 7 April, 2020;
originally announced April 2020.
-
The Collection Virtual Machine: An Abstraction for Multi-Frontend Multi-Backend Data Analysis
Authors:
Ingo Müller,
Renato Marroquín,
Dimitrios Koutsoukos,
Mike Wawrzoniak,
Sabir Akhadov,
Gustavo Alonso
Abstract:
Getting the best performance from the ever-increasing number of hardware platforms has been a recurring challenge for data processing systems. In recent years, the advent of data science with its increasingly numerous and complex types of analytics has made this challenge even more difficult. In practice, system designers are overwhelmed by the number of combinations and typically implement only o…
▽ More
Getting the best performance from the ever-increasing number of hardware platforms has been a recurring challenge for data processing systems. In recent years, the advent of data science with its increasingly numerous and complex types of analytics has made this challenge even more difficult. In practice, system designers are overwhelmed by the number of combinations and typically implement only one analysis/platform combination, leading to repeated implementation effort -- and a plethora of semi-compatible tools for data scientists.
In this paper, we propose the "Collection Virtual Machine" (or CVM) -- an extensible compiler framework designed to keep the specialization process of data analytics systems tractable. It can capture at the same time the essence of a large span of low-level, hardware-specific implementation techniques as well as high-level operations of different types of analyses. At its core lies a language for defining nested, collection-oriented intermediate representations (IRs). Frontends produce programs in their IR flavors defined in that language, which get optimized through a series of rewritings (possibly changing the IR flavor multiple times) until the program is finally expressed in an IR of platform-specific operators. While reducing the overall implementation effort, this also improves the interoperability of both analyses and hardware platforms. We have used CVM successfully to build specialized backends for platforms as diverse as multi-core CPUs, RDMA clusters, and serverless computing infrastructure in the cloud and expect similar results for many more frontends and hardware platforms in the near future.
△ Less
Submitted 8 April, 2020; v1 submitted 4 April, 2020;
originally announced April 2020.
-
High Bandwidth Memory on FPGAs: A Data Analytics Perspective
Authors:
Kaan Kara,
Christoph Hagleitner,
Dionysios Diamantopoulos,
Dimitris Syrivelis,
Gustavo Alonso
Abstract:
FPGA-based data processing in datacenters is increasing in popularity due to the demands of modern workloads and the ensuing necessity for specialization in hardware. Driven by this trend, vendors are rapidly adapting reconfigurable devices to suit data and compute intensive workloads. Inclusion of High Bandwidth Memory (HBM) in FPGA devices is a recent example. HBM promises overcoming the bandwid…
▽ More
FPGA-based data processing in datacenters is increasing in popularity due to the demands of modern workloads and the ensuing necessity for specialization in hardware. Driven by this trend, vendors are rapidly adapting reconfigurable devices to suit data and compute intensive workloads. Inclusion of High Bandwidth Memory (HBM) in FPGA devices is a recent example. HBM promises overcoming the bandwidth bottleneck, faced often by FPGA-based accelerators due to their throughput oriented design. In this paper, we study the usage and benefits of HBM on FPGAs from a data analytics perspective. We consider three workloads that are often performed in analytics oriented databases and implement them on FPGA showing in which cases they benefit from HBM: range selection, hash join, and stochastic gradient descent for linear model training. We integrate our designs into a columnar database (MonetDB) and show the trade-offs arising from the integration related to data movement and partitioning. In certain cases, FPGA+HBM based solutions are able to surpass the highest performance provided by either a 2-socket POWER9 system or a 14-core XeonE5 by up to 1.8x (selection), 12.9x (join), and 3.2x (SGD).
△ Less
Submitted 2 April, 2020;
originally announced April 2020.
-
Report on the ECFA Early-Career Researchers Debate on the 2020 European Strategy Update for Particle Physics
Authors:
N. Andari,
L. Apolinário,
K. Augsten,
E. Bakos,
I. Bellafont,
L. Beresford,
A. Bethani,
J. Beyer,
L. Bianchini,
C. Bierlich,
B. Bilin,
K. L. Bjørke,
E. Bols,
P. A. Brás,
L. Brenner,
E. Brondolin,
P. Calvo,
B. Capdevila,
I. Cioara,
L. N. Cojocariu,
F. Collamati,
A. de Wit,
F. Dordei,
M. Dordevic,
T. A. du Pree
, et al. (96 additional authors not shown)
Abstract:
A group of Early-Career Researchers (ECRs) has been given a mandate from the European Committee for Future Accelerators (ECFA) to debate the topics of the current European Strategy Update (ESU) for Particle Physics and to summarise the outcome in a brief document [1]. A full-day debate with 180 delegates was held at CERN, followed by a survey collecting quantitative input. During the debate, the E…
▽ More
A group of Early-Career Researchers (ECRs) has been given a mandate from the European Committee for Future Accelerators (ECFA) to debate the topics of the current European Strategy Update (ESU) for Particle Physics and to summarise the outcome in a brief document [1]. A full-day debate with 180 delegates was held at CERN, followed by a survey collecting quantitative input. During the debate, the ECRs discussed future colliders in terms of the physics prospects, their implications for accelerator and detector technology as well as computing and software. The discussion was organised into several topic areas. From these areas two common themes were particularly highlighted by the ECRs: sociological and human aspects; and issues of the environmental impact and sustainability of our research.
△ Less
Submitted 7 February, 2020;
originally announced February 2020.
-
Diseño de un controlador de ángulo en un balancín
Authors:
Alvarado Moreno,
Jose David,
Delgadillo Romero,
Kevin Andrey,
Galvis Reyna,
David Enrique,
Poblador Parra,
Gustavo Alonso,
Rodríguez Cortés,
César Alejandro
Abstract:
This document describes the design of a PID controller for a rotation pitch plant of a degree freedom. In the controller design, the tuning methods of Aströn Hägglund (AH), Kaiser Chaira (KC) and Kaiser Rajka (KR) will be used, verifying the performance in simulations and in the plant. Finally, the development for the implementation of an analog PID controller through circuits with operational amp…
▽ More
This document describes the design of a PID controller for a rotation pitch plant of a degree freedom. In the controller design, the tuning methods of Aströn Hägglund (AH), Kaiser Chaira (KC) and Kaiser Rajka (KR) will be used, verifying the performance in simulations and in the plant. Finally, the development for the implementation of an analog PID controller through circuits with operational amplifiers is described.
△ Less
Submitted 27 January, 2020;
originally announced January 2020.
-
Contributions to the 36th International Cosmic Ray Conference (ICRC 2019) of the JEM-EUSO Collaboration
Authors:
G. Abdellaoui,
S. Abe,
J. H. Adams Jr.,
A. Ahriche,
D. Allard,
L. Allen,
G. Alonso,
L. Anchordoqui,
A. Anzalone,
Y. Arai,
K. Asano,
R. Attallah,
H. Attoui,
M. Ave Pernas,
S. Bacholle,
M. Bakiri,
P. Baragatti,
P. Barrillon,
S. Bartocci,
J. Bayer,
B. Beldjilali,
T. Belenguer,
N. Belkhalfa,
R. Bellotti,
A. Belov
, et al. (287 additional authors not shown)
Abstract:
Compilation of papers presented by the JEM-EUSO Collaboration at the 36th International Cosmic Ray Conference (ICRC), held July 24 through August 1, 2019 in Madison, Wisconsin.
Compilation of papers presented by the JEM-EUSO Collaboration at the 36th International Cosmic Ray Conference (ICRC), held July 24 through August 1, 2019 in Madison, Wisconsin.
△ Less
Submitted 18 December, 2019;
originally announced December 2019.
-
Lambada: Interactive Data Analytics on Cold Data using Serverless Cloud Infrastructure
Authors:
Ingo Müller,
Renato Marroquín,
Gustavo Alonso
Abstract:
The promise of ultimate elasticity and operational simplicity of serverless computing has recently lead to an explosion of research in this area. In the context of data analytics, the concept sounds appealing, but due to the limitations of current offerings, there is no consensus yet on whether or not this approach is technically and economically viable. In this paper, we identify interactive data…
▽ More
The promise of ultimate elasticity and operational simplicity of serverless computing has recently lead to an explosion of research in this area. In the context of data analytics, the concept sounds appealing, but due to the limitations of current offerings, there is no consensus yet on whether or not this approach is technically and economically viable. In this paper, we identify interactive data analytics on cold data as a use case where serverless computing excels. We design and implement Lambada, a system following a purely serverless architecture, in order to illustrate when and how serverless computing should be employed for data analytics. We propose several system components that overcome the previously known limitations inherent in the serverless paradigm as well as additional ones we identify in this work. We can show that, thanks to careful design, a serverless query processing system can be at the same time one order of magnitude faster and two orders of magnitude cheaper compared to commercial Query-as-a-Service systems, the only alternative with similar operational simplicity.
△ Less
Submitted 2 December, 2019;
originally announced December 2019.
-
Rumble: Data Independence for Large Messy Data Sets
Authors:
Ingo Müller,
Ghislain Fourny,
Stefan Irimescu,
Can Berker Cikis,
Gustavo Alonso
Abstract:
This paper introduces Rumble, a query execution engine for large, heterogeneous, and nested collections of JSON objects built on top of Apache Spark. While data sets of this type are more and more wide-spread, most existing tools are built around a tabular data model, creating an impedance mismatch for both the engine and the query interface. In contrast, Rumble uses JSONiq, a standardized languag…
▽ More
This paper introduces Rumble, a query execution engine for large, heterogeneous, and nested collections of JSON objects built on top of Apache Spark. While data sets of this type are more and more wide-spread, most existing tools are built around a tabular data model, creating an impedance mismatch for both the engine and the query interface. In contrast, Rumble uses JSONiq, a standardized language specifically designed for querying JSON documents. The key challenge in the design and implementation of Rumble is mapping the recursive structure of JSON documents and JSONiq queries onto Spark's execution primitives based on tabular data frames. Our solution is to translate a JSONiq expression into a tree of iterators that dynamically switch between local and distributed execution modes depending on the nesting level. By overcoming the impedance mismatch in the engine, Rumble frees the user from solving the same problem for every single query, thus increasing their productivity considerably. As we show in extensive experiments, Rumble is able to scale to large and complex data sets in the terabyte range with a similar or better performance than other engines. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.
△ Less
Submitted 19 October, 2020; v1 submitted 25 October, 2019;
originally announced October 2019.
-
Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries
Authors:
Maciej Besta,
Robert Gerstenberger,
Emanuel Peter,
Marc Fischer,
Michał Podstawski,
Claude Barthels,
Gustavo Alonso,
Torsten Hoefler
Abstract:
Graph processing has become an important part of multiple areas of computer science, such as machine learning, computational sciences, medical applications, social network analysis, and many others. Numerous graphs such as web or social networks may contain up to trillions of edges. Often, these graphs are also dynamic (their structure changes over time) and have domain-specific rich data associat…
▽ More
Graph processing has become an important part of multiple areas of computer science, such as machine learning, computational sciences, medical applications, social network analysis, and many others. Numerous graphs such as web or social networks may contain up to trillions of edges. Often, these graphs are also dynamic (their structure changes over time) and have domain-specific rich data associated with vertices and edges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving, and rich datasets. Due to the sheer size of such datasets, combined with the irregular nature of graph processing, these systems face unique design challenges. To facilitate the understanding of this emerging domain, we present the first survey and taxonomy of graph database systems. We focus on identifying and analyzing fundamental categories of these systems (e.g., triple stores, tuple stores, native graph database systems, or object-oriented systems), the associated graph models (e.g., RDF or Labeled Property Graph), data organization techniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspects of data distribution and query execution (e.g., support for sharding and ACID). 51 graph database systems are presented and compared, including Neo4j, OrientDB, or Virtuoso. We outline graph database queries and relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms). Finally, we describe research and engineering challenges to outline the future of graph databases.
△ Less
Submitted 30 August, 2023; v1 submitted 20 October, 2019;
originally announced October 2019.
-
Strongly measuring qubit quasiprobabilities behind out-of-time-ordered correlators
Authors:
Razieh Mohseninia,
José Raúl González Alonso,
Justin Dressel
Abstract:
Out-of-time-ordered correlators (OTOCs) have been proposed as a tool to witness quantum information scrambling in many-body system dynamics. These correlators can be understood as averages over nonclassical multi-time quasi-probability distributions (QPDs). These QPDs have more information, and their nonclassical features witness quantum information scrambling in a more nuanced way. However, their…
▽ More
Out-of-time-ordered correlators (OTOCs) have been proposed as a tool to witness quantum information scrambling in many-body system dynamics. These correlators can be understood as averages over nonclassical multi-time quasi-probability distributions (QPDs). These QPDs have more information, and their nonclassical features witness quantum information scrambling in a more nuanced way. However, their high dimensionality and nonclassicality make QPDs challenging to measure experimentally. We focus on the topical case of a many-qubit system and show how to obtain such a QPD in the laboratory using circuits with three and four sequential measurements. Averaging distinct values over the same measured distribution reveals either the OTOC or parameters of its QPD. Stronger measurements minimize experimental resources despite increased dynamical disturbance.
△ Less
Submitted 24 July, 2019;
originally announced July 2019.
-
MLSys: The New Frontier of Machine Learning Systems
Authors:
Alexander Ratner,
Dan Alistarh,
Gustavo Alonso,
David G. Andersen,
Peter Bailis,
Sarah Bird,
Nicholas Carlini,
Bryan Catanzaro,
Jennifer Chayes,
Eric Chung,
Bill Dally,
Jeff Dean,
Inderjit S. Dhillon,
Alexandros Dimakis,
Pradeep Dubey,
Charles Elkan,
Grigori Fursin,
Gregory R. Ganger,
Lise Getoor,
Phillip B. Gibbons,
Garth A. Gibson,
Joseph E. Gonzalez,
Justin Gottschlich,
Song Han,
Kim Hazelwood
, et al. (44 additional authors not shown)
Abstract:
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne…
▽ More
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.
△ Less
Submitted 1 December, 2019; v1 submitted 29 March, 2019;
originally announced April 2019.
-
The Polarimetric and Helioseismic Imager on Solar Orbiter
Authors:
S. K. Solanki,
J. C. del Toro Iniesta,
J. Woch,
A. Gandorfer,
J. Hirzberger,
A. Alvarez-Herrero,
T. Appourchaux,
V. Martínez Pillet,
I. Pérez-Grande,
E. Sanchis Kilders,
W. Schmidt,
J. M. Gómez Cama,
H. Michalik,
W. Deutsch,
G. Fernandez-Rico,
B. Grauf,
L. Gizon,
K. Heerlein,
M. Kolleck,
A. Lagg,
R. Meller,
R. Müller,
U. Schühle,
J. Staub,
K. Albert
, et al. (99 additional authors not shown)
Abstract:
This paper describes the Polarimetric and Helioseismic Imager on the Solar Orbiter mission (SO/PHI), the first magnetograph and helioseismology instrument to observe the Sun from outside the Sun-Earth line. It is the key instrument meant to address the top-level science question: How does the solar dynamo work and drive connections between the Sun and the heliosphere? SO/PHI will also play an impo…
▽ More
This paper describes the Polarimetric and Helioseismic Imager on the Solar Orbiter mission (SO/PHI), the first magnetograph and helioseismology instrument to observe the Sun from outside the Sun-Earth line. It is the key instrument meant to address the top-level science question: How does the solar dynamo work and drive connections between the Sun and the heliosphere? SO/PHI will also play an important role in answering the other top-level science questions of Solar Orbiter, as well as hosting the potential of a rich return in further science.
SO/PHI measures the Zeeman effect and the Doppler shift in the FeI 617.3nm spectral line. To this end, the instrument carries out narrow-band imaging spectro-polarimetry using a tunable LiNbO_3 Fabry-Perot etalon, while the polarisation modulation is done with liquid crystal variable retarders (LCVRs). The line and the nearby continuum are sampled at six wavelength points and the data are recorded by a 2kx2k CMOS detector. To save valuable telemetry, the raw data are reduced on board, including being inverted under the assumption of a Milne-Eddington atmosphere, although simpler reduction methods are also available on board. SO/PHI is composed of two telescopes; one, the Full Disc Telescope (FDT), covers the full solar disc at all phases of the orbit, while the other, the High Resolution Telescope (HRT), can resolve structures as small as 200km on the Sun at closest perihelion. The high heat load generated through proximity to the Sun is greatly reduced by the multilayer-coated entrance windows to the two telescopes that allow less than 4% of the total sunlight to enter the instrument, most of it in a narrow wavelength band around the chosen spectral line.
△ Less
Submitted 26 March, 2019;
originally announced March 2019.
-
Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning (Technical Report)
Authors:
Zeke Wang,
Kaan Kara,
Hantian Zhang,
Gustavo Alonso,
Onur Mutlu,
Ce Zhang
Abstract:
Learning from the data stored in a database is an important function increasingly available in relational engines. Methods using lower precision input data are of special interest given their overall higher efficiency but, in databases, these methods have a hidden cost: the quantization of the real value into a smaller number is an expensive step. To address the issue, in this paper we present MLW…
▽ More
Learning from the data stored in a database is an important function increasingly available in relational engines. Methods using lower precision input data are of special interest given their overall higher efficiency but, in databases, these methods have a hidden cost: the quantization of the real value into a smaller number is an expensive step. To address the issue, in this paper we present MLWeaving, a data structure and hardware acceleration technique intended to speed up learning of generalized linear models in databases. ML-Weaving provides a compact, in-memory representation enabling the retrieval of data at any level of precision. MLWeaving also takes advantage of the increasing availability of FPGA-based accelerators to provide a highly efficient implementation of stochastic gradient descent. The solution adopted in MLWeaving is more efficient than existing designs in terms of space (since it can process any resolution on the same design) and resources (via the use of bit-serial multipliers). MLWeaving also enables the runtime tuning of precision, instead of a fixed precision level during the training. We illustrate this using a simple, dynamic precision schedule. Experimental results show MLWeaving achieves up to16 performance improvement over low-precision CPU implementations of first-order methods.
△ Less
Submitted 28 March, 2019; v1 submitted 8 March, 2019;
originally announced March 2019.
-
Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution
Authors:
Renato Marroquín,
Ingo Müller,
Darko Makreshanski,
Gustavo Alonso
Abstract:
Cloud-based data analysis is nowadays common practice because of the lower system management overhead as well as the pay-as-you-go pricing model. The pricing model, however, is not always suitable for query processing as heavy use results in high costs. For example, in query-as-a-service systems, where users are charged per processed byte, collections of queries accessing the same data frequently…
▽ More
Cloud-based data analysis is nowadays common practice because of the lower system management overhead as well as the pay-as-you-go pricing model. The pricing model, however, is not always suitable for query processing as heavy use results in high costs. For example, in query-as-a-service systems, where users are charged per processed byte, collections of queries accessing the same data frequently can become expensive. The problem is compounded by the limited options for the user to optimize query execution when using declarative interfaces such as SQL. In this paper, we show how, without modifying existing systems and without the involvement of the cloud provider, it is possible to significantly reduce the overhead, and hence the cost, of query-as-a-service systems. Our approach is based on query rewriting so that multiple concurrent queries are combined into a single query. Our experiments show the aggregated amount of work done by the shared execution is smaller than in a query-at-a-time approach. Since queries are charged per byte processed, the cost of executing a group of queries is often the same as executing a single one of them. As an example, we demonstrate how the shared execution of the TPC-H benchmark is up to 100x and 16x cheaper in Amazon Athena and Google BigQuery than using a query-at-a-time approach while achieving a higher throughput.
△ Less
Submitted 1 September, 2018;
originally announced September 2018.
-
First observations of speed of light tracks by a fluorescence detector looking down on the atmosphere
Authors:
G. Abdellaoui,
S. Abe,
J. H. Adams Jr.,
A. Ahriche,
D. Allard,
L. Allen,
G. Alonso,
L. Anchordoqui,
A. Anzalone,
Y. Arai,
K. Asano,
R. Attallah,
H. Attoui,
M. Ave Pernas,
S. Bacholle,
M. Bakiri,
P. Baragatti,
P. Barrillon,
S. Bartocci,
J. Bayer,
B. Beldjilali,
T. Belenguer,
N. Belkhalfa,
R. Bellotti,
A. Belov
, et al. (289 additional authors not shown)
Abstract:
EUSO-Balloon is a pathfinder mission for the Extreme Universe Space Observatory onboard the Japanese Experiment Module (JEM-EUSO). It was launched on the moonless night of the 25$^{th}$ of August 2014 from Timmins, Canada. The flight ended successfully after maintaining the target altitude of 38 km for five hours. One part of the mission was a 2.5 hour underflight using a helicopter equipped with…
▽ More
EUSO-Balloon is a pathfinder mission for the Extreme Universe Space Observatory onboard the Japanese Experiment Module (JEM-EUSO). It was launched on the moonless night of the 25$^{th}$ of August 2014 from Timmins, Canada. The flight ended successfully after maintaining the target altitude of 38 km for five hours. One part of the mission was a 2.5 hour underflight using a helicopter equipped with three UV light sources (LED, xenon flasher and laser) to perform an inflight calibration and examine the detectors capability to measure tracks moving at the speed of light. We describe the helicopter laser system and details of the underflight as well as how the laser tracks were recorded and found in the data. These are the first recorded laser tracks measured from a fluorescence detector looking down on the atmosphere. Finally, we present a first reconstruction of the direction of the laser tracks relative to the detector.
△ Less
Submitted 7 August, 2018;
originally announced August 2018.
-
Out-of-Time-Ordered-Correlator Quasiprobabilities Robustly Witness Scrambling
Authors:
José Raúl González Alonso,
Nicole Yunger Halpern,
Justin Dressel
Abstract:
Out-of-time-ordered correlators (OTOCs) have received considerable recent attention as qualitative witnesses of information scrambling in many-body quantum systems. Theoretical discussions of OTOCs typically focus on closed systems, raising the question of their suitability as scrambling witnesses in realistic open systems. We demonstrate empirically that the nonclassical negativity of the quasipr…
▽ More
Out-of-time-ordered correlators (OTOCs) have received considerable recent attention as qualitative witnesses of information scrambling in many-body quantum systems. Theoretical discussions of OTOCs typically focus on closed systems, raising the question of their suitability as scrambling witnesses in realistic open systems. We demonstrate empirically that the nonclassical negativity of the quasiprobability distribution (QPD) behind the OTOC is a more sensitive witness for scrambling than the OTOC itself. Nonclassical features of the QPD evolve with timescales that are robust with respect to decoherence and are immune to false positives caused by decoherence. To reach this conclusion, we numerically simulate spin-chain dynamics and three measurement protocols (the interferometric, quantum-clock, and weak-measurement schemes) for measuring OTOCs. We target experiments based on quantum-computing hardware such as superconducting qubits and trapped ions.
△ Less
Submitted 2 February, 2019; v1 submitted 25 June, 2018;
originally announced June 2018.
-
Strengthening weak measurements of qubit out-of-time-order correlators
Authors:
Justin Dressel,
José Raúl González Alonso,
Mordecai Waegell,
Nicole Yunger Halpern
Abstract:
For systems of controllable qubits, we provide a method for experimentally obtaining a useful class of multitime correlators using sequential generalized measurements of arbitrary strength. Specifically, if a correlator can be expressed as an average of nested (anti)commutators of operators that square to the identity, then that correlator can be determined exactly from the average of a measuremen…
▽ More
For systems of controllable qubits, we provide a method for experimentally obtaining a useful class of multitime correlators using sequential generalized measurements of arbitrary strength. Specifically, if a correlator can be expressed as an average of nested (anti)commutators of operators that square to the identity, then that correlator can be determined exactly from the average of a measurement sequence. As a relevant example, we provide quantum circuits for measuring multiqubit out-of-time-order correlators using optimized control-Z or ZX-90 two-qubit gates common in superconducting transmon implementations.
△ Less
Submitted 4 October, 2018; v1 submitted 2 May, 2018;
originally announced May 2018.