2.1 Functional Model
Figure
1 depicts the functional model of a data-intensive system. The resulting classification criteria are shown in Table
1. A data-intensive system offers data management and processing functionalities to external
clients. Clients can register and start
driver programs, which are the parts of the application logic that interact with the data-intensive system and exploit its functionalities. Specifically, during its execution, a driver program can invoke one or more
jobs, which are the largest units of execution that can be offloaded onto the
distributed computing infrastructure made available by the data-intensive system. Depending on the specific system, a driver program may execute
client-side or
system-side. Some systems decouple activation of driver programs from their registration; in this case, we say that driver execution time is
on start, and otherwise we say that it is
on registration. To exemplify, in a DMS, a driver program may be a stored procedure that combines code expressed in some general-purpose programming language with one or more queries (the jobs) expressed in the language offered by the engine (e.g., SQL). Stored procedures typically run system-side every time a client activates them (on start). Similarly, in a DPS, the driver program may be a piece of Java code that spawns one or more distributed computations (the jobs) written using the API offered by the data processing engine. In this context, the driver program will typically run system-side on registration.
The data-intensive system runs on the distributed computing infrastructure as a set of worker processes, hosted on the same or different nodes (physical or virtual machines). We model the processing resources offered by workers as a set of slots. Jobs are compiled into elementary units of execution that we denote as tasks and run sequentially on slots. Jobs consume input data and produce output data. Some systems also store some state within the distributed computing infrastructure: in this case, jobs may access (read and modify) the state during their execution. When present, state can be split (partitioned and replicated) across workers such that each of them is responsible for a state portion.
In our model, data elements are immutable and are distributed through communication channels (dark gray arrows in Figures
1 and
2) that we collectively refer to as the
data bus. Notice that the data bus also distributes
jobs invocations. Indeed, our model emphasizes the dual nature of invocations and data, which can both carry information and trigger jobs execution: invocations may transport data in the form of input parameters and return values, whereas the availability of new data may trigger the activation of jobs. Our model exploits this duality to capture the heterogeneity in activating jobs and exchanging data of the systems we surveyed.
Jobs invocations may be either
synchronous, if the driver program waits for jobs completion before making progress, or
asynchronous, if the driver program continues to execute after submitting the invocation. In both cases, invocations may return some result to the driver program, as indicated by the bidirectional arrows in Figure
1. In some systems, jobs also consume data from external
sources and produce data for external
sinks. We distinguish between
passive sources, which consist of static datasets that jobs can access during their execution (e.g., a distributed filesystem), and
active sources, which produce new data dynamically and may trigger job execution (e.g., a messaging system).
To exemplify, stored procedures (the driver programs) in a DMS invoke (synchronously or asynchronously, depending on the specific system) one or more queries (the jobs) during their execution. Invocations carry input data in the form of actual parameters. Queries can access (read-only queries) and modify (read-write queries) the state of the system, and return query results. In batch DPSs such as MapReduce, jobs read input data from passive sources (e.g., a distributed filesystem), apply functional transformations that do not involve any mutable state, and store the resulting data into sinks (e.g., the same distributed filesystem). In stream DPSs, jobs run indefinitely and make progress when active sources provide new input data. We say that input data activates a job, and in this case jobs may preserve some state across activations.
We characterize the distributed computing infrastructure based on its deployment. In a cluster deployment, all nodes belong to the same cluster or data center, which provides high bandwidth and low latency for communication. Conversely, in a wide-area deployment, nodes can be spread in different geographical areas, a choice that increases the latency of communication and may impact the possibility of synchronizing and coordinating tasks. For this reason, we also consider hybrid deployments, when the system adopts a hierarchical approach, exploiting multiple fully functional cluster deployments that are loosely synchronized with each other.
2.2 Jobs and Their Lifecycle: From Definition to Execution
This section concentrates on jobs, following their lifecycle from definition to execution (see Figure
2). Jobs are defined inside a driver program (Section
2.2.1) and compiled into an execution plan of elementary tasks (Section
2.2.2), which are deployed and executed on the distributed computing infrastructure (Section
2.2.3). The resulting classification criteria are presented in Table
2.
2.2.1 Jobs Definition.
Jobs are defined inside driver programs. Frequently, driver programs include multiple jobs and embed the logic that coordinates their execution. For instance, stored procedures (driver programs) in DMSs may embed multiple queries (jobs) within procedural code. Similarly, DPSs invoke analytic jobs from a driver program written in a standard programming language with a fork-join execution model. Notably, some systems implement iterative algorithms by spawning a new job for each iteration and by evaluating termination criteria within the driver program.
Jobs are expressed using programming primitives (jobs definition API) with heterogeneous forms. For instance, relational DMSs rely on SQL (a Domain-Specific Language (DSL)), whereas DPSs usually offer libraries for various programming languages. Some systems support both forms.
Jobs are compiled into an execution plan, which defines the computation as a set of elementary units of deployment and execution called tasks. Tasks
(i) run on slots, (ii) exchange data over the data bus (dark gray arrows in Figure
2) according to the communication schema defined in the execution plan, and (iii) may access the state portion of the worker they are deployed on.
We say that the execution plan definition is explicit if the programming primitives directly specify the individual tasks and their logical dependencies. The definition is instead implicit if the logical plan is compiled from a higher-level, declarative specification of the job. To exemplify, the dataflow formalism adopted in many DPSs provides an explicit definition of the logical plan, whereas SQL, and most query languages, provide an implicit definition. With an explicit definition of the logical plan, the communication between tasks can itself be explicit or implicit. In the first case, the system APIs include primitives to send and receive data across tasks, whereas in the latter case, the exchange of data is implicit. The execution plan structure can be a generic workflow, where there are no restrictions to the pattern of communication between tasks, or a dataflow, where tasks need to be organized into an acyclic graph and data can only move from upstream tasks to downstream tasks. When present, we also highlight further structural constraints. For instance, the execution plan of the original MapReduce system forces data processing in exactly two phases: a map and a reduce.
Iterations within the execution plan may or may not be allowed. We say that a system supports dynamic creation of the plan if it enables spawning new tasks during execution. Dynamic creation gives the flexibility of defining or activating part of the execution plan at runtime, which may be used to support control flow constructs.
Jobs can be either
one-shot or
continuous. One-shot jobs are executed once and then terminate. We use the term
invoke here: as invoking a program twice leads to two distinct processes, invoking a one-shot job multiple times leads to separate executions of the same code. For instance, queries in DMSs are typically one-shot jobs, and indeed multiple invocations of the same query lead to independent executions. Instead, continuous jobs persist across invocations. In this case, we use the term
activate to highlight that the same job is repeatedly activated by the arrival of new data. This happens in stream DPSs, where continuous jobs are activated when new input data comes from active sources. As detailed in Section
2.4, the key distinguishing factor of continuous jobs is their ability to persist in some private task state across activations. By definition, this option is not available for one-shot jobs, since each invocation is independent from the other.
State management in jobs may be absent, explicit, or implicit. For instance, state management is absent in batch DPSs, which define jobs in terms of functional transformations that solely depend on the input data. State management is explicit when the system provides constructs to directly access state elements to read and write them. For instance, queries in relational DMSs provide select clauses to retrieve state elements and insert and update clauses to store new state elements and update them. State management is implicit when state accesses are implicit in job definition. For instance, stream DPSs manage state implicitly through ad hoc operators such as windows that record previously received data and use it to compute new output elements.
Another relevant characteristic of the programming model is the support for
data parallelism, which allows defining computations for a single element and automatically executing them on many elements in parallel. Data parallelism is central in many systems, and particularly in DPSs, as it simplifies the definition of jobs by letting developers focus on individual elements. It promotes parallel execution, as the tasks operating on different elements are independent. Systems supporting data parallelism apply partitioning strategies to both data and state (when available), as we discuss in Section
2.3 and Section
2.4. As inter-task communication and remote data access may easily become performance bottlenecks, some systems aim to reduce inter-task communication and to promote local access to data and state by offering
placement-aware APIs that enable developers to suggest suitable placement strategies based on the expected workload.
2.2.2 Jobs Compilation.
The process of compiling jobs into an execution plan may either start on driver registration or on driver execution. The first case models situations where the driver program is registered in the system and can be executed multiple times, as in the case of stored procedures. The second case happens in DPSs, which usually offer a single command to submit and execute a program. Jobs compilation may use resources information—that is, information about the resources of the distributed computing infrastructure. The information is static if it only considers the available resources. For instance, data-parallel operators are converted into multiple tasks that run the same logical step in parallel: the concrete number of tasks is typically selected depending on the available processing resources (overall number of CPU cores). The information is dynamic if it also considers the actual use of resources. For instance, a join operation may be compiled into a distributed hash join or into a sort-merge join depending on the current cardinality and distribution of the elements to join.
2.2.3 Jobs Deployment and Execution.
Jobs deployment is the process of allocating the tasks of a job execution plan onto slots. For instance, the execution plan in Figure
2 consists of seven tasks, and each of them is deployed on a different slot. Tasks tagged A and B exemplify data-parallel operations, each executed by two tasks in parallel. Deployment can be performed with
job-level granularity or
task-levelgranularity. Job-level granularity is common when the deployment takes place
on job compilation, whereas task-level granularity is used when the deployment (of individual tasks) takes place
on task activation. It is important to note that the preceding classification is orthogonal to the nature of jobs (one-shot or continuous) as defined earlier. One-shot jobs may be
(i) entirely deployed either on job compilation or (ii) progressively, as their input data is made available by previous tasks in the execution plan.
The first choice is frequent in DMSs, whereas the latter characterizes several DPSs. Similarly, continuous jobs may be
(i) fully deployed on compilation, with their composing tasks remaining available onto slots, ready to be activated by incoming data elements, or (ii) their tasks may be deployed individually when new input data becomes available and activates them. In this case, the same task is deployed multiple times, once for each activation: systems that follow this strategy minimize the overhead of deployment by accumulating input data into batches and deploying a task only once for the entire batch, as well exemplified by the micro-batch approach of Spark Streaming [
98].
As we discuss in Section
2.3, task-level deployment requires a persistent data bus, to decouple task execution in time. If the data bus is not persistent, all tasks in the execution plan need to be simultaneously deployed to enable the exchange of data.
The deployment process always exploits static information about the resources available in the computing infrastructure, like the address of workers and their number of slots. Some systems also exploit dynamic information, such as the current load of workers and the location of data. This is typically associated with task-level scheduling on activation, where tasks are deployed when their input data is available and they are ready for execution. Finally, the deployment process may have a system-only or sharedmanagement of resources. System-only management only considers the resources occupied by the data-intensive system. Shared management takes global decisions in the case in which multiple software systems share the same distributed computing infrastructure. For instance, it is common to use external resource managers such as Yarn for task deployment in cluster environments.
2.3 Data Management
This section studies the characteristics of data elements and the data bus used to distribute them. The resulting classification criteria are shown in Table
3. Recall that in our model, data elements are immutable, meaning that once they are delivered through the data bus, they cannot be later updated. In addition, they are used to represent both data and invocations, as they carry some payload and may trigger the activation of tasks. Data elements may be
structured, if they have an associated schema determining the number and type of fields they contain, or
unstructured, otherwise. Structured data is commonly found in DPSs, when input datasets or data streams are composed of tuples with a fixed structure. The structure of elements may reflect on the data bus, with assumptions of homogeneous data elements (same schema) in some communication channels. For instance, DPSs typically assume homogeneous input and homogeneous output data for each task. We further distinguish between systems that accept
general structured data, when the developers are free to define their custom data model, and systems that assume a
domain-specific structure, when developers are constrained to a specific data model, as in the case of relational data, time series, or graph-shaped data. Finally, data may or may not have a
temporal dimension: this is particularly relevant for stream DPSs, where it is used for time-based analysis. Section
2.6 will detail how the temporal dimension influences the order in which tasks analyze data elements.
The data bus can either consist of
direct connections between the communicating parties or can be
mediated by some middleware service. Accordingly, the actual
bus implementation may range from TCP links (direct connection) to various types of middleware systems (mediated connection), like message queuing or distributed storage services.
1 Whereas direct connections are always
ephemeral, various mediated connections are
persistent. In the first case, receivers need to be active when the data is transmitted over the bus and they cannot retrieve the same elements later in time. In the second case, elements are preserved inside the bus and receivers can access them multiple times and at different points in time. For instance, DPSs that implement job-level deployment usually adopt a direct (and ephemeral) data bus made of TCP connections among tasks. Conversely, DPSs that deploy tasks independently (task-level deployment) require a persistent and mediated data bus (e.g., a distributed filesystem or a persistent messaging middleware) where intermediate tasks can store their results for downstream tasks.
In many systems, the data bus provides communication channels where data elements are logically
partitioned based on some criterion—for instance, based on the value of a given field. The use of a partitioned data bus is common in DPSs, where it is associated with data-parallel operators: the programmer specifies the operator for the data elements in a single partition, but the operator is concretely implemented by multiple (identical) tasks that work independently on different partitions. A persistent data bus may also be
replicated, meaning that the same data elements are stored in multiple copies. Replication may serve two purposes: improving performance, such as by enabling different tasks to consume the same data simultaneously from different replicas, or tolerating faults, to avoid losing data in the case in which one replica becomes unavailable. We will discuss fault tolerance in greater detail in Section
2.7. Among the systems we analyze in this article, all those that replicate the data bus implement a single-leader replication schema, where one
leader replica is responsible for receiving all input data and for updating the other
\(f\) (
follower) replicas. The update is synchronous (or semi-synchronous), meaning that the addition of an input data element to the data bus completes when the data element has been successfully applied to all
\(f\) follower replicas (or to
\(r \lt f\) replicas, if the update is semi-synchronous). The data bus offers an
interaction model that is
push if the sender delivers data to recipients or
pull if receivers demand data to senders.
Hybrid approaches are possible in the presence of a mediated bus, a common case being a push approach between the sender and the data bus and a pull approach between the data bus and the recipients.
2.4 State Management
After discussing data, we focus on state, deriving the classification criteria listed in Table
4. In absence of state, tasks results (data written on the bus) only depend on their input (data read from the bus), but many systems support stateful tasks whose results also depend on some mutable state that they can read and modify during execution. This marks a difference between data and state elements: the former are immutable, whereas the latter may change over time. As for their structure, state elements resemble data elements: they may be
unstructured or
structured, and in the second case, they may rely on
domain-specific data models.
When present, state may be stored on different
storage media: (i) many systems store the entire state
in-memory and replicate it to disk only for fault tolerance; (ii) other systems use
disk storage, or (iii)
hybrid solutions, where state is partially stored in memory for improved performance and flushed to disk to scale in size; and (iv) some systems rely on a storage
service, as is common in cloud-based systems that split their core functionalities into independently deployed services. Some recent work investigates the use of persistent memory [
61], but these solutions are not employed in currently available systems.
The storage structure indicates the data structure used to represent state on the storage media. This structure is heavily influenced by the expected access pattern. For instance, relational state may be stored row-wise, to optimize access element by element (common in data management workloads), or column-wise, to optimize access attribute by attribute (common in data analysis workloads, e.g., to compute an aggregation function over an attribute). Many DMSs use indexed structures such as B-trees or Log-Structured Merge (LSM) trees to rapidly identify individual elements.
Data-intensive systems may support two types of state: task state, which is private to a single task, and shared state, which can be accessed by multiple tasks. The availability of these types of state deeply affects the design and implementation of the system. Shared state is central in DMSs, where two tasks (e.g., an insert and a select query) can write and read simultaneously from the same state (e.g., a relational table). Conversely, most DPSs avoid shared state to simplify parallel execution. Frequently, batch DPSs do not offer any type of state, whereas stream DPSs only offer task state, which does not require any concurrency control mechanism, as it is accessed only by one task (sequential unit of execution). Notice that task state is only relevant in continuous jobs, where it can survive across multiple activations of the same task. Indeed, it is used in DPSs to implement stateful operators like windows.
In our model, workers are responsible for storing separate portions of the shared state. Tasks have local access to elements stored (in memory or on disk) on the shared state portion of the worker they are deployed on. They can communicate with remote tasks over the data bus to access shared state portions deployed on other workers. For systems that rely on a storage service, we model the service as a set of workers that are only responsible for storing shared state portions and offer remote access through the data bus. Splitting of the shared state among workers may respond to a criterion of partitioning. For instance, partitioning enables DMSs to scale beyond the memory capacity of a single node, but also to run tasks belonging to the same or different jobs (queries) in parallel on different partitions. Besides partitioning, many data-intensive systems adopt replication. As in the case of data bus replication, state replication may also serve two purposes:
(i) reduce read access latency, by allowing multiple workers to store a copy (replica) of the same state elements locally, and (ii) provide durability and fault tolerance, avoiding potential loss in the case of failures. We return on the specific use of replication for fault tolerance in Section
2.7.
Here, we consider if the replication is backup-only, meaning that replicas are only used for fault tolerance and cannot be accessed by tasks during execution, or not.
If tasks can access state elements from multiple replicas, different replication
consistency models are possible, which define which state values may be observed by tasks when accessing multiple replicas. Replication models have been widely explored in database and distributed systems theory. For the goal of our analysis, we only distinguish between
strong and
weak consistency models, where the former require synchronous coordination among the replicas while the latter do not. This classification approach is also in line with the recent literature that denotes models that do not require synchronous coordination as being highly available [
15]. Intuitively, strong consistency models are more restrictive and use coordination to avoid anomalies that may arise when tasks access elements simultaneously from different replicas. In practice, most data-intensive systems that adopt a strong consistency model provide
sequential consistency, a model that ensures that accesses to replicated state are the same as if they were executed in some serial order. This simplifies reasoning on the state of the system, as it hides concurrency by mimicking the behavior of a sequential execution. In terms of implementation, we distinguish two main classes of mechanisms to achieve strong consistency: in
leader-based algorithms, all state updates are delivered to a single replica (leader) that decides their order, and in
consensus-based algorithms, replicas use quorum-based or distributed consensus protocols to agree on the order of state accesses. Systems that adopt a weak consistency model typically provide
eventual consistency, where updates to state elements are propagated asynchronously, which may lead to (temporary) inconsistencies between replicas. For this reason, weak consistency is typically coupled with automated
conflict resolution algorithms, which guarantee that all replicas solve conflicts in the same way and eventually converge to the same state. A popular approach to conflict resolution is conflict-free replicated data types, which expose only operations that guarantee deterministic conflict resolution in the presence of simultaneous updates [
81].
Finally, replication protocols may employ two approaches to propagate updates: state-based or operation-based (also known as active replication). In state-based replication, when a task updates a value in a replica, the new state is propagated to the other replicas. In operation-based replication, the operation causing the update is propagated and re-executed at each replica: this approach may save bandwidth, but it may spend more computational resources to re-execute operations at each replica.
2.5 Tasks Grouping
Several systems offer primitives to identify groups of tasks and provide additional guarantees for such groups, which we classify as
group atomicity (Section
2.5.1) and
group isolation (Section
2.5.2). The resulting classification criteria are presented in Table
5. Atomicity ensures no partial failures for a group of tasks: they either all fail or all complete successfully. Isolation limits the ways in which running tasks can interact and interleave with each other. In DMSs, these concerns are considered part of transactional management, together with consistency and durability properties. In our model, we discuss consistency constraints as part of group atomicity in the next section, whereas we integrate durability with fault tolerance and discuss it in Section
2.7.
2.5.1 Group Atomicity.
Atomicity ensures that a group of tasks either entirely succeeds or entirely fails. We use the established jargon of database transactions and say that a task (or group of tasks) either
commits or
aborts. If the tasks commit, all the effects of their execution, and particularly all their changes to the shared state, become visible to other tasks. If the tasks abort, none of the effects of their execution becomes visible to other tasks. We classify group atomicity along two dimensions. First, we consider the possible
causes for aborts and distinguish between
system-driven or
job-driven. System-driven aborts (e.g., a worker running out of memory) derive from non-deterministic hardware or software failures, whereas job-driven aborts (e.g., database integrity constraints) are part of a job definition and are triggered if job completion may lead to a logic error. Second, we consider how systems implement group atomicity. Atomicity is essentially a consensus problem [
65], where tasks need to agree on a common outcome: commit or abort. The established protocol to implement atomicity is
two-phase commit. In this protocol, one of the participants (tasks) takes the role of a coordinator, collects all votes (commit or abort) from participants (phase 1), and distributes the common decision to all of them (phase 2). Notice that this protocol is not robust against failures of the coordinator; however, data-intensive systems typically adopt orthogonal mechanisms to deal with failures, as discussed in Section
2.7. Most importantly, two-phase commit is a
blocking protocol as participants cannot make progress before receiving the global outcome from the coordinator. For these reasons, some systems adopt simplified,
coordination-free protocols, which reduce or avoid coordination under certain assumptions. Being specific to individual systems, we discuss such protocols in Section
4.
2.5.2 Group Isolation.
Group isolation constrains how tasks belonging to different groups can interleave with each other and is classically organized into
levels [
4]. The stronger, serializable isolation, requires the effects of execution to be the same as if all groups were executed in some serial order, with no interleaving of tasks, whereas weaker levels enable some disciplined form of concurrency that may lead to anomalies in the results that clients observe. In line with the approach adopted for replication consistency and for atomicity, in this work we consider only two broad classes of isolation levels: those that require
blocking coordination between tasks and those that are
coordination free (referred to as being highly available in the literature [
15]). This is also motivated by the systems under analysis, which either provide strong isolation levels (typically, serializable) or do not provide isolation at all.
Implementation-wise, strong isolation is traditionally achieved with two classes of coordination protocols:
lock-based and
timestamp-based. With lock-based protocols, tasks acquire non-exclusive or exclusive locks to access shared resources (shared state in our model) in read-only or read-write mode. Lock-based protocols may incur distributed deadlocks: to avoid them, protocols implement detection or prevention schemes that abort and restart groups in the case of deadlock. Timestamp-based protocols generate a serialization order for groups before execution, then the tasks need to enforce that order. Pessimistic timestamp protocols abort and re-execute groups when they try to access shared resources out of order. Multi-version concurrency control protocols reduce the probability of aborts by storing multiple versions of shared state elements and allowing tasks to read old versions when executed out of order. Optimistic concurrency control protocols allow out-of-order execution of tasks but check for conflicts before making the effects of a group of tasks visible to other tasks. Finally, a few systems adopt special protocols that reduce or avoid coordination under certain assumptions: as in the case of group atomicity, we discuss these protocols in Section
4.
2.6 Delivery and Order Guarantees
Delivery and
order guarantees define how external actors (driver programs, sources, and sinks) observe the effects of their actions (submitting input data and invocations). Both topics are crucial for distributed systems and have been widely explored in the literature. Here, we focus on the key concepts that characterize the behavior of the systems we analyzed, and we offer a description that embraces different styles of interaction, from invocation-based (as in DMS queries) to data-driven (as in stream DPSs). The resulting classification criteria are presented in Table
6.
Delivery focuses on the effects of a single input
\(I\) (data element or invocation). Under
at most once delivery, the system behaves as if
\(I\) was either received and processed once or never. Under
at least once delivery, the system behaves as if
\(I\) was received and processed once or more than once. Under
exactly once delivery, the system behaves as if
\(I\) was received and processed once and only once. A well-known theoretical result in the area of distributed systems states the impossibility to deliver an input exactly once in a distributed environment where components can fail. Nevertheless, a system can behave as if the input was processed exactly once under some assumptions: the most common are that driver programs and sources can resubmit the input upon request (to avoid loss of data), whereas sinks can detect duplicate output results and discard them (to avoid duplicate processing and output). To exemplify, DMSs offer exactly once delivery when they guarantee group atomicity through transactions: in this case, a job entirely succeeds or entirely fails, and in the case of a failure, the system either notifies the driver program (that may retry until success) or internally retries, allowing the jobs to be executed exactly once. DPSs offer exactly once delivery by replaying data from sources (or from intermediate results stored in a persistent data bus) in the case of a failure. In the presence of continuous jobs (stream processing), systems also need to avoid duplicating the effects of processing on task state when replaying data: to do so, they often discard the effects of previous executions by reloading an old task state from a checkpoint (see also the role of checkpoints on fault tolerance in Section
2.7).
Order focuses on multiple data elements or invocations and defines in which order their effects become visible. Order ultimately depends on the
nature of timestamps physically or logically attached to data elements. In some systems, no timestamp is associated with data elements; in these cases, no ordering guarantees are provided. Conversely, when data elements represent occurrences of events in the application domain, they have an associated timestamp that can be set by the original source or by the system when it first receives the element. We rely on established terminology and refer to the former case as
event time and the latter case as
ingestion time [
8]. When a timestamp is provided, systems may ensure that the associated order is guaranteed
always or
eventually. Systems in the first class wait until all data elements before a given timestamp become available and then process them in order. To do so, they typically rely on a contract between the sender components and the data bus, where sender components use special elements (denoted as
watermarks) to indicate that all elements up to a given time
\(t\) have been produced, and the data bus delivers elements up to time
\(t\) in the correct order. Systems in the second class execute elements out of order, but they
retract previously issued results and correct them when they receive new data with an older timestamp. Thus, upon receiving all input data up to time
\(t\) , the system eventually returns the correct results. Notice that this mechanism requires the elements receiving output data to tolerate temporarily incorrect results. According to our preceding definitions, retraction is not compatible with exactly once delivery, as it changes the results provided to sinks after they have already been delivered, thus breaking the illusion that they have been produced once and only once.
2.7 Fault Tolerance
Fault tolerance is the ability of a system to tolerate failures that may occur during its execution. We consider hardware failures, such as a disk failing or a node crashing or becoming unreachable, and non-deterministic software failures, such as a worker exhausting a node’s memory. We assume that the logic of jobs is correct, which guarantees that the re-execution of a failed job does not deterministically lead to the same failure. Our minimal unit of fault is the worker, and we assume the typical approach to tolerate failures that involves first detecting the failure and then recovering from it. The classification criteria for fault tolerance are presented in Table
7.
Fault detection is usually addressed as a problem of group membership: given a group of workers, determine the (sub)set of those active and available to execute jobs. Systems address this problem either using a leader-worker approach, which assumes one entity with a special role (leader) that cannot fail and can supervise normal workers, or using a distributed protocol, like gossip-based protocols.
After a failure is detected, fault recovery brings the system into a state from which it can resume with the intended semantics. We describe the recovery process by focusing on five aspects: scope, computation recovery, state recovery, guarantees, and assumptions. Depending if tasks are stateless or stateful and if they can share state or not, the scope of recovery may involve recovering the computation of failing tasks, the task state of failing tasks, and/or the shared state portions held by failing workers.
Computation recovery may be
absent, in which case failing jobs are simply discarded and the system offers at most once delivery (see Section
2.6). Otherwise, the system recovers the computation by restarting it: we distinguish between systems that need to restart an entire
job and systems that can restart individual
tasks. DMSs typically restart entire jobs to satisfy transactional (atomicity) guarantees that require a job to either entirely succeed or entirely fail. Some DPSs (those using a persistent data bus to save the intermediate results of tasks) may restart only failed tasks. Restarting a computation requires that input data and invocations are persisted and replayable, and that duplicate output data can be detected and discarded by sinks (if the system wants to ensure exactly once delivery; see Section
2.6). To replay input data and invocations, systems either rely on replayable sources, such as persistent message services, or keep a log internally (see the discussion on logging in the following).
To recover state, systems may rely on
checkpointing,
logging, and
replication. Frequently, they combine these mechanisms. Checkpointing involves saving a copy of the entire state to durable storage. When a failure is detected, the last available checkpoint can be reloaded and the execution of jobs may restart from that state. Different workers may take
independent checkpoints, or they may
coordinate, such as by using the distributed snapshot protocol [
28] to periodically save a consistent view of the entire system state. A third alternative (
per-activation checkpoint) is sometimes used for continuous jobs to save task state: at each activation, a task stores its task state together with the output data. This approach essentially transforms a stateful task into a stateless task, where state is encoded as a special data element that the system receives in input at the next activation. In practice, per-activation checkpoint is used in presence of a persistent data bus that stores checkpoints. Frequent checkpointing may be resource consuming and affect the response time of the system.
Logging is an alternative approach that saves either individual operations or state changes rather than the entire content of the state. These two forms of logging are known in the database literature as follows:
(i) the
Command Logging (CL) persists input data and invocations, and in the case of failure re-processes the same input to restore state [
67], and (ii)
Write Ahead Log (WAL) persists state changes coming from tasks to durable storage before they are applied, and in the case of a failure reapplies the changes in the log to restore the state .
As logs may grow unbounded with new invocations entering the system, they are always complemented with (infrequent) checkpoints. In the case of failure, the state is restored from the last checkpoint and then the log is replayed from the checkpoint on. Finally, systems may
replicate state portions in multiple workers. In this case, a state change performed by a task succeeds only after the change has been applied to a given number of replicas
\(r\) . This means that the system can tolerate the failure of
\(r-1\) replicas without losing the state and without the need to restore it from a checkpoint. As already discussed in Section
2.4, the same replicas used for fault tolerance may also be used by tasks during normal processing to improve state access latency.
The preceding recovery mechanisms provide different guarantees on the state of the system after recovery. It can be any state, a valid system state, or the same state the system was before failing. In any state recovery, the presence of a failure may bring the system to a state that violates some of the invariants for data and state management that hold for normal (non failing) executions. For instance, the system may drop some input data. A valid state recovery mechanism brings the system to a state that satisfies all invariants, but it may differ from those states traversed before the failure. For instance, a system that provides serializability for groups of tasks may recover by re-executing groups of tasks in a different (but still serial) order. Depending on the system, clients may be able or not to observe the differences between the two states (before and after the failure). For instance, in a DMS, two read queries before and after the failure may observe different states. A same state recovery mechanism brings the system in the same state it was before the failure. Replication, write-ahead logging, and per-activation checkpointing bring the system to the same state it was prior to fail, whereas independent checkpointing only guarantees to bring the system back to a valid state, as the latest checkpoints of each task may not represent a consistent cut of the system. The same happens for CL, due to different interleavings in the original execution and in the recovery phase that may lead to different (albeit valid) states.
A final aspect related to recovery is the assumptions under which it operates, like assuming no more than \(k\) nodes can fail simultaneously as in the case of replication.
2.8 Dynamic Reconfiguration
With dynamic reconfiguration, we refer to the ability of a system to modify the deployment and execution of jobs on the distributed computing infrastructure at runtime. The corresponding classification criteria are shown in Table
8. Reconfiguration may be driven by different
goals, which may involve providing some minimum quality of service, such as in terms of throughput or response time and/or minimizing the use of resources to cut down operation costs. It may be activated manually or be
automated, if the system can monitor the use of resources and make changes that improve the state of the system with respect to the intended goals. The reconfiguration process may involve different mechanisms:
state migration across workers, such as to rebalance shared state portions if they become unbalanced, and
task migration, to change the association of tasks to slots, including the addition or removal of slots, to add computational resources if the load increases or release them when they are not necessary. State migration is common in DMSs, where the distribution of shared state across workers may affect performance. Task migration is instead used in DPSs in the presence of continuous jobs, where tasks are migrated across invocations. In both cases, the migration may adapt the system to the addition or removal of slots. Some systems can continue operating during a reconfiguration process, whereas other systems need to temporarily stop and
restart the jobs they are running: this approach appears in some systems that adopt job-level deployment; in this case, reconfiguration takes place by saving the current state, restarting the whole system, and restoring the last recorded state.
3 Survey of Data-Intensive Systems
To keep our survey compact and general, we decided not to survey data-intensive systems one by one (this is provided in appendix), but to group them into the taxonomy of Figure
3 and to discuss systems class by class in Sections 4 through 6. The caption of Figure
3 lists all systems we grouped in each class.
Our taxonomy groups existing systems in a way that emphasizes their commonalities with respect to our classification criteria (see Tables
1–
8) while also capturing pre-existing classifications widely adopted by experts. The top-level distinction between DMSs (Section
4) and DPSs (Section
5) is well known to researchers and practitioners, and it is also well captured by our classification criteria, with all DMSs providing shared state but not task state and DPSs having opposite characteristics: this distinction impacts on the value of most fields of Table
4.
Within the class of DMSs, the ability to offer strong guarantees in terms of consistency (see Table
4), group atomicity, and group isolation (see Table
5) draws a sharp distinction between those systems usually known as “NoSQL” and those known as “NewSQL.” NoSQL and NewSQL systems can be further classified looking at the data model they offer (the field “elements structure” in Table
3).
Within the class of DPSs, the criterion “execution plan structure” (see Table
2) differentiates dataflow and graph processing systems, whereas the criterion “granularity of deployment” (see Table
2) further separates dataflow systems into those offering task-level deployment and those offering job-level deployment.
Finally, other systems (Section
6) that do not clearly fall into the two main categories of DMSs and DPSs include those that implement data processing (as DPSs) but on top of shared state abstractions usually offered by DMSs only, those that aim to offer new programming models, and hybrid systems that explicitly try to integrate data processing and management capabilities within a unified solution.
Table
9 and Table
10 show how the classes of systems at the second level of our taxonomy map on the classification criteria of our model. The next sections describe each class in detail, explaining the values in these tables and making practical examples that refer to specific systems.
4 Data Management Systems
DMSs offer the abstraction of a mutable state store that many jobs can access simultaneously to query, retrieve, insert, and update elements. Differently from DPSs, they mostly target lightweight jobs, which do not involve computationally expensive data transformations and are short lived. Since their advent in the 1970s, relational databases represented the standard approach to data management, offering a uniform data model (relational), query language (SQL), and execution semantics (transactional). Over the past two decades, new requirements and operational conditions brought the idea of a unified approach to data management to an end [
87,
88]: new applications emerged with different needs in terms of data and processing models, for instance, to store and retrieve unstructured data; scalability concerns related to data volume, number of simultaneous users, and geographical distribution pointed up the cost of transactional semantics. This state of things fostered the development of the DMSs presented in this section. Section
4.1 discusses the aspects in our model that are common to all such systems. Then, following an established terminology, we organize them in two broad classes: NoSQL databases [
36] (Section
4.2) emerged since the early 2000s, providing simple and flexible data models such as key-value pairs, and trading consistency guarantees and strong (transactional) semantics for horizontal scalability, high availability, and low response time; NewSQL databases [
86] (Section
4.3) emerged in the late 2000s and take an opposite approach: they aim to preserve the traditional relational model and transactional semantics by introducing new design and implementation strategies that reduce the cost of coordination.
4.1 Overview
Functional Model. All DMSs provide a global shared state that applications can simultaneously access and modify. In the more traditional systems, there is a sharp distinction between the application logic (the driver, executed client-side on registration) and the queries (the jobs, executed by the DMS). Recent systems increasingly allow to move the part of the application logic that orchestrates jobs execution within the DMS, in the form of stored procedures that run system-side on start. Stored procedures may bring two advantages:
(i) reducing the interactions with external clients, thus improving latency, and (ii) moving part of the overhead for compiling jobs from driver execution time to driver registration time.
Being conceived for interactive use, all DMSs offer synchronous APIs to invoke jobs from the driver program. Many also offer asynchronous APIs that allow the driver to invoke multiple jobs and receive notifications of their results when they terminate. A common approach to reduce the cost of communication when starting jobs from a client-side driver is batching multiple invocations together, which is offered in some NoSQL systems such as MongoDB [
32] and Redis [
64].
Several DMSs can interact with active sources and sinks. Active sources push new data into the system, leading to insertion or modification of state elements. Sinks register to state elements of interest (e.g., by specifying a key or a range of keys) and are notified upon modification of such elements.
DMSs greatly differ in terms of deployment strategies, which are vastly influenced by the coordination protocols that govern replication, group atomicity, and isolation. NoSQL systems do not offer group atomicity and isolation. Those designed for cluster deployment typically use blocking (synchronous or semi-synchronous) replication protocols. Those that support wide-area deployments either use coordination-free (asynchronous) replication protocols that reduce durability and consistency guarantees or employ a hybrid strategy, with synchronous replication within a data center and asynchronous replication across data centers. Many NewSQL systems claim to support wide-area deployments while offering strong consistency, group atomicity, and isolation. We review their implementation strategies to achieve this result in Section
4.3.
Jobs. All DMSs implement one-shot jobs with explicit state management primitives to read and modify a global shared state. Jobs definition APIs greatly differ across systems, ranging from pure key-value stores that offer CRUD (create, read, update, and delete) primitives for individual state elements to expressive domain-specific libraries (e.g., for graph computation) or languages (e.g., SQL for relational data). In almost all DMSs, the execution plan and the communication between tasks are implicit and do not include iterations or dynamic creation of new tasks. A notable exception are graph databases, which support iterative algorithms where developers explicitly define how tasks update portions of the state (the graph) and exchange data. The structure of the execution plan also varies across systems: common structures include the use of a single task that implements CRUD primitives, workflows orchestrated by a central coordinator task, or hierarchical structures. Jobs are compiled on driver execution, except for those systems where part of the driver program is registered server-side (as stored procedures). Job compilation always uses static information about resources, such as the allocation of shared state portions onto nodes. Some structured NewSQL DMSs such as Spanner [
14] and CockroachDB [
90] also exploit dynamic information about resources—for instance, to configure a given task (e.g., select a sort-merge or a hash-based strategy for a join task) depending on some resource utilization or statistics about state (e.g., cardinality of the tables to join).
All DMSs perform deployment at the job level, when the job is compiled, with the only exception of AsterixDB [
9], which compiles jobs into a dataflow plan and deploys individual tasks when they are activated [
19]. Deployment is always guided by the location of the state elements to be accessed, which we consider as a static information. Indeed, under normal execution, shared state portions do not move, and our model captures dynamic relocation of shared state portions (e.g., for load balancing) as a distinct aspect (
dynamic reconfiguration). In addition, we keep saying that deployment is based on static information even for those systems that exploit dynamic information (e.g., the load of workers) but only to deploy the tasks that manage the communication between the system and external clients. Finally, DMSs are not typically designed to operate in scenarios where the compute infrastructure is shared with other software applications. This is probably due to the interactive and short-lived nature of jobs, which would make it difficult to predict and adapt the demand of resources in those scenarios. The only case in which we found explicit mention of a shared platform is in the description of the Google infrastructure, where DMS components (BigTable [
29], Percolator [
76], Spanner [
34]) share the same physical machines and their scheduling, execution, and monitoring is governed by an external resource management service.
Data and State Management. The data model (i.e., the structure of data and state elements) is a key distinguishing characteristic of DMSs, which we use to organize our discussion in Sections
4.2 and
4.3. Some systems explicitly consider the temporal dimension of data elements. For instance, some wide columns stores associate timestamps to elements and let users store multiple versions of the same element, whereas time-series databases are designed to efficiently store and query sequences of measurements over time. DMSs differ in terms of storage medium and structure. We detail the choices of the different classes of systems in Section
4.2 and Section
4.3, but we can identify some common concerns and design strategies. First, the representation of state on storage is governed by the data model and the expected access pattern: relational tables are stored by row, whereas time series are stored by column to facilitate computations on individual measurements over time (e.g., aggregation, trend analysis). Second, there is a tension between read and write performance: read can be facilitated by indexed data structures such as B-trees, but they incur higher insertion and update costs. Hierarchical structures such as LSM trees improve write performance by buffering updates in higher-level logs (frequently in-memory) that are asynchronously merged into lower-level indexed structured, at the expense of read performance, due to the need to navigate multiple layers. Third, most DMSs exploit main memory to reduce data access latency. For instance, systems based on LSM trees store the write buffer in memory. Similarly, most systems that use disk-based storage frequently adopt some in-memory caching layer. Finally, some systems adopt a modular architecture that supports different storage layers. This is common in DMSs offered as a service in public cloud environments (e.g., Amazon Aurora [
95]) or in private data centers (e.g., Google Spanner [
34]), where individual system components (including the storage layer) are themselves services.
Tasks always communicate using direct, ephemeral connections, which implement a partitioned, non-replicated data bus. Shared state is always partitioned across workers to enable concurrent execution of tasks. Most DMSs also adopt state replication to improve read access performance, with different guarantees in terms of consistency between replicas: NoSQL databases provide weak (or configurable) consistency guarantees to improve availability and reduce response time. NewSQL databases provide strong consistency using leader-based or consensus protocols and restricting the types of transactions (jobs) that can read from non-leader replicas—typically snapshot transactions, which are a subset of read-only transactions that read a consistent version of the state, without guarantees of it being the most recent one. Group atomicity and isolation are typically absent or optional in NoSQL databases, or restricted to very specific cases, such as jobs that operate on single data elements. Instead, NewSQL databases provide strong guarantees for atomicity and isolation, at the cost of blocking coordination. The transactional semantics of NewSQL systems ensures exactly once delivery. Indeed, transactions (jobs) either complete successfully (and their effects become durable) or abort, in which case they are either automatically retried or the client is notified and can decide to retry them until success. Conversely, NoSQL systems frequently offer at most once semantics, as they do not guarantee that the results of job execution are safely stored on persistent storage or replicated. In some cases, users can balance durability and availability by selecting the number of replicas that are updated synchronously. Finally, systems that support timestamps use event time semantics, where timestamps are associated with state elements by clients, whereas none of the systems provides order guarantees. Even in the presence of timestamps, DMSs do not implement mechanisms to account for elements produced or received out of timestamp order.
Fault Tolerance. Frequently, DMSs offer multiple mechanisms for fault tolerance and durability that administrators can enable and combine depending on their needs. Fault detection can be centralized (leader-worker) or distributed, depending on the specific system. Fault recovery mostly targets the durability of shared state. Since jobs are lightweight, DMSs simply abort them in the case of failure and do not attempt to recover (partial) computations. Transactional systems guarantee group atomicity: in the case of failure, none of the effects of a job become visible. To enable recovery of failed jobs, they either notify the clients about an abort, allowing them to restart the failed job, or restart the job automatically. Almost all DMSs adopt logging mechanisms to ensure that the effects of jobs execution on shared state are durable. Logging enables changes to be recorded on some durable append-only storage before being applied to shared state: most systems adopt a WAL that records changes to individual data elements, whereas few others adopt a command log that stores the operations (commands) that perform the change. Individual systems make different assumptions on what they consider as durable storage: in some cases the logs are stored on a single disk, but more frequently they are replicated (or saved on third-party log services that are internally replicated). Logging is frequently used in combination with replication of shared state portions on multiple workers: in these cases, each worker stores its updates on a persistent log to recover from software crashes, whereas replication on other workers may avoid unavailability in the case of hardware crashes or network disconnections. In addition, most systems offer geo-replication for disaster recovery, where the entire shared state is replicated in a different data center and periodically synchronized with the working copy. Similarly, many systems provide periodic or manual checkpointing to store a copy of the entire database at a given point in time. Depending on the specific mechanisms adopted, DMSs provide either no guarantees for state, such as in the case of asynchronous replication, or same state guarantees, such as in the case of persistent log or consistent replication.
Dynamic Reconfiguration. Most DMSs support adding and removing workers at runtime, and migrating shared state portions across workers, which enable dynamic load balancing and scaling without restarting the system. All systems support dynamic reconfiguration as a manual administrative procedure, and some can also automatically migrate state portions for load balancing. A special case of reconfiguration for NewSQL systems that rely on structured state (in particular, relational systems) involves changing the state schema: many systems support arbitrary schema changes as long-running procedures that validate state against the new schema and migrate it while still serving clients using the previous schema. Instead, systems such as VoltDB [
89] rely on a given state partitioning scheme and prevent changes that violate such scheme.
4.2 NoSQL Systems
Using an established terminology, we classify as NoSQL all those DMSs that aim to offer high availability and low response time by relinquishing features and guarantees that require blocking coordination between workers. In particular, they typically do the following. First, they avoid expressive job definition APIs that may lead to complex execution plans (as in the case of SQL, hence the name) [
85]. In fact, the majority of the systems we analyzed focus on jobs comprising a single task that operates on an individual element of the shared state. Second, they use asynchronous replication protocols: depending on the specific protocol, this may affect consistency, may affect durability (if replication is used for fault tolerance), and may generate conflicts (when clients are allowed to write to multiple replicas). Third, they abandon or restrict group guarantees (atomicity and isolation) when jobs with multiple tasks are supported. In our discussion, we classify systems by the data model they offer.
4.2.1 Key-Value Stores.
Key-value stores offer a very basic API for managing shared state:
(i) shared state elements are represented by a key and a value; (ii) elements are schema-less, meaning that different elements can have different formats, such as free text or JSON objects with heterogeneous attributes; (iii) the key-space is partitioned across workers; and (iv) jobs consist of a single task that retrieves the value of an element given a key (get) or insert/update an element given its key (put).
Individual systems differ in the way they physically store elements. For instance, Dynamo [
39] supports various types of physical storage, DynamoDB uses B-trees on disk but buffers incoming updates in main memory to improve write performance, and Redis [
64] stores elements in memory only. Keys are typically partitioned across workers based on their hash (hash partitioning), but some systems also support range partitioning, where each worker stores a sequential range of keys, as in the case of Redis. Given the focus on latency, some systems cache the association of keys to workers client-side, allowing clients to directly forward requests to workers responsible for the key they are interested in.
Some systems provide richer APIs to simplify the interaction with the store. First, keys can be organized into tables, mimicking the concept of a table in a relational database. For instance, DynamoDB and PNUTS [
33] let developers split state elements into tables. Second, elements may have an associated structure. For instance, PNUTS lets developers optionally define a schema for each table, DynamoDB specifies a set of attributes but does not constrain their internal structure, and Redis provides built-in data types to define values and represent them efficiently in memory. Third, most systems provide functions to iterate (
scan) on the key-space or on individual tables (range-based partitioning may be used to speedup such range-based iterations), as it happens for DynamoDB, PNUTS, and Redis. Fourth, some systems provide query operations (
select) to retrieve elements by value and in some cases these operations are supported by secondary indexes that are automatically updated when the value of an element changes, as in DynamoDB.
All key-value stores replicate shared state to increase availability but use different replication protocols. Dynamo, Voldemort, and Riak KV use a quorum approach, where read and write operations for a given key need to be processed by a given number of replica workers responsible for that key. After a write quorum is reached, updates are propagated to remaining replicas asynchronously. A larger number of replicas and a larger write quorum better guarantee durability and consistency at the cost of latency and availability. However, being designed for availability, these systems adopt mechanisms to avoid blocking on write when some workers are not responsive. For instance, other workers can supersede and store written values on their behalf. In some cases, these mechanisms can lead to conflicting simultaneous updates: Dynamo tracks causal dependencies between writes to solve conflicts automatically whenever possible, and stores conflicting versions otherwise, leaving manual reconciliation to the users. DynamoDB, Redis, and Aerospike [
84] use single-leader protocols where all writes are processed by one worker and propagated synchronously to some replicas and asynchronously to others, depending on the configuration. DynamoDB also supports consistent reads that are always processed by the leader at a per-operation granularity. Redis supports multi-leader replication in the case of wide-area scenarios, using conflict-free replicated data types for automated conflict resolution.
In summary, key-value stores represent the core building block of a DMS. They expose a low-level but flexible API to balance availability, consistency, and durability, and to adapt to different deployment scenarios. Systems that adopt a modular implementation can use key-value stores as a storage layer or a caching layer [
73] and build richer job definition API, job execution engines, protocols for group atomicity, group isolation, and consistent replication on top.
4.2.2 Wide-Column Stores.
Wide-column stores organize shared state into tables (multi-dimensional maps), where each row associates a unique key to a fixed number of column families, and each column family contains a value, possibly organized into more columns (attributes). State is physically stored per column family and keys need not have a value for each column family (the table is typically sparse). One could define the wide-column data model as middle ground between the key-value and the relational model: it is similar to the key-value model but associates a key to multiple values (column families), and it defines tables as the relational model, but tables are sparse and lack referential integrity. The main representative systems of this class are Google BigTable [
29], with its open source implementation HBase, and Apache Cassandra [
57]. As the official documentation of Cassandra explains, the typical use of wide-column systems is to compute and store answers to frequent queries (read-only jobs) for each key, at insertion/update time, within column families. In contrast, relational databases normalize tables to avoid duplicate columns and compute results at query time (rather then insertion/update time) by joining data from multiple tables. In fact, wide-column stores offer rich API to scan, select, and update values by key but do not offer any join primitive. To support the preceding scenario, wide-column systems
(i) aim to provide efficient write operations to modify several column families (i.e., both BigTable and Cassandra adopt LSM trees for storage and improve write latency by buffering writes in memory) and (ii) provide isolation for operations that involve the same key.
These two design choices allow users to update all entries for a given key (answers to queries) efficiently and in isolation.
BigTable and Cassandra have different approaches to replication. BigTable uses replication only for fault tolerance and executes all tasks that involve a single key on the leader worker responsible for that key. It also supports wide-area deployment by fully replicating the data store in additional data centers: replicas in these data centers can be used for fault tolerance but also to perform jobs, in which case they are synchronized with eventual consistency. Cassandra uses quorum replication as in Dynamo, and allows users to configure the quorum protocols to trade consistency and durability for availability.
4.2.3 Document Stores.
Document stores represent a special type of key-value stores where values are structured documents, such as XML or JSON objects. Document stores offer an API similar to key-value stores, but they can exploit the structure of documents to update only some of their fields. Physical storage solutions vary across systems, ranging from disk-based to memory solutions to hybrid approaches and storage-agnostic solutions. In most cases, document stores support secondary indexes to improve retrieval of state elements using criteria different from the primary key. Most document stores offer group isolation guarantees for jobs that involve a single document. This is the case of MongoDB [
32] and AsterixDB [
9]. Recent versions of MongoDB also implement multi-document atomicity and isolation as an option, using blocking protocols.
MongoDB supports replication for fault tolerance or also to serve read-only jobs. It implements a single-leader protocol with semi-synchronous propagation of changes, where clients can configure the number of replicas that need to synchronously receive an update, thus trading durability and consistency for availability and response time. CouchDB [
10] offers a quorum-based replication protocol and allows for conflicts in the case a small write quorum is selected. In this case, conflict resolution is manual. AsterixDB does not currently support replication.
Several document stores support some form of data analytic jobs. MongoDB offers jobs in the form of a pipeline of data transformations that can be applied in parallel to a set of documents. CouchDB focuses on Web applications and can start registered jobs when documents are added or modified to update some views. AsterixDB provides a declarative language that integrates operators for individual and for multiple documents (like joins, group by), and compiles jobs into a dataflow execution plan.
4.2.4 Time-Series Stores.
Time-series stores are a special form of wide-column stores dedicated to store sequences of values over time, like measurements of a numeric metric such as the CPU utilization of a computer over time. Given the specific application scenario, this class of systems stores data by column, which brings several advantages:
(i) together with the use of an in-memory or hybrid storage layer, it improves the performance of write operations, which typically append new values (measurements) to individual columns; (ii) it offers faster sequential access to columns, which is common in read-only jobs that perform aggregations or look for temporal patterns over individual series; and (iii) it enables a higher degree of data compression, such as by storing only the difference between adjacent numerical values (delta compression), which is small if measurements change slowly.
Among the time-series stores we analyzed, InfluxDB is the most general one. It provides a declarative job definition language that supports computations on individual columns (measurements). Gorilla [
75] is used as an in-memory cache to store monitoring metrics at Facebook. Given the volume and rate at which metrics are produced, Facebook keeps the most recent data at a very fine granularity within the Gorilla cache and stores historical data at a coarser granularity in HBase. Peregreen [
96] follows a similar approach and optimizes retrieval of data through indexing. It uses a three-tier data indexing, where each tier pre-computes aggregated statistics (minimum, maximum, average, etc.) for the data it references. This allows to quickly identify chunks of data that satisfy some conditions based on the pre-computed statistics and to minimize the number of interactions with the storage layer. Monarch [
3] is used to store monitoring data at Google. It has a hierarchical architecture: data is stored in the zone (data center) in which it is generated and sharded (by key ranges, lexicographically) across nodes called
leaves. Jobs are evaluated hierarchically: nodes are organized in three layers (global, zone level, leaves), and the job plan pushes tasks as close as possible to the data they need to consume. All time-series stores we analyzed replicate shared state to improve availability and performance of read operations. To avoid blocking write operations, they adopt asynchronous or semi-synchronous replication, thus reducing durability guarantees. This is motivated by the specific application scenarios, where losing individual measurements may be tolerated.
4.2.5 Graph Stores.
Graph stores are a special form of key-value stores specialized in graph-shaped data, meaning that shared state elements represent entities (vertices of a graph) and their relations (edges of the graph). Despite that researchers widely recognized the importance of large-scale graph data structures, several graph data stores do not scale horizontally [
80]. A prominent example of distributed graph store is TAO [
20], used at Facebook to manage the social graph that interconnects users and other entities such as posts, locations, and actions. It builds on top of key-value stores with hybrid storage (persisted on disk and cached in memory), asynchronously replicated with no consistency or grouping guarantees.
A key distinguishing factor in graph stores is the type of queries (read-only jobs) they support. Indeed, a common use of graph stores is to retrieve sub-graphs that exhibit certain patterns of relations. For instance, in a social graph, one may want to retrieve people (vertices) that are direct friends or have friends in common (friendship relation edges) and like the same posts. This problem is denoted as
graph pattern matching, and its general form can only be solved by systems that can express iterative or recursive jobs, as it needs to traverse the graph following its edges. These types of vertex-centric computations have been first introduced in the Pregel DPS [
66], also discussed in Section
5.
Efficient query of graph stores can also be supported by external systems. For instance, Facebook developed the Unicorn [
35] system to store indexes that allow to quickly navigate and retrieve data from a large graph. Indexes are updated periodically using an external compute engine. Unicorn adopts a hierarchical architecture, where indexes (the shared state of the system) are partitioned across servers and the results of index lookups (read jobs) are aggregated first at the level of individual racks and then globally to obtain the complete query results. This approach aggregates results as close as possible to the servers producing them to reduce network traffic. Unicorn supports graph patterns queries by providing an
apply function that can dynamically start new lookups based on the results of previous ones: our model captures this feature by saying that jobs can dynamically start new tasks.
4.3 NewSQL Systems
NewSQL systems aim to provide transactional semantics (group atomicity and isolation), durability (fault tolerance), and strong replication consistency while preserving horizontal scalability. Following the same approach we adopted in Section
4.2, we organize them according to their data model.
4.3.1 Key-Value Stores.
NewSQL key-value stores are conceived as part of a modular system, where the store offers transactional guarantees to read and update a group of elements with atomicity and isolation guarantees, and it is used by a job manager that compiles and optimizes jobs written in some high-level declarative language. A common design principle of these systems is to separate the layer that manages the transactional semantics from the actual storage layer, thus enabling independent scaling based on the application requirements. Deuteronomy [
59] implements transactional semantics using a locking protocol and is storage agnostic. FoundationDB [
100] uses optimistic concurrency control with a storage layer based on B-trees. Solar [
101] also uses optimistic concurrently control with LSM trees.
4.3.2 Structured and Relational Stores.
Stores for structured and relational data provide the same data model, job model, and job execution semantics as classic non-distributed relational databases. As we clarify in the following classification, they differ in their protocols for implementing group atomicity, group isolation, and replication consistency, which reflects on their architectures.
Time-Based protocols. Some systems exploit physical (wall-clock) time to synchronize nodes. This approach was pioneered by Google’s Spanner [
34]. It adopts standard database techniques: two-phase commit for atomicity, two-phase locking and multi-version concurrency control for isolation, and single-leader synchronous replication of state portions. Paxos consensus is used to elect a leader for each state portion and to keep replicas consistent. The key distinguishing characteristic of Spanner is the use TrueTime, a clock abstraction that uses atomic clocks and GPS to return physical time within a known precision bound. In Spanner, each job is managed by a transaction coordinator, which assigns jobs with a timestamp at the end of the TrueTime clock uncertainty range and waits until this timestamp is passed for all nodes in the system. This ensures that jobs are globally ordered by timestamp, thus offering the illusion of a centralized system with a single clock (external consistency). Spanner is highly optimized for workloads with many read-only jobs. Indeed, multi-version concurrency control combined with TrueTime allows read-only jobs to access a consistent snapshot of the shared state without locking and without conflicting with in-progress read-write jobs, as they will be certainly be assigned a later timestamp. More recently, Spanner has been extended with support for distributed SQL query execution [
14]. CockroachDB [
90] is similar to Spanner but uses an optimistic concurrency control protocol that, in the case of conflicts, attempts to modify the timestamp of a job to a valid one rather than re-executing the entire job. Like Spanner, CockroachDB supports distributed execution plans. It supports wide-area deployment and allows users to define how data is partitioned across regions, to promote locality of data access or to enforce privacy regulations.
Deterministic Execution. Calvin [
92] builds on the assumption that jobs are deterministic and achieves atomicity, isolation, and consistency by ensuring that jobs are executed in the same order in all replicas. Determinism ensures that jobs either succeed or fail in any replica (atomicity), and interleave in the same way (global order ensures isolation), leading to the same results (consistency). Workers are organized into three layers. The first is a sequencing layer that receives jobs invocations from clients, organizes them into batches, and orders them consistently across replicas. Ordering of jobs is the only operation that requires coordination and takes place before jobs execution. Calvin provides both synchronous (Paxos) and asynchronous protocols for ordering jobs, which bring different tradeoffs between latency and cost of recovery in the case of failures. The second is a scheduler layer that executes tasks onto workers in the defined global order. In cases where it is not possible to statically determine which shared state portions will be involved in the execution of a job (i.e., in the case of state-dependent control flow), Calvin uses an optimistic protocol and aborts jobs if some of their tasks are received by workers out of order. The third is a storage layer that stores the actual data. In fact, Calvin supports any storage engine providing a key-value interface.
Explicit Partitioning and Replication Strategies. VoltDB [
88,
89] lets users control partitioning and replication of shared state, so they can optimize most frequently executed jobs. For instance, users can specify that
Customer and
Payment tables are both partitioned by the attribute (column)
customerId. Jobs that are guaranteed to access only a shared state portion within a given worker are executed sequentially and atomically on that worker. For instance, a job that accesses tables
Customer and
Payment to retrieve information for a given
customerId can be fully executed on the worker with the state portion that includes that customer. Every table that is not partitioned is replicated in every worker, which optimizes read access from any worker at the cost of replicating state changes. In the case in which jobs need to access state portions at different workers, VoltDB resorts to standard two-phase commit and timestamp-based concurrency control protocols. Differently from Spanner and Calvin, VoltDB provides strong consistency only for cluster deployment: geographical replication is supported but only implemented with asynchronous and weakly consistent protocols.
Primary-Based Protocols. Primary-based protocols are a standard approach to replication used in traditional transactional databases. They elect one primary worker that handles all read-write jobs and acts as a coordinator to ensure transactional semantics. Other (secondary) workers only handle read-only jobs and can be used to fail over if the primary crashes. Recently, the approach has been revamped by DMSs offered as services on the cloud. These systems adopt a layered architecture that decouples jobs execution functionalities (e.g., scheduling, managing atomicity and isolation) from storage functionalities (durability): the two layers are implemented as services that can scale independently from each other. The execution layer still consists of one primary worker and an arbitrary number of secondary workers, which access shared state through the storage service (although they typically implement a local cache to improve performance). Amazon Aurora [
95] implements the storage layer as a sequential log (replicated for availability and durability), which offers better performance for write operations. Indexed data structures that improve read performance are materialized asynchronously without affecting write latency. The storage layer uses a quorum approach to guarantee replication consistency across workers. Microsoft Socrates [
11] adopts a similar approach but further separates storage into a log layer (that stores write requests with low latency), durable storage layer (that stores a copy of the shared state), and a backup layer (that periodically copies the entire state).
4.3.3 Objects Stores.
Object stores became popular in the early 1990s, inheriting the same data model as object-oriented programming languages. We found one recent example of a DMS that uses this data model, namely Tango [
16]. In Tango, clients store their view of objects locally, in-memory, and this view is kept up-to-date with respect to a distributed (partitioned) and durable (replicated) log of updates. The log represents the primary replica of the shared state that all clients refer to. All updates to objects are globally ordered on the log through sequence numbers that are obtained through a centralized sequencer. Total order guarantees isolation for operations on individual objects: Tango also offers group atomicity and isolation across objects using the log to store information for an optimistic concurrency control protocol.
4.3.4 Graph Stores.
We found one example of a NewSQL graph store, named
A1 [
22], which provides strong consistency, atomicity, and isolation using timestamp-based concurrency control. Its data model is similar to that of NoSQL distributed graph stores, and jobs can traverse the graph and read and modify its associated data during execution. The key distinguishing characteristic of A1 is that it builds on a distributed shared memory abstraction that uses RDMA (remote direct memory access) implemented within network interface cards [
41].
5 Data Processing Systems
DPSs aim to perform complex computations (long-lasting jobs) on large volumes of data. Most of today’s DPSs inherit from the seminal MapReduce system [
38]: to avoid the hassle of concurrent programming and to simplify scalability, they organize each job into a dataflow graph where vertices are functional operators that transform data and edges are the flows of data across operators. Each operator is applied in parallel to independent partitions of its input data, and the system automatically handles data partitioning and data transfer across workers. Following an established terminology, we denote as
batch processing systems those that take in input static (finite) datasets, and
stream processing systems those that take in input streaming (potentially unbounded) datasets. In practice, many systems support both types of input, and we do not use the distinction between batch and stream processing as the main factor to organize our discussion. Instead, after discussing the aspects in our model that are common to all DPSs (Section
5.1), we classify dataflow systems based on the key aspect that impacts their implementation: if they deploy individual tasks on activation (Section
5.2) or entire jobs on registration (Section
5.3). Finally, we present systems designed to support computations on large graph data structures. They evolved in parallel with respect to dataflow systems, which originally were not suited for iterative computations that are typical in graph algorithms (Section
5.4).
5.1 Overview
Functional Model. Most DPSs use a leader-workers architecture, where one of the processes that compose the system (denoted the leader) has the special role of coordinating other workers. Such systems always allow submitting the driver program to the leader for system-side execution. Some of them also allow client-side driver execution, such as Apache Spark [
99] and Apache Flink [
24]. Other systems, such as Kafka Streams [
18] and timely dataflow [
71], are implemented as libraries where client processes also act as workers. Developers start one or more client processes, and the library handles the distributed execution of jobs onto them. Stream processing systems support asynchronous invocation of (continuous) jobs, whereas batch processing systems may offer synchronous or asynchronous job invocation APIs, or both. All DPSs support sources and sinks, as they are typically used to read data from external systems (sources), perform some complex data analysis and transformation (jobs), and store the results into external systems (sinks). Sources are passive in the case of batch processing systems and active in the case of stream processing systems. Most batch processing systems are stateless: output data is the result of functional transformations of input data. Stream processing systems can persist a (task) state across multiple activations of a continuous job. We model iterative graph algorithms as continuous jobs where tasks (associated with vertices, edges, or sub-graphs) are activated at each iteration and store their partial results (values associated with vertices, edges, or sub-graphs) in task state. DPSs assume a cluster deployment, as job execution typically involves exchanging large volumes of data (input, intermediate results, and final results) across workers.
Jobs. All dataflow systems provide libraries to explicitly define the execution plan of jobs. Increasingly often, they also offer higher-level abstractions for specific domains, such as relational data processing [
2,
12,
23], graph computations [
46], or machine learning [
69]. Some of these APIs are declarative in nature and make the definition of the execution plan implicit. Task communication is always implicitly defined and controlled by the system’s runtime. Concerning jobs, dataflow systems differ with respect to the following aspects. The first is generality. MapReduce [
38] and some early systems derived from it only support two processing stages with fixed operators, whereas later systems like Spark support any number of processing stages and a vast library of operators. The second is support for iterations. Systems like HaLoop [
21] extended MapReduce to efficiently support iterations by caching data accessed across iterations in workers. Spark inherits the same approach and, together with Flink, supports some form of iterative computations for streaming data. Timely dataflow [
71] generalizes the approach to nested iterations. The third is dynamic creation of tasks. Among the systems we analyzed, only CIEL [
72] enables dynamic creation of tasks depending on the results of processing.
Data parallelism is key to dataflow systems, and all operators in their jobs definition API are data parallel. Jobs are one-shot in the case of batch processing and continuous in the case of stream processing. In the latter case, jobs may implicitly define some task state, such as by expressing computations that operate on a recent portion (window) of data rather than on individual data elements. Jobs cannot control task placement explicitly, but many systems provide configuration parameters to guide placement decisions—for instance, to force or inhibit the colocation of certain tasks. An exception to the preceding rules is represented by graph processing systems, which are based on a programming model where developers define the behavior of individual vertices [
68]: the model provides explicit primitives to access the state of a vertex and to send messages between vertices (explicit communication).
All DPSs compile jobs on driver execution. For other characteristics related to jobs compilation, deployment, and execution, we distinguish between systems that perform task-level deployment (discussed in Section
5.2) and systems that perform job-level deployment (discussed in Section
5.3).
Data and State Management. Deployment and execution strategies affect the implementation of the data bus. In the case of job-level deployment, the data bus is implemented using ephemeral, push-based communication channels between tasks (e.g., direct TCP connections). In the case of task-level deployment, the data bus is mediated and implemented by a persistent service (e.g., a distributed filesystem or a persistent message queuing system) where upstream tasks push the results of their computation and downstream tasks pull them when activated. A persistent data bus can be replicated for fault tolerance, as in the case of Kafka Streams, which builds on replicated Kafka topics. CIEL [
72] and Dryad [
51] support hybrid bus implementations, where some connections may be direct while others may be mediated. Data elements may range from arbitrary strings (unstructured data) to specific schemas (structured data). The latter offer opportunities for optimizations in the serialization process, such as allowing for better compression or for selective deserialization of only the fields that are accessed by a given task. In general, DPSs do not provide shared state. Stream processing and graph processing systems include a task state to persist information across multiple activations of a continuous job (i.e., windows). In the absence of shared state, DPSs do not provide group atomicity or isolation properties. Almost all systems provide exactly once delivery, under the assumption that sources can persist and replay data in the case of failure and sinks can distinguish duplicates. The concrete approaches to provide such guarantee depend on the type of deployment (task level or job level) and are discussed later in Sections
5.2 and
5.3. Order is relevant for stream processing systems: with the exception of Storm [
94], all stream processing systems support timestamped data (event or ingestion time semantics). Most systems deliver events in order, under the assumptions that sources either produce data with a pre-defined maximum delay or inform the system about the progress of time using special metadata denoted as watermark. Kafka Streams takes a different approach: it does not wait for out-of-order elements and immediately produces results. In the case in which new elements arrive out of order, it retracts updates the previous results.
Fault Tolerance. All DPSs detect faults using a leader-worker architecture, and in absence of a shared state, they recover from failures through the mechanisms that guarantee exactly once delivery.
Dynamic Reconfiguration. DPSs use dynamic reconfiguration to adapt to the workload by adding or removing slots. Systems that adopt task-level deployment can decide how to allocate resources to individual tasks when they are activated, whereas systems that adopt job-level deployment need to suspend and resume the entire job, which increases the overhead for performing a reconfiguration. The mechanisms that dynamically modify the resources (slots) available to a DPS can be activated either manually or by an automated service that monitors the utilization of resources and implements the allocation and deallocation policies. All commercial systems implement automated reconfiguration, frequently by relying on external platforms for containerization, such as Kubernetes, or for cluster resources management, such as YARN. The only exceptions for which we could not find official support for automated reconfiguration are Storm [
94] and Kafka Streams [
18].
5.2 Dataflow with Task-Level Deployment
Systems that belong to this class deploy tasks on activation, when their input data becomes available. Tasks store intermediate results on a persistent data bus, which allows to selectively restart them in the case of failure. This approach is best suited for long running batch jobs and was pioneered in the MapReduce batch processing system [
38]. It has been widely adopted in various extensions and generalizations. HaLoop optimizes iterative computations by caching loop-invariant data and by co-locating tasks that reuse the same data across iterations [
21]. Dryad [
51] generalizes the programming model to express arbitrary dataflow plans and enables developers to flexibly select the concrete channels (data bus in our model) that implement the communication between tasks. CIEL [
72] extends the dataflow model of Dryad by allowing tasks to create other tasks dynamically, based on the results of their computation. Spark is the most popular system of this class [
99]: it inherits the dataflow model of Dryad and supports iterative execution and data caching like HaLoop. Spark Streaming [
98] implements streaming computations on top of Spark by splitting the input stream into small batches and by running the same job for each batch. It implements task state using native Spark features: the state of a task after a given invocation is implicitly stored as a special data item that the task receives as input in the subsequent invocation.
In systems with task-level deployment, job compilation considers dynamic information to create tasks. For instance, the number of tasks instantiated to perform a data-parallel operation depends on how the input data is partitioned. Similarly, the deployment phase uses dynamic information to submit tasks to workers running as close as possible to the their input data. Hadoop (the open source implementation of MapReduce) and Spark adopt a delay scheduling [
97]. They put jobs (and their tasks) in a FIFO queue. When slots become available, the first task in the queue is selected: if the slot is located near to the input data for the task, then the task is immediately deployed, and otherwise each task can be postponed for some time to wait for available slots closer to their input data. Task-level deployment enables sharing of compute resources with other applications: in fact, most of the systems that use this approach can be integrated with cluster management systems.
Task-level deployment also influences how systems implement fault tolerance and ensure exactly once delivery of results. Batch processing systems simply re-execute the tasks involved in the failure. In absence of state, the results of a task depend only on input data and can be recomputed at need. Intermediate results may be persisted on durable storage and retrieved in the case of a failure or recomputed from the original input data. Spark Streaming [
98] adopts the same fault tolerance mechanism for streaming computations. It segments a stream into a sequence of so-called micro-batches and executes them in order. Task state is treated as a special form of data; it is periodically persisted to durable storage and is retrieved in the case of a failure. Failure recovery may require activating failed tasks more than once, to recompute the task state from the last state persisted before the failure.
Dynamic reconfiguration is available in all systems that adopt task-level deployment. Systems that do not provide any state abstraction can simply exploit new slots to schedule tasks when they become available and remove workers when idle. In the presence of task state, migrating a task involves migrating its state across activations: as in the case of fault tolerance, this is done by storing task state on persistent storage.
5.3 Dataflow with Job-Level Deployment
In the case of job-level deployment, all tasks of a job are deployed onto the slots of the computing infrastructure on job registration. As a result, this class of systems is better suited for streaming computations that require low latency: indeed, no scheduling decision is taken at runtime and tasks are always ready to receive and process new data. Storm [
94] and its successor Heron [
56] are stream processing systems developed at Twitter. They offer a lower-level programming API than dataflow systems discussed previously, asking developers to fully implement the logic of each processing step using a standard programming language. Flink [
24] is a unified execution engine for batch and stream processing. In terms of a programming model, it strongly resembles Spark, with a core API to explicitly define job plans as a dataflow of functional operators, and domain-specific libraries for structural (relational) data, graph processing, and machine learning. One notable difference involves iterative computations: Flink supports them with native operators (within jobs) rather than controlling them from the driver program. Timely dataflow [
71] offers a lower-level and more general dataflow model than Flink, where jobs are expressed as a graph of (data-parallel) operators and data elements carry a logical timestamp that tracks global progress. Management of timestamps is explicit, and developers control how operators handle and propagate them, which enables various execution strategies. For instance, developers may choose to complete a given computation step before letting the subsequent one start (mimicking a batch processing strategy as implemented in MapReduce or Spark), or they may allow overlapping of steps (as it happens in Storm or Flink). The flexibility of the model allows for complex workflows, including streaming computations with nested iterations, which are hard or even impossible to express in other systems. The preceding systems rely on direct and ephemeral channels (typically, TCP connections) to implement the data bus. Kafka Streams [
18] and Samza [
74], instead, build a dataflow processing layer on top of Kafka [
55] durable channels. In systems that adopt job-level deployment, job compilation and deployment only depend on static information about the computing infrastructure. For instance, the number of tasks for data-parallel operations only depends on the total number of slots made available in workers. As a result, this class of systems does not support sharing resources with other applications: all resources need to be acquired at job compilation, which prevents scheduling decisions across applications at runtime.
We observed three approaches to implement fault tolerance and delivery guarantees. First, systems such as Flink and MillWheel [
7] periodically take a consistent snapshot of the state. The command to initiate a snapshot starts from sources and completes when it reaches the sinks. In the case of failure, the last completed snapshot is restored and sources replay data that was produced after that snapshot, in the original order. If sinks can detect and discard duplicate results, this approach guarantees exactly once delivery. Second, Storm acknowledges each data element delivered between two tasks: developers decide whether to use acknowledgements (and retransmit data if an acknowledgement is lost), providing at least once delivery, or not, providing at most once delivery. Third, Kafka Streams relies on the persistency of the data bus (Kafka): it stores the task state in special Kafka topics and relies on two-phase commit to ensure that upon activation a task consumes its input, updates its state, and produces results for downstream tasks atomically. In the case of failure, a task can resume from the input elements that were not successfully processed, providing exactly once delivery (unless data elements are received out of order, in which case it retracts and updates previous results leading to at least once delivery). These three mechanisms are also used for dynamic reconfiguration, as they allow a system to resume processing after a new deployment.
5.4 Graph Processing
Early dataflow systems were not well suited for iterative computations, which are common in graph processing algorithms. To overcome this limitation, an alternative computational model was developed for graph processing, known as vertex-centric [
68]. In this model, pioneered by the Google Pregel system [
66], jobs are iterative: developers provide a single function that encodes the behavior of each vertex
\(v\) at each iteration. The function takes in input the current (local) state of
\(v\) and the set of messages produced for
\(v\) during the previous iteration; it outputs the new state of
\(v\) and a set of messages to be delivered to connected vertices, which will be evaluated during the next iteration. The job terminates when vertices do not produce any message at a given iteration. Vertices are partitioned across workers and each task is responsible for a given partition. Jobs are continuous, as tasks are activated multiple times (once for each iteration) and store the vertex state across activations (in their task state). Tasks only communicate by exchanging data (messages between vertices) over the data bus, which is implemented as direct channels. One worker acts as a leader and is responsible for coordinating the iterations within the job and for detecting possible failures of other workers. Workers persist their state (task state and input messages) at each iteration: in the case of a failure, the computation restarts from the last completed iteration. Several systems inherit and improve the original Pregel model in various ways:
(i) by using a persistent data bus, where vertices can pull data when executed, to reduce the overhead for broadcasting state updates to many neighbor vertices [
62]; (ii) by decoupling communication and processing in each superstep, to combine messagess and reduce the communication costs [
45]; (iii) by allowing asynchronous execution of supersteps, to reduce synchronization overhead and inactive time [
62]; (iv) by optimizing the allocation of vertices to tasks based on topological information, to reduce the communication overhead; (v) by dynamically migrating vertices between tasks (dynamic reconfiguration) across iterations, to keep the load balanced or to place frequently communicating vertices on the same worker [
30]; and (vi) by offering sub-graph centric abstractions, suitable to express graph mining problems that aim to find sub-graphs with given characteristics [
91].
For the sake of space, we do not discuss all systems derived from Pregel here, but the interested reader can refer to the detailed survey by McCune et al [
68].
6 Other Systems
This section includes all systems that do not clearly fall in either of the two classes identified previously. Due to their heterogeneity, we do not provide a common overview, but we organize and discuss them within three main classes:
(i) systems that support analytical jobs on top of shared state abstractions, (ii) systems that propose new programming models, and (iii) systems that integrate concepts from both DMSs and DPSs in an attempt to provide a unifying solution.
6.1 Computations on DMSs
DMSs are designed to execute lightweight jobs that read and modify a shared state. We identified a few systems that support some form of heavy-weight job.
6.1.1 Incremental Computations.
Percolator [
76] builds on top of the BigTable column store and incrementally updates its shared state. It adopts observer processes that periodically scan the shared state: when they detect changes, they start a computation that may update other tables with its results. In Percolator, computations are broken down into a set of small updates to the current shared state. This differentiates it from DPSs, which are not designed to be incremental. For instance, Percolator can incrementally update Web search indexes as new information about Web pages and links become available. Percolator jobs may involve multiple shared-state elements, and the system ensures group atomicity using using two-phase commit and group isolation using a timestamp-based protocol.
6.1.2 Long-Running Jobs.
F1 [
83] implements a SQL query executor on top of Spanner. It supports long-running jobs, which are compiled to execution plans where the tasks (or at least part of them) are organized into a dataflow to enable distributed execution as in DPSs. F1 also introduces optimistic transactions (jobs), which consist of a read phase to retrieve all the data needed for the computation and a write phase to store the results. The read phase does not block other concurrent jobs, so they can run for long time (as in the case of analytical jobs). The write phase completes only if no conflicting updates from other jobs occurred during the read phase.
6.1.3 Graph Processing.
In graph data stores, long-running jobs appear as computations that traverse multiple hops of the graph (e.g., jobs that search for paths or patterns in the graph) or as iterative analytical jobs (e.g., vertex-centric computations). Trinity [
80] inherits the same model of NoSQL graph stores such as TAO but implements features designed specifically to support long-running jobs. It lets users define the communication protocols that govern the exchange of data over the data bus, to optimize them for each specific job. For instance, data may be buffered and aggregated at the sender or at the receiver. It checkpoints the intermediate state of a long-running jobs to resume it in the case of failure.
6.2 New Programming Models
6.2.1 Stateful Dataflow.
The absence of shared mutable state in the dataflow model forces developers to encode all information as data that flows between tasks. However, some algorithms would benefit from the availability of state that can be modified in-place, such as machine learning algorithms that iteratively refine a set of parameters. Thus, several systems propose extensions to the dataflow programming model that accommodate shared mutable state. In
Stateful Dataflow Graphs (SDGs) [
43], developers write a driver program using imperative (Java) code that includes state and methods to access and modify it. Code annotations are used to specify state access patterns within methods. The resulting jobs are compiled into a dataflow graph where operators access the shared state. If possible, state elements are partitioned across workers, and otherwise they are replicated in each worker and the programming model supports user-defined functions to merge changes applied to different replicas. Deployment and execution rely on a DPS with job-level deployment [
26].
Tangram [
50] implements task-based deployment and allows tasks to access and update an in-memory key-value store as part of their execution. By analyzing the execution plan, Tangram can understand which parts of the computation depend on mutable state and which parts do not, and optimizes fault tolerance for the job at hand.
TensorFlow [
1] is a library to define machine learning models. Jobs represent models with transformations (tasks) and variables (shared state elements). As strong consistency is not required for the application scenario, tasks can execute and update variables asynchronously, with only barrier synchronization at each step of an iterative algorithm. TensorFlow was conceived for distributed execution, whereas other machine learning libraries such as PyTorch were initially designed for single-machine execution and later implemented distributed training using the same approach as TensorFlow.
6.2.2 Relational Actors.
ReactDB [
79] extends the actor-based programming model with data management concepts such as relational tables, declarative queries, and transactional semantics. It builds on logical actors that embed state in the form of relational tables. Actors can query their internal state using a declarative language and asynchronously send messages to other actors. ReactDB lets developers explicitly control how the shared state is partitioned across actors. Jobs are submitted to a coordinator actor that governs their execution. The system guarantees transactional semantics for the entire duration of the job, across all actors that are directly or indirectly invoked by the coordinator.
6.3 Hybrid Systems
Several works aim to integrate data management and processing within a unified solution. S-Store [
27] integrates stream processing capabilities within a transactional database. It uses an in-memory store to implement the shared state (visible to all tasks), the task state (visible only to individual tasks of stream processing jobs), and the data bus (where data flowing from task to task of stream processing jobs is temporarily stored). S-Store uses the same concepts as VoltDB [
89] to offer transactional guarantees with low overhead. Data management and stream processing tasks are scheduled on the same engine in an order that preserves transactional semantics and is consistent with the dataflow. S-Store unifies input data (for streaming jobs) and invocations (of data management jobs, in the form of stored procedures): this is in line with the conceptual view we provide in Section
2.
SnappyData [
70] has a similar goal to S-Store but a different programming and execution model. It builds on Spark and Spark Streaming, and augments them with the ability to access a key-value store (shared state) during their execution. In the attempt to efficiently support heterogeneous types of jobs, SnappyData lets developers select how to organize the shared state, such as in terms or format (row oriented or column oriented), partitioning, and replication. It supports group atomicity and isolation using two-phase commit and multi-version concurrency control, and integrates fault detection and recovery mechanisms for Spark tasks and their effects on the shared state.
StreamDB [
31] and TSpoon [
5] take the opposite approach with respect to S-Store by integrating data management capabilities within a stream processor. StreamDB models database queries as stream processing jobs that receive updates from external sources and output new results to sinks. Stream processing tasks can read and modify portions of a shared state: all database queries that need to access a given portion will include the task responsible for that portion. StreamDB ensures group atomicity and isolation without explicit locks: invocation of jobs are timestamped when first received by the system, and each worker executes tasks from different jobs in timestamp order. TSpoon does not provide a shared state but enriches the dataflow model with
(i) the ability to read (query) task state on demand and (ii) transactional guarantees in the access to task state.
Developers can identify portions of the dataflow graph (denoted as transactional sub-graphs) that need to be read and modified with group atomicity and isolation. TSpoon implements atomicity and isolation by decorating the dataflow graph with additional operators that act as transaction managers. It supports different levels of isolation (from read committed to serializable) with different tradeoffs between guarantees and overhead.
Hologres [
53] is used within Alibaba to execute both analytical jobs and interactive jobs. The system is designed to support high volume ingestion data from external sources and continuous jobs that derive information to be stored in the shared state or to be presented to external sinks. The shared state is partitioned across workers. A worker stores an in-memory representation of the partition it is responsible for and delegates durability to an external storage service. The distinctive features of the system are
(i) a structured data model where state is represented as tables that can be physically stored row-wise or column-wise depending on the access pattern, and (ii) a scheduling mechanism where tasks are deployed and executed onto workers based on load balancing and prioritization of jobs that require low latency.
7 Discussion
In building and discussing our model and taxonomy, we derived several observations. We report them in this section, pointing out ideas for future research.
State and Data Management. The dichotomy between DMSs and DPSs is frequently adopted in the literature but not defined in precise terms. Our model makes the characteristics that contribute to this dichotomy clear and explicit, introducing the state component and a sharp distinction between shared and task state. DMSs offer primitives to read and modify a mutable shared state, whereas DPSs target computationally expensive data transformations and do not support state at all, or support it only within individual tasks. This distinction brings together other differences (which we made explicit with the classification criteria in Table
4) such that the two classes of systems complement each other and are often used in combination to support heterogeneous workloads.
This complementarity pushed researchers to extend their DMSs or DPSs to break the dichotomy, adding features typical of the other class. For instance, some DMSs, such as AsterixDB, support long-lasting queries using the dataflow processing model typical of DPSs, whereas recent versions of stream DPSs, such as Flink and Kafka, started to offer primitives to access their task state with read-only queries (one-shot jobs). This triggered interesting research on declarative APIs that integrate streaming data and state changes into a unifying abstraction [
78]. A few systems, such as S-Store and TSpoon, pushed this effort even further, integrating transactional semantics within stream DPSs.
Future research efforts could continue to explore approaches that extend the capabilities of individual systems, with the goal of better supporting hybrid workloads that demand both state management and data processing capabilities, reducing the need to deploy many different systems, thus simplifying the overall architecture of data-intensive applications.
Coordination Avoidance. In distributed scenarios, the coordination between workers may easily become a bottleneck. Avoiding or reducing coordination is a recurring principle we observed in all data-intensive systems. Most DPSs circumvent this problem by forcing developers to think in terms of functional and data-parallel transformations. As state is absent or local to tasks, tasks may freely proceed in parallel. Coordination, if present, is limited to barrier synchronization in systems that support iterative jobs (e.g., iterative dataflow systems and graph processing systems).
Conversely, DMSs require coordination to control concurrent access to shared state from multiple jobs. Indeed, the approach to coordination is the main criterion we used to classify them in Section
4. NoSQL systems partition state by key: they either only support jobs that operate on individual keys or relinquish group guarantees for jobs that span multiple keys, effectively treating accesses to different keys as if they came from independent jobs that do not coordinate with each other. NewSQL systems do not entirely avoid coordination but try to limit the situations in which it is required or its cost. In our analysis, we identified four main approaches to reach this goal:
(i) use of precise clocks [
34], (ii) pre-ordering of jobs and deterministic execution [
92], (iii) explicit partitioning strategies to maximize jobs executed (sequentially) in a single slot [
89], and (iv) primary-based protocols that delegate the scheduling of all read-write jobs to a single worker [
95].
In addition, all DMSs adopt strategies that optimize the execution of read-only jobs and minimize their impact on read-write jobs. They include the use of replicas to serve read-only jobs and multi-version concurrency control to let read-only jobs access a consistent view of the state without conflicting with read-write jobs.
An open area of investigation for future research is a detailed study of the assumptions and performance implications of coordination avoidance strategies under different workloads. This study could guide the selection of the best strategies for the scenario at hand and open the room for dynamic adaptation mechanisms.
Architectures for Data-Intensive Applications. Data-intensive applications typically rely on complex software architectures that integrate different data-intensive systems to harness their complementary capabilities [
37]. For instance, many scenarios require integrating OLTP (online transaction processing) workloads, which consist of read-write jobs that mutate the state of an application (e.g., user requests in an e-commerce portal), and OLAP (online analytical processing) workloads, which consist of read-only analytic jobs (e.g., analysis of sales segmented by time, product, and region). To support these scenarios, software architectures typically delegate OLTP jobs to DMSs that efficiently support concurrent read-only and read-write queries (e.g., relational databases), and use DMSs optimized for read-only queries (e.g., wide-column stores) for OLAP jobs. The process of extracting data from OLTP systems and loading it into OLAP systems is denoted as ETL (extract, transform, load), and is handled by DPSs that pre-compute and materialize views to speedup read queries in OLAP systems (e.g., by executing expensive grouping, joins, and aggregates, as well as building secondary indexes). Traditionally, ETL was executed periodically by batch DPSs, with the downside that analytical jobs do not always access the latest available data, whereas recent architectural patterns (e.g., lambda and kappa architectures [
60]) advocate the use of stream DPSs for this task.
In general, the architectural patterns of data-intensive applications are in continuous evolution [
37], and our study highlights a vast choice of diverse data-intensive systems, with partially overlapping features. Future research could build on our model and classification to simplify the design of applications. Indeed, although the primary goal of our model was to present the key characteristics of data-intensive systems to researchers and practitioners with diverse backgrounds, it may inspire high-level modeling frameworks to capture the requirements of data-intensive applications and guide the design of a suitable software architecture for the specific scenario at hand. Recent work already explored similar model-driven development in the context of stream processing applications [
47].
Modular Implementations. Several data-intensive systems have a modular design, where the functionalities of the system are implemented by distinct components that can be developed and deployed independently. This approach is well suited for cloud environments where individual components are offered as services and can be scaled independently depending on the workload. In addition, the same service can be used in multiple products—for example, storage services, log services, lock services, and key-value stores may be used as stand-alone products or adopted as building block of a relational DMS. We observed this strategy in systems developed at Google [
29,
34], Microsoft [
11], and Amazon [
95].
Future research could bring this idea forward, proposing more general component models that promote reusability and adaptation to heterogeneous scenarios. Our work may guide the identification of the abstract components that build data-intensive systems, the interfaces they offer, the assumptions they rely on, and the functionalities they provide. These research efforts may complement the aforementioned study of architectural patterns, promoting the definition of complex architectures from pre-defined components.
Wide Area Deployment. The systems we analyzed are primarily designed for cluster deployment. In DPSs, tasks exchange large volumes of data over the data bus and the limited bandwidth of wide-area deployment may easily become a bottleneck. Some DMSs support wide-area deployment through replication, but in doing so they either drop consistency guarantees or implement mechanisms to reduce the cost for updating remote replicas. For instance, deterministic databases [
92] define an order for jobs and force all replicas to follow this order, with no need to explicitly synchronize job execution.
However, increasingly many applications work at a geographical scale and the edge computing paradigm [
82] is emerging to exploit processing and storage resources at the edge of the network, close to the end users. Designing data-intensive systems that embrace this paradigm is an important topic of investigation.
New and Specialized Hardware. The use of specialized hardware to improve the performance of data-intensive systems is an active area of research. Recent works study hardware acceleration for DPSs [
49] and DMSs [
42,
58] using GPUs or FPGAs. Offloading of tasks to GPUs is also supported in recent versions of DPSs, such as Spark, and is a key feature for systems that target machine learning problems, such as TensorFlow [
1].
Open research problems in the area include devising suitable programming abstractions to simplify the deployment of tasks onto hardware accelerators, building new libraries of tasks that may run onto hardware accelerators, and exploring new types of accelerators. More in general, the availability of new hardware solutions stimulates the definition of design choices that better exploit the characteristics of those solutions. In the context of DMSs, non-volatile memory offers durability at nearly the same performance as main memory. The interested reader can refer to the work by Arulraj and Pavlo [
13] that discusses the use of non-volatile memory to implement a database system. As pointed out in recent studies, another area of investigation consists of using remote memory access to better exploit data locality and reduce data access latency. The potential of remote memory access has been pointed out in recent studies both in the domain of DMSs [
102] and in the domain of DPSs [
40].
Dynamic Adaptation. Our model captures the ability of some systems to adapt to mutating workload conditions (see Table
8). Many works use this feature to implement automated control systems for DPSs that monitor the use of resources and adapt the deployment to meet the quality of service specified by the users while using the minimum amount of resources. The interested reader can refer to recent work on dynamic adaptation for batch [
17] and stream [
25] DPSs.
Future studies in the area of dynamic adaptation could intersect with topics already presented in this section: in particular, they may consider the availability of geographically distributed processing, memory, and storage resources, as well as heterogeneous and specialized hardware platforms.
8 Conclusion
This article presented a unifying model for distributed data-intensive systems, which defines a system in terms of abstract components that cooperate to offer the system functionalities. The model precisely captures the possible design and implementation strategies for each component, with the assumptions they rely on and the guarantees they provide. From the model, we derived a list of classification criteria that we use to organize state-of-the-art systems into a taxonomy and survey them, highlighting their commonalities and distinctive features. Our work can be useful not only for engineers who need to deeply understand the range of possibilities to select the best systems for their application but also to researchers and practitioners who work on data-intensive systems, to acquire a wide yet precise view of the field.