1 Introduction

For many years now we hear promises of the emergence of the Internet of Things (IoT) and of Edge Computing. Still, the world of interconnected things has remained more an idea than a concrete reality. Recent predictions from the International Data Corporation (IDC) studies, however, point to significant developments in this area and it is expected that by 2020 there will be an extraordinary number of 32 billion things connected to the Internet [14]. Moreover, the amount of digital data will grow from 4.4 ZB in 2013 to 44 ZB in 2020.

Naturally, an explosion in the number of connected devices and in the amount of data being produced and exchanged demands for novel approaches to data management. Massive scale systems, composed of thousands to millions of devices, exhibit specific characteristics that are specially challenging and need to be addressed. Namely, the increase in scale is necessarily accompanied by an increase in system dynamism. Such dynamism arises both from failures that, in these environments, become the rule instead of the exception and by the natural constant entrance and departure of devices, which we will call nodes from now on.

Alongside, real world applications start to struggle to find affordable systems to manage and store massive amounts of data. As an example, the Wikimedia Foundation is currently requesting help to users that have spare storage and bandwidth capabilities to store and host Wikipedia snapshotsFootnote 1. These snapshots contain the entire history of Wikipedia across distinct periods of time and are valuable for a wide variety of users including researchers. However, they are not easily accessible due to limited storage capabilities. Thus, offering a massive scale storage system able to accommodate the entire Wikipedia and its history relying only on commodity hardware becomes of significant interest. Moreover, serving all these snapshots from an unified storage service, instead of scattering the snapshots across independent storage systems, is key for users to have an efficient way of accessing the full history of Wikipedia.

Recent research work proposed a data store entirely built with epidemic protocols, tailored precisely for large scale environments [18]. The success of DataFlasks, with respect to coping with high levels of system dynamism, lies in its autonomous and unstructured approach to node organization and in its pro-active approach to fault tolerance. In DataFlasks, nodes autonomously organize themselves into groups that are responsible for a subset/partition of the data. Then, the number of nodes in a group determines the data replication factor for the data being stored. The effectiveness of a pro-active approach to data replication comes, unfortunately, with an increase in storage and network resource usage. In fact, bandwidth is actually a bottleneck for scalability in this type of systems and, even though DataFlasks autonomous data partitioning alleviates the problem, this still weakens its applicability in real world scenarios [2]. Alongside, as all nodes belonging to the same group are fully-replicated, the available storage space provided by the group is limited to the size manageable by the single node with the lesser storage capabilities. This restriction is of special importance if we consider each node to be commodity hardware or even smaller edge devices where storage space available is limited.

Data deduplication has proven to be an efficient technique for finding and eliminating duplicate content in large volumes of data [21]. Moreover, it was used in the past to reduce the network bandwidth consumption of distributed storage systems. However, leveraging deduplication in a massive-scale data store such as DataFlasks is not a trivial task. One approach is to apply local deduplication only for the data stored in each node. As this approach does not eliminate duplicates stored across distinct nodes, it requires an efficient content-aware policy for distributing data to nodes that maximizes the obtainable space-savings. Other approach is to perform global deduplication across data stored in all nodes, thus finding redundancy across the entire storage system. However, finding duplicates across all nodes requires global metadata and coordination, which not only increases the complexity of the system but may also compromise the decentralization, fault-tolerance, and performance of systems such as DataFlasks.

Contributions. We propose DDFlasks, a massive scale deduplicated data store. It shows the applicability of integrating DataFlasks, a massive scale data store, with deduplication, without loosing any of its design guarantees, such as decentralization and high churn tolerance. Additionally, we evaluate its effectiveness using a real workload, specifically storing and serving simultaneously both the most recent versions of Wikipedia [10] articles and their older historical versions. In fact, using real data from Wikipedia, we show that our system is able to store and serve articles across several nodes with high levels of storage savings up to 63% and network savings up to 20%.

Roadmap. The rest of the paper is organized as follows. In Sect. 2 we describe the architecture and design of DataFlasks, the baseline system used to build our novel approach. Next, in Sect. 3 we describe the Wikipedia use case and present some preliminary results that motivate the usage of deduplication. In Sect. 4 we introduce DDFlasks. We then proceed to DDFlasks evaluation in Sect. 5 and present related work in Sect. 6. The paper is concluded in Sect. 7.

2 DataFlasks: Epidemic Store for Massive Scale Systems

The pivotal idea guiding the design of DataFlasks is decentralization, where each node is autonomous and all nodes play the same role [18]. A node progresses relying solely on local decisions without depending on any other node and on any kind of hierarchy. When a client issues a request, such request is disseminated throughout the system and each node decides how to handle it. Store requests are composed by an identifier of the object to be stored that must be unique, by the version of the object to be stored, and by the object’s data. Storing several versions of the same object is important for many applications that resort to data versioning.

Briefly, the API is composed by a get and put operation. When a get is received, if the node holds the corresponding triple (key, version, object) it replies to the client. Otherwise, it ignores the request. In the case of a put operation, the node locally decides to store the corresponding triple (key, version, object) or to discard it. The decision to store or not the data is used to implement data distribution and replication. DataFlasks ensures that a sufficient number of nodes actually decides to store each data object in order to guarantee data replication, and thus, to tolerate node failures.

The set of nodes that takes the same decisions on whether to store data objects or not is viewed as a group. Accordingly, the decision of which data to store is reduced to the decision of which group a node belongs to. Once that decision is made, each node is responsible for a subset of the data according to a deterministic mapping between the pair (key, version) of an object and the group it belongs to. Data is thus distributed by groups, providing load balancing, and replicated a number of times equal to the size of the group. Strikingly, each node is able to decide to which group it belongs without requiring any kind of coordination.

In order to achieve this, the system is entirely built with unstructured and pro-active epidemic protocols. They are characterized by their independence from any kind of structure or hierarchy among nodes and by the fact that they rely on pro-active mechanisms for fault tolerance that are able to antecipate system repair. The result is a completely decentralized and coordination-free data store. Characteristics that make DataFlasks inherently scalable and able to cope with unprecedented levels of system dynamism, may it be caused by membership instability or by failures.

In the system’s architecture, each node runs five components: Membership, Group Construction, Storage, Replica Maintenance and Interface. In order to provide some background and context to the design of the system proposed in this paper, we briefly describe how each component works in the original setting.

The Membership component is responsible for providing each node with a list of available nodes in the system. It does so guaranteeing that such list represents a random sample of nodes from the entire system and that it is periodically refreshed. It is important to notice that each membership list is always a small subset of nodes with respect to the system size, which allows the system to scale.

The Group Construction This component is responsible for determining to which group the node belongs. As described previously, the group determines which data to locally store or to discard. Without going into much detail, this component works by leveraging information being propagated at the membership level to estimate the number of groups needed to satisfy a desired, user defined, replication factor. Then, the node places himself on one of those groups guaranteeing that system nodes are uniformly distributed across the different groups. For a detailed description of the protocol please refer to [18]. Once in a group, each time a put operation is issued for a certain key, that key is mapped deterministically to a group by using an hash function. As described further on, this mapping allows different versions of the same key to be placed in the same replication group. This will allow maximizing deduplication effectiveness.

The Storage component abstracts the actual medium to which the data is persisted. Currently, this component can be configured to be a in-memory store or a disk-based one. This paper introduces a new storage component to support data deduplication.

In order to maintain the replication level in the presence of churn, the Replica Maintenance component periodically publishes to other nodes in the group the set of keys it currently holds locally. Within a group, all nodes store the same set of data objects. Upon receiving a maintenance message, each node checks if it is storing all keys correspondent to the group. If not, it requests the missing data from the nodes in its group. In this paper we provide a new replica maintenance component which allows to optimize this process by avoiding to transmit duplicate data through the network.

Finally, the Interface component is responsible for handling the incoming connections from other nodes and managing the request workflow in the system. In order to issue put or get requests the client only needs to be able to contact a single node in the system. The request is then forward appropriately to the correct nodes that can fulfill it.

3 Duplicates in the Real World

Many large information systems tend to exhibit a significant amount of duplicate data [19]. This is particularly true for storage systems that evolve incrementally with time. A paradigmatic example is Wikipedia, also known as the Internet encyclopedia [10]. The Wikipedia allows users to create and complement articles about virtually any subject. Articles evolve through time and periodic snapshots of the entire Wikipedia are stored for future reference. Because Wikipedia serves a very high volume of requests and stores a growing large volume of data, it is a suitable use case for DataFlasks that can leverage its highly scalable infrastructure to serve Wikipedia’s high demand.

Naturally, different versions of the same article share significant portions of the text, which is redundant when stored. This means that a storage system holding the full history of Wikipedia is expected to have a considerable amount of duplicate content [11]. A possible approach to eliminate this redundancy would be to use a traditional compression technique such as gzip. However, compression techniques are ideally designed to eliminate intra-file redundancy or redundancy over a small group of files, typically stored together in the same operation. In the Wikipedia use-case, new versions of the same article are created over time and must be retrieved efficiently if requested. This means compressing and decompressing data several times which results in a significant penalty on storage requests performance. Another possible approach to eliminate such redundancy and to spare storage space is to use incremental backup techniques such as delta-encoding. With this technique new versions of a previously stored article are stored as deltas or diffs that only contain the content that was actually modified. These deltas can then be applied to the original (base) article to rebuild a specific version of the article. Although this technique is efficient in terms of storage space savings, it requires additional computational power and it is slower than deduplication, specially when articles have a large number of versions and several deltas must be applied to the base article to retrieve latest versions. For this reason, this paper proposes the use of block-based deduplication, which allows users to query any article version in the past and get the response without the need to rebuild a set of deltas or decompress data [21].

To validate that deduplication is, in fact, suitable and effective for a deployment where DataFlasks is serving Wikipedia articles, we performed the following experiment. We used 15 monthly Wikipedia snapshots taken for the period between November of 2014 and January of 2016 (See footnote 1). Each snapshot has the latest full version of all articles belonging to the English version of Wikipedia. The snapshots were processed by the order they were taken and the corresponding articles were stored in a way that mimics the distributed storage approach taken by DataFlasks in a real deployment i.e., articles were divided into groups and stored accordingly. Each group of articles represents the data partition that would be assigned to a specific set of DataFlasks nodes. We then focus our analysis on each one of the partitions. It is important to notice that deduplication will be applied locally by each node. Consequently, nodes in the same group, that replicate the same data partition, will store the same content, which makes it sufficient to analyze a single node per group. Additionally, across consecutive snapshots there are some repeated articles that remained unmodified and were not stored in our experiment.

On the other hand, new versions of previously stored articles were routed to the same data group, where their ancestors were persisted, and were stored as new objects (files) with distinct version identifiers. This way, the experiments stored the full content for each article version which is in conformity with the rationale explained previously where our very large data store is used to serve several articles and their distinct versions without requiring the usage of incremental backup techniques.

Table 1. Analysis of duplicates results with 1024, 2048 and 4096 bytes Rabin fingerprints for a single group of the DataFlasks configuration with 40 groups.

After populating the distinct data groups with the Wikipedia dataset the global storage space in use was \(\approx \)305 GB, corresponding to 55,745,648 articles. In order to check the percentage of redundancy in the stored dataset, we resort to the DupsAnalyser tool an open-source project (https://github.com/jtpaulo/dupsanalyzer) that processes the content of files and extracts statistics for the duplicate content found. Duplicates can be found either by searching for duplicate blocks with a fixed or variable size. The latter resorts to an implementation of the Rabin Fingerprint scheme for calculating variable-sized blocks and their corresponding content hashes efficiently [20]. As Wikipedia articles are text articles, using variable sized blocks is a better choice for finding duplicates [11, 21]. Briefly, lets consider two versions of the same article where version A only differs from version B by a single character that was added to the beginning of the latter version. If the two articles are scanned with a fixed size partitioning scheme, no blocks from version A will match blocks from version B. In contrast, the Rabin fingerprint scheme uses a sliding window that moves through the data until a fixed content pattern defining the block boundary is found. This approach generates variable-sized blocks and solves the issue of inserting a single byte in the beginning of version B. More precisely, only the first block from version B will differ from the first block of version A due to the byte addition, while the remaining blocks will still be duplicate. Finally, the Rabin scheme is configurable with target average, maximum and minimum block size, which allows avoiding the generation of very small or large blocks while still keeping their sizes variable. In the results discussed next, we used DupsAnalyser to process the articles, and corresponding versions, stored at each data group. Individually, for each data group, our analysis tool processed all stored files to find intra and inter-file duplicates.

Distinct Group Sizes Results. Our first experiment was designed to check the amount of duplicates found per group node when dividing articles into 10, 20 and 40 groups for different block sizes: 1024, 2048 and 4096 bytes. With 10 groups each group node holds \(\approx \)30 GB, with 20 groups \(\approx \)15 GB and with 40 groups \(\approx \)7.5 GB. We noticed that the percentage of duplicates found does not increase significantly if a group holds more data, because most redundancy is originated by storing distinct versions of the same article in the same group, which happens identically for the three group sizes.

Single Group Analysis for the 40 Groups Scenario. Since the percentage of duplicates does not change significantly when considering different number of groups, we show in Table 1 a more detailed analysis of the stored content in a single group for the experiment with 40 groups. The analyzed group holds 7.63 GB of data corresponding to more than one million articles. For each Rabin fingerprint size, the total number of generated blocks diverges and, as expected, with a smaller size it is possible to find more duplicates and have significantly higher space savings. However, reducing the block size increases the size of the metadata used to index all stored blocks and to find duplicates.

To conclude, these results show that single-node deduplication with a variable 1024 bytes fingerprinting scheme allows reducing 45% of the storage space occupied by 15 snapshots of the English Wikipedia version.

4 DDFlasks

Recalling Sect. 2, data distribution and replication in DataFlasks is achieved by dividing nodes into groups. Each group is responsible for a set of data and, accordingly, each node belonging to that group will have to store that specific set of data in its local storage. The Wikipedia study discussed in the previous section shows that a significant percentage of duplicates exists in each node when all the versions of a specific article are grouped together. In DDFlasks, this insight is leveraged by ensuring that data objects identified by a key are always assigned to the same group independently of their version. With this approach, all the versions of an article are stored in the same group while clients can still retrieve specific versions of an article by specifying the article’s key and the desired version. This is achieved by taking into advantage the load balancing mechanism from the original DataFlasks, which deterministically routes a certain key to a group. DDFlasks inherits characteristics from DataFlasks, such as fully-decentralization. In particular, it resorts to node-local deduplication that does not require any global index or coordination mechanisms that would impact high-churn tolerance and the performance of storage requests [21].

In comparison with the baseline architecture discussed in Sect. 2, DDFlasks is extended with storage and network deduplication mechanisms. The resulting open-source system is available at http://github.com/fmaia/dataflasks.

Fig. 1.
figure 1

Deduplication in DDFlasks

First, a new storage component is provided with integrated in-line local storage deduplication, which works as follows. In each node, duplicates are identified and eliminated before actually being stored persistently. In the literature this approach is known as in-line deduplication [21]. Duplicates are found by resorting to an index that maps blocks with unique content to their respective storage addresses. When a block is being written, a digest of the block’s content is calculated and the index is searched for a possible duplicate. If a duplicate exists, then the new block does not need to be stored, otherwise, the block is stored and the index is updated with a new entry for that block. A Rabin Fingerprint scheme identical to the one described in Sect. 3 is used to divide files into variable-size blocks and to calculate small digests of their content [20]. This way, the index does not store the actual block but a smaller digest identifying the content of that block. Fingerprints are deterministically calculated per-file. Thus, at each node, storing files in different orders does not affect the correctness of the approach. In order to retrieve files from the storage system, an additional metadata structure, that we refer to as file recipe is used. Each file recipe identifies a single file stored on DDFlasks and tracks the digests of the blocks that belong to that specific file. The actual storage address of these digests can be consulted at the index. Deduplication is thus achieved because file recipes with duplicate content share digests that are mapped to the same storage block. Figure 1 shows an example of the proposed single-node deduplication mechanism. As the first step, File A is routed to the correct group of nodes. Then, in each node storing the file, the file is divided into variable-sized blocks and a digest for the content of each block is calculated. In the example, block1 and block3 have the same content. Each digest is checked at the index and if not found, a new entry is added while the corresponding block is stored in a append-only storage. In the figure blocks b1 and b3 are duplicates, so only block b1 and b2 are stored. Finally, the file recipe for File A is also kept at the node in order to fetch all the necessary blocks when a client asks for that file. The index keeps the digests and corresponding location for all blocks at the local storage which enables both intra- and inter-file deduplication for all files stored in the same node. In Sect. 5 we show that our approach is still able to achieve significant storage space savings even when metadata space is accounted for.

In this paper we do not address data deletion functionalities. This is motivated by the fact that DDFlasks is a large-scale system intended to store large amounts of archival data. For use-cases such as the Wikipedia one used in the paper, this is a practical assumption since the main goal is to keep all versions of wikipedia articles without ever deleting them. As described in the previous section, for this use case, single-node deduplication proves to be an efficient technique to spare redundant storage space and avoids scalability issues found in large-scale in-line deduplication systems that must maintain a global index for finding duplicates across remote storage nodes [7, 8].

The second deduplication mechanism proposed in the paper aims at optimizing the network bandwidth used by DDFlasks data replication techniques. In order to cope with high levels of node churn and to maintain desirable data replication levels, each system node proactively and periodically contacts other nodes in the same group to announce the set of files it is currently storing. If one node receives this set and verifies that its local storage is currently missing some files, it must contact other nodes in the same group to ask for those files. Naturally, when churn levels become significantly high, the volume of data traversing the network increases as more files are being exchanged. We propose to mitigate this problem by employing deduplication to the data being exchanged between nodes. In detail, nodes periodically announce to the group not only the set of files that they currently hold but also the digests that compose those files. When a node receives this list and verifies that a set of files is missing, it checks first what digests from those files are already stored locally. This can be done by leveraging the index metadata used for local storage deduplication. Then, the node only requests the blocks that are actually missing in its local storage. After receiving these blocks the node updates the index and creates the corresponding file recipes. A key advantage of this mechanism is that it relies on the metadata already used for performing in-line deduplication, which is an idea that has proven successful in previous proposals for backing up data across peer-to-peer networks [5, 20]. Although this strategy requires sending the list of digests when announcing the files that nodes currently hold, we show in Sect. 5 that it still spares significant network bandwidth. Note that although single-node deduplication is already provided in several storage appliances, it is not trivial to incorporate these solutions with DDFlasks and take advantage of the deduplication metadata, that is in most cases is protected within the appliance, to implement the previous network optimizations.

Implementation Details. The two deduplication mechanisms were implemented on top of the current implementation of the system described in Sect. 2. The deduplication index is an in-memory HashMap that maps blocks digests (8 bytes) to storage addresses (8 bytes)Footnote 2. Similarly, file recipes are stored in an in-memory HashMap that maps the identifier of a file (16 bytes, 8 bytes for the file key and 8 bytes for the version) to its file recipe whose size depends on the number of block digests composing that file. DDFlasks is mainly thought for running in commodity hardware nodes and the amount of data hold by each node is not expected to be very large (tens to hundreds of GBs). So, the amount of metadata held by each node is also expected not grow to large values. Additionally, in the context of this paper we assume that, even in the presence of high levels of churn, for each group there is always a set of live nodes. This way metadata for freshly booted nodes can always be reconstructed from live nodes.

5 Evaluation

DDFlasks was evaluated in a real deployment to validate two main claims. First, that deduplication allows sparing significant storage space for each node. Second, that the network bandwidth used by nodes when exchanging messages is also reduced.

To this end, we have performed a set of experiments that demonstrate the effectiveness of the deduplication mechanism implemented. Each experiment was run both in the original DataFlasks, non-deduplicated system (used as the baseline) and in DDFlasks. The experiment set up consists of a cluster of commodity hardware nodes equipped either with a 3.1 GHz Dual-Core Intel i3 Processor, 8 GB of RAM and a 7200 RPMs SATA disk or a 3.7 GHz Dual-Core Intel i3 Processor, 8 GB of RAM and a SSD disk. All nodes are connected through a gigabit ethernet switch. It is important to notice that hardware heterogeneity does not impact the results of our experiments. In fact, it is out of the scope of the present paper the evaluation of system performance metrics. These metrics will mostly be affected by the deduplication approach being used i.e., fingerprinting scheme, index scheme, etc. As discussed in previous work, each scheme adds different tradeoffs in terms of storage performance, deduplication performance and resources (RAM, CPU, Disk) consumption [21].

Instead, we focus on analyzing storage and network savings achievable by our system. Similarly, the validation of DDFlasks scalability to thousands of nodes and resiliency to high churn rations is already addressed in previous work [18].

Leveraging the results obtained in Sect. 3 and aiming at real world assessment of DDFlasks, all the experiments presented next resort to actual Wikipedia data.

5.1 Storage Savings

In order to evaluate the storage behavior of DDFlasks we have considered 15 Wikipedia monthly snapshots. Each one of these snapshots contains a set of articles from the English version of the Wikipedia. From snapshot to snapshot each article may change reflecting its evolution through time. In the real world deployment of Wikipedia, users see only a single (latest) snapshot. However, in our scenario we want to go a step forward and it is our goal to simultaneously store and serve several Wikipedia snapshots.

The 15 snapshots used amount to \(\approx \)115 GB corresponding to \(\approx \)6.3 million articles. Each article is stored as a single data object in the storage system and each new article snapshot corresponds to a new version of such object. Moreover, article versions are treated as new articles thus identified with the same key as the original article but with a different version number. This information is used by DDFlasks to collocate articles with their subsequent versions in the same node group.

We configured both DataFlasks and DDFlasks to arrange nodes into 16 groups. Each group is responsible for storing a subset of the articles written to the store. As described previously, all nodes belonging to a certain group store the same data and deduplication is applied locally to each node. Consequently, in order to observe the system’s behavior it is sufficient to analyze the behavior of a single node per group. Other nodes in the same group will exhibit exactly the same results as the ones presented next.

The experiment consisted on loading both DataFlasks and DDFlasks with the 15 data snapshots writing each article and subsequent versions in chronological order (from the oldest snapshot to the latest one). After the load was completed we analyzed the storage usage of a node per group.

Table 2. Storage and metadata space occupied for DDFlasks and the DataFlasks storage systems

In Table 2 we present the results of this experiment. It is observable that DDFlasks is significantly more frugal than DataFlasks with respect to storage space usage. The former requires 42.4 GB to store all the articles while the latter, without deduplication, requires 115.5 GB. In detail, 73.1 GB are saved by using deduplication which corresponds to a space saving of 63% when compared to the baseline approach. Please note that, when compared with the motivation tests described in Sect. 3, there is an improvement in the storage savings results. This improvement is explained by the fact that, in this real deployment, we used a sample of the articles (and corresponding versions) used in the motivation experiments, which happen to exhibit slightly higher redundancy between them. Additionally, we can observe that the local storage space required by nodes in different groups is similar and that the deduplication savings in each node are identical to the one observed globally for the whole storage by considering a load balancing strategy that routes articles uniformly across distinct groups.

Going into some detail, we also show in the table the space used by metadata structures. In both systems, more than 390,000 articles were stored in each node. As expected, deduplication requires additional metadata space for storing and indexing articles’ blocks, while in the baseline system it is only required a simpler file recipe that points a specific file to its storage address. Nevertheless, the space savings achieved clearly compensate the overhead introduced by the extra metadata structures used in DDFlasks. In fact, less than 17% of the space spared by deduplication is needed for fulfilling the extra metadata space overhead. Finally, Table 3 shows the exact space occupied by the index and file recipe metadata in our system. Again, the space occupied by each metadata structure across different nodes does not change significantly.

Table 3. Space occupied by DDFlasks index and file recipe

5.2 Network Savings

Replication is achieved in our system resorting to periodic message exchanges between nodes with information about the data objects they are storing. Each time, following a message exchange, a node detects it is missing some object it requests it from other nodes in the same group. Naturally, if the system is stable, it is expected that nodes store all correspondent data objects and that these message exchanges do not yield missing data requests. However, when nodes fail or enter the system data objects need to be requested to maintain the desirable replication levels.

In this experiment, we show that deduplication can reduce network consumption of the data exchange mechanism between nodes. We focus on two nodes belonging to the same group and observe their behavior when one of them keeps failing and re-entering the system while the system is continuously being loaded with new data. Naturally, it is expected that each time the node re-enters the system it will request missing data from its peer that runs continuously. The test ran for 2 h and after the first 30 min one of the nodes was stopped in intervals of 20 min. In detail, after being stopped the node remains offline for 20 min and then it is rebooted again and it is kept online for additional 20 min. This cycle was repeated until the last 30 min of the test when the two nodes were kept online. The node being stopped saved its metadata to disk periodically to ensure that when rebooted the index and file recipe metadata were holding previously stored information. Again, 15 Wikipedia monthly snapshots were used, and both systems (DDFlasks and baseline) stored more than 400,000 articles, which corresponds to \(\approx \)8.3 GB. Please recall that the two nodes were configured to be in the same group so these were fully-replicated, each holding the same amount of articles mentioned previously. In terms of storage space savings the DDFlasks nodes stored 4.3 GB while the baseline system nodes stored 8.3 GB. This corresponds to a space saving of \(\approx \)49%, which is in conformity with the results discussed previously and in Sect. 3. The metadata space required by each node is also compensated by the space savings as in the previous results.

The baseline approach, without network deduplication, sends more than 22 GB through the network while the deduplication approach only sends 17.71 GB. Note that these bandwidth consumption results consider all network traffic. In fact, while most of this traffic is due to the data replication mechanism, system control traffic and client requests are also accounted for in the total value. Moreover, both systems rely on the UDP protocol that requires resending messages that are lost due to failures of the protocol, which also increases network bandwidth usage. Nevertheless, these results show that only by using deduplication for the data replication mechanism it is possible to spare \(\approx \)20% of all the data exchanged across replicated peers.

The previous results show that significant storage space and network bandwidth can be spared with DDFlasks. We expect these savings to be similar for other backup workloads with periodic snapshots. In fact, as presented in [19], some of these backup workloads will have higher duplication ratios than Wikipedia, meaning that the network and storage savings achievable should also be higher.

6 Related Work

In the pursuit for large scale data management, traditional relational database systems have been, for certain domains and applications, largely replaced by new approaches to data management. Commonly know as NoSQL data stores, these data management systems offer relaxed consistency guarantees when compared with traditional relational database management systems. Examples are Dynamo, PNuts, Bigtable, Cassandra and Riak [3, 4, 6, 15, 16]. One of the key features of these data stores is how they implement data distribution and discovery. Leveraging scalability properties of peer-to-peer protocols, all these data stores rely on a distributed hash table (DHT) such as Chord or variants to distribute and locate data objects [24]. The exceptions are Bigtable and PNUTS, which are centrally managed instead and typically use a specific DHT variation called ‘one-hop’ DHT [13]. This variation allows faster lookups but requires complete membership knowledge, i.e., each node knows about all other nodes in the system. Moreover, DHTs are know to struggle in the presence of high levels of churn [23]. As a result, even if the distributed and peer-to-peer nature of these data stores is closely related to DataFlasks, this system presents an unique unstructured and pro-active approach to node organization and data replication.

To our best knowledge, applying deduplication to epidemic massive scale systems for improving the usable storage space of peers and to improve the network bandwidth usage of gossip protocols and pro-active replication mechanisms is a novel contribution of this paper. To achieve these goals, we leverage ideas of previous work on deduplication for distributed storage systems [21]. In more detail, for achieving both storage and network savings, in-line local deduplication is applied so that duplicates are eliminated before being stored persistently [8, 22]. In fact, for sparing network bandwidth, duplicates are eliminated before even being sent through the network [20].

Peer-to-peer in-line deduplication, where backups are made cooperatively with remote nodes, was introduced in Pastiche [5]. In this system, nodes backup their data to other remote nodes that are chosen by their network proximity and data similarity. Only non-duplicate data is sent through the network and since nodes with similar datasets are chosen, the amount of data that must be sent through the network and stored in each peer is reduced significantly. Other distributed deduplication systems propose novel load balancing designs that route similar data to the same node in order to optimize the amount of duplicates found and, consequently, maximize storage space savings. These proposals rely on centralized indexes that have global knowledge of the content stored in all nodes, on distributed indexes that scale better than the centralized ones, on statefull and stateless routing algorithms, and on probabilistic routing algorithms that do not need a global knowledge of the content of each node in the system [1, 7,8,9, 11, 12, 17, 25].

Although DDFlasks could benefit from some of the ideas and optimizations discussed in previous deduplication systems, our current design uses the original load balancing algorithm proposed by DataFlasks. Our approach collocates different versions of the same data objects, which are expected to have duplicated content. Deduplication is thus performed locally on each node i.e., each node manages its own index and only eliminates duplicates that are stored on its local storage. Strikingly, as shown in the paper, for realistic use-cases such as the Wikipedia one, ensuring that the same versions of articles are routed to the same DDFlasks group is enough to achieve significant storage space savings while keeping metadata overhead acceptable. Additionally, our deduplication design can be leveraged to spare not only storage space but also network bandwidth usage across nodes. For epidemic data stores such as DDFlasks this is a novel contribution that reduces significantly the number of messages exchanged across nodes, thus improving the efficiency of current gossip protocols, which is of particular importance since bandwidth consumption is critical in these systems [2]. Furthermore, our approach does not impact the decentralization and high-churn tolerance assumptions of the original DataFlasks system.

7 Conclusion

This paper describes a deduplicated massive scale data store, which can handle high volumes of data while minimizing storage resource usage. DDFlasks is built resorting to a stack of proactive and completely decentralized gossip-based protocols.

The core idea driving this store is effective data dissemination and independent, local decisions of what to do with the data at each node. In-line deduplication is employed at each node and we show, resorting to a real world scenario, that the system is able to save up to 63% of storage space, in comparison with a non deduplicated one.

Additionally, DDFlasks design is completely decentralized and is able to cope with unprecedented amounts of churn, while saving up to 20% in network bandwidth consumption when compared with the original DataFlasks non deduplicated system.