6.1 Overview of Web Storage
In terms of content storage, the current web ecosystem is dominated by silos of providers residing in centrally controlled, public Cloud infrastructures. While these public Clouds provide users with on-demand access to a large pool of shared resources, they operate with little or no transparency. As a result, concerns over confidential or sensitive data security can favour the deployment of private Cloud infrastructures, which require large upfront costs.
More importantly, the centralisation in the infrastructures of these silos means that they reside in only a few locations on the Internet. As a consequence, even simple network failures can lead to the unavailability of these silos, as experienced by users during recent outages at Amazon Web Services or AWS (which resulted in the loss of access to a significant portion of the web) and Facebook [
48,
182]. While replication of content across silo boundaries would lead to better performance and availability for users, the lack of incentives prevents such cooperative action among the silos.
Content
retrieval from centralised Cloud infrastructures deployed at remote datacenters can experience large communication latency. To reduce this latency, the emerging
edge computing [
168] paradigm promises to deploy small-scale datacenters at locations close to users. However, such small-scale edge infrastructures are mostly appropriate for small-scale, low-latency applications and are not typically designed for the workload of the web. Instead, a truly decentralised web can be realised by pooling the vast amount of global user resources and
incentivising their proper usage to achieve scalability and sufficient performance.
Other important actors in content retrieval in the current web are CDNs, which provide large-scale retrieval of
Quality-of-Service (QoS)-sensitive content through on-demand content replication at distributed caches worldwide. While on-demand replication of content with simple reactive caching policies (such as LRU) is effective in providing sufficient content retrieval performance, the location-based nature of web references (i.e.,
addressing) makes replication and moving of content difficult, as such actions invalidate existing references to the content. To deal with this problem, CDNs use proprietary name resolution mechanisms that immediately update the invalid web references to content upon movement or replication. Despite being a distributed infrastructure, CDNs are centrally governed systems and charge content producers for distributing their content. This makes content delivery expensive, especially for small content producers. Finally, to serve content using HTTPS, CDNs need to hold the content publisher’s private keys, further increasing centralisation and lowering the security of the entire web [
86].
Key Challenges. Decentralised file systems encounter several crucial challenges. First, achieving a level of reliability and performance that matches the centralised counterparts is a major challenge, especially without giving up on decentralisation. Second, the collaborative nature of these systems (i.e., by pooling the resources of peers, some of which can be malicious) can lead to privacy and security challenges. Other challenges include the efficient support for mutable content (e.g., dynamic webpages), ease of accessibility (e.g., by current web users on browsers), and moderation of content stored on these systems. These challenges are further elaborated in the sections below.
6.2 Implementations
The ideas behind decentralised storage networks were first developed for P2P networks and initially produced unstructured networks like Gnutella [
165]. While these were able to perform well in fetching popular items, they were not as successful in quickly retrieving less popular content. Shortly after, a number of projects started leveraging instead the structured networks, particularly the DHTs, to achieve more reliable performance guarantees. Most prominently among these was BitTorrent [
154]. Over time, it became clear that many of these networks lacked robustness in terms of reliability and security, partially due to the lack of incentives. Furthermore, BitTorrent’s main use became the distribution of unlicensed products [
127], leading to copyright and legal issues (see Section
6.7).
Recently, novel storage networks have emerged and gained popularity [
50], most notably IPFS [
26], Sia [
193], and Swarm [
180]. These can be built on structured, unstructured, or hybrid networks and use content addressing. While the principles of these projects are closely related to
Information-centric Networking (ICN) [
5,
216]—a content-centric, network layer paradigm that performs name-based routing using hierarchical content names—these novel projects work in the application layer.
Content addressing (see Section
3.2) is a natural fit for decentralised file systems targeting a public decentralised web, as content is distributed over the network with a level of replication, and therefore, any node (or a set of nodes) may be able to serve a requested file. It would be counter-intuitive to restrict file retrieval to only a single location as is done in the current web. For storage of private data, however, similar to personal Cloud storage, content addressing is not always necessary. Such is the case with Storj [
210], which also introduces optimisations targeted toward decentralised Cloud storage and uses satellite nodes that manage parts of the network.
DStore [
215] takes another approach to create a distributed outsourced data storage and retrieval scheme. It uses smart contracts to audit the integrity of the outsourced data, achieving security and efficiency. Liang et al. [
114] designed a storage and repair scheme for fault-tolerant data coding, realising a regeneration code with high precision and repairability, focusing on blockchain-based networks.
Another distinct project that proposes decentralising storage led by Tim Berners-Lee is called
Social Linked Data (SoLiD) [
37]. SoLiD is designed to decouple users’ personal data from the applications that use them and allows users to set access control policies to maintain the privacy of their data stored in decentralised storage units. However, the users must trust the decentralised storage units by properly authenticating applications and following their access control policies. More importantly, the current SoLiD protocols rely on centralised infrastructures such as the PKIs and DNS.
Finally, we mention blockchains as an alternative method of storing data in a decentralised manner. While storing data on the blockchain can be made secure, it is extremely expensive, as the data is replicated over all peers and thus distributed with extreme redundancy. In the rest of this section, we focus on recent decentralised file systems on the application layer with live implementations and analyse their key aspects.
6.3 Storage
Content is initially stored only by its original publisher, who then serves the file, given that the publisher can (and is willing to) function as a provider of that content. In many decentralised file systems, any peer downloading content by default becomes a provider for that content unless it configures its software to opt-out from being a provider [
26].
Performance and Reliability. Some decentralised file systems allow for nodes to formally publish deals governed by a blockchain, where one node pledges to store a particular content item [
28,
193]. Secondary off-chain markets have also emerged where providers offer to
pin specific files (i.e., permanently make the files available). Some systems also introduce coding techniques (e.g., erasure coding) to improve the reliability of content storage (e.g., only a certain percentage of coded segments of content is sufficient to restore the content) in the presence of churn. Combined with incentivised pinning of files at multiple locations, coding can further improve the permanence of content stored in these systems.
In addition to voluntarily storing and providing content, peers in some decentralised file systems [
26] have to participate in the (mandatory) storage of meta-data for content that they do not necessarily provide. In IPFS, the peers with public IP addresses are involved in the collaborative storage of (and providing) an index (i.e., meta-data) that maps the CIDs of the available content in the network to the providers of that content. In this system, to serve content, a content producer must prepare a “provider record” that maps the CID of the content to its network identifier i.e., IP address and port number, and store this in the DHT (i.e., using a DHT put(key, value) operation where key is the CID and value is the provider record). In a sense, provider records function as “pointers” to content that are used to resolve the providers.
The content and meta-data stored on the decentralised file systems are generally publicly accessible; anyone in the network who knows the CID can fetch the corresponding data. This approach causes security and privacy concerns for storage nodes.
Privacy. Making content provider information publicly accessible in clear text in a DHT is a privacy concern for nodes storing content and making them accessible. An obvious solution is storing provider records in encrypted form; however, managing decryption keys for content is an overhead for publishers. A possible workaround is to derive a decryption key for a CID’s provider records from the CID itself. This way, only the parties that know a CID can decrypt the provider records for that CID.
Moreover, the act of a provider putting encrypted provider records into the DHT (to be able to serve content for a given CID) should ideally not disclose to the DHT nodes the CID associated with the record. Otherwise, the DHT peers can passively observe the providers of the CIDs, even when the records are encrypted. Michel et al. [
132,
133] propose using the hash(CID) as the key to put the provider records for a CID in the DHT instead of using CID as the key. Using “double-hashing”—a CID is derived from the hash of the content, and therefore, the hash(CID) is considered double-hashing—technique to use separate identifiers in the DHT can effectively hide the CIDs from the peers during the DHT operations.
One remaining problem is the possibility of malicious peers putting fake provider records to launch Denial-of-Service attacks at victim peers whose peer IDs are supplied as providers in the records. Michel et al. [
132,
133] propose that a peer publishing a provider record also signs the record with its private key whose public counterpart is used to derive the peer’s ID [
133]. By including signatures in the records, the clients who can decrypt the provider records can verify that the CID in the record is provided by the peer who originally signed the record (see References [
132,
133] for details).
Security. In both MaidSafe [
106] and Storj [
210], the content is stored in the network in an encrypted form to provide confidentiality. Also, in both of those systems, content is divided into a sequence of chunks and the individual chunks are stored on the DHT. In MaidSafe [
106], each chunk of content is encrypted with the hash of the previous chunk in the sequence, and each encrypted chunk is then XORed with the concatenated hashes of the original chunks for further obfuscation. Together with the encrypted chunks, a publisher must also publish a manifest file (i.e., containing meta-data) that maps the hash of obfuscated chunks to the hash of the real chunks.
6.4 Retrieval
Data retrieval using content addressing requires
resolving CIDs to network identifiers or locations (i.e., IP addresses and port numbers) of peers that can provide the content, i.e., providers. In terms of the underlying P2P network structure, these systems can use unstructured, structured, or a hybrid of both; the underlying network’s structure impacts how content is resolved. In the unstructured case of Sia [
193], nodes gather hints of the possible locations through, for example, the blockchain deals, after which a select number of candidate nodes are queried rather than using a flooding-based search approach to resolve CIDs to their providers. The other projects use modified versions of the Kademlia [
129] DHT for either locating peers [
28,
154], or both peers and content [
26,
180,
210].
Performance. The hybrid P2P approach in IPFS aims to optimise content retrieval latency using both unstructured and structured network connections, where the structured connections form a DHT (i.e., Kademlia). As part of the unstructured network, each node maintains a set of connections with peers discovered through DHT communications or incoming content requests. A peer uses these connections as part of the
Bitswap [
2] protocol to send requests for content directly to other peers. In the Bitswap protocol, nodes send
want requests to each other, specifying lists of requested content CIDs. The want requests do not propagate beyond the directly connected peers. Upon sending a Bitswap request for content to a set of peers, one or more of them may respond with an acknowledgement of having the content stored (i.e., cached) locally. The node can then attempt to retrieve the content from all the acknowledging peers in parallel (e.g., request individual chunks of the content from different peers), similar to downloading content using BitTorrent [
154].
In IPFS, a client looking for a content object first asks its Bitswap peers for that content’s CID. If none of the direct peers has the requested content locally cached, then the node queries the DHT, storing a distributed index that maps CIDs to the providers of that CID (i.e., provider records). In the Kademlia DHT used by IPFS [
26], the provider records for content with CID
\(c\) are stored at the twenty peers whose peer IDs are “closest” to
\(c\), where the closeness of IDs and CID is determined according to the distance metric (i.e., XoR) used in Kademlia DHT. A get() operation on a CID
\(c\) returns the provider records for
\(c\) from the twenty closest peers to
\(c\) in the DHT.
In IPFS, it is likely for Bitswap requests for popular content (i.e., ones that are stored by many peers) to succeed and, therefore, retrieval of such content may not require DHT resolution. Because content resolution through a DHT can be slow (i.e., requires contacting O(log n) peers), Bitswap can significantly reduce the content retrieval latency. At the same time, the Bitswap protocol also helps reduce the burden on the DHT network, as the distribution of content requests tends to follow a power-law distribution, i.e., the majority of requests demand the most popular content in most content networks [
70,
124]. However, retrieving unpopular content through the Bitswap protocol is likely to fail, and this can slightly delay the switch over to the DHT resolution for content, slightly delaying the retrieval for such content. Therefore, a hybrid system may require optimisations to improve the content retrieval latency by using both networks simultaneously at the cost of additional overhead.
IPFS facilitates peer-to-peer connections between nodes situated behind Network Address Translation (NAT) devices. When two peers willing to communicate are behind NAT, IPFS allows them to utilise a third (i.e., relay) peer with a public IP address to bootstrap their communication. When NAT’ed peers provide content, the IPFS address stored in their provider records includes the IP address of both the relay peer and the NAT’ed peer’s public IP. These relay nodes facilitate connections between NAT’ed peers by employing standard hole-punching techniques. Ensuring accessibility for peers behind NAT is a crucial aspect of decentralised file systems, particularly since contributors typically connect from homes where NAT’ed connections are to be expected.
Privacy. In addition to the performance of content retrieval, privacy is another important consideration. Some systems such as OneSwarm [
88] distinguish between trusted (e.g., friends and family) and untrusted peers and introduce address obscuring techniques to increase the privacy protection of their participants. Ideally, a system should not reveal which particular content is searched by a given client, providing a form of “reader” privacy. While recent measurement studies on IPFS demonstrate the ease of monitoring content requests [
19,
20,
21], using the hash of the cid (double-hashing) as the search key (Section
6.3) can be effective in hiding the target CID [
132]. The double-hashing extension is also useful to hide CIDs in the Bitswap protocol—when sending want requests for content, the hash of the CID can be used instead of the CID itself. This way, the Bitswap peers cannot determine which CID a reader wants unless they have the content stored.
Censorship. Although decentralisation should theoretically make censorship of content difficult, Sridhar et al. [
174] have demonstrated a censorship attack on the DHT resolution of IPFS where Sybil peers are strategically placed on the DHT to block requests to provider records of a target CID. In particular, when twenty or more Sybils are placed as the closest peers (i.e., based on the XOR distance metric used by Kademlia) to the target CID, then provider record lookups can be intercepted (and simply ignored) by these Sybils. The placement of Sybils can be done through brute-force generation of peer identifiers. The authors propose detection and mitigation mechanisms against this attack in this work. The detection method examines the distribution of peer IDs among the closest peers to a given CID. It identifies a potential attack if this distribution significantly differs from the expected distribution of peer IDs, assuming that the IDs of legitimate (non-Sybil) peers are uniformly distributed throughout the DHT key space. Conversely, to mitigate the attack, a broader region of the DHT is utilised for storing and retrieving provider records after a peer identifies an ongoing Sybil attack.
Decentralised file systems are vulnerable to Sybil attacks, which aim to undermine the integrity of the underlying P2P network. One such attack is the
eclipsing attack [
80,
83,
126] where Sybils isolate peers by gaining control over their connections and then manipulate or censor the information exchanged between the isolated peers and the rest of the network. Eclipse attacks can target unstructured blockchain networks that some decentralised file systems use to publish storage deals [
28] or DHTs that store content metadata, such as provider records, to prevent content retrieval. Recent research proposes diversifying the connections of peers (e.g., in terms of IP addresses they connect to) to make such attacks more difficult [
100].
6.6 Incentives
Providing participants incentives for continued and active participation is important for decentralised file systems to operate in a reliable manner. Early P2P storage networks generally leveraged non-financial incentives, such as BitTorrent’s tit-for-tat [
43], which rewards resources put toward the network by faster downloads in return. Another example is Samsara [
46], which focuses on tit-for-tat behaviour for contributing storage resources, i.e., symmetric storage relationships between peers. In Samsara, a peer
S stores a chunk of data for a peer
R in exchange for
R storing an equally sized
storage claim by
S.
S can periodically verify the existence of the claim through a challenge-response protocol, which prevents
R from removing or compressing the claim, and eventually
S can request
R to store a data chunk from
R, in which case
R stores
S’s data replacing the claim. However, malicious peers can refuse to store data later when requested as the claim mechanism can not enforce peers storing claims to replace them with data. Also, the verification of claims adds significant overheads on the peers.
A number of projects have also started incorporating blockchain-based rewards in their networks. Filecoin [
28] creates an incentive layer on IPFS where nodes create on-chain storage deals. Storage nodes regularly submit proof that they have been storing unique copies of the data, for which they receive off-chain micropayments. Similarly, BitTorrent issued a token to add robustness to their platform, while Skynet, a decentralised CDN, leverages the Sia blockchain. Swarm and Storj issued blockchain tokens as well. Arweave [
211] takes another approach toward realising decentralised storage and uses a blockchain-like linked structure with mining rewards based on pseudo-random previous blocks linked to the latest state. Therefore, users pay a one-time mining fee for storage, assuming that miners are honest in keeping and providing their data, which may not hold in practice and lead to poor scalability and performance.