survey

Open access

A Survey on Content Retrieval on the Decentralised Web

Authors:

George PavlouAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 8

Article No.: 198, Pages 1 - 39

https://doi.org/10.1145/3649132

Published: 26 April 2024 Publication History

PDF eReader

Abstract

The control, governance, and management of the web have become increasingly centralised, resulting in security, privacy, and censorship concerns. Decentralised initiatives have emerged to address these issues, beginning with decentralised file systems. These systems have gained popularity, with major platforms serving millions of content requests daily. Complementing the file systems are decentralised search engines and name-registry infrastructures, together forming the basis of a decentralised web. This survey article analyses research trends and emerging technologies for content retrieval on the decentralised web, encompassing both academic literature and industrial projects.

Several challenges hinder the realisation of a fully decentralised web. Achieving comparable performance to centralised systems without compromising decentralisation is a key challenge. Hybrid infrastructures, blending centralised components with verifiability mechanisms, show promise to improve decentralised initiatives. While decentralised file systems have seen more mature deployments, they still face challenges such as usability, performance, privacy, and content moderation. Integrating these systems with decentralised name-registries offers a potential for improved usability with human-readable and persistent names for content. Further research is needed to address security concerns in decentralised name-registries and enhance governance and crypto-economic incentive mechanisms.

1 Introduction

Over the past decade, the World Wide Web has become a significant part of people’s lives. The web supports the global economy, provides entertainment, and is often the primary source of information about the world [208]. Furthermore, the web has a tremendous impact on shaping people’s views, opinions, and choices [7].

In recent years, the infrastructure providing the core web services on the Internet has become increasingly consolidated, with a handful of players controlling most of the market [15]. While these players provide outstanding services and Quality-of-Experience (QoE) for users, their centralised model of service delivery has introduced several drawbacks such as lack of transparency [38], lack of privacy-protection [52], a single point of failure [181], and censorship [71].

Recent initiatives in research and industry aim to tackle these issues by creating an open and decentralised web, which seeks to fix the problems that come with centralisation—in particular, they focus on openness, security by design, and decentralised governance and control. This is achieved by using transparent, open-source software and Peer-to-Peer (P2P) architectures [135], allowing anyone to join and contribute to the system. Furthermore, tools like blockchains [195], proofs of work [102], and self-certification through content addressing [26] are used to establish trust between anonymous users and reliably reward system contributors.

While the objective of the decentralised web is to achieve decentralisation—i.e., redistribution of ownership and control from centralised infrastructures to individual users—it is an open question whether this can be achieved in practice. Current web centralisation is mainly driven by economic concentration, and whether the same would happen to the decentralised web is unclear. Furthermore, interacting with untrusted, anonymous peers requires additional security mechanisms that are difficult to design and can lower the system’s overall performance. Finally, the current centralised model emerged from ad-monetised services usually delivered with high QoE to users without monetary compensation. Although end-users do not directly pay for these centralised services, the service providers collect user-related data to display targeted advertisements, making the ecosystem economically viable [138]. To be successful, the decentralised web ecosystem would require appropriate rewards for service providers and content creators while combating users’ intrinsic reluctance to spend money.

1.1 Contributions

In this article, we provide a survey on content retrieval on the decentralised web. We explore whether the decentralisation objective is realised by investigating the incentive structures, as well as the performance, security, and privacy aspects of the content retrieval process (Figure 1), starting with decentralised search engines, then decentralised name-registries, and finally decentralised file systems.

Fig. 1.

We identify these focus areas as key components for which decentralised alternatives must be developed. For each of these, we first describe the status quo, i.e., how operations are performed in the current web. We then compare them with state-of-the-art decentralised implementations and proposals from both academia and industry. We use insights gained throughout the process to define a number of open issues.

Many of the discussed platforms lack clear documentation and a vision of integration to realise a decentralised web. Furthermore, terms used in documentation differ greatly across projects, and the fast development pace in the field makes obtaining a clear view and deep understanding challenging. With this work, we hope to clear up some of the contradictions and confusion. By defining a clear framework, we help to provide a big picture to understand and define future research opportunities.

1.2 Scope

While the documentation of novel decentralised web projects is often scarce, their underlying concepts are usually derived from an extensive body of research. In this work, we utilise this underlying literature for background but do not go in-depth into the specific implementations. Rather, we focus on recent initiatives over the period 2009–2024 that have produced working implementations, as well as research proposals. While we give an overview of how components are handled in the current web, we do not mention specific centralised solutions except when comparisons are appropriate. Furthermore, this article highlights architectures, their properties, and their aims. However, it is too early to definitively conclude that they can live up to their claimed potential, and we have added this nuance in the open issues. This work mainly serves as a general analysis of the decentralised web at large, a framework for analysing and implementing new initiatives, and the first comprehensive body of work looking at decentralised web technologies and their role in content retrieval analogous to the current web. Therefore, this survey is relevant to industry practitioners and researchers who aim to better understand the field at large.

1.3 Methodology

To survey a relevant body of work, we started by querying research search engines (e.g., Google Scholar) for works that contain {decentralised + web} in their title, keywords, or abstract. We also queried for {distributed + web}, which generally returned works of the prior P2P era. We used these earlier works and broader literature related to {web + content retrieval} to identify key components and structure our framework. We also looked at related work specifically in our key components of {search engine, name registry, file system}, and surveyed works that combined components with keywords like {web3, blockchain}.

Besides academic works, we surveyed industry contributions, including white papers, yellow papers, blog posts, and more. We paid particular attention to resources that were additionally cited in academic sources to curate a high-quality body of work without a marketing focus, obscure or incorrect jargon, or over-optimistic claims. To verify quality, we did manual inspection and selection. We emphasise industry contributions, because the decentralised web remains a rapidly evolving field, with many concepts yet to be formalised in research. We always inspect the underlying technology and design and check third-party sources to ensure objectivity.

The rest of this survey is structured as follows. Section 2 gives an overview of web content retrieval, provides a timeline of advancements, and presents a systematisation framework for structuring this survey. We subsequently provide general background on key concepts in Section 3. In Section 4, we discuss search engines; in Section 5, we describe name registry; and in Section 6, we examine decentralised file systems. Finally, we review related work in Section 7, and summarise our key findings and conclude the article in Section 8.

2 Web Content Retrieval

In this section, we describe the process of retrieving web content, discuss the need for decentralisation, and define our systematisation framework.

2.1 Retrieving Content on the Current Web

Content retrieval on the current web involves a multi-step process. Often, a search-based workflow is used, where users submit a query to their favourite search engine with a description of a content object of interest in the form of a few keywords. The description may include a content creator or publisher name, a real-world description of the content, and more. In turn, a search engine returns results consisting of web references; that is, the Uniform Resource Locators (URLs) such as https://example.com/category_B/subcategory_C/Foo/.

With its “hostname/pathname” structure, a URL referencing a content object embeds both the hostname of the content’s provider and the (server-specific) location of the object within the directory structure of the hosting provider’s server(s). As a result, moving a content object to a different provider invalidates existing reference names to the content. Furthermore, replicating an object across different servers requires duplicating server-specific directory structures across different servers; this makes replication and movement of content difficult in the current web [196] and has led to increased centralisation.

Once a user obtains a valid URL of a content object, the next step in the content retrieval process is to perform a name resolution on the URL’s hostname component to obtain the content provider’s storage location. In the current web, the Domain Name System (DNS) performs the name resolution service through a distributed database storing mappings from domain names (i.e., hostnames) to IP addresses (i.e., locations) of hosts. Users can retrieve a content object once a host location is resolved through the DNS.

In the current web, content producers increasingly rely on Content Distribution Networks (CDNs) for large-scale content distribution, such as video streaming to many geographically distributed users. These networks use proprietary technologies to serve content requests using a distributed infrastructure of content caches. Although CDNs achieve scalable content distribution using a distributed system of centrally controlled caches, one can argue that the need for CDNs in the current web stems from the lack of a viable decentralised content delivery technology. Several decentralised web projects [26, 180, 210] aim to replace the CDNs with decentralised file systems, which we discuss in Section 6.

The current host-centric content retrieval ecosystem on the web is exposed to serious flaws and vulnerabilities because of centralisation in control and ownership of the entities involved, such as the search engines, the DNS, and the content storage (e.g., Cloud) providers. At present, there is a large power imbalance between these centralised entities and users, which allows these centralised parties to influence users by adding bias and censorship, tracking and selling personal data, influencing public opinion, and so on. The users are expected by default to trust these centralised entities unconditionally (i.e., a centralised trust model), while they operate without much transparency.

2.2 Retrieving Content on the Decentralised Web

In a distributed trust model, the content retrieval can no longer depend on trusted third parties (e.g., a single root of trust as in a Public-key Infrastructure (PKI) or DNS). Instead, the users must ideally be able to verify each step of the content retrieval process (Figure 1). For example, the users must be able to verify the binding between the contents of a retrieved data object and its reference name; that is, to verify that the object is the correct one for the given reference name without a centralised, third-party vouching for its provenance (i.e., the data object came from the appropriate source). This verification can be achieved by technical solutions such as self-certifying names and zero-knowledge proofs [72]. We further elaborate on the challenges and tools that can be used to establish distributed trust in Section 3.

The decentralised web aims to evolve the current web away from a host-centric paradigm and instead use a content-centric paradigm where reference names (i.e., content identifiers (CIDs)) directly identify content objects (also referred to as content addressing). This allows retrieval of content objects from anywhere in the network, rather than being restricted to retrieving them only from one of the content providers’ locations. The location-independence of this paradigm is important, because frequent replication and migration of content is the expected norm in the decentralised web. Furthermore, decentralised services are realised by nodes in the network who share their resources for the network to outsource tasks like storage, computation, and bandwidth to them. Incentives and rewards play an important role in ensuring fair compensation for resource sharing and mitigating against malicious entities. Table 1 briefly illustrates the key differences between the current and decentralised web.

Table 1.

	Current web	Decentralised web
Trust	Centralised Root of Trust	Distributed Trust Model
Retrieval	Location-centric	Content-centric
Addressing	URL	Hash-based
Infrastructure	Centralised Entities (e.g., DNS)	Node Resource Sharing
Benefits	Performance, Accessibility, Scalability	Censorship Free, Availability
Drawbacks	Power Imbalance, Transparency, Privacy, Replication	Usability, Incentivisation, Interoperability

Table 1. Comparison of the Current and Decentralised Web

Similar to the current web, we envision a search-based workflow to take place in the decentralised web, starting with decentralised search engines. Because CIDs are typically not human-readable for reasons of security² (see Section 3.2), canonical names for content have an important functionality to serve as names that humans can refer to. A decentralised name-registry service replaces the DNS and performs the resolution of canonical names to CIDs. For the actual content retrieval, an extra resolution is needed to obtain location(s) from CIDs, and this is typically performed by decentralised content storage networks (i.e., decentralised file systems).

The resulting search-based content retrieval process in the current web and the decentralised web are depicted in Figure 1. Although the search-based workflow is popular, other workflows exist to access content on the current web, such as following hyperlinks from one page to another and sharing hyperlinks to objects on Cloud-based shared drives. In this work, we focus on the search-based workflow, as it encompasses the other workflows—i.e., the other workflows start from later points in the same sequence of events, and therefore analysing only the search-based workflow is sufficient.

2.3 Timeline

The content of this article mainly spans the time period between 2009 and 2024. However, there is a large body of foundational works and developments. To illustrate the relations and chronological advancements, Table 2 presents a timeline of developments, their benefits, and key works studied.

Table 2.

Time	Event	Benefits	Notable Works
1980-2000	Introduction of the DNS	Human-readable name resolution, scalability, standardisation	[14, 136, 184]
	Introduction of the Web	Global connectivity, information accessibility	[31, 98]
	Unstructured P2P	Decentralised information storage and retrieval with low overhead	[32, 107, 165]
2000-2005	Structured P2P	Efficient decentralised content retrieval, scalability	[129, 160, 178]
	Introduction of Web 2.0	Dynamic content, further engagement and collaborations	[45, 142, 148]
	Popularity of Search Engines	Information accessibility, content discovery, monetisation	[33, 34, 42]
2005-2010	Growth of Social Media	Connectivity, communication, global sharing, business	[128, 153]
	Introduction of Blockchain	Decentralised trust, cryptocurrency, security, transparency	[66, 143, 212]
	Mobile Edge Computing	Reduced latency, improved performance, scalability	[123, 125, 168]
2010-2015	Smart Contract Blockchains	Decentralised and trustless execution, incentivisation	[35, 140, 201]
	Decentralised Name Registry	Decentralisation, security, immutability, censorship resistance	[92, 213, 221]
	Novel Decentralised Storage	Availability, security, persistence	[26, 50, 193, 210]
	Novel Decentralised Search	Censorship resistance, transparency, decentralised governance	[94, 109, 155, 159]
2015-present	NFTs and Blockchain Scalability	Immutable ownership records, business models, improved performance	[75, 198, 199]
	Novel Resource Sharing and Web3	Collaboration, incentivisation, trust, transparency	[28, 141]

Table 2. Timeline of Key Web Advancements

The period 1980–2000 is characterised by the emergence of fundamental structures like the DNS and the web itself, bringing about global connectivity. Going into the 2000s, P2P networks emerged, as well as web 2.0. In the period 2005–2010, social media’s exponential growth transformed connectivity, communication, and business interactions while innovations like blockchain and mobile edge computing emerged.

Between 2010 and 2015, the rise of smart contract blockchains, decentralised name-registries, and novel storage and search technologies increased the focus on decentralisation, security, and transparency. Most recently, developments in non-fungible tokens (NFTs) and blockchain scalability have contributed to a vision of a decentralised web, with shared services and resource sharing.

2.4 Systematisation Framework

We use the process of traditional web retrieval, as described in Section 2.1, to define a framework that can be applied to study decentralised web initiatives. As shown in Figure 2, we divide web retrieval into three main components: search engine, name registry, and file system. For each of these areas, decentralised initiatives should be developed. This framework should allow them to position themselves amidst others in the space and define how interoperability can be achieved.

Fig. 2.

To search for content on the decentralised web, users will need to use a search engine, which can index web content. The search engine also needs to decide which content to index through curation and in what order results are returned to users, which is decided by a ranking algorithm. Indexing and retrieval of content depend on human-readable names (i.e., canonical names), which are linked to CIDs using decentralised name-registries. Users need to be able to register name-to-value mappings to this service and resolve names to CIDs. Finally, content is stored on a decentralised file system, blockchain, or a web server and needs to be retrieved from these systems using its address or CID.

Our framework identifies these orthogonal components to describe the key pillars of a decentralised web infrastructure. However, in practice, many components may overlap and share underlying technologies. For example, each component uses blockchains to promote honest participation and resource-sharing through incentives. Each component could use the same blockchain network and underlying P2P network (e.g., Ethereum [212]). To keep clarity and structure in this work, we describe these overlapping components in the background (Section 3) and refer to them in later analyses when relevant or distinct in implementation.

In Sections 4, 5, and 6, we will go through each of the key components in our framework and discuss the status quo of centralised systems; after that, we discuss and compare these to decentralised initiatives, and identify open issues.

3 Background

In this section, we provide background on key concepts in content retrieval on the decentralised web, consisting of peer-to-peer networks, addressing of web content, and incentivisation of participation. Table 3 provides an overview of the concepts covered in this section.

Table 3.

Concept	Description	Variations	Challenges
P2P Networks	Distributed application architectures, allowing for resource sharing between peers	Structured, unstructured, hybrid	Churn, scalability, discovery, efficiency, security
Addressing Web Content	Method for addressing Web content. Content addressing uses hashing and verifiable bindings	Hash of content, hash of public key	Human-readability, security, decentralisation
Incentivising Participation	Incentives for sharing resources in decentralised protocols, often governed by smart contracts	Tokens, cryptocurrency	Fair exchange, sybil attacks, reputation

Table 3. Overview of Background Concepts

3.1 Peer-to-Peer Networks

P2P networks form the basis of decentralised architectures that partition application-level tasks or workloads between peers. Peers are equally privileged participants in the application, making P2P networks a sound basis for the decentralised web. Peers make a portion of their resources, such as processing power, disk storage or network bandwidth, directly available to other network participants without the need for central coordination by servers or stable (i.e., always-on) hosts. Peers are suppliers and consumers of resources, unlike the traditional client-server model, where roles are distinctly separate.

P2P networks implement a virtual overlay network on top of the physical network topology, where the nodes in the overlay form a subset of the nodes in the physical network. Data is still exchanged directly over the underlying TCP/IP network, but at the application layer, peers can communicate with each other directly via the logical overlay links. Overlays are used for indexing, peer discovery, and to make the P2P systems independent from the physical network topology. The two main types of P2P networks are (i) unstructured and (ii) structured.

3.1.1 Unstructured P2P Networks.

Unstructured P2P networks do not impose a particular structure on the overlay network by design; instead, they are formed by nodes that randomly form connections to each other [32, 107, 165]. Without a globally imposed structure, unstructured networks are easy to build and are highly robust to churn.

However, finding content is difficult in an unstructured network. In the earlier P2P networks, such as Gnutella [165], the search queries were flooded through the overlay network to find as many peers as possible for the searched data. However, flooding is unscalable as its overhead on the network grows linearly with the number of search queries, which in turn grows with system size. The problem gets more severe for unpopular content, which is present at only a few nodes. More recent P2P systems use slightly more scalable search mechanisms, such as random walk, as discussed in Section 4.2.1.

3.1.2 Structured P2P Networks.

In structured P2P networks, the overlay is organised into a specific topology, and the protocol ensures that any node can efficiently search the network for content, even if the resource is extremely rare. The most common type of structured P2P networks implement a Distributed Hash Table (DHT) [129, 161, 166, 178, 227] in which a variant of consistent hashing is used to assign responsibility for maintaining each content or resource to a particular peer. This enables peers to search for resources on the network using a hash table; that is, (key, value) pairs are stored in the DHT, and any participating node can efficiently retrieve the value associated with a given key within a bounded number of steps (usually \(O(log(n))\), where \(n\) is the number of peers in the network).

Unfortunately, maintaining a structured overlay topology makes this type of network less robust in networks with a high churn rate. Maintaining a structure also exposes the network to a vast range of attacks that can be more difficult to perform in an unstructured P2P network [187].

3.2 Addressing Web Content

As mentioned in Section 2.2, a distributed trust model requires a secure and verifiable content retrieval process. Therefore, the users must verify the authenticity of the binding (mapping) between reference names to the retrieved content object. This can be achieved by using content addressing, where more importance is given to the integrity of the content, rather than its origin.

Decentralised file systems typically use verifiable (i.e., self-certifying [130]) CIDs as reference names to achieve verifiability without trusted third parties. Self-certifying names for content objects are typically generated using one of the two mechanisms:

(1)

Hash of the content: Generated by applying a well-known hash function to the object’s contents. The users can apply the same hash function on the retrieved content object to verify the binding between the name and the object.

(2)

Hash of a public key: Generated by hashing a public key whose private counterpart is used to sign the content object. In this case, the content object includes a signature, which can be used to verify the name-to-content binding of an object. The content publisher who owns the private key typically generates the signature.

The properties of distributed trust (i.e., decentralisation), security (e.g., binding between names to object), and usability (i.e., a system with human-readable names) are non-trivial to achieve simultaneously, as also conjectured by Zooko’s trilemma [209], which states that naming systems can only have two of the following three properties: human-readability, security, and decentralisation.

Among these three properties, some are contradictory. For instance, security is at odds with human-readability, because secure, self-certifying names are not human-readable due to the hash function applied. Similarly, the intrinsic binding between a human-readable name and its content (producer) is weak, and verification of this binding through a centralised trusted party contradicts decentralisation. Another desirable property is persistence (i.e., names should not change when location, content or ownership changes). Ideally, a minor update to web content should not produce a completely different name. This, however, can be at odds with the security property, because self-certifying names lead to modifications in the names of mutable (i.e., dynamic) content upon updates to content (i.e., hash of the content) or ownership (hash of the public key).

As decentralised web content uses hash-based addressing, they satisfy only the decentralisation and security properties, which are discussed further in Section 6. Decentralised name-registries have the potential to “square” Zooko’s trilemma; that is, achieving usability while maintaining security and decentralisation. This is done by mapping human-readable, canonical names to CIDs in a decentralised manner using a blockchain, as discussed in Section 5.

3.3 Incentivising Participation

One of the core foundations of the decentralised web is the distribution of trust. Rather than relying on a single root of trust, the responsibility of system upkeep is delegated to a network of nodes, who share their resources for the protocol. As these nodes are required to spend resources for network upkeep, there needs to be incentives, financially or otherwise, to keep them performing the work and keeping them honest. This can be seen as an overlapping component of all the decentralised web components, and therefore, we now discuss resource sharing and particularly incentivisation as essential components and refer to these in our later sections.

Although early P2P networks survived on the basis of resource sharing based on altruism [191], they eventually failed to reach their full potential, partially due to the absence of incentives [119]. Recently, financial incentives powered by blockchains have been implemented and studied extensively.

Blockchains are secure, immutable shared ledgers that allow for value transfer in a network without a trusted third party. Blockchains are especially useful in decentralised web architectures due to their ability to incentivise users to participate and contribute to a network by paying them rewards, using tools such as smart contracts [35] and off-chain micropayment channels [75]. Blockchains allow trust to be exercised given that at least a certain percentage of the participants are honest (e.g., more than 50%); that is, they execute the blockchain consensus protocol correctly.

However, blockchains can only ensure a fair exchange of reward for work if the resource contributors can produce verifiable proofs of resource consumption toward getting useful work (e.g., for up-keeping) done. For example, a node can prove that bandwidth [69], computation [66, 224], or storage [27] resources were actually provided, and a subset of the participants in the system can collectively verify these proofs as part of a consensus protocol [23], which can then trigger automatic rewarding of contributors for their valid proofs. Proving useful work done is not always plausible, for instance, for continuous services that take place for a period of time. When such proofs are unavailable, beneficiaries may issue periodic payments (e.g., using off-chain channels) to contributors (at the end of fixed or increasing time intervals) as long as the provided service is satisfactory. However, if the counter-party is malicious, it could lead to a loss of revenue for at least one interval, and the absence of penalties for malicious behaviour may encourage more nodes to behave undesirably.

A well-known way to counteract malicious actions when a fair exchange is unavailable is using reputation systems. Centralised reputation systems have been explored thoroughly for online retail [91]. More recent research [24, 53, 96] specifically focuses on decentralised reputation systems targeted to work with blockchains [25]. These works aim to incentivise honest collaboration between peers, as malicious behaviour results in a deduction of reputation. The deduction in reputation, in turn, leads to lower rewards in the future, either directly [102] or indirectly due to a loss of future revenue.

Another method that can be used to achieve fair exchange and thereby facilitate resource sharing in the decentralised web scenarios is using Trusted Execution Environments(TEE), which are secure computation enclaves to be used in various use-cases. Specifically, in the case of computation outsourcing, using an enclave can maintain privacy and correctness while greatly improving performance compared to smart contracts. A number of works use TEE’s in combination with smart contracts to achieve distributed computation [6, 40, 49, 101, 225].

In a decentralised system, any participant can create and control an identity without the involvement of a trusted third party. This makes it possible for malicious nodes to simultaneously use multiple identities as part of a Sybil attack. By generating multiple Sybil identities (that pose as real users), malicious parties can trick a fair exchange mechanism into issuing undeserved rewards, for instance, by bypassing a reputation system or inflating the amount of actual resources consumed by the node. To prevent such attacks, proofs of resource consumption must be Sybil-resistant. In addition, researchers have proposed reputation systems that can identify Sybil nodes through mechanisms such as voting [137] and social network analysis [223]. These mechanisms identify outlier nodes as Sybils in the presence of an honest majority, e.g., by taking the absence of a node’s connections to other honest nodes in a social network as a sign of Sybil behaviour.

In the next sections, we go through the components of content retrieval on the decentralised web.

4 Search Engine

In this section, we first discuss how current search engines work and identify a number of their characteristic components. After describing the centralised components of current search engines, we introduce several decentralised search engine architectures. We then analyse these based on how they incorporate the key components. Specifically, we discuss how they differ in terms of curating, indexing, ranking, and incentives.

4.1 Overview of Centralised Search Engines

Currently, when a user looks for content on the web, they often start by submitting a query to a centralised search engine consisting of one or more keywords. Proactively, the search engine has curated content to add to an index by crawling the web. Keywords are then extracted from the content and added to an inverted index, which maps keywords to the web pages where they can be found.

Upon receiving queries, the inverted index is used to compile a list of pages that might be relevant to the users. These results are then ranked using a ranking algorithm and returned to the users. The centralised search engines control what ranking mechanism (e.g., PageRank [33]) is used and are not always transparent about the specifics. Furthermore, ranking is generally personalised, which may lead to filter bubbles [151].

To incorporate a healthy business model, most centralised search engines monetise their services by adding advertisements through keyword auctions in search results, which allows the service to be freely accessible for users [42]. While the network infrastructure might be distributed, the control, management, security, and policy are centralised, thus introducing a single point of failure that may also lead to cascade failures. As these network tasks are managed centrally, they do not require incentives for participation. However, in a decentralised model, services likely need to leverage alternative business models and incentives for economic feasibility.

4.1.1 Key Challenges.

There are still a number of key challenges surrounding decentralised search engines. Foremost among these challenges is the establishment of true decentralisation, where curation, indexing, ranking, and incentive mechanisms operate without reliance on trusted entities. Moreover, privacy and security concerns remain important topics of attention to protect user data while maintaining search efficiency, robustness, and scalability. These challenges are further highlighted in Section 4.7.

4.2 Implementations

We can generally classify decentralised search engines by their degree of decentralisation. The content that is being searched can also be classified similarly. We refer to centralised data as “traditional” web content that is hosted on web servers. However, decentralised data encompasses content stored using decentralised file storage (see Section 6) and blockchains. Table 5 provides an overview of these content types. Using this, we can distinguish between three different decentralised search types: centralised search on decentralised data, decentralised search on centralised data, and decentralised search on decentralised data.

Table 4.

	Curating	Indexing	Ranking		Incentive	Advertisement	Decentralised		Network
	Curating	Indexing	Function	Location	Incentive	Advertisement	Search	Content	Network
Presearch [155]	Crawling	—	—	Gateway Server	Y	Y	Y	N	Ethereum
Yacy [217]	Voluntary	Distributed	Combined	Local	N	N	Y	N	Hybrid P2P
	Crawling	By Document
Brave [173]	Crawling	Centralised	—	Centralised	-	Y	N	Y	—
Nebulas [146]	Crawling	Centralised	NebulasRank	Centralised	N	N	N	Y	—
The Graph [159]	Token	Subgraph	—	—	Y	N	Y	Y	Ethereum
	Signaling	At Indexer

Table 4. Overview of Decentralised Search Engine Industry Projects

Table 5.

Type	Addressing	Location	Name Registry
Blockchain Data	Block Hash	Blockchain	Blockchain Name Registry
Decentralised Storage Data	Content Hash	Decentralised File System	Blockchain Name Registry
Traditional Web Data	IP	Web Servers	DNS

Table 5. Classification of Decentralised Web Content

We use these classifications to analyse early-stage implementations, as well as several proposals in the research literature, which generally have a narrow but detailed focus. Table 4 overviews notable industry projects and summarises how they approach the various search components. Table 6, however, presents an overview of research proposals, focusing specifically on decentralised search mechanisms on decentralised storage networks. We have divided research from industry works, because the former generally focus on one or a few aspects of search rather than presenting complete systems, and therefore, they have been analysed using different properties. As these projects are generally narrow in focus, we will now discuss their main properties, only referring to them occasionally in the rest of the analysis, as they do not present full and operational systems.

Table 6.

	Index Storage	Ranking	Performance Optimisation	Security Features	Privacy Features	Governance
SIVA [97]	IPFS DHT	—	Bloom Filter	Caching	—	—	—
Li et al. [109]	Kanban Cloud	—	Decoupled State and Computation	Verifiable Search, TEE, Decoupled Verification	Message Equalising, TEE
Zichichi et al. [230]	Hypercube DHT	—	Routing using Hypercube	—	—	DAO
Zhu et al. [229]	B+ Tree / Hashmap	—	Index Storage Methods	Version Control	—	—
Wang and Wu [197]	IPFS DHT	Network Metrics	—	—	—	—

Table 6. Overview of Research Proposals for Decentralised Search Mechanisms on Decentralised Storage Networks

4.2.1 P2P Search Engines.

The idea of decentralised search engines was first conceived by P2P search engines to improve the privacy, security, and performance of search on the web and P2P storage networks. A number of initially distributed search engines relied on unstructured P2P networks [205], which offered high resilience to peer churn and good performance in retrieving popular items [157]. Some projects focused on improving the performance of unstructured search using techniques such as replication [39, 122, 183] and random walks [39, 122].

Another method of realising distributed search engines leveraged structured overlays, specifically DHTs [62, 134, 170, 219]. This allows for more reliable performance guarantees and better efficiency, especially when retrieving less popular items. A number of these focused on performance optimisations such as incorporating Bloom filters [120, 164] and caching [65, 164], as well as efficient routing using ant-like behaviour [189]. Some of these used popularity scores to determine the number of indexers per file [65] or ranking of results [120].

To optimise performance, a hybrid of structured and unstructured networks was used. For example, Yacy [217] structures all peers in a DHT, without implementing DHT routing. Another approach [118] locates rare items using a structured overlay, while popular items are located using flooding, leading to better performance and lower overhead.

These early search engines, however, often lacked additional security measures and incentives for useful work, which are needed due to the absence of a trusted third party [85]. This ultimately led to their loss in popularity. The rest of this section focuses on recent initiatives that are able to query novel decentralised file systems (see Section 6) or blockchains.

4.2.2 Centralised Search on Decentralised Data.

There are a number of centralised search engines that can query decentralised data. Recent works often focus on allowing users to fetch content using CIDs [173]. However, keyword search is also possible [87], where the central entity sniffs the structured [81] or unstructured network [19] to discover new content to add to the index.

Rather than creating search engines for decentralised file systems, some works have aimed to make centralised [146] and decentralised [185] search infrastructures for blockchain and smart contract data. While the projects above rely on centralisation, they will likely play an important role in adopting the decentralised web.

4.2.3 Decentralised Search on Centralised Data.

Another class of search engines are those that are decentralised but search the traditional web. These offer much better privacy guarantees than centralised engines but are unsuitable for the decentralised web, as they currently do not support indexing content on blockchains or decentralised file systems.

As mentioned above, P2P search engines lacked incentives to add robustness and security to the system. Recent decentralised search engines often leverage a blockchain to add financial rewards, thereby making the network more secure and robust. For example, Presearch [155] rewards users for participating in upkeep functions such as crawling and indexing. Instead of centralised methods of issuing and distributing rewards, smart contracts may be used for decentralised incentive governance [201]. Smart contracts can also be used for reaching consensus on indexing and ranking, as is done by Raza et al. [162] to create a framework for privacy-preserving, decentralised search.

4.2.4 Decentralised Search on Decentralised Data.

We finally discuss decentralised search engines that operate on decentralised data, as these are the only suitable ones for a fully decentralised web. However, at the time of writing and to the best of our knowledge there are no implemented projects that entirely achieve this. A number of projects [11, 81, 82, 95] focus on decentralised crawling and indexing of decentralised storage and blockchain data. Most notably, The Graph [159] is a decentralised indexing protocol for blockchain data, which itself is built on top of a blockchain.

Besides these industry projects, a number of research works have proposed a decentralised keyword-search mechanism for decentralised storage networks like IPFS [26] (see Section 6). As these projects are generally narrow in focus, we will now discuss their main properties, only referring to them occasionally in the rest of the analysis as they do not present full and operational systems.

Li et al. [109] proposed DeSearch, which is a search engine for decentralised services, decoupling state from computation by using a centralised Cloud to store the index with high data availability, while maintenance of the index uses decentralised workers executing verifiable tasks (e.g., indexing, query processing). The verifiability property ensures that any third party (e.g., consumers of search results) can confirm that any task involved in the search process (carried out by an untrusted worker) is performed properly. This property is crucial in a decentralised setting where any worker can misbehave.

A number of works present systems that are fully decentralised (i.e., they also store the index over a P2P network). SIVA [97] builds a decentralised index for IPFS and stores it on the IPFS network using the native DHT. To increase performance, caching based on the Least-recently Used (LRU) [139] strategy and bloom filters are used. Wang and Wu [197] also propose to use the IPFS DHT to store the index and rank retrieved results from the index based on network metrics such as freshness, proximity, resource quantity, and bandwidth.

To increase performance, existing work has proposed storing the index in optimised structures rather than a general-purpose DHT. For example, Zhu et al. [229] propose decentralised keyword search on decentralised data networks using B+ Tree and hashmap data structures to store the index. Zichichi et al. [230] propose a hypercube DHT to store index items, structuring network topology using keywords. Furthermore, existing work proposes delegating governance of the index to a Decentralised Autonomous Organisation (DAO) [200], which allows peers to make governance decisions in a decentralised manner, e.g., propose and vote for changes, as well as implement tokens.

Another interesting idea is proposed by Fujita et al.[64], who argues for implementing similarity search on IPFS based on locality-sensitive hashing (LSH) as an alternative to the prevalent keyword-search mechanisms. In their system, content hashes are stored on a DHT, although further implementation details and feasibility analysis are an interesting avenue for future work. Furthermore, it remains unclear if this scheme is sufficient for users who expect to submit queries consisting of keywords and retrieve a range of relevant information rather than submitting content and retrieving similar content. Ditto [94] is another initiative that uses LSH to provide search functionality and stores identifiers on a DHT, irrespective of the underlying content network or addressing scheme.

As we will discuss in Section 4.7, while the above systems seem promising, they are mostly early-stage works and, therefore, suffer from a number of limitations and require further work. A particularly interesting question is whether they can truly achieve decentralisation. In the remainder of this section, we examine implemented projects and highlight how some of these projects uniquely implement the components of a search engine.

4.3 Curating

The curation process defines which content is added to the index. A number of projects take a similar approach to centralised search engines, which rely on crawling. Yacy is an example of a decentralised crawler, which allows users to crawl locally, either manually or proactively. Optimisations for decentralised crawlers have also been proposed such as leveraging the geographic proximity of resources [172]. Most other projects [146, 170, 173] remain reliant on centralised crawlers.

To crawl decentralised storage networks, however, different approaches are needed. To gain insights on peers and content in structured networks, one may sniff (i.e., intercept) the DHT traffic to discover new peers and CIDs, which can be fetched to gain insights [81, 87]. A similar approach may be used for unstructured networks, for example, in the case of the IPFS Bitswap [2] protocol traffic (Section 6.4), which is used to query peers for CIDs, may be monitored [19].

Another approach besides crawling is curation based on network consensus, as is used in The Graph [159]. Nodes in the network act as curators and use tokens to signal to indexers what content is valuable. While this might be a viable approach for on-chain data, it remains to be seen if this approach would work for other content types. This can be compared to research works that use popularity scores or managers [65] to signal which items should be indexed, although the latter lack monetary incentives and are therefore more prone to performance problems.

4.4 Indexing

The indexing process in decentralised search engines consists of two main steps. First, metadata is collected from content to create index entries that map extracted keywords to content identifiers. The second step determines how and where the index is stored, which is generally based on partitioning by document or by keyword.

Partitioning by document means that the content objects to be indexed are divided among peers who each maintain a reverse word index for a subset of the content objects, as is often the case in unstructured networks. This is inefficient when locating rare items, as nodes must flood the network to locate and retrieve the query results. Storing replicas of popular items can increase the performance in these networks [65], and in general, many distributed search engines offer a degree of replication, which also adds resilience against Denial-of-Service (DoS) attacks.

Most structured and hybrid engines are based on partitioning by keyword, where each node maintains an index for the words that appear across different content, generally by mapping to the closest peer in a DHT [129, 176].

Another distinct approach is used in The Graph, where the indexers simultaneously perform the tasks of producing and storing an index in the form of subgraphs of blockchain data. Users can then directly contact these indexer nodes to access the indexed data and, in return, issue off-chain conditional micropayments. Other recent search engines manage the index centrally [146, 155, 173].

In DeSearch [109], decentralised workers index content verifiably through a “witness” process, which runs in the TEE within each worker. The witness process provides logs of inputs and outputs of tasks carried out by workers for third parties to verify the causality between the inputs and outputs. The witness logs are also stored in a verifiable data structure, albeit in a centralised public cloud. Other research works [97, 197] have proposed to store the index directly on the storage network on which they operate, as well as optimised structured overlay networks [229, 230].

4.5 Ranking

When a user submits a search query, the relevant entries are fetched from the index, after which the results need to be ranked based on various metrics to be ordered and returned to the user. There are various ranking algorithms that may be applied to decentralised search engines. The most well-known is PageRank [33], which scores the importance of web pages based on the references pointing to and from the pages.

PageRank can be modified to determine the value of an entity on the blockchain, as done in NebulasRank [146]. This work uses transaction graphs to infer an entity’s liquidity, propagation, and interoperability to determine its value. Nodes, smart contracts, as well as an entity’s contribution to the network over a time period can be ranked in a similar fashion to LeaderRank [111].

In centralised search engines, the ranking process generally runs globally. In a decentralised search, clients may locally select and implement their own ranking policies [179] or combine pre- and post-rankings, where results are initially ranked based on a number of standard metrics, after which they can be ranked again by the user based on local configurations [217]. While most research proposals overlook ranking of results, it has been proposed [197] to use network metrics such as freshness, proximity, resource quantity, and bandwidth.

Distributed ledgers can also be utilised to reach consensus on ranking, for example using random groups of The Onion Router (TOR) [55] block nodes and the Practical Byzantine Fault Tolerance (PBFT) algorithm [162].

4.6 Incentives

Centralised search engines can offer free services by monetising advertisements and user data. Most early distributed engines rely on an altruistic model where users are assumed to participate in the system honestly without the need for rewards. Recent systems have incorporated incentives using a blockchain. For instance, the revenue collected from advertisements could be used as a reward for the up-keeping of the system [105]. We now discuss the monetary inflow and outflow of the system separately to illustrate this decentralised network economics.

4.6.1 Inflow.

There are generally three sources of inflow of money in the decentralised search mechanisms. The first is users paying for a service. For example, this is the case for users querying the indexed data in both The Graph [159] and DeSearch [109]. This assumes that users are willing to pay for decentralised services instead of using free centralised options, which may not hold in practice.

The second source of inflow comes from advertisements. Generally, advertisers submit bids to show their advertisements with higher priority for particular keywords on search engines. Centralised engines generally use auctions to determine which advertisements are shown with higher priority [104, 158], although decentralised advertisement markets have been proposed as alternatives. An interesting example is keyword staking in Presearch, where the advertiser who stakes the most tokens on-chain for a particular keyword will be shown. In this case, the inflow is expected to come from per-click fees. However, currently, this approach retains centralisation as it relies on dedicated ad servers.

The advertisements shown to users are generally personalised based on data collected from previous search behaviour. In this scenario, the user loses control over their privacy and must trust the central entity. To alleviate these concerns, Google had introduced but then later scrapped a proposal named FLOC,³ which was to use federated learning [61, 110] to group users in clusters, without data leaving the user’s device. Although this is argued to be decentralised and privacy first, it might have led to an advertisement monopoly, as other third-party cookies would have been removed. Several other research works have investigated decentralised and privacy-preserving methods of personalised advertisements [18, 76], for example, using blockchains [115, 156, 188].

Finally, in the search protocols built on top of blockchains, there is a third source of inflow. These are newly minted tokens, which are periodically released to reward for network upkeep [54]. There are also transaction fees that clients pay to use the underlying blockchain network, which are proportional to the added load placed on the miners. These fees are often collected directly by miners.

4.6.2 Outflow.

The monetary inflow into the search protocols needs to be redistributed and flow out toward involved parties. In centralised search engines, the revenue generated by advertisements is collected by the centralised operator. In contrast, decentralised systems may delegate the ad revenue back to the users who watch the ads [173] or to nodes who assist in network upkeep [105, 155].

For example, in the Graph, Indexers earn tokens by serving client queries to their indexed subgraphs. Delegators can decide to stake tokens for a specific indexer, for which they will receive a percentage of their profits. Curators are incentivised to signal subgraphs honestly, as they can earn a percentage of the query fees.

Similar to other platforms, slashing of tokens [36] may occur when malicious behaviour is detected. This leads to a penalty deduction of a node’s staked deposit on-chain.

However, DeSearch [109] rewards both workers for carrying out search-related tasks (e.g., indexing) and publishers of content using tokens. The reward tokens flow from the consumers of search results all the way to the publishers of content (that appear in the search results) as in the following chain: consumers → rankers → indexer → crawlers → publishers. This chain follows the functional dependency between the tasks involved in the search process and rewards content publishers based on their popularity, as similarly done in decentralised social media platforms [3].

4.7 Open Issues

4.7.1 Reliance on Centralised Infrastructures.

As discussed, only a few projects aim to provide a fully decentralised search on decentralised data, and many still rely on a centralised back-end or gateway servers. For example, DeSearch [112] uses a hybrid infrastructure consisting of both centralised and decentralised components but with built-in accountability (verifiability), achieving some of the desirable properties of decentralisation with good overall performance. However, while being more decentralised, storing the index directly on storage networks like IPFS introduces new challenges. Because the index should be a mutable object that is frequently updated, storing it on an immutable storage solution is difficult. One can use a naming layer (i.e., an indirection) to alleviate the problem of mutable data, for example using name-registries. However, there still remain a number of issues such as the management of private keys. In Section 5, this is discussed further.

We conclude that building a truly decentralised search engine is non-trivial, and therefore a feasibility analysis is required. Specifically, the question: “are industry or research projects actually able to provide true decentralisation?” needs to be answered. Particularly, the process of curating content to be indexed, maintaining and partitioning the distributed index, and ranking in a decentralised fashion need to be explored further. The difficulty here also applies to designing a system that simultaneously encompasses all of these. Alternative search workflows such as those based on similarity search [64, 94] seem promising in achieving higher degrees of decentralisation, but these and other workflows should be investigated further. On top of this, while privacy improvements are desirable, they should not come with significant performance degradation, and thus, this trade-off should be analysed.

4.7.2 Complete Systems.

The area of decentralised search engines has relatively been investigated less than other decentralised web infrastructures, and this is reflected in the fact that most systems are not complete in coverage of all search steps users expect. For example, the industry projects covered generally have a specific niche in terms of decentralised web networks, data types, or applications. They also are not as sophisticated in implementation as some research works, which have a much narrow focus.

While most research works have proposed some performance optimisations, few have looked past the structuring and storing of the index and routing of queries. For example, how results are ranked after fetching them from the index has been barely explored in these works. Furthermore, how governance using incentives can be used to make the system more secure, robust, efficient, and usable has been largely overlooked.

4.7.3 Analysis of Claims.

It is argued in most works, both in industry and research, that a decentralised search will lead to better privacy and security, but this has not been shown in practice, as new attacks may arise in a new infrastructure. Therefore, we believe security analyses to be vital. Security is partially dependent on the crypto-economic incentives and mechanism design, which has not been considered in detail in most works, specifically in industry. Similarly, there is the issue of trust, as not all operations can be mediated through the blockchain. Here, reputation systems could play an important role.

5 Name Registry

In this section, we first give an overview of the name registry currently used on the web: the DNS. While the DNS is physically distributed, it is controlled and managed by a centralised entity. Then, we describe two important aspects of name-registry systems, namely, registration and resolution. Finally, we present a number of decentralised name-registries and DNS alternatives and analyse how they differ in these aspects.

5.1 Overview of the DNS

The DNS is the default name-registry system used in the current web, and one of its uses is to maintain name records, which map domain names (e.g., hostnames in URLs) to locations (i.e., IP addresses). The DNS servers use these records to respond to user queries.

The domain namespace is hierarchical: at the root of the hierarchy are the top-level domains (TLDs) such as .edu and .com. These TLDs extend to subdomains such as acme.edu, which in turn can extend arbitrarily to sub-domains such as mail.acme.edu. The DNS namespace consists of portions called zones, each managed by a specific organisation or administration. The DNS records for each zone are permanently stored on an authoritative DNS server (under the control of the zone’s administration) that has the authority to respond to DNS queries for its zone(s) [136].

An authoritative DNS server for a zone (e.g., acme.edu) can delegate its authority over the subdomains (e.g., mail.acme.edu) to other servers. The result is a hierarchy of distributed DNS servers across the globe, each responsible for a portion of the hierarchical domain namespace. The hierarchy of servers starts from the root name servers that hold “pointer” (i.e., NS) records, mapping each TLD zone to its corresponding authoritative DNS servers. Similarly, each authoritative server for a zone maintains a list of authoritative servers of its delegated subdomains.

The resolution of a hostname starts with a user contacting its local DNS server. If the local server has not previously cached the result, then it returns either a root name server or an authoritative name server for one of the zones that are part of the queried domain name. If a server is not able to resolve the name, then it returns the authoritative name server for the next subdomain using its NS record.

The root zones (i.e., TLD names) are centrally controlled by the Internet Corporation for Assigned Names and Numbers (ICANN), which delegates the administrative responsibility of each zone to a single manager, such as an organisation or government, who in turn runs authoritative servers for the zone and can allocate (e.g., sell) subdomains (and delegate the control over that zone) to others. Domain names under TLDs are registered with a registrar or reseller who is accredited by ICANN and certified by the registries.

Centralisation in DNS refers to ICANN’s control and management of TLD zones and the root name servers. In addition to the top-level zones, governments have full power over the DNS servers residing within their territory. This may lead to censorship (e.g., blocking of wikileaks.org by several countries). Furthermore, there are other known security issues with the current infrastructure such as DoS attacks [150], DNS hijacking [167], DNS spoofing [177], and DNS cache poisoning attacks [99]. Existing security extensions, such as DNSSEC [14], have slow adoption [113] due to large overheads impacting performance and also due to intrinsic reluctance to change already deployed protocols.

5.1.1 Key Challenges.

A number of critical challenges remain for decentralised name registries, for example, in the management of namespaces. Concerns surrounding ownership, pricing, and conflict resolution arise when multiple entities compete for the same domain. Moreover, their deployment and adoption require practical support and ease of integration. These challenges are further highlighted in Section 5.5.

5.2 Implementations

We now discuss a number of decentralised name-registry systems from industry and research. Within the context of the decentralised web, these provide registration and resolution from human-readable names to CIDs. In doing so, they have the potential to overcome Zooko’s trilemma, as the content names remain secure (due to hashing), human-readable (due to the name registry), and also decentralised (as the registry happens on a decentralised network or blockchain).

5.2.1 P2P DNS Alternatives.

Decentralisation of the DNS was initially proposed by research in P2P systems, with various goals in mind. For example, Overlook [184] aimed to improve the scalability and performance of the DNS by using dynamic replication and a DHT for servers.

Several works also aimed to improve the security of the DNS against various attacks by structuring DNS nodes in a P2P network, thereby distributing the top-level namespace. This was argued to protect against attacks such as DoS and both malicious root and TLD servers [4, 77]. However, the added security could present a trade-off, with a loss of performance [47]. These early P2P initiatives suffered from the limitations of P2P networks, such as the lack of incentives.

5.2.2 Hybrid Name Registry.

Several hybrid approaches have aimed to provide name-registry improvements over the current DNS by leveraging a combination of centralised and decentralised infrastructures. For instance, DNSLink⁴ allows IPFS [26] CIDs to be mapped using DNS txt records to DNS names. This does not overcome Zooko’s trilemma (see Section 3.2), as it remains reliant on the centralised DNS.

Some works use consortium blockchains to create a decentralised DNS. We also consider these hybrids, because these networks are not entirely open and decentralised. These consortium blockchains generally publish domain name operations on-chain but store actual domain name data off-chain. Besides singular blockchain implementations [203], a hierarchical structure of multiple chains may also be used [59, 117]. The rest of this section focuses on solutions to implementing open blockchain and smart contract-based name-registry systems.

5.2.3 Blockchain-based Name Registry.

Several industry and research projects have proposed using blockchains for name registry, mapping human-readable names to CIDs in a decentralised manner and claiming that they overcome Zooko’s trilemma. We first describe projects that use first-order registration—i.e., those that modify the blockchain state directly using transactions rather than through intermediary smart contracts. While smart contract systems also modify the blockchain state, they operate at a different level of abstraction and allow for more flexibility and complex logic. We make the distinction between the two to bring structure and grouping to a large number of works but also because smart contract systems can be seen as second-generation blockchain systems.

A generic name-value registration system is implemented by Namecoin [145], offering a naming system with decentralised governance. Similarly, NXT [44] and Emercoin [60] implement generic name-value storage services on their native blockchains. Another blockchain-based naming protocol is Handshake [78], which aims to replace the root zone file and root servers. Rather than targeting to replace the entire DNS infrastructure, this system proposes that the control of the TLDs is decentralised, allowing an infinite amount of top-level domain names to be created. Therefore, Handshake is more flexible and customisable compared to other solutions that allow naming operations within the scope of one or a few TLDs (e.g., .bit for Namecoin). On top of these naming protocols, other systems can be built to create secondary marketplaces for reselling names and easy participation in name auctions [144], as well as to add security and accessibility.⁵

Besides these industry initiatives, several research works [30, 73, 207, 221] have focused on the security vulnerabilities of the DNS and propose using blockchain solutions to enhance the security of the current infrastructure. Security issues are partially due to the absence of a method to certify the integrity of information of queried name records. A number of works have improved this by storing verifiable record hashes on the blockchain [116, 222]. Blockchain-based registry systems may also be extended to PKI encryption schemes, which generally suffer from similar issues due to reliance on centralised certificate authorities [8, 93].

5.2.4 Smart Contract Name Registry.

Decentralised name-registry systems can also be implemented using smart contracts on top of existing blockchains. The advantage of using smart contracts is that many services can be offered (i.e., implemented using a smart contract) on the same blockchain. Blockchains that solely implement naming operations can be less secure as the network is often smaller and may have limited functionality. However, as there is less overall traffic, better performance can be expected. Both advantages are expected to converge with sharding [198] and layer-2 solutions [75].

A number of recent projects use the Ethereum blockchain as the underlying infrastructure [58, 169] and generally use a set of smart contracts for registration and resolution. Most developed among these is the Ethereum Name Service (ENS) [171], which is a general name registry for web3.0 content including cryptocurrency addresses. However, around 98% of currently registered names on ENS seem to identify Ethereum addresses [213]. Stacks [9, 175] also created the Blockchain Naming System (BNS) on top of their native blockchain using a smart contract, after initially using the Namecoin blockchain [10].

The industry projects discussed above still have many security vulnerabilities [152], particularly in the areas of malware, name registration mechanisms and markets, phishing, and immutability. Specifically, looking at name registration, domain squatting [226] attacks present a big threat. In this attack, malicious users register as many names as possible at low costs, with the sole purpose of selling them in the future for profit rather than using them for fraudulent activity based on misdirection or impersonating another source. To illustrate some of these issues, studies have identified that in Namecoin, squatting is a significant problem [92], a single entity controlled over 51 % of the network [10], and that there are possible domain extortion and phishing schemes [152].

Another aspect often overlooked in the design of decentralised name-registries are incentives. Like the other components in a decentralised web infrastructure, nodes will need to work collaboratively to keep the system up and running, for which they expect rewards. In the case of blockchain and smart-contract-based solutions, some of the incentivisation for networking functions is handled by the underlying blockchain and consensus protocol. However, to mitigate some of the attacks mentioned, malicious behaviour should be discouraged by aligning incentives with honest behaviour, specifically tailored for the name-registry use case. This has been partially achieved by the registration mechanism, as we describe in Section 5.3.

In the remainder of this section, we highlight unique aspects of blockchain and smart contract name-registries, specifically in the areas of registration and resolution. Table 7 gives an overview of key aspects for select projects described in the previous sections.

Table 7.

	Scope	Ownership	Off-chain Storage	Registry Fee	Resolution	Allow Subdomains	Network
Namecoin [92]	TLD	Permanent	N	Flat Fee	Local	N	Bitcoin
BNS [175]	Root zone	TLD	Y	TLD	Local	Y	Bitcoin
		Dependent		Dependent
Handshake [78]	Root zone	Permanent	N	Auction	Local	Y	Handshake
ENS [171]	TLDs	Lease	N	Length-based	Local	Y	Ethereum
NXT [44]	TLD	Permanent	N	Flat Fee	Local/Server	N	NXT
Emercoin [60]	TLDs	Lease	N	Length-based	Local/Server	N	Emercoin
CNS [58]	TLD	Permanent	N	Premium/Regular	Local	Y	Ethereum

Table 7. Overview of Decentralised Name-registry Projects

Table 8.

Architecture	Hash	Decentralised	Self-certifying	Human-readable	Hierarchical
IPFS [26]	Multihash	Y	Y	N	N
Swarm [180]	bzzhash	Y	Y	N	N
BitTorrent [154]	–	N	N	Y	N
Skynet [193]	Skylink Hash*	Y	Y	N	N
Storj [210]	–	N	N	Y	Y

Table 8. Comparison of Addressing of Decentralised Web Content

\(*\) Unclear which hashing algorithm is used.

5.3 Registration

Ownership, pricing, and control of names are handled differently among projects. Ownership of a namespace can be permanent [44, 58, 78, 145], in which case the owner has control over the subdomains indefinitely, although there may be periodic renewals required to ensure liveness at no cost. Conversely, ownership may also be temporary and require periodic renewal fees to extend the lease period [60, 171], which may deter squatting attacks. Ownership permanence may also be set differently among namespaces within the same system [175].

Pricing of domains and namespaces also varies across systems (and within the same system [175]). Initially, low flat fees were the norm for acquiring domains [44, 145]. However, it was shown that in the case of Namecoin, this pricing model made the system susceptible to squatting. To counter this, a number of projects started charging differently based on the perceived value of a name [58], for example, based on their length [60, 171]. Another method leverages Vickrey sealed-bid auctions [192] on-chain to allocate names [78].

All systems allow for reselling domain names on a secondary market, as this is a security feature against squatting. Some extend this further by allowing the sale of subdomains of a name [58, 78, 171, 175].

5.4 Resolution

The hybrid projects mentioned earlier either rely on servers,⁶ the current infrastructure, or a permissioned chain to resolve names. However, for blockchain and smart contract-based solutions, the main difference in resolution with the DNS is that they directly use the blockchain to resolve names. This can be done locally by running a full node on the network, using a simplified payment verification (SPV) node [78], relying on browser extensions, or using servers [44, 60, 145].

When querying the blockchain, the entire naming records could be traversed to find a relevant entry. A faster method uses a separate resolver (which maintains an “authoritative” record set by the owner) and registry (where the search starts) smart contracts [58, 171].

5.5 Open Issues

5.5.1 Security.

Decentralised name-registries and DNS alternatives are recent developments, especially those built on top of blockchains. While the initial implementations and results seem promising, more research is needed into how they hold up in practice, especially in terms of security.

Recent works [92, 152, 213] have exposed some serious security threats and design flaws in early systems. They focus on specific vulnerabilities such as domain squatting and phishing, but a wider attack vector needs to be analysed and evaluated before we can claim that they offer better or similar security guarantees as the DNS, and that they are actually able to “square” Zooko’s trilemma. Furthermore, they rely on trust and performance assumptions of the underlying blockchain network, which has been shown to be too slow [117] in certain instances. Some projects rely on centralised servers for name resolution to increase performance, but this adds a layer of centralisation [44, 60, 145].

5.5.2 Namespace Management.

Another overlooked aspect is how these systems handle instances where the keys to alter names are lost, compromised, or revoked (invalidated) for security reasons. It may also be desirable to use a threshold of public keys instead of just one to verify the identity of owners or publishers for security reasons. Although P2P literature has attempted to tackle these issues, for example, using social and personal naming systems [63], the blockchain-based systems have, as far as we know, not identified or addressed these issues.

The prevalence of various financially motivated attacks (such as domain squatting) is a sign that there is room for improvement in the decentralised management and governance of namespaces. For example, popular names, especially those with commercial values (e.g., registered trademarks), require careful management, as they are obvious targets for such attacks [213]. While decentralised name registries that are governed by smart contracts have developed mechanisms (e.g., auctions) to manage namespace ownership, more research is needed for building algorithmic mechanisms for robust namespace governance (possibly together with crypto-economic incentive mechanisms) to deter financially motivated attacks on the namespace.

5.5.3 Deployment and Support.

In terms of ease and practicality of deployment for decentralised name registries, several browsers have recently introduced extensions (plug-ins) for ENS support. However, despite the browser support, a recent study [213] has reported only a few thousand URLs being stored in ENS, while the vast majority (i.e., 98%) of the names identify blockchain addresses (e.g., addresses of popular cryptocurrency addresses such as crypto exchanges). On the positive side, unlike NameCoin, which was deemed dysfunctional by a recent measurement study [92], the number of names registered on the ENS system (including the number of URLs) has been reported to be steadily rising [213].

6 Decentralised File System

In this section, we first discuss how content is currently stored on the current web (Section 6.1) and describe a number of decentralised file system implementations (Section 6.2). Then, we discuss storage (Section 6.3), retrieval (Section 6.4), addressing (Section 6.5), and incentivisation (Section 6.6) aspects of the decentralised file systems. Finally, we conclude the section with a list of open problems (Section 6.7).

6.1 Overview of Web Storage

In terms of content storage, the current web ecosystem is dominated by silos of providers residing in centrally controlled, public Cloud infrastructures. While these public Clouds provide users with on-demand access to a large pool of shared resources, they operate with little or no transparency. As a result, concerns over confidential or sensitive data security can favour the deployment of private Cloud infrastructures, which require large upfront costs.

More importantly, the centralisation in the infrastructures of these silos means that they reside in only a few locations on the Internet. As a consequence, even simple network failures can lead to the unavailability of these silos, as experienced by users during recent outages at Amazon Web Services or AWS (which resulted in the loss of access to a significant portion of the web) and Facebook [48, 182]. While replication of content across silo boundaries would lead to better performance and availability for users, the lack of incentives prevents such cooperative action among the silos.

Content retrieval from centralised Cloud infrastructures deployed at remote datacenters can experience large communication latency. To reduce this latency, the emerging edge computing [168] paradigm promises to deploy small-scale datacenters at locations close to users. However, such small-scale edge infrastructures are mostly appropriate for small-scale, low-latency applications and are not typically designed for the workload of the web. Instead, a truly decentralised web can be realised by pooling the vast amount of global user resources and incentivising their proper usage to achieve scalability and sufficient performance.

Other important actors in content retrieval in the current web are CDNs, which provide large-scale retrieval of Quality-of-Service (QoS)-sensitive content through on-demand content replication at distributed caches worldwide. While on-demand replication of content with simple reactive caching policies (such as LRU) is effective in providing sufficient content retrieval performance, the location-based nature of web references (i.e., addressing) makes replication and moving of content difficult, as such actions invalidate existing references to the content. To deal with this problem, CDNs use proprietary name resolution mechanisms that immediately update the invalid web references to content upon movement or replication. Despite being a distributed infrastructure, CDNs are centrally governed systems and charge content producers for distributing their content. This makes content delivery expensive, especially for small content producers. Finally, to serve content using HTTPS, CDNs need to hold the content publisher’s private keys, further increasing centralisation and lowering the security of the entire web [86].

Key Challenges. Decentralised file systems encounter several crucial challenges. First, achieving a level of reliability and performance that matches the centralised counterparts is a major challenge, especially without giving up on decentralisation. Second, the collaborative nature of these systems (i.e., by pooling the resources of peers, some of which can be malicious) can lead to privacy and security challenges. Other challenges include the efficient support for mutable content (e.g., dynamic webpages), ease of accessibility (e.g., by current web users on browsers), and moderation of content stored on these systems. These challenges are further elaborated in the sections below.

6.2 Implementations

The ideas behind decentralised storage networks were first developed for P2P networks and initially produced unstructured networks like Gnutella [165]. While these were able to perform well in fetching popular items, they were not as successful in quickly retrieving less popular content. Shortly after, a number of projects started leveraging instead the structured networks, particularly the DHTs, to achieve more reliable performance guarantees. Most prominently among these was BitTorrent [154]. Over time, it became clear that many of these networks lacked robustness in terms of reliability and security, partially due to the lack of incentives. Furthermore, BitTorrent’s main use became the distribution of unlicensed products [127], leading to copyright and legal issues (see Section 6.7).

Recently, novel storage networks have emerged and gained popularity [50], most notably IPFS [26], Sia [193], and Swarm [180]. These can be built on structured, unstructured, or hybrid networks and use content addressing. While the principles of these projects are closely related to Information-centric Networking (ICN) [5, 216]—a content-centric, network layer paradigm that performs name-based routing using hierarchical content names—these novel projects work in the application layer.

Content addressing (see Section 3.2) is a natural fit for decentralised file systems targeting a public decentralised web, as content is distributed over the network with a level of replication, and therefore, any node (or a set of nodes) may be able to serve a requested file. It would be counter-intuitive to restrict file retrieval to only a single location as is done in the current web. For storage of private data, however, similar to personal Cloud storage, content addressing is not always necessary. Such is the case with Storj [210], which also introduces optimisations targeted toward decentralised Cloud storage and uses satellite nodes that manage parts of the network.

DStore [215] takes another approach to create a distributed outsourced data storage and retrieval scheme. It uses smart contracts to audit the integrity of the outsourced data, achieving security and efficiency. Liang et al. [114] designed a storage and repair scheme for fault-tolerant data coding, realising a regeneration code with high precision and repairability, focusing on blockchain-based networks.

Another distinct project that proposes decentralising storage led by Tim Berners-Lee is called Social Linked Data (SoLiD) [37]. SoLiD is designed to decouple users’ personal data from the applications that use them and allows users to set access control policies to maintain the privacy of their data stored in decentralised storage units. However, the users must trust the decentralised storage units by properly authenticating applications and following their access control policies. More importantly, the current SoLiD protocols rely on centralised infrastructures such as the PKIs and DNS.

Finally, we mention blockchains as an alternative method of storing data in a decentralised manner. While storing data on the blockchain can be made secure, it is extremely expensive, as the data is replicated over all peers and thus distributed with extreme redundancy. In the rest of this section, we focus on recent decentralised file systems on the application layer with live implementations and analyse their key aspects.

6.3 Storage

Content is initially stored only by its original publisher, who then serves the file, given that the publisher can (and is willing to) function as a provider of that content. In many decentralised file systems, any peer downloading content by default becomes a provider for that content unless it configures its software to opt-out from being a provider [26].

Performance and Reliability. Some decentralised file systems allow for nodes to formally publish deals governed by a blockchain, where one node pledges to store a particular content item [28, 193]. Secondary off-chain markets have also emerged where providers offer to pin specific files (i.e., permanently make the files available). Some systems also introduce coding techniques (e.g., erasure coding) to improve the reliability of content storage (e.g., only a certain percentage of coded segments of content is sufficient to restore the content) in the presence of churn. Combined with incentivised pinning of files at multiple locations, coding can further improve the permanence of content stored in these systems.

In addition to voluntarily storing and providing content, peers in some decentralised file systems [26] have to participate in the (mandatory) storage of meta-data for content that they do not necessarily provide. In IPFS, the peers with public IP addresses are involved in the collaborative storage of (and providing) an index (i.e., meta-data) that maps the CIDs of the available content in the network to the providers of that content. In this system, to serve content, a content producer must prepare a “provider record” that maps the CID of the content to its network identifier i.e., IP address and port number, and store this in the DHT (i.e., using a DHT put(key, value) operation where key is the CID and value is the provider record). In a sense, provider records function as “pointers” to content that are used to resolve the providers.

The content and meta-data stored on the decentralised file systems are generally publicly accessible; anyone in the network who knows the CID can fetch the corresponding data. This approach causes security and privacy concerns for storage nodes.

Privacy. Making content provider information publicly accessible in clear text in a DHT is a privacy concern for nodes storing content and making them accessible. An obvious solution is storing provider records in encrypted form; however, managing decryption keys for content is an overhead for publishers. A possible workaround is to derive a decryption key for a CID’s provider records from the CID itself. This way, only the parties that know a CID can decrypt the provider records for that CID.

Moreover, the act of a provider putting encrypted provider records into the DHT (to be able to serve content for a given CID) should ideally not disclose to the DHT nodes the CID associated with the record. Otherwise, the DHT peers can passively observe the providers of the CIDs, even when the records are encrypted. Michel et al. [132, 133] propose using the hash(CID) as the key to put the provider records for a CID in the DHT instead of using CID as the key. Using “double-hashing”—a CID is derived from the hash of the content, and therefore, the hash(CID) is considered double-hashing—technique to use separate identifiers in the DHT can effectively hide the CIDs from the peers during the DHT operations.

One remaining problem is the possibility of malicious peers putting fake provider records to launch Denial-of-Service attacks at victim peers whose peer IDs are supplied as providers in the records. Michel et al. [132, 133] propose that a peer publishing a provider record also signs the record with its private key whose public counterpart is used to derive the peer’s ID [133]. By including signatures in the records, the clients who can decrypt the provider records can verify that the CID in the record is provided by the peer who originally signed the record (see References [132, 133] for details).

Security. In both MaidSafe [106] and Storj [210], the content is stored in the network in an encrypted form to provide confidentiality. Also, in both of those systems, content is divided into a sequence of chunks and the individual chunks are stored on the DHT. In MaidSafe [106], each chunk of content is encrypted with the hash of the previous chunk in the sequence, and each encrypted chunk is then XORed with the concatenated hashes of the original chunks for further obfuscation. Together with the encrypted chunks, a publisher must also publish a manifest file (i.e., containing meta-data) that maps the hash of obfuscated chunks to the hash of the real chunks.

6.4 Retrieval

Data retrieval using content addressing requires resolving CIDs to network identifiers or locations (i.e., IP addresses and port numbers) of peers that can provide the content, i.e., providers. In terms of the underlying P2P network structure, these systems can use unstructured, structured, or a hybrid of both; the underlying network’s structure impacts how content is resolved. In the unstructured case of Sia [193], nodes gather hints of the possible locations through, for example, the blockchain deals, after which a select number of candidate nodes are queried rather than using a flooding-based search approach to resolve CIDs to their providers. The other projects use modified versions of the Kademlia [129] DHT for either locating peers [28, 154], or both peers and content [26, 180, 210].

Performance. The hybrid P2P approach in IPFS aims to optimise content retrieval latency using both unstructured and structured network connections, where the structured connections form a DHT (i.e., Kademlia). As part of the unstructured network, each node maintains a set of connections with peers discovered through DHT communications or incoming content requests. A peer uses these connections as part of the Bitswap [2] protocol to send requests for content directly to other peers. In the Bitswap protocol, nodes send want requests to each other, specifying lists of requested content CIDs. The want requests do not propagate beyond the directly connected peers. Upon sending a Bitswap request for content to a set of peers, one or more of them may respond with an acknowledgement of having the content stored (i.e., cached) locally. The node can then attempt to retrieve the content from all the acknowledging peers in parallel (e.g., request individual chunks of the content from different peers), similar to downloading content using BitTorrent [154].

In IPFS, a client looking for a content object first asks its Bitswap peers for that content’s CID. If none of the direct peers has the requested content locally cached, then the node queries the DHT, storing a distributed index that maps CIDs to the providers of that CID (i.e., provider records). In the Kademlia DHT used by IPFS [26], the provider records for content with CID \(c\) are stored at the twenty peers whose peer IDs are “closest” to \(c\), where the closeness of IDs and CID is determined according to the distance metric (i.e., XoR) used in Kademlia DHT. A get() operation on a CID \(c\) returns the provider records for \(c\) from the twenty closest peers to \(c\) in the DHT.

In IPFS, it is likely for Bitswap requests for popular content (i.e., ones that are stored by many peers) to succeed and, therefore, retrieval of such content may not require DHT resolution. Because content resolution through a DHT can be slow (i.e., requires contacting O(log n) peers), Bitswap can significantly reduce the content retrieval latency. At the same time, the Bitswap protocol also helps reduce the burden on the DHT network, as the distribution of content requests tends to follow a power-law distribution, i.e., the majority of requests demand the most popular content in most content networks [70, 124]. However, retrieving unpopular content through the Bitswap protocol is likely to fail, and this can slightly delay the switch over to the DHT resolution for content, slightly delaying the retrieval for such content. Therefore, a hybrid system may require optimisations to improve the content retrieval latency by using both networks simultaneously at the cost of additional overhead.

IPFS facilitates peer-to-peer connections between nodes situated behind Network Address Translation (NAT) devices. When two peers willing to communicate are behind NAT, IPFS allows them to utilise a third (i.e., relay) peer with a public IP address to bootstrap their communication. When NAT’ed peers provide content, the IPFS address stored in their provider records includes the IP address of both the relay peer and the NAT’ed peer’s public IP. These relay nodes facilitate connections between NAT’ed peers by employing standard hole-punching techniques. Ensuring accessibility for peers behind NAT is a crucial aspect of decentralised file systems, particularly since contributors typically connect from homes where NAT’ed connections are to be expected.

Privacy. In addition to the performance of content retrieval, privacy is another important consideration. Some systems such as OneSwarm [88] distinguish between trusted (e.g., friends and family) and untrusted peers and introduce address obscuring techniques to increase the privacy protection of their participants. Ideally, a system should not reveal which particular content is searched by a given client, providing a form of “reader” privacy. While recent measurement studies on IPFS demonstrate the ease of monitoring content requests [19, 20, 21], using the hash of the cid (double-hashing) as the search key (Section 6.3) can be effective in hiding the target CID [132]. The double-hashing extension is also useful to hide CIDs in the Bitswap protocol—when sending want requests for content, the hash of the CID can be used instead of the CID itself. This way, the Bitswap peers cannot determine which CID a reader wants unless they have the content stored.

Censorship. Although decentralisation should theoretically make censorship of content difficult, Sridhar et al. [174] have demonstrated a censorship attack on the DHT resolution of IPFS where Sybil peers are strategically placed on the DHT to block requests to provider records of a target CID. In particular, when twenty or more Sybils are placed as the closest peers (i.e., based on the XOR distance metric used by Kademlia) to the target CID, then provider record lookups can be intercepted (and simply ignored) by these Sybils. The placement of Sybils can be done through brute-force generation of peer identifiers. The authors propose detection and mitigation mechanisms against this attack in this work. The detection method examines the distribution of peer IDs among the closest peers to a given CID. It identifies a potential attack if this distribution significantly differs from the expected distribution of peer IDs, assuming that the IDs of legitimate (non-Sybil) peers are uniformly distributed throughout the DHT key space. Conversely, to mitigate the attack, a broader region of the DHT is utilised for storing and retrieving provider records after a peer identifies an ongoing Sybil attack.

Decentralised file systems are vulnerable to Sybil attacks, which aim to undermine the integrity of the underlying P2P network. One such attack is the eclipsing attack [80, 83, 126] where Sybils isolate peers by gaining control over their connections and then manipulate or censor the information exchanged between the isolated peers and the rest of the network. Eclipse attacks can target unstructured blockchain networks that some decentralised file systems use to publish storage deals [28] or DHTs that store content metadata, such as provider records, to prevent content retrieval. Recent research proposes diversifying the connections of peers (e.g., in terms of IP addresses they connect to) to make such attacks more difficult [100].

6.5 Addressing

As discussed in Section 3.2, addressing content on the decentralised web is not straightforward, because many of the desirable properties cannot be achieved simultaneously. Most recent projects targeting public data, such as web content, use content-addressed, self-certifying hashes to refer to content [180, 193]. This can be extended to support multiple hash functions using prefixes, as is done by multihash [26].

A desirable property of naming is that even mutable content objects have persistent names that users can always use to refer to them. This means that the CIDs of content objects should not change when their attributes (e.g., location, file contents, or ownership) change. Hash-based names do not provide persistence, because the contents of a file determine its name. This could, however, be achieved with public-key-based names (refer to the types of self-certifying names in Section 3.2) such as IPNS,⁷ which allow identifiers to be linked to public keys. This way, a user can update a file by signing the updated file with their private key, while keeping the name of the file the same.

6.6 Incentives

Providing participants incentives for continued and active participation is important for decentralised file systems to operate in a reliable manner. Early P2P storage networks generally leveraged non-financial incentives, such as BitTorrent’s tit-for-tat [43], which rewards resources put toward the network by faster downloads in return. Another example is Samsara [46], which focuses on tit-for-tat behaviour for contributing storage resources, i.e., symmetric storage relationships between peers. In Samsara, a peer S stores a chunk of data for a peer R in exchange for R storing an equally sized storage claim by S. S can periodically verify the existence of the claim through a challenge-response protocol, which prevents R from removing or compressing the claim, and eventually S can request R to store a data chunk from R, in which case R stores S’s data replacing the claim. However, malicious peers can refuse to store data later when requested as the claim mechanism can not enforce peers storing claims to replace them with data. Also, the verification of claims adds significant overheads on the peers.

A number of projects have also started incorporating blockchain-based rewards in their networks. Filecoin [28] creates an incentive layer on IPFS where nodes create on-chain storage deals. Storage nodes regularly submit proof that they have been storing unique copies of the data, for which they receive off-chain micropayments. Similarly, BitTorrent issued a token to add robustness to their platform, while Skynet, a decentralised CDN, leverages the Sia blockchain. Swarm and Storj issued blockchain tokens as well. Arweave [211] takes another approach toward realising decentralised storage and uses a blockchain-like linked structure with mining rewards based on pseudo-random previous blocks linked to the latest state. Therefore, users pay a one-time mining fee for storage, assuming that miners are honest in keeping and providing their data, which may not hold in practice and lead to poor scalability and performance.

6.7 Open Issues

6.7.1 Role of Centralisation.

According to recent studies, decentralised file systems have demonstrated a trend toward centralisation [21, 206]. Balduf et al. conducted measurements that expose heavy reliance of the IPFS on cloud-based peers, i.e., nodes hosted in various datacenters [21]. In this study, the authors have shown that almost 80% of peers in the DHT are cloud-based, and 95% of the content on IPFS has at least one cloud-based provider. While individual cloud-based peers from various vendors may not raise significant concerns, recent additions of centralised infrastructures in the IPFS system have emerged, which we discuss below.

Trautwein et al. [186] identified one of the performance bottlenecks as the DHT-based content resolution. Although DHT-based resolution ensures that provider lookups are completed within no more than log(n) hops, the cumulative round-trip times to query individual hops can lead to significant delays. Consequently, IPFS has recently integrated centralised network indexers hosted entirely in the cloud [103]. The indexers collect information about all the content stored on IPFS and can resolve content in a single RTT, i.e., significantly faster than a DHT-based resolution.

Wei et al. argued that centralised infrastructure can address performance bottlenecks by leveraging operational data collected from IPFS [206]. While centralisation effectively resolves performance issues, it introduces trade-offs concerning security, privacy, and censorship risks associated with such centralised infrastructures [206]. Specifically, such infrastructure may concentrate power in the hands of a few operators, posing a single point of failure and potentially threatening user privacy and enabling censorship. Additionally, IPFS has introduced HTTP gateways, each controlled by an operator. These gateways allow browser-based web users to access IPFS content using HTTP without running an IPFS node. Despite being centralised infrastructures, gateways enhance IPFS accessibility and promote its adoption.

To sum up, there are trade-offs to consider when introducing centralised components in a decentralised system. These trade-offs present open problems to be studied further in future work.

6.7.2 Privacy.

We find that the currently deployed decentralised file systems have mostly overlooked privacy so far. Privacy of both the content retrievers (i.e., readers) and content providers (i.e., writers) in decentralised file systems is an active area of research. In IPFS, the recently proposed double-hashing technique uses hash(CID) as an identifier for content during resolution and put operations, as we discuss in Sections 6.3 and 6.4.

Regarding reader privacy, a few other promising approaches (see Section 6.4) aim to hide the content clients are storing and searching for from other participants. A different approach by Backes et al. [17] involves using threshold cryptography and quorums to enable DHT routing queries with privacy. Their approach, however, requires significant overhead to content retrieval. Striking a balance between performance (e.g., ease of content retrieval) and privacy is a challenging problem that requires more attention.

6.7.3 Dynamic Content.

Mutable content presents challenges for content-based naming architectures. Persistence of CIDs is a desirable property for publishing dynamic content. At the same time, the CIDs must be verifiable. As discussed in Section 3.2, a possible solution is to use the hash of a public key as the CID to name content (which contains the signature produced by the corresponding private key). However, naming is not the only issue; the file system must ideally guarantee that a retrieval operation on a CID returns an up-to-date version of the content and not an outdated one that is cached by the nodes in the network. One possible workaround is to add version numbers to names (e.g., as a suffix in systems that support hierarchical naming), but this also comes with problems, such as users not necessarily knowing the current version of content. We believe the current decentralised file systems still have room for improvement in supporting mutable content.

6.7.4 Content Moderation.

Decentralised file systems can be misused to host phishing websites, illegal content (e.g., copyrighted content), and even more severe forms of illegal content such as child sexual abuse material. Because decentralised filesystems such as IPFS are accessible even by browser-based clients, it is necessary to moderate and remove illegal content on the decentralised file systems.

As the largest contributor to the IPFS open-source implementation, Protocol Labs (PL) centrally maintains a so-called “bad-bits” list containing hashes of blocklisted content CIDs built primarily on takedown requests that PL receives. When writing this article, PL operated an HTTP gateway that blocked requests to those content from the badbits list. Because Protocol Labs only operates a very small part of the IPFS infrastructure, they make the bad-bits list publicly available⁸ for other operators to remove or block that content. However, little is known about the moderation process involved in preparing the badbits list and whether the list is used for moderation by other operators in the system.

The main challenge with moderation on the decentralised web comes with decentralisation—the file system infrastructure is controlled by many entities, and such a system ideally requires a decentralised approach to content moderation, which is a difficult problem. However, there is an opportunity to design a more democratic moderation model compared to centralised systems, where content moderation may be susceptible to misuse for censorship purposes.

7 Related Work

To the best of our knowledge, our work is the first to provide a holistic view of the technologies that are useful for content retrieval on the decentralised web. Although the main focus of this survey is on the recent works, i.e., Blockchain-era technologies related to the decentralised web, we also introduce some of the notable P2P-era research that introduces the key concepts used by the emerging decentralised content retrieval systems.

In one of the earliest works on web information retrieval, Kobayashi et al. [98] survey the content retrieval technologies in the early web, i.e., web1.0, when it was only few years old. In this work, the authors discuss the search engines and the users’ experience with the early search technologies of that time.

Other surveys have focused on only a subset of the technologies involved in content retrieval in blockchain-era systems. For example, a recent survey by Daniel et al. [50] discuss decentralised storage networks, but without attention to the technologies that enable the search and retrieval of content in those storage systems. Li et al. [112] take a different focus and survey how future data-driven networks can be realised using blockchains as the underlying technology to enable decentralisation, security, privacy, and resource sharing. However, their focus is mainly on blockchain-based solutions and does not consider the rest of the decentralised web systems. Similarly, Benisi et al. [29] survey blockchain-based decentralised storage systems focusing on the consensus protocols used to secure the storage and access to the data.

Zheng et al. [228] present a comprehensive overview of blockchain technologies, including technical components such as consensus protocols and potential applications. While certain aspects, such as security and privacy enhancements and reputation systems, have been mentioned, the global decentralised web use case is not discussed. Neudecker and Hartenstein [147] describe the network layer aspects of permissionless blockchain networks. While this work focuses on blockchains, other decentralised web components, such as decentralised file systems, often share network layer design and concepts. For example, Filecoin and Ethereum have used gossip-based messaging protocols in their network layers [194].

Earlier work also surveys the P2P-era content distribution research, which we briefly touch upon in this article. For a general overview of P2P networks, Keong et al. [121] study and compare network overlay architectures. Closer to our work, Androutsellis-Theotokis and Spinellis [12] present an early survey and framework for analysing P2P content distribution technologies. Similarly, Hasan et al. [79] focus on storage techniques within distributed file systems.

Xylomenos et al. [216] present a comprehensive survey of ICN, which aims to implement a content-centric network layer replacing the current IP layer. Although the content-centric paradigm (i.e., fetch content by name) is central to decentralised web technologies, these systems implement content addressing at the application layer.

A number of surveys focus on popular techniques that distribute Cloud solutions but do not necessarily decentralise their ownership and governance. Zolfaghari et al. [231] discuss the state-of-the-art solutions and future directions for CDNs. They also describe how CDNs converge with emerging paradigms like Cloud and edge computing. Ghaznavi et al. Ghaznavi et al. [67] focus on CDN security challenges and potential solutions. Mach et el. [123] describe the emerging concept of mobile edge computing and present use cases, integration and standardisation efforts, and technical solutions. Mao et al. [125] also survey mobile edge computing, focusing on communication aspects. As mentioned before, while these solutions tackle some issues associated with centralised Cloud and web, they do not focus on decentralised web technologies.

A number of surveys focus on hybrid solutions that combine distributed storage and computation techniques with decentralised solutions and governance, such as P2P networks and blockchains. Related to content retrieval, Anjum et al. [13] survey techniques that complement the centralised content delivery with P2P content retrievals in CDNs. However, such techniques use a centralised architecture with trusted CDN servers resolving requests to appropriate peers. Jia et al. [90] also present a survey on collaboration for content delivery, focusing on collaboration techniques in network infrastructures including P2P-CDN, collaborative caching, SDN, ICN, and more. Finally, Yang et al. [220] survey attempts to integrate blockchains with edge computing solutions in the areas of network, computation, and storage.

8 Summary and Conclusion

In this survey, we present a thorough overview and analysis of the content retrieval process on the decentralised web. After describing how content retrieval is handled on the current web, we identify essential components of the retrieval process, consisting of search engines, name-registries, and file systems. In each of these areas, we provide an overview of the state-of-the-art projects and proposals and a comparative analysis with the current centralised model.

The analysis of the decentralised web landscape reveals several critical issues that must be addressed for its successful realisation. First, emerging search engines for the decentralised web exhibit immaturity, necessitating further research to develop engines capable of handling realistic workloads with performance comparable to those in the current web. Notably, promising advancements in search engine technology often compromise decentralisation by introducing centralised components. Additionally, while name-registries on blockchains have more mature implementations, they suffer from a lack of critical mass of users, with only a few thousand URLs reported to be stored despite some browser support. To address this, more research is needed to design algorithmic mechanisms for robust namespace governance, potentially incorporating crypto-economic incentive mechanisms to deter financially motivated attacks such as squatting. Among the key components of the decentralised web, file systems exhibit the most maturity with working implementations and decent adoption. However, there is a notable trend toward centralisation to address performance, content moderation needs, and reliability challenges. These findings underscore the multifaceted nature of the challenges facing the decentralised web and highlight the need for more research and innovation to overcome them.

Footnotes

CIDs are typically self-certifying names to secure the binding between name and content object it refers to.

https://www.wired.co.uk/article/google-cookies-floc

⁴

https://dnslink.io

⁵

https://github.com/okTurtles/dnschain

⁶

https://www.opennic.org

⁷

https://docs.ipfs.io/concepts/ipns/

⁸

https://badbits.dwebops.pub/

References

[1]

Gnutella. 2001. Retrieved from www.gnutellanews.com.