Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Next Article in Journal
Forensic Investigation Capabilities of Microsoft Azure: A Comprehensive Analysis and Its Significance in Advancing Cloud Cyber Forensics
Next Article in Special Issue
Deep Reinforcement Learning Recommendation System Algorithm Based on Multi-Level Attention Mechanisms
Previous Article in Journal
Integrated Anti-Aliasing and Fully Shared Convolution for Small-Ship Detection in Synthetic Aperture Radar (SAR) Images
Previous Article in Special Issue
Challenges and Advances in Analyzing TLS 1.3-Encrypted Traffic: A Comprehensive Survey
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DistOD: A Hybrid Privacy-Preserving and Distributed Framework for Origin–Destination Matrix Computation

Department of Computer Science, Sangmyung University, Seoul 03016, Republic of Korea
Electronics 2024, 13(22), 4545; https://doi.org/10.3390/electronics13224545
Submission received: 13 October 2024 / Revised: 7 November 2024 / Accepted: 18 November 2024 / Published: 19 November 2024
(This article belongs to the Special Issue Emerging Distributed/Parallel Computing Systems)

Abstract

:
The origin–destination (OD) matrix is a critical tool in understanding human mobility, with diverse applications. However, constructing OD matrices can pose significant privacy challenges, as sensitive information about individual mobility patterns may be exposed. In this paper, we propose DistOD, a hybrid privacy-preserving and distributed framework for the aggregation and computation of OD matrices without relying on a trusted central server. The proposed framework makes several key contributions. First, we propose a distributed method that enables multiple participating parties to collaboratively identify hotspot areas, which are regions frequently traveled between by individuals across these parties. To optimize the data utility and minimize the computational overhead, we introduce a hybrid privacy-preserving mechanism. This mechanism applies distributed differential privacy in hotspot areas to ensure high data utility, while using localized differential privacy in non-hotspot regions to reduce the computational costs. By combining these approaches, our method achieves an effective balance between computational efficiency and the accuracy of the OD matrix. Extensive experiments on real-world datasets show that DistOD consistently provides higher data utility than methods based solely on localized differential privacy, as well as greater efficiency than approaches based solely on distributed differential privacy.

1. Introduction

The origin–destination (OD) matrix captures the flow of individuals between any two regions. This matrix has been extensively applied across a wide range of fields due to its ability to provide valuable insights into mobility patterns [1]. In transportation management, OD matrices serve as the basis for traffic simulation models that enable efficient resource allocation, congestion reduction, and the optimization of public transportation systems [2,3,4,5]. In urban planning, OD matrices guide the development of sustainable cities by supporting evidence-based decisions about infrastructure investment and land use. They provide urban planners with critical insights into commuting patterns, enabling more effective and efficient urban design strategies [6,7].
Beyond transportation and urban planning, OD matrices play a critical role in public health and epidemiology, where they are used to model the spread of infectious diseases and assess the impact of human mobility on disease transmission [8,9]. In economic development, OD matrices are critical in modeling spatial interactions, providing insights into the flow of economic activity between locations and enabling the more accurate analysis of regional economic patterns and infrastructure impacts [10]. In addition, OD matrices are used in geographic analysis to represent and explore spatial interactions between locations, enabling the identification of complex flow patterns [11]. These diverse applications underscore the importance of OD matrices in comprehensively understanding and managing human mobility, making them indispensable for both theoretical research and practical implementation.
Despite the widespread utility of the OD matrix, its use can pose significant privacy risks to individuals. Because the OD matrix captures information about movement between regions, it has the potential to reveal sensitive personal information, including users’ travel routines and behavioral patterns. This risk arises particularly when the data are used without sufficient protective safeguards, making it possible to re-identify individuals based on their movement data. Researchers have demonstrated that combining mobility data with auxiliary information, such as publicly available demographics or social media check-ins, can make it possible to identify individuals with a high degree of accuracy [12,13,14]. Therefore, similar privacy concerns arise when collecting and sharing OD matrices, as the aggregated movement data can still reveal patterns that are vulnerable to re-identification attacks [15,16]. This is particularly problematic when OD matrices are shared with external organizations or released for public use, as the risk of unauthorized access, data breaches, or misuse increases [17,18]. In such scenarios, malicious attackers could exploit the published OD matrix data to identify individuals or gain insights into sensitive movement patterns.
Therefore, privacy concerns have become a major obstacle in constructing OD matrices, as many individuals are reluctant to share their location data due to potential misuse. This reluctance stems from the sensitive nature of location data, which can reveal work and home locations, travel patterns, and even personal habits [19,20]. Without sufficient trust in how their data are handled, individuals are less likely to participate in data collection efforts, resulting in incomplete or unreliable datasets [21,22]. As a result, the absence of comprehensive and secure privacy measures can significantly limit the collection of essential movement data, undermining the accuracy and utility of the resulting OD matrices. To overcome these challenges, it is critical to develop privacy-preserving techniques that protect individuals’ data while still enabling the generation of OD matrices with high utility.
Recently, the distributed approach has become increasingly popular, allowing multiple parties to work together to achieve a common goal without directly sharing their data. This method ensures that each party’s data are kept private, not only from other participants but also from the central server managing the process. As a result, it is particularly useful in situations where a trusted server is unavailable or where privacy concerns are paramount. For example, federated learning is widely used in deep learning to train models across distributed data sources [23,24,25]. Federated clustering allows data from multiple sources to be grouped together without compromising privacy [26,27]. In addition, secure multiparty computation allows multiple distributed parties to jointly compute a function over their inputs while ensuring that each party’s input remains private [28,29].
To address privacy concerns when constructing OD matrices, a distributed solution can be utilized to compute the matrix without sharing sensitive movement data with external, untrusted entities. Figure 1 presents a motivating example for this work, where each party, such as a service provider, maintains a trusted relationship with its users. As a result, each party is able to construct a local OD matrix using the trajectory information provided by its users. To compute an aggregated and global OD matrix, these parties collaborate with each other. Since the central server is not trusted in this scenario, each party shares only a privacy-preserved version of its local OD matrix with external entities, thus preserving their privacy while enabling the computation of the global OD matrix. Here, the privacy-preserved local OD matrix is generated using privacy-preserving mechanisms, such as differential privacy (DP) [30]. In this approach, even if the central server is not fully trusted, as is often the case in real-world applications, it can compute the combined OD matrix from the sanitized data provided by all parties, without having access to the raw OD matrices of individual participants.
Therefore, in this paper, we propose a distributed privacy-preserving framework for the computation of OD matrices using DP, which is the de facto standard for privacy-preserving data collection and publication. The contributions of this paper can be summarized as follows.
  • We present DistOD, a distributed privacy-preserving framework for the aggregation and computation of OD matrices in the absence of a trusted central server, a common scenario in real-world applications.
  • We propose a distributed method that allows participating parties to collaboratively identify hotspot areas, which represent regions frequently traveled between by individuals across multiple parties.
  • To enhance the utility of the resulting OD matrix, our approach employs a hybrid privacy-preserving mechanism. Specifically, we apply distributed DP (DDP) to collect OD data for hotspot areas, while using localized DP for non-hotspot regions. This hybrid approach balances between reducing the computational overhead associated with using only DDP and mitigating the reduced data utility often caused by relying solely on localized DP. By balancing the computational overhead with data utility, the proposed method enables more efficient OD matrix generation while maintaining higher accuracy.
  • Finally, we validate the effectiveness of our proposed framework through experiments on real-world datasets, demonstrating that it can accurately compute OD matrices without relying on a trusted central server. This highlights the practical applicability of our approach in real-world scenarios.
The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 outlines the necessary background and formally defines the problem addressed in this paper. Section 4 presents the proposed framework for the computation of OD matrices. Section 5 evaluates the proposed approach through experiments conducted on real-world datasets, and Section 6 presents the conclusions of the paper.

2. Related Work

2.1. OD Matrix Estimation and Computation

There has been extensive research focused on the estimation and computation of OD matrices. In this section, we briefly summarize some key studies. Mamei et al. [31] explored the estimation of OD matrices using call detail records (CDRs). They introduced two approaches: the time-based approach, which estimates user movements directly from CDRs generated within specific time windows, and the routine-based approach, which uses the routine movements of individuals to compute the OD matrix. Castiglione et al. [32] proposed a method for the estimation of OD matrices by combining principal component analysis (PCA) with a Kalman filter (KF). Their approach uses PCA to capture spatial correlations between variables, while the KF framework effectively handles the nonlinear relationship between traffic data and time-dependent OD flows. To address key challenges in OD matrix estimation, such as underdetermination and lags caused by varying traffic conditions, Xiong et al. [33] proposed an integrated approach that uses deep learning to infer the structures of dynamic OD sequences and applies structural constraints to guide traditional numerical optimization. Sun et al. [34] proposed two bi-level models for the reconstruction of the origin–destination demand in congested networks using both link and route travel times. Tsanakas et al. [35] proposed a data-driven approach to estimating time-dependent OD matrices using floating car data through a data-driven network assignment mechanism, which provides a linear mapping of the OD matrix to link flow observations. A two-step process for the estimation of bicycle OD matrices was proposed by [36]. The proposed approach first generates a primary OD matrix using a gravity model and then refines it with a path flow estimator to produce the final OD demand. Ros-Roca et al. [37] proposed a new constrained nonlinear optimization model that reduces the number of variables linearly based on the network size, rather than quadratically. By using traffic-related data, the proposed approach avoids the need for traditional traffic assignment and bi-level iterative processes.
With the recent success of deep learning techniques across diverse fields, there has been growing interest in applying these methods to predict OD flow matrices. Li et al. [38] present a generative network-based approach for the accurate prediction of network-wide ride-sourcing passenger demand OD matrices. Their model effectively captures spatiotemporal features and external dependencies, demonstrating superior performance and convergence when evaluated on real-world datasets. OD-GPT [39] is a generative pre-training model inspired by natural language processing that predicts OD flows by framing grid sequence prediction as a next-token prediction task. This approach effectively captures complex spatiotemporal dependencies in urban environments. GODDAG [40] is a framework designed to generate OD flow data in cities lacking historical records, utilizing a graph neural network-based mobility model paired with domain adversarial training. It learns mobility patterns from data-rich source cities and seamlessly transfers this knowledge to estimate OD flows in new urban areas. Chen et al. [41] proposed a framework using an autoencoder network with feature transfer to estimate urban dynamic OD matrices by integrating connected vehicle trajectories and limited automatic vehicle identification data.

2.2. Privacy-Preserving Techniques for OD Matrix Computation

There have been several proposals to address privacy concerns in the publication and estimation of OD matrices. Shaham et al. [18] introduced a technique for the privacy-preserving publication of multi-dimensional OD matrices that capture intermediate points along individual trips. By exploiting DP and taking into account important data properties, such as the density and homogeneity, their method provides robust privacy protection while ensuring high query accuracy. Matet et al. [17] presented an approach for the k-anonymization of OD matrices, addressing privacy concerns in mobility data. By exploiting the low dimensionality of OD data, the proposed method explores a larger solution space than traditional generalization algorithms while maintaining scalability for high-flow matrices. Yin et al. [42] introduced a spatial generalization approach based on k-anonymization to protect the privacy of OD matrices derived from mobile phone location data. Their method addresses the challenge of preserving data utility for mobility analysis while mitigating the risk of re-identification. Kohli et al. [43] developed an algorithm for the construction of differentially private mobility matrices, such as OD matrices, with formal privacy guarantees. They also explored practical strategies to balance privacy and data accuracy, providing valuable guidance for the responsible use of private mobility data.
The method proposed in this paper differs fundamentally from the existing approaches in the way in which it addresses privacy concerns in the handling of OD matrices. Existing methods assume the presence of a trusted central server that manages raw OD matrices and processes them to enhance the privacy. In contrast, the approach proposed in this paper assumes the absence of a trusted central server, necessitating a decentralized solution. As a result, the proposed method provides enhanced privacy protection by ensuring that no party has centralized access to the raw OD data.

2.3. DP in Distributed Systems

DP is widely used in various distributed environments to ensure data privacy, particularly in scenarios where sensitive information is shared across multiple devices, networks, or organizations. Its application spans a variety of areas, including federated learning, federated clustering, and Internet of Things (IoT) networks. In federated learning, DP is commonly used to protect individual user data by adding noise to local model updates before they are aggregated, ensuring that sensitive information remains protected throughout the learning process, without being shared directly with a central server [44,45,46]. In federated clustering, DP is used to collaboratively perform clustering tasks across distributed data sources, while ensuring that each party’s data remain private [47,48]. By adding noise to intermediate results or cluster centers, federated clustering with DP ensures that individual data points are not revealed, even when the collective task is complete. In addition, DP is being used in distributed environments such as cloud and fog layers to securely collect diverse IoT data while maintaining privacy and enabling data analysis and aggregation across multiple sources [49,50,51]. By integrating DP into these environments, sensitive IoT data, such as health information from wearable devices, smart home data, and location data from mobile sensors, can be collected and processed without compromising individual privacy.
The approach proposed in this paper shares structural similarities with federated learning and federated clustering, as each participating party collaborates to construct a global OD matrix. In addition, it is similar to distributed methods that collect data from diverse sources, as the global OD matrix is built by aggregating the local OD matrices of each participating party. We note that this is the first work to apply DP in the collaborative computation of a global OD matrix in a distributed environment, where the central server is assumed to be untrusted.

2.4. Security Patterns in Distributed Systems

A security pattern is a reusable solution to a common security problem encountered in software design and system architecture [52,53]. These patterns provide a structured framework for the implementation of security practices, enhancing systems’ robustness and resilience against potential threats. In this subsection, we provide a brief overview of the existing work on security patterns in distributed systems.
Uzunov et al. [54] presented a pattern-oriented approach to designing authorization infrastructure for distributed systems. They introduced a security solution framework that guides developers in building custom, application-specific authorization models through the incremental application of microprocess and product patterns. Security and Dependability (S&D) Artefacts [55] extended the concept of security patterns by incorporating dependability mechanisms that dynamically adapt to changing contextual conditions. S&D Artefacts are structured to cover the entire system lifecycle, providing a comprehensive library of solutions that enhance both the security and dependability in distributed systems. With the widespread adoption of cloud computing, several studies have focused on developing security patterns for cloud environments [56,57,58,59]. Moral-Garcia et al. [57] introduced enterprise security patterns to address recurring security challenges in protecting information systems within cloud-based platforms. Anand et al. [58] proposed a pattern-based cloud security framework that offers practical security solutions, which enable developers to implement them without needing to be security experts. Rath et al. [59] explored security patterns for cloud software as a service (SaaS) that address various security concerns, such as system security, data security, and privacy, and provided guidelines and case studies for the implementation of these patterns on platforms such as AWS and Azure.

3. Preliminaries

In this section, we first present the background information for this paper and then formally define the problem addressed in this paper. In addition, we describe the threat model assumed in this work.

3.1. Differential Privacy

DP is based on a mathematical framework that provides a probabilistic privacy guarantee even in the presence of attackers with arbitrary background knowledge [30]. DP guarantees that an attacker will not be able to determine with certainty whether a particular individual is included in the published dataset. It is defined as follows.
 Definition 1 (( ϵ , δ )-DP).
A randomized algorithm A satisfies ( ϵ , δ ) -DP if, for (1) any two neighboring datasets, D 1 and D 2 , and (2) any possible output O of A , the following condition holds:
P r [ A ( D 1 ) = O ] e ϵ × P r [ A ( D 2 ) = O ] + δ .
Two datasets, D 1 and D 2 , are defined as neighbors if they differ by only one record. The privacy parameter ϵ , referred to as the privacy budget, controls the strength of the privacy guarantee. This definition ensures that, for any output of the algorithm A , an adversary cannot confidently determine whether the input was D 1 or D 2 .
The Gaussian and Laplace mechanisms are two widely used mechanisms in achieving DP. The Gaussian mechanism achieves ( ϵ , δ ) -DP by adding random noise drawn from the Gaussian distribution.
 Definition 2 (Gaussian Mechanism).
Given a function f, the Gaussian mechanism A releases the following differentially private output:
A ( D ) = f ( D ) + N ( 0 , σ 2 )
where N ( 0 , σ 2 ) represents noise drawn from a Gaussian distribution with mean 0 and variance σ 2 . The standard deviation σ is defined as
σ = Δ f · 2 ln ( 1.25 / δ ) ϵ
Here, Δ f is the global sensitivity, representing the maximum amount by which the output of f may change when a single record in the dataset is modified. Similarly, the Laplace mechanism achieves ( ϵ , 0 ) -DP by adding noise drawn from a Laplace distribution with mean 0 and scale Δ f ϵ .
One of the key characteristics of DP is the composition property, which ensures that the overall privacy guarantee is maintained when multiple differentially private algorithms are applied to the same dataset. However, the total privacy budget accumulates with each application. Specifically, if k mechanisms, each providing ( ϵ i , δ i ) -DP, are applied to the same dataset, the combined privacy guarantee becomes i = 1 k ϵ i , i = 1 k δ i -DP. This means that, while the privacy is maintained, the total privacy budget increases with the number of algorithms used.

Variants of Differential Privacy

DP has several variants designed for different privacy needs and scenarios. Two commonly used variants are local differential privacy (LDP) and DDP.
LDP [60,61] provides DP guarantees at the individual data point level. In this approach, noise is added directly to each user’s data before they are shared or aggregated. This ensures that privacy protection is applied at the source, without relying on a trusted central server. Because each data point is obfuscated independently, LDP results in a larger loss of data utility due to the significant amount of noise required.
Similar to LDP, DDP [62,63] operates without relying on a trusted central server. However, it differs from LDP in one fundamental way. Noise is added to the combined output, ensuring that the overall result satisfies DP while allowing for more accurate data analysis compared to LDP. Unlike LDP, where each data point is individually protected, DDP guarantees DP only for the aggregated data, making it suitable for scenarios where collaboration among parties is required to compute joint statistics, while protecting the privacy of the entire dataset.

3.2. Problem Definition and Threat Model

Let assume that the entire space is partitioned into non-overlapping regions, denoted as R = { r i | i = 1 , 2 , , m } . The OD matrix contains the directional mobility data, specifying the movement between origins and destinations. Let F be an m × m matrix, where each element F [ i , j ] represents the volume of flow from region r i to region r j . Typically, an OD matrix captures the mobility patterns between all regions within a given city.
Let us assume that P = { p i | i = 1 , 2 , , n } represents a set of parties, where each party p i could be a service provider, such as an application provider, that maintains a trust relationship with its users. Let F i denote the OD matrix corresponding to each party p i P .
The objective of this paper is to compute the combined OD matrix, F, by aggregating the individual OD matrices from each participating party (i.e., F = i = 1 n F i ) under the assumption that the central server is untrusted. In this scenario, the original OD matrices from each party cannot be directly shared with the central server or with other parties due to privacy concerns. Therefore, this paper leverages DP to protect each party’s sensitive OD matrix during the aggregation process. By using DP, each party can share a privacy-preserving version of its OD matrix, allowing the central server to compute the combined matrix without accessing the original data.

3.3. Threat Model

Mobility data are highly sensitive and vulnerable to privacy attacks, especially re-identification attacks. Adversaries can leverage auxiliary information to link anonymized mobility patterns to specific individuals [64]. This type of attack poses significant privacy risks and requires the use of privacy mechanisms.
In this paper, we assume two primary adversary models: the honest-but-curious adversary and re-identification attack models. In the honest-but-curious adversary model, the central server and the participating parties are assumed to follow the protocol as specified, but remain curious and may attempt to infer additional information from the data that they receive. In the re-identification attack model, we assume that adversaries may have access to auxiliary data, such as publicly available home and work locations. These adversaries can attempt to re-identify the individuals by correlating the entries in the OD matrix with this external information.

4. Distributed Privacy-Preserving Computation of OD Matrix Without Trusted Central Server

In this section, we introduce a distributed framework for the privacy-preserving computation of OD matrices without the need for a trusted central server. We begin by presenting the baseline approaches and highlighting their limitations. Next, we detail the proposed DistOD framework, which is designed to compute OD matrices efficiently and effectively in distributed environments, balancing the computational overhead with data utility.

4.1. Baseline Approaches

Algorithm 1 presents the pseudocode for the t-th participating party, which is identically applied to all other parties, to distributively compute an OD matrix while ensuring privacy through localized DP. We note that, in our work, localized DP refers to a setting where each party (e.g., a service provider) applies DP to a local dataset collected from its users. This approach differs from LDP [60,61], where noise is added directly to individual data points at the user level before any data aggregation occurs.
The algorithm iterates over each element ( i , j ) of the OD matrix, adding noise to F t [ i , j ] to locally satisfy ϵ -DP (line 5). After perturbing all non-zero entries, the party uploads the perturbed data to the central server for aggregation. Since OD matrices are typically sparse, this approach perturbs and reports only non-zero elements of the OD matrix. We assume that reporting only the non-zero elements poses a minimal privacy risk, as these elements alone are insufficient to reconstruct sensitive individual movement patterns or compromise user privacy in a meaningful way.
Algorithm 1 Aggregation of OD matrix based on localized DP.
1:
Each Participating Party Processing:
2:
Initialize O t as an empty list
3:
for each ( i , j ) where i , j { 1 , 2 , , m }  do
4:
    if  F t [ i , j ] > 0  then
5:
        Append ( i , j , F t [ i , j ] + N ( 0 , σ 2 ) ) to O t
6:
    end if
7:
end for
8:
Report O t to the central server
In Algorithm 1, each party perturbs the elements of its local OD matrix independently to locally satisfy ϵ -DP, which can lead to an excessive amount of noise being added when all perturbed data from multiple parties are aggregated. For example, consider the case where the ( i , j ) element is non-zero for all parties, such as in the case of popular regions. When the central server aggregates the data from all users, the combined value of the ( i , j ) element will be t = 1 n F t [ i , j ] + n · N ( 0 , σ 2 ) , which results in an overly noisy aggregated value. Since the goal is to compute the sum of the data across parties, rather than each individual value, it is sufficient to add noise to the final aggregated sum. By satisfying ϵ -DP globally instead of locally, we would only need to add noise N ( 0 , σ 2 ) , which is significantly less than n · N ( 0 , σ 2 ) .
In Algorithm 2, a DDP framework is used with secure multiparty aggregation (SecAgg) to compute the OD matrix in a privacy-preserving manner. Unlike Algorithm 1, where each party independently adds noise to its OD matrix to satisfy ϵ -DP locally, this approach distributes the noise across parties, significantly reducing the total amount of noise introduced during the aggregation process. Each participating party iterates over each element ( i , j ) of the OD matrix, adding noise drawn from a Gaussian distribution, N ( 0 , σ 2 n ) , where n is the total number of participating parties (line 4). Although the smaller noise added at each party individually does not provide ϵ -DP protection for its data, the aggregated noise across all parties ensures that the final combined result satisfies the ϵ -DP guarantee [62,63]. For example, for the ( i , j ) element, the combined value across all parties will be t = 1 n ( F t [ i , j ] + N ( 0 , σ 2 n ) ) = t = 1 n F t [ i , j ] + N ( 0 , σ 2 ) , which satisfies the ϵ -DP requirement for the aggregated data. Once the noise has been added, each party encrypts its perturbed matrix using the SecAgg protocol (line 6), ensuring that the central server can compute only the aggregated sum of the OD matrices from all participating parties, while preventing access to individual data that may not independently satisfy ϵ -DP.
Algorithm 2 Aggregating OD matrix based on DDP with secure aggregation.
1:
Each Participating Party Processing:
2:
Initialize F t as m × m matrix
3:
for each ( i , j ) where i , j { 1 , 2 , , m }  do
4:
     F t [ i , j ] F t [ i , j ] + N ( 0 , σ 2 n )
5:
end for
6:
Encrypt F t using SecAgg protocol
7:
Report encrypted F t to the central server
Even though Algorithm 2 reduces the noise added to the data, it incurs a significant computational overhead. First, unlike Algorithm 1, each party cannot simply upload its non-zero elements, since the non-zero entries in each party’s OD matrix differ from one another. If parties only reported non-zero elements, when combined by the central server, there would not be enough noise added to satisfy ϵ -DP. Secondly, while there have been significant advances in secure multiparty aggregation techniques, such as those in [65,66], the computational overhead when using these techniques is still considerably high for both the central server and the participating parties. Moreover, the size of the OD matrix can be prohibitively large, making it impractical to report all elements. For instance, if the entire space is divided into a 100-by-100 grid of regions, the corresponding matrix would be 10 4 -by- 10 4 , resulting in a total of 10 8 elements. Reporting such a large number of elements is not feasible due to computational and communication constraints. As a result, while Algorithm 2 effectively reduces the amount of added noise, it remains computationally expensive because of these limitations.

4.2. Proposed DistOD Framework

In this subsection, we introduce the proposed DistOD, a distributed privacy-preserving framework for the computation of OD matrices without relying on a trusted central server. Figure 2 provides an overview of the DistOD process, which consists of two main phases.
  • In the hotspot identification phase, each party performs local clustering and sends the clustering results to the central server. The server then identifies hotspot areas based on the clustering results from all parties and distributes the identified hotspot information back to all parties.
  • In the OD matrix aggregation phase, a hybrid privacy-preserving mechanism is employed: DDP is applied to the OD data for hotspot areas, while localized DP is used for non-hotspot areas. After applying the DP mechanisms, each party uploads its OD matrix to the central server, which then aggregates these matrices to compute the global OD matrix.
The rationale behind applying DDP to hotspot areas is as follows. As discussed in the previous subsection, DDP introduces less noise to the raw data compared to localized DP. However, the difference in the amount of noise added by each approach can vary significantly depending on the number of common occurrences shared among parties—for instance, in an extreme case where only one party (e.g., the i-th party) has recorded visits for a particular region pair ( r s , r e ) in the OD matrix, applying localized DP results in noisy data F i [ s , e ] + N ( 0 , σ 2 ) . On the other hand, if DDP is applied, the noisy data are expressed as t = 1 n F t [ s , e ] + n · N ( 0 , σ 2 n ) , which is simplified to F i [ s , e ] + N ( 0 , σ 2 ) . Thus, in this extreme scenario, there is no advantage of using DDP over localized DP in terms of data utility.
With this in mind, the proposed DistOD approach first identifies hotspot regions that many parties are likely to visit and then applies DDP to these areas to maximize the benefits of using DDP. The detailed steps of each phase are presented below.

4.2.1. Hotspot Identification Phase

In our scenario, the central server is untrusted, and the participating parties do not trust each other. This necessitates a method to identify hotspot areas without directly sharing sensitive raw data. To address this, we employ location clustering techniques. Each party independently clusters its location data and shares only these aggregated results with the untrusted central server, enabling hotspot identification while preserving data privacy. Furthermore, in Section 4.2.3, we extend this clustering-based approach by incorporating DP. This additional layer of protection mitigates the potential for privacy breaches when sharing even the aggregated clustering results with the server. The process of collaboratively identifying hotspot areas consists of local clustering performed by each participating party and hotspot detection performed by the central server.
Local Clustering: Given a region r i R , let ( x i , y i ) represent the center coordinates of r i . Furthermore, let f i denote the frequency of r i , which represents the number of times that r i appears as either an origin or destination in the OD matrix. This frequency can be computed by summing the values in both the row and column corresponding to r i in the OD matrix as follows:
f i = h = 1 n F t [ i , h ] + h = 1 n F t [ h , i ]
Given the dataset R D = { ( x i , y i , f i ) i = 1 , 2 , , m } , where ( x i , y i ) represents the center coordinates of region r i and f i represents the frequency (weight) of visits for region r i , we apply the k-means algorithm to partition the m regions into k clusters. The k-means algorithm minimizes the following objective function:
minimize j = 1 k r i C j f i · ( x i , y i ) ( μ j , x , μ j , y ) 2
where C j is the set of points (regions) in cluster j and ( μ j , x , μ j , y ) is the centroid of cluster j.
After performing k-means clustering, the clustering result is represented as C R = { ( μ j , x , μ j , y , w j ) j = 1 , 2 , , k } , where w j is the weight representing the total visit frequency of all regions assigned to cluster C j , calculated as w j = r i C j f i . This set of cluster centers and weights is then uploaded to the central server for further processing. The clustered regions represent a privacy-preserving version of the users’ location data, as only the aggregated cluster centers and their associated weights are shared, rather than the individual raw data.
Hotspot Detection: Upon receiving the k-means clustering results from all participating parties, the central server identifies the hotspots that correspond to the top-l most frequently visited regions. Let C R = { C R 1 , C R 2 , , C R n } represent the set of k-means clustering results received from the n participating parties. For each C R t C R , the server assigns a weight to each region based on its proximity to the cluster centers. This is performed using a Gaussian kernel, where regions closer to a cluster center are assigned higher weights, effectively prioritizing areas with higher visit densities.
For each clustering result C R t from the t-th party, the server assigns a weight to each region based on its proximity to all cluster centers. The weight assigned to a region r i from the clustering result C R t is calculated as
W t ( r i ) = ( μ j , x , μ j , y , w j ) C R t w j × exp ( x i μ j , x ) 2 + ( y i μ j , y ) 2 2 γ 2
Here, γ controls the spread of the Gaussian kernel, determining the extent to which each cluster center influences the surrounding regions. As a result, the weight assigned to each region is the cumulative sum of the contributions from all cluster centers in the clustering result C R t , where each contribution is weighted by the total visit frequency of the regions within its corresponding cluster.
After calculating the weights for each region based on all clustering results in C R , the server aggregates these values to compute the global weight for each region. The global weight for a region r i is computed by summing the weights assigned to it by all participating parties, as follows:
W global ( r i ) = t = 1 n W t ( r i ) .
Finally, the server identifies hotspots by selecting the top l regions with the highest global weights. However, selecting regions based on their global weights alone may lead to the inclusion of irrelevant areas, such as regions that are not part of the road network. To prevent this, we first apply a road map filter to exclude any regions that are not on roads, ensuring that only meaningful areas are selected as hotspots. We then sort the remaining regions by their global weights and select the top l regions with the highest weights. This refined set of regions is then returned to each participating party for further analysis.

4.2.2. OD Matrix Aggregation Phase

In the OD matrix aggregation phase, each party first applies a hybrid DP mechanism to its OD matrix. The privacy-preserved OD matrices are then uploaded to the central server, which aggregates them to compute the global OD matrix.
Hybrid DP Mechanism: Algorithm 3 presents the proposed hybrid DP mechanism, which combines DDP for hotspot areas and localized DP for non-hotspot areas, ensuring a balance between computational efficiency and data utility in the aggregation of the OD matrix. Each participating party processes its local OD matrix by iterating over each pair of regions ( i , j ) . For hotspot areas, the party applies DDP by adding Gaussian noise N ( 0 , σ 2 n ) to the OD matrix entry F t [ i , j ] (line 5). For non-hotspot areas, if the value F t [ i , j ] is greater than 0, Gaussian noise N ( 0 , σ 2 ) is added to the OD matrix entry (line 8). After processing, the values in O t DDP are encrypted using the SecAgg protocol, which ensures the secure aggregation of the DDP results without compromising individual privacy. Finally, both O t DDP and O t are reported to the central server for further aggregation.
To improve the utility of the resulting OD matrix, our approach uses a hybrid privacy-preserving mechanism. By applying DDP to high-density regions (hotspots), we ensure that the OD matrix remains highly accurate in these critical areas by reducing the amount of noise introduced. On the other hand, localized DP is applied to non-hotspot areas to protect individual privacy using a simpler and more computationally efficient mechanism, although this comes at the cost of reduced accuracy.
Algorithm 3 Hybrid privacy-preserving mechanism for OD matrix aggregation.
1:
Each Participating Party Processing:
2:
Initialize O t DDP and O t as empty lists
3:
for each ( i , j ) where i , j { 1 , 2 , , m }  do
4:
    if  r i and r j belongs to a hotspot area then
5:
        Append ( i , j , F t [ i , j ] + N ( 0 , σ 2 n ) ) to O t DDP
6:
    else
7:
        if  F t [ i , j ] > 0  then
8:
           Append ( i , j , F t [ i , j ] + N ( 0 , σ 2 ) ) to O t
9:
        end if
10:
    end if
11:
end for
12:
Encrypt O t DDP using SecAgg protocol
13:
Report encrypted O t DDP and O t to the central server
Aggregation of the Global OD Matrix: Upon receiving all DP-applied OD matrices from all parties, the central server proceeds to aggregate them to compute the global OD matrix. This process is performed in two distinct steps. First, for the elements of the OD matrix corresponding to hotspot areas, the data received from the parties are encrypted using the SecAgg protocol. Thus, the server collaborates with all participating parties to securely decrypt these data, ensuring that it only accesses the aggregated results, without being able to access the individual data of each party. Secondly, for the elements of the OD matrix corresponding to non-hotspot areas, the server directly aggregates the noisy data provided by each party.
We also note that, in practice, a small number of parties may be adversarial or drop out during the secure aggregation process. In such cases, the server may either receive an inaccurate sum of the noise-added data or be unable to access the aggregated data due to the absence or malfunctioning of certain parties. The issue of dealing with adversarial parties or dropouts has been extensively studied in the literature [62]; therefore, we do not address it in this paper. These techniques can be seamlessly integrated into the proposed method. Instead, this paper focuses on the hybrid approach of using DDP and localized DP in the context of computing the OD matrix, with the goal of balancing the computational overhead and data utility.

4.2.3. Enhancing Privacy in the Hotspot Identification Process

As mentioned earlier, although the proposed DistOD approach shares the cluster results with the untrusted central server rather than the raw data, sharing the cluster centers can still lead to privacy breaches. To enhance the privacy during the hotspot identification process, we can leverage DP to perturb the cluster centers before uploading them to the server. Specifically, after performing k-means clustering, we generate the perturbed clustering result that satisfies ( ϵ 1 , δ 1 ) -DP as follows:
C R = μ j , x + N ( 0 , σ 1 2 ) , μ j , y + N ( 0 , σ 1 2 ) , w j j = 1 , 2 , , k
Here, σ 1 is defined as Δ f · 2 ln ( 1.25 / δ 1 ) ϵ 1 . This ensures that the shared cluster centers are differentially private, minimizing the risk of privacy breaches while still allowing for effective hotspot identification.
Through the composition property of DP, given the total privacy budget ϵ , the remaining privacy budget, ϵ 2 ( = ϵ ϵ 1 ) , is used to perturb the OD matrix, as presented in Algorithm 3. This enhanced privacy inevitably reduces the utility of the aggregated OD matrix, as the accuracy of hotspot identification is affected by the noise added to the true cluster centers, and the reduced privacy budget ϵ 2 is used to perturb the OD matrix during aggregation. In the experimental section, we will evaluate the impact of this enhanced privacy mechanism on the utility of the aggregated OD matrix.

4.2.4. Integration of Security Patterns in DistOD Framework

A security pattern is a reusable solution to a common security problem in software design and system architecture [52,53,56]. These patterns outline recurring security issues and provide best-practice solutions that can be consistently applied in specific contexts to improve the security of system development and operation. In this subsection, we describe how security patterns are integrated into the DistOD framework to provide robust protection against potential security threats. To secure the DistOD framework, we propose to implement security patterns that address the challenges of distributed and privacy-preserving OD matrix computation, ensuring secure interactions, data flow management, and the protection of sensitive information.
Interaction Security Pattern: The decentralized structure of the DistOD framework necessitates that interactions between participating parties and the central server occur in an environment where trust cannot be inherently established. Given this lack of mutual trust, it becomes essential to assume security measures that ensure that only legitimate entities are involved in data exchange and computation. To address this, we propose the use of an authentication and authorization pattern as a fundamental component for secure interactions. In this model, each party undergoes a thorough authentication process, using secure protocols to verify its identity before participating in any data-related activities. Once authenticated, an authorization mechanism must be in place to confirm that the entity has the necessary permissions to access or process specific data. This layered approach to identity verification and access control ensures that only verified and authorized parties are allowed to participate in the distributed computation of OD matrices.
Data Flow Security Pattern: To secure the transmission of data within the DistOD framework, we propose a secure data transmission pattern. This pattern assumes the use of encryption mechanisms to protect the confidentiality of data as they flow between the parties and the untrusted central server. By encrypting all shared data, such as the clustering results and noise-added OD matrices, this pattern prevents eavesdropping and guarantees secure data flows during the computation of OD matrices.
Cryptographic Protection Pattern: In the DistOD framework, while all data transmissions are protected using the data flow security pattern, data associated with hotspot areas require additional protection due to the use of DDP. To address this need, we assume the use of a cryptographic protection pattern that enhances the privacy by ensuring that the central server can only access aggregated results, rather than individual data. This pattern is applied specifically to the OD matrix elements corresponding to hotspot areas, where DDP is used to add noise to each party’s data before sharing. By using cryptographic techniques such as the secure aggregation protocol, individual values from the hotspot data remain inaccessible to the central server, which can only access the aggregated sum of these elements.
The security patterns described in this subsection significantly increase the robustness and reliability of the DistOD framework, making it well suited for real-world applications that require strong data security and privacy protections.

5. Experiment

In this section, we present an experimental evaluation of the proposed DistOD method using real-world datasets.

5.1. Experimental Setup

5.1.1. Datasets

We evaluated the effectiveness of the proposed DistOD framework using two different datasets.
  • The T-Drive dataset [67] contains one week of trajectory data from 10,357 taxis in Beijing. The T-Drive dataset provides detailed information, including taxi IDs, timestamps, and latitude–longitude coordinates. To generate meaningful origin–destination pairs, we divided the location data into two-hour intervals and used each interval to determine the origin and destination points. This process resulted in 660,000 origin–destination pairs.
  • The Porto dataset [68] consists of GPS coordinates collected from 442 taxis operating in Porto, Portugal. For the experiment, we processed these data to extract 1,323,078 origin–destination pairs.
In the experiment, we simulated a scenario with 50 participating parties (i.e., n = 50 ) and distributed the origin–destination pairs evenly among them to create their respective local OD matrices.
For the experiment, we divided the geographic area into four grid sizes, 25 × 25 , 50 × 50 , 75 × 75 , and 100 × 100 , resulting in 625, 2500, 5625, and 10,000 regions, respectively. Each origin or destination location was mapped to the corresponding region to which it belonged. Consequently, the size of the OD matrix used in the experiments was 625 × 625 , 2500 × 2500 , 5625 × 5625 and 10,000 × 10,000, depending on the grid size.

5.1.2. Baseline and Evaluation Metrics

In the experiments, we evaluated the proposed DistOD against the localized DP-based method ( D P G M ) in Algorithm 1 and the DDP-based approach ( D D P A L L ) in Algorithm 2. We note that the D D P A L L used in the experiments is based on methods commonly utilized in federated learning contexts, such as [23]. In addition, we compared DistOD with a localized DP-based method using the Staircase mechanism [69] ( D P S M ), which is known to introduce less noise into the original data compared to the Gaussian and Laplace mechanisms.
The mean absolute error (MAE), which quantifies the difference between the actual value and the estimated one, was used for evaluation:
M A E = g i G g j G R ( g i ) · R ( g j ) · F [ i , j ] F ^ [ i , j ] g i G g j G R ( g i ) · R ( g j )
Here, F represents the true OD matrix, while F ^ denotes the estimated OD matrix generated by the algorithm used in the experiments. In addition, R ( g i ) is an indicator function that returns 1 if the region g i is part of the road network and 0 otherwise. By including the road network in the MAE calculation, we ensured that the MAE was computed meaningfully, considering only valid origin–destination pairs located within the road network.

5.1.3. Experimental Settings

In the experiment, the privacy budget ϵ was varied between 0.25 and 1.0, with δ fixed at 10 5 . These values are consistent with standards observed in the DP literature, where ϵ values below 2.0 are typically chosen to balance privacy and utility in practical applications. Lower ϵ values (e.g., 0.25) provide stronger privacy, while higher values allow for moderate privacy, allowing for a range of scenarios to assess the privacy–utility tradeoff. The choice of δ = 10 5 reflects the common practice of setting δ to a small constant, as lower values (e.g., 10 5 or less) help to control the probability of significant privacy loss. This ensures that the privacy guarantee is consistently maintained, minimizing the probability of re-identification even under adversarial conditions.
For the identification of hotspot regions in the proposed DistOD method, we set the number of clusters (k) to 50 and the Gaussian kernel parameter ( γ ) to 10. All algorithms were implemented in Python 3.8, and the experiments were conducted on a system equipped with Intel Xeon 5220R CPUs and 64 GB of RAM.

5.2. Evaluation Results

Figure 3 shows the effect of the varying privacy budget ( ϵ ) on the MAE. In these experiments, ϵ ranged from 0.25 to 1.0, while the size of the OD matrix was fixed at 10,000 × 10,000. For the proposed DistOD framework, the size of the hotspot areas was varied at between 5% and 20% of the total number of regions (i.e., 10,000). As the value of ϵ decreased (i.e., as the level of privacy increased), the MAE increased for all approaches. This is because stronger privacy guarantees (lower ϵ ) introduce more noise into the true data, which reduces the accuracy of the estimated OD matrix. Specifically, as ϵ decreased, the σ of the Gaussian mechanism increased, resulting in more noise being added. On the other hand, as the privacy budget ϵ increased (i.e., as the level of privacy decreased), the σ of the Gaussian mechanism decreased, resulting in less noise being added and thus improving the accuracy.
This tradeoff between data utility and privacy is a well-known characteristic of DP-based methods. Therefore, choosing an appropriate privacy budget requires balancing the desired level of data utility with the required level of privacy, both of which depend on the specific requirements of the application. For example, applications that deal with highly sensitive data may require stronger privacy guarantees, even if this means compromising the accuracy of the results. On the other hand, applications that require accurate data insights for effective decision-making might tolerate a slightly reduced level of privacy in exchange for higher data accuracy.
As expected, among the four different methods, the DDP-based approach ( D D P A L L ) shows the best performance, while the localized DP-based methods, D P G M and D P S M , show the worst performance. The proposed method, DistOD, performs at an intermediate level within these two categories. However, it is important to note that although the DDP-based approach provides the highest accuracy, this comes at a significant computational cost, making it less practical for large-scale applications. This limitation will be explored in more detail in Section 5.3.
Furthermore, as the size of the hotspot areas increases, a decrease in the MAE is observed. This is because, in the proposed DistOD, the DDP-based approach is applied to the hotspot areas, which collaboratively adds noise to the true data among the participating parties to satisfy ϵ -DP. As a result, the total error decreases as more regions are covered by the DDP-based approach. However, it is important to note that this improved performance in terms of data utility comes with higher computational costs due to the complexity of the DDP-based approach. As the hotspot size increases, the computational overhead increases because more regions are processed using the DDP mechanism. Therefore, there is a tradeoff between improving the data utility and managing the computational efficiency.
Figure 4 shows the effect of the varying OD matrix sizes on the MAE. In this experiment, the OD matrix size was varied among 625 × 625 , 2500 × 2500 , 5625 × 5625 , and 10,000 × 10,000, while the privacy budget ϵ was fixed at 0.5. As seen in the figure, the DDP-based approach ( D D P A L L ) consistently achieved the lowest MAE, indicating the best performance, while the localized DP-based methods, D P G M and D P S M , exhibited the highest MAEs, reflecting the worst performance. The proposed method, DistOD, showed intermediate performance between these two categories, following the same trend as observed in Figure 3.
For the Porto dataset, the MAE steadily decreases as the OD matrix size increases. However, for the T-Drive dataset, the relationship between the OD matrix size and MAE does not follow a consistent pattern. Specifically, the MAE decreases as the matrix size increases from 625 × 625 to 5625 × 5625 , but then rebounds when the matrix size reaches 10,000 × 10,000. This behavior is explained as follows: when calculating the MAE, only valid origin–destination pairs within the road network are considered, which means that the MAE is influenced more by the number of meaningful origin–destination pairs than by the matrix size itself. In the Porto dataset, the number of valid origin–destination pairs increases proportionally with the matrix size, resulting in a steady decrease in the average MAE. However, in the T-Drive dataset, the number of valid origin–destination pairs reaches a stable point at a matrix size of 5625 × 5625 . Beyond this size, the MAE is less dependent on the matrix size and more influenced by the actual data distribution. This observation highlights that the relationship between the matrix size and error is less pronounced when limited to practical, applicable origin–destination pairs within the road network.
Figure 5 shows the experimental results when DP is applied to enhance the privacy during hotspot identification, as discussed in Section 4.2.3. This experiment used the T-Drive dataset with ϵ values ranging from 0.25 to 1.0 and evaluated two OD matrix sizes, 2500 × 2500 and 10,000 × 10,000. The size of the hotspot areas was fixed at 20% of the total number of regions. In addition, the experiments explored different splits of the total privacy budget between the hotspot identification and OD matrix collection phases, using ratios of 1:9, 2:8, 3:7, 4:6, and 5:5. For example, in Figure 5, DistOD (1:9) indicates that 10% of the privacy budget is allocated to the hotspot identification phase, while the remaining 90% is allocated to the OD matrix collection phase. For comparison, we also plot the results of the DDP-based approach ( D D P A L L ) and the localized DP-based methods, D P G M and D P S M . Furthermore, the figure also shows the DistOD method, which does not use the DP mechanism during the hotspot identification phase.
As shown in the figure, the MAE increases as a larger portion of the privacy budget is allocated to the hotspot identification phase, leaving less for the OD matrix collection phase. This suggests that, to maintain the data utility in the collected OD matrix, it is more important to minimize the noise during the OD matrix collection phase than to enhance the accuracy of hotspot area identification. This is because the noise introduced during the collection of the OD matrix has a direct impact on the accuracy of the origin–destination estimates. While accurately identifying hotspot areas is valuable, the overall quality of the OD matrix is primarily influenced by how well the collected data capture the actual movement patterns.
Furthermore, as expected, DistOD without the DP mechanism during hotspot identification consistently shows a lower MAE compared to DistOD using the DP mechanism in this phase. More importantly, as shown in the figure, even with a 5:5 privacy budget split between hotspot identification and OD matrix collection, the proposed DistOD still outperforms the localized DP-based approaches. This demonstrates that, despite allocating a significant privacy budget to hotspot identification, DistOD maintains a superior balance between privacy and utility compared to localized methods.
Figure 6 presents the experimental results when applying DP in hotspot identification, where the size of the hotspot regions was varied from 5% to 20% of the total regions. In these experiments, the total privacy budget was divided between the hotspot identification and OD matrix collection phases at a fixed ratio of 1:9. For comparison, the localized DP-based methods, D P G M and D P S M , are also included in the graph.
As shown in the figure, there is a slight but consistent increase in the MAE when DP is applied during hotspot identification, compared to when it is not. This increase in error can be attributed to two main factors. First, the introduction of the DP mechanism during hotspot identification adds noise, resulting in the less accurate identification of hotspot areas. Second, a smaller portion of the privacy budget is left for the OD matrix collection phase, further reducing the utility of the collected data. However, despite this slight reduction in utility, the use of DP during hotspot identification can add an important layer of privacy protection, as the true clustering results are not directly shared with the untrusted central server.

5.3. Performance Analysis Regarding the Communication and Computational Costs

This subsection provides a performance analysis of the communication and computation overheads incurred when using secure aggregation in DDP. Secure aggregation has been extensively studied, and the recent work in [66] introduces a scalable protocol where the communication and computational costs of each participating party depend logarithmically on the number of parties. Specifically, for n parties, the communication and computational costs per party are O ( log 2 n + L ) and O ( log 2 n + L log n ) , respectively, where L represents the length of the input vectors. Furthermore, the server’s communication and computational costs are O ( n ( log 2 n + L ) ) and O ( n ( log 2 n + L log n ) ) , respectively. Therefore, when the number of participating parties is fixed, the dominant factor affecting the communication and computational overhead of DDP is L, which corresponds to the number of elements in the OD matrix collected using DDP in this work.
Table 1 presents the relative ratio of the number of OD matrix elements where the DDP mechanism is applied under the proposed DistOD method compared to the number of OD matrix elements where the DDP mechanism is applied under the pure DDP-based method ( D D P A L L ), as described in Algorithm 2. The relative ratio is defined as
Relative Ratio = Number of elements using DDP with DistOD Number of elements using DDP in with D D P A L L
This relative ratio quantifies the reduction in the number of elements requiring expensive secure aggregation when using the proposed DistOD method compared to the pure DDP-based approach, D D P A L L .
In Table 1, the size of the hotspot areas for the proposed DistOD method is varied between 5% and 20% of the total number of regions. As expected, as the size of the hotspot areas increases, the relative ratio also increases, indicating a corresponding increase in the communication and computational overhead caused by the use of secure aggregation. More importantly, the relative ratio remains significantly less than 1, demonstrating that the proposed DistOD method significantly reduces the communication and computational overhead associated with secure aggregation compared to the pure DDP-based approach ( D D P A L L ).

6. Conclusions

In this paper, we have introduced DistOD, a distributed privacy-preserving framework for the computation OD matrices, which addresses the challenge of constructing these matrices without relying on a trusted central server. The proposed DistOD framework employs a hybrid privacy-preserving approach that combines DDP for hotspot regions with localized DP for non-hotspot regions. This method optimizes both the computational efficiency and data utility, achieving a tradeoff between accuracy and privacy. Extensive experiments on real-world datasets demonstrate that DistOD consistently outperforms localized DP-based methods in terms of data utility and surpasses the pure DDP-based approach in terms of efficiency. As a result, DistOD achieves a better balance between utility and efficiency in collecting and computing OD matrices compared to these approaches. The results also highlight the effectiveness of our approach in handling scenarios where no trusted server is available, a common requirement in real-world applications.
For future work, we plan to extend this framework to other domains, such as healthcare and IoT, where privacy concerns are equally important. In the healthcare domain, the framework can be adapted to protect sensitive patient mobility data, which is crucial for epidemiological studies and public health research. In the IoT, it can be used to protect the privacy of location data generated by connected devices, ensuring the safe and responsible use of data in smart environments. In addition, we recognize the significant value in developing formalized security patterns for privacy-preserving computation in distributed systems. Future research will focus on these patterns to further enhance the robustness and reliability of the DistOD framework, supporting broader applications in privacy-sensitive domains.

Funding

This research was funded by a 2023 Research Grant from Sangmyung University (2023-A000-0118).

Data Availability Statement

The original data presented in the study are openly available via Kaggle at https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/, (accessed on 1 July 2024).

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Rong, C.; Ding, J.; Li, Y. An interdisciplinary survey on origin-destination flows modeling: Theory and techniques. ACM Comput. Surv. 2024, 57, 1–49. [Google Scholar] [CrossRef]
  2. Behara, K.N.S.; Bhaskar, A.; Chung, E. A DBSCAN-based framework to mine travel patterns from origin-destination matrices: Proof-of-concept on proxy static OD from Brisbane. Transp. Res. Part C Emerg. Technol. 2021, 131, 103370. [Google Scholar] [CrossRef]
  3. Alshehri, A.; Owais, M.; Gyani, J.; Aljarbou, M.H.; Alsulamy, S. Residual neural networks for origin–destination trip matrix estimation from traffic sensor tnformation. Sustainability 2023, 15, 9881. [Google Scholar] [CrossRef]
  4. Lattman, K.; Olsson, L.E.; Friman, M. Development and test of the perceived accessibility scale (PAC) in public transport. J. Transp. Geogr. 2016, 54, 257–263. [Google Scholar] [CrossRef]
  5. Pereira, F.C.; Rodrigues, F.; Ben-Akiva, M. Using data from the web to predict public transport arrivals under special events scenarios. J. Intell. Transp. Syst. 2015, 19, 273–288. [Google Scholar] [CrossRef]
  6. Credit, K.; Arnao, Z. A method to derive small area estimates of linked commuting trips by mode from open source LODES and ACS data. Environ. Plan. B Urban Anal. City Sci. 2022, 50, 709–722. [Google Scholar] [CrossRef]
  7. Yang, T. Understanding commuting patterns and changes: Counterfactual analysis in a planning support framework. Environ. Plan. B Urban Anal. City Sci. 2020, 47, 1440–1455. [Google Scholar] [CrossRef]
  8. Jia, J.S.; Lu, X.; Yuan, Y.; Xu, G.; Jia, J.; Christakis, N.A. Population flow drives spatio-temporal distribution of COVID-19 in China. Nature 2020, 582, 389–394. [Google Scholar] [CrossRef]
  9. Li, Z.; Huang, X.; Hu, T.; Ning, H.; Ye, X.; Huang, B.; Li, X. ODT FLOW: Extracting, analyzing, and sharing multi-source multi-scale human mobility. PLoS ONE 2021, 16, e0255259. [Google Scholar] [CrossRef]
  10. LeSage, J.P.; Fischer, M.M. Spatial econometric methods for modeling origin-destination flows. In Handbook of Applied Spatial Analysis: Software Tools, Methods and Application; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  11. Vrotsou, K.; Fuchs, G.; Andrienko, N.; Andrienko, G. An interactive approach for exploration of flows through direction-based filtering. J. Geovisualization Spat. Anal. 2017, 1, 1. [Google Scholar] [CrossRef]
  12. Sapiezynski, P.; Stopczynski, A.; Gatej, R.; Lehmann, S. Tracking human mobility using WiFi signals. PLoS ONE 2015, 10, e0130824. [Google Scholar] [CrossRef] [PubMed]
  13. Khazbak, Y.; Cao, G. Deanonymizing mobility traces with co-location information. In Proceedings of the IEEE Conference on Communications and Network Security, Las Vegas, NV, USA, 9–11 October 2017. [Google Scholar]
  14. Mattos, E.P.; Domingues, A.C.S.A.; Loureiro, A.A.F. Give me two points and I’ll tell you who you are. In Proceedings of the IEEE Intelligent Vehicles Symposium, Paris, France, 9–12 June 2019. [Google Scholar]
  15. Liu, Q.; Yu, J.; Han, J.; Yao, X. Differentially private and utility-aware publication of trajectory data. Expert Syst. Appl. 2021, 180, 115120. [Google Scholar] [CrossRef]
  16. Qiu, S.; Pi, D.; Wang, Y.; Xu, T. SGTP: A spatiotemporal generalized trajectory publishing method with differential privacy. J. Ambient Intell. Humaniz. Comput. 2023, 14, 2233–2247. [Google Scholar] [CrossRef]
  17. Matet, B.; Furno, A.; Fiore, M.; Come, E.; Oukhellou, L. Adaptative generalisation over a value hierarchy for the k-anonymisation of origin–destination matrices. Transp. Res. Part C Emerg. Technol. 2023, 154, 104236. [Google Scholar] [CrossRef]
  18. Shaham, S.; Ghinita, G.; Shahabi, C. Differentially-private publication of origin-destination matrices with intermediate stops. In Proceedings of the International Conference on Extending Database Technology, Virtual Event, 29 March–1 April; pp. 131–142.
  19. Primault, V.; Boutet, A.; Mokhtar, S.B.; Brunie, L. The long road to computational location privacy: A survey. IEEE Commun. Surv. Tutor. 2018, 21, 2772–2793. [Google Scholar] [CrossRef]
  20. Kim, J.W.; Edemacu, K.; Jang, B. Privacy-preserving mechanisms for location privacy in mobile crowdsensing: A survey. J. Netw. Comput. Appl. 2022, 200, 103315. [Google Scholar] [CrossRef]
  21. Kim, J.; Jang, B. Workload-aware indoor positioning data collection via local differential privacy. IEEE Commun. Lett. 2019, 23, 1352–1356. [Google Scholar] [CrossRef]
  22. Jin, W.; Xiao, M.; Guo, L.; Yang, L.; Li, M. ULPT: A user-centric location privacy trading framework for mobile crowd sensing. IEEE Trans. Mob. Comput. 2022, 21, 3789–3806. [Google Scholar] [CrossRef]
  23. Truex, S.; Baracaldo, N.; Anwar, A.; Steinke, T.; Ludwig, H.; Zhang, R.; Zhou, Y. A hybrid approach to privacy-preserving federated learning. In Proceedings of the the ACM Workshop on Artificial Intelligence and Security, London, UK, 15 November 2019; pp. 1–11. [Google Scholar]
  24. Banabilah, S.; Aloqaily, M.; Alsayed, E.; Malik, N.; Jararweh, Y. Federated learning review: Fundamentals, enabling technologies, and future applications. Inf. Process. Manag. 2022, 59, 103061. [Google Scholar] [CrossRef]
  25. Antunes, R.S.; Costa, C.A.; Kuderle, A.; Yari, I.A.; Eskofier, B. Federated learning for healthcare: Systematic review and architecture proposal. ACM Trans. Intell. Syst. Technol. 2022, 13, 1–23. [Google Scholar] [CrossRef]
  26. Dennis, D.K.; Li, T.; Smith, V. Heterogeneity for the win: One-shot federated clustering. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 2611–2620. [Google Scholar]
  27. Qiao, D.; Ding, C.; Fan, J. Federated spectral clustering via secure similarity reconstruction. Adv. Neural Inf. Process. Syst. 2023, 36, 58520–58555. [Google Scholar]
  28. Gao, C.; Yu, J. SecureRC: A system for privacy-preserving relation classification using secure multi-party computation. Comput. Secur. 2023, 128, 103142. [Google Scholar] [CrossRef]
  29. Sucasas, V.; Aly, A.; Mantas, G.; Rodriguez, J.; Aaraj, N. Secure multi-party computation-based privacy-preserving authentication for smart cities. IEEE Trans. Cloud Comput. 2023, 11, 3555–3572. [Google Scholar] [CrossRef]
  30. Dwork, C. Differential privacy. In Proceedings of the International Colloquium on Automata, Languages, and Programming, Venice, Italy, 10–14 July 2006; pp. 1–12. [Google Scholar]
  31. Mamei, M.; Bicocchi, N.; Lippi, M.; Mariani, S.; Zambonelli, F. Evaluating origin–destination matrices obtained from CDR data. Sensors 2019, 19, 4470. [Google Scholar] [CrossRef] [PubMed]
  32. Castiglione, M.; Cantelmo, G.; Qurashi, M.; Nigro, M.; Antoniou, C. Assignment matrix free algorithms for on-line estimation of dynamic origin-destination matrices. Front. Future Transp. 2021, 2, 640570. [Google Scholar] [CrossRef]
  33. Xiong, Z.; Lian, D.; Chen, E.; Chen, G.; Cheng, X. A DeepLearning framework for dynamic estimation of origin-destination sequence. arXiv 2023, arXiv:2307.05623. [Google Scholar]
  34. Sun, C.; Chang, Y.; Luan, X.; Tu, Q.; Tang, W. Origin-destination demand reconstruction using observed travel time under congested network. Netw. Spat. Econ. 2020, 20, 733–755. [Google Scholar] [CrossRef]
  35. Tsanakas, N.; Gundlegard, D.; Rydergren, C. O–D matrix estimation based on data-driven network assignment. Transp. B Transp. Dyn. 2023, 11, 376–407. [Google Scholar] [CrossRef]
  36. Ryu, S. A bicycle origin–destination matrix estimation based on a two-stage procedure. Sustainability 2020, 12, 2951. [Google Scholar] [CrossRef]
  37. Ros-Roca, X.; Montero, L.; Barcelo, J.; Nokel, K.; Gentile, G. A practical approach to assignment-free dynamic origin–destination matrix estimation problem. Transp. Res. Part C Emerg. Technol. 2022, 134, 103477. [Google Scholar] [CrossRef]
  38. Li, C.; Zheng, L.; Jia, N. Network-wide ride-sourcing passenger demand origin-destination matrix prediction with a generative adversarial network. Transp. A Transp. Sci. 2024, 20. [Google Scholar] [CrossRef]
  39. Zhang, M.; Gao, L.; Wang, Q.; Gao, W. Predicting city origin-destination flow with generative pre-training. In Proceedings of the International Conference on Artificial Neural Networks, Lugano, Switzerland, 17–20 September 2024. [Google Scholar]
  40. Rong, C.; Feng, J.; Ding, J. GODDAG: Generating origin-destination flow for new cities via domain adversarial training. IEEE Trans. Knowl. Data Eng. 2023, 35, 10048–10057. [Google Scholar] [CrossRef]
  41. Chen, P.; Wang, Z.; Zhou, B.; Yu, G. Dynamic origin-destination flow imputation using feature-based transfer learning. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17147–17159. [Google Scholar] [CrossRef]
  42. Yin, L.; Wang, Q.; Shaw, S.-L.; Fang, Z.; Hu, J.; Tao, Y.; Wang, W. Re-identification risk versus data utility for aggregated mobility research using mobile phone location data. PLoS ONE 2015, 10, e0140589. [Google Scholar] [CrossRef] [PubMed]
  43. Kohli, N.; Aiken, E.; Blumenstock, J. Privacy guarantees for personal mobility data in humanitarian response. arXiv 2023, arXiv:2306.09471. [Google Scholar] [CrossRef] [PubMed]
  44. Ouadrhiri, A.E.; Abdelhad, A. Differential privacy for deep and federated learning: A survey. IEEE Access 2022, 10, 22359–22380. [Google Scholar] [CrossRef]
  45. Wei, K.; Li, J.; Ding, M.; Ma, C.; Yang, H.H.; Farokhi, F. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Trans. Inf. Forensics Secur. 2020, 15, 3454–3469. [Google Scholar] [CrossRef]
  46. Truex, S.; Liu, L.; Chow, K.-H.; Gursoy, M.E.; Wei, W. LDP-Fed: Federated learning with local differential privacy. In Proceedings of the ACM International Workshop on Edge Systems, Analytics and Networking, Heraklion, Greece, 27 April 2020; pp. 61–66. [Google Scholar]
  47. Li, Y.; Wang, S.; Chi, C.-Y.; Quek, T.Q.S. Differentially private federated clustering over non-IID data. IEEE Internet Things J. 2024, 11, 6705–6721. [Google Scholar] [CrossRef]
  48. Li, Z.; Wang, T.; Li, N. Differentially private vertical federated clustering. Proc. VLDB Endow. 2023, 16, 1277–1290. [Google Scholar] [CrossRef]
  49. Lyu, L.; Nandakumar, K.; Rubinstein, B.; Jin, J.; Bedo, J.; Palaniswami, M. PPFA: Privacy preserving fog-enabled aggregation in smart grid. IEEE Trans. Ind. Inform. 2018, 14, 3733–3744. [Google Scholar] [CrossRef]
  50. Yang, M.; Tjuawinata, I.; Lam, K.Y.; Zhao, J.; Sun, L. Secure hot path crowdsourcing with local differential privacy under fog computing architecture. IEEE Trans. Serv. Comput. 2022, 15, 2188–2201. [Google Scholar] [CrossRef]
  51. Wang, T.; Mei, Y.; Jia, W.; Zheng, X.; Wang, G.; Xie, M. Edge-based differential privacy computing for sensor–cloud systems. J. Parallel Distrib. Comput. 2020, 136, 75–85. [Google Scholar] [CrossRef]
  52. Gallego-Nicasio, B.; Munoz, A.; Mana, A.; Serrano, D. Security patterns, towards a further level. In Proceedings of the International Conference on Security and Cryptography, Milan, Italy, 7–10 July 2009; pp. 349–356. [Google Scholar]
  53. Papoutsakis, M.; Fysarakis, K.; Spanoudakis, G.; Ioannidis, S.; Koloutsou, K. Towards a collection of security and privacy patterns. Appl. Sci. 2021, 11, 1396. [Google Scholar] [CrossRef]
  54. Uzunov, A.V.; Fernandez, E.B.; Falkner, K. Security solution frames and security patterns for authorization in distributed, collaborative systems. Comput. Secur. 2015, 55, 193–234. [Google Scholar] [CrossRef]
  55. Sanchez-Cid, F.; Mana, A.; Spanoudakis, G.; Kloukinas, C.; Serrano, D.; Munoz, A. Representation of security and dependability solutions. Secur. Dependability Ambient. Intell. 2009, 45, 69–95. [Google Scholar]
  56. Jafari, A.J.; Rasoolzadegan, A. Security patterns: A systematic mapping study. J. Comput. Lang. 2020, 56, 100938. [Google Scholar] [CrossRef]
  57. Moral-Garcia, S.; Moral-Rubio, S.; Fernandez, E.B.; Fernandez-Medina, E. Enterprise security pattern: A model-driven architecture instance. Comput. Stand. Interfaces 2014, 36, 748–758. [Google Scholar] [CrossRef]
  58. Anand, P.; Ryoo, J.; Kim, H. Addressing security challenges in cloud computing–A pattern-based approach. In Proceedings of the International Conference on Software Security and Assurance, Suwon, Republic of Korea, 27 July 2015. [Google Scholar]
  59. Rath, A.; Spasic, B.; Boucart, N.; Thiran, P. Security pattern for cloud SaaS: From system and data security to privacy case study in AWS and Azure. Computers 2019, 8, 34. [Google Scholar] [CrossRef]
  60. Erlingsson, U.; Pihur, V.; Korolova, A. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 1054–1067. [Google Scholar]
  61. Wang, T.; Blocki, J.; Li, N.; Jha, S. Locally differentially private protocols for frequency estimation. In Proceedings of the SENIX Conference on Security Symposium, Berkeley, CA, USA, 16–18 August 2017. [Google Scholar]
  62. Goryczka, S.; Xiong, L. A comprehensive comparison of multiparty secure additions with differential privacy. IEEE Trans. Dependable Secur. Comput. 2015, 14, 463–477. [Google Scholar] [CrossRef]
  63. Wei, Y.; Jia, J.; Wu, Y.; Hu, C.; Dong, C.; Liu, Z.; Chen, X.; Peng, Y.; Wang, S. Distributed differential privacy via shuffling versus aggregation: A curious study. IEEE Trans. Inf. Forensics Secur. 2024, 19, 2501–2516. [Google Scholar] [CrossRef]
  64. Kim, J.; Jang, B. Privacy-preserving generation and publication of synthetic trajectory microdata: A comprehensive survey. J. Netw. Comput. Appl. 2024, 230. [Google Scholar] [CrossRef]
  65. Kadhe, S.; Rajaraman, N.; Koyluoglu, O.O.; Ramchandran, K. FastSecAgg: Scalable secure aggregation for privacy-preserving federated learning. arXiv 2020, arXiv:2009.11248. [Google Scholar]
  66. Bell, J.H.; Bonawitz, K.A.; Gascon, A.; Lepoint, T.; Raykova, M. Secure single-server aggregation with (poly)logarithmic overhead. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Virtual Event USA, 9–13 November 2020; pp. 1253–1269. [Google Scholar]
  67. T-Drive Trajectory Data Sample. 2018. Available online: https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample (accessed on 1 July 2024).
  68. Moreira-Matias, L.; Gama, J.; Ferreira, M.; Mendes-Moreira, J.; Damas, L. Predicting taxi–passenger demand using streaming data. IEEE Trans. Intell. Transp. Syst. 2013, 14, 1393–1402. [Google Scholar] [CrossRef]
  69. Geng, Q.; Kairouz, P.; Oh, S.; Viswanath, P. The staircase mechanism in differential privacy. IEEE J. Sel. Top. Signal Process. 2015, 9, 1176–1184. [Google Scholar] [CrossRef]
Figure 1. Overview of distributed privacy-preserving OD matrix computation framework.
Figure 1. Overview of distributed privacy-preserving OD matrix computation framework.
Electronics 13 04545 g001
Figure 2. The proposed DistOD framework consists of two main phases. In the hotspot identification phase, hotspot areas are collaboratively identified using location clustering. In the aggregation phase, a hybrid privacy-preserving mechanism is employed by applying DDP to collect OD data for hotspot areas and using localized DP for non-hotspot areas to ensure privacy.
Figure 2. The proposed DistOD framework consists of two main phases. In the hotspot identification phase, hotspot areas are collaboratively identified using location clustering. In the aggregation phase, a hybrid privacy-preserving mechanism is employed by applying DDP to collect OD data for hotspot areas and using localized DP for non-hotspot areas to ensure privacy.
Electronics 13 04545 g002
Figure 3. Effect of varying privacy budget ( ϵ ) on MAE. (a) T-Drive dataset. (b) Porto dataset.
Figure 3. Effect of varying privacy budget ( ϵ ) on MAE. (a) T-Drive dataset. (b) Porto dataset.
Electronics 13 04545 g003
Figure 4. Effect of varying OD matrix size on MAE. (a) T-Drive dataset. (b) Porto dataset.
Figure 4. Effect of varying OD matrix size on MAE. (a) T-Drive dataset. (b) Porto dataset.
Electronics 13 04545 g004
Figure 5. Experimental results showing the impact of DP in hotspot identification with the T-Drive dataset: effect of varying the allocation of the total privacy budget between the hotspot identification and OD matrix collection phases. (a) OD matrix size = 2500 × 2500 . (b) OD matrix size = 10,000 × 10,000.
Figure 5. Experimental results showing the impact of DP in hotspot identification with the T-Drive dataset: effect of varying the allocation of the total privacy budget between the hotspot identification and OD matrix collection phases. (a) OD matrix size = 2500 × 2500 . (b) OD matrix size = 10,000 × 10,000.
Electronics 13 04545 g005
Figure 6. Experimental results showing the impact of DP in hotspot identification with the T-Drive dataset: effects of varying hotspot size on MAE. (a) OD matrix size = 2500 × 2500 . (b) OD matrix size = 10,000 × 10,000.
Figure 6. Experimental results showing the impact of DP in hotspot identification with the T-Drive dataset: effects of varying hotspot size on MAE. (a) OD matrix size = 2500 × 2500 . (b) OD matrix size = 10,000 × 10,000.
Electronics 13 04545 g006
Table 1. The relative ratio of the number of OD matrix elements where the DDP mechanism is applied using the proposed DistOD method compared to the pure DDP-based method ( D D P A L L ).
Table 1. The relative ratio of the number of OD matrix elements where the DDP mechanism is applied using the proposed DistOD method compared to the pure DDP-based method ( D D P A L L ).
OD Matrix Size
625 × 625 2500 × 2500 5625 × 5625 10,000 × 10,000
DistOD (5%)0.00590.01590.01890.0203
DistOD (10%)0.02290.05200.05700.0593
DistOD (15%)0.05030.09440.10110.1089
DistOD (20%)0.08570.14020.15460.1647
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, J. DistOD: A Hybrid Privacy-Preserving and Distributed Framework for Origin–Destination Matrix Computation. Electronics 2024, 13, 4545. https://doi.org/10.3390/electronics13224545

AMA Style

Kim J. DistOD: A Hybrid Privacy-Preserving and Distributed Framework for Origin–Destination Matrix Computation. Electronics. 2024; 13(22):4545. https://doi.org/10.3390/electronics13224545

Chicago/Turabian Style

Kim, Jongwook. 2024. "DistOD: A Hybrid Privacy-Preserving and Distributed Framework for Origin–Destination Matrix Computation" Electronics 13, no. 22: 4545. https://doi.org/10.3390/electronics13224545

APA Style

Kim, J. (2024). DistOD: A Hybrid Privacy-Preserving and Distributed Framework for Origin–Destination Matrix Computation. Electronics, 13(22), 4545. https://doi.org/10.3390/electronics13224545

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop