research-article

Open access

Mobility Data Science: Perspectives and Challenges

ACM Transactions on Spatial Algorithms and Systems, Volume 10, Issue 2

Article No.: 10, Pages 1 - 35

Published: 01 July 2024 Publication History

Editorial Notes

The authors have requested minor, non-substantive changes to the VoR and, in accordance with ACM policies, a Corrected Version of Record was published on July 18, 2024. For reference purposes, the VoR may still be accessed via the Supplemental Material section on this citation page.

Abstract

Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of Global Positioning System (GPS)–equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated a significant impact in various domains, including traffic management, urban planning, and health sciences. In this article, we present the domain of mobility data science. Towards a unified approach to mobility data science, we present a pipeline having the following components: mobility data collection, cleaning, analysis, management, and privacy. For each of these components, we explain how mobility data science differs from general data science, we survey the current state-of-the-art, and describe open challenges for the research community in the coming years.

1 Introduction

The volume of mobility data being collected has been steadily increasing since the advent of affordable personal location-enabled mobile devices. Examples of mobility data continuously generated and collected in huge volumes include (a) individual sporadic locations obtained from mobile app data and location-based social networks; (b) individual pedestrians, biking, or driving trajectories constrained by underlying sidewalks, biking trails, and road networks, respectively; (c) indoor individual or asset tracking data obtained from RFID and Bluetooth devices; (d) athletes’ movement data in various sports obtained from wearable devices; (e) public transportation, taxis, ride sharing, and delivery logistics trajectories obtained by location-tracking devices and specially designed app services; (f) aircraft and vessel trajectories moving in an unconstrained environment (i.e., no underlying road network) obtained by air and sea traffic monitoring services; and (g) animal tracking data moving freely in the space obtained from physically tagged and remotely sensed animals. Generally speaking, for each moving object, mobility data is typically available in the form of a sequence of (location, timestamp) pairs. The location attribute could be as simple as a point, represented by either latitude and longitude coordinates or as relative coordinates with respect to the underlying space. The location attribute could also be an area, which can represent the mobility of objects with spatial extents, e.g., flocks or group movement.

The ability to understand and analyze mobility data is crucial for various widely used important sectors and applications. In transportation and traffic management, analyzing traffic data through vehicle mobility helps in predicting accidents [158], traffic congestion [258], and better route planning [51]. In ride sharing and delivery logistics application, analyzing trip mobility data helps in data-driven eco route planning, which results in huge cost and energy savings [96]. In location-based services, analyzing people movements around the city significantly helps in trip planning activities [217], finding popular tourists sites and restaurants [118], and data-driven routing and querying [218]. In indoor navigation, understanding how people move indoors helps in understanding the traffic for various stores inside a mall, which is needed in various market research studies [114]. In urban planning, driving data can significantly help in building highly accurate, reliable, and annotated maps [159] as well as deciding on good locations for various facilities, e.g., restaurants, retail stores, and clinics [206]. In social computing, analyzing how people move in cities and regions helps in understanding the demand for infrastructure and energy as a means of reducing inequalities [200]. In disaster response, analyzing crowd movement helps in preparing for natural disasters through rescuing and evacuation efforts [105]. In health informatics, connected wearables can monitor and analyze the movement of elderly people, allowing for timely, and potentially life-saving, interventions [134]. In pandemic prevention, privacy-preserving individual tracking allows for contact tracing, which was deemed to be a cornerstone in limiting pandemic spread [155, 277].

Fig. 1.

Despite the common goal of acquiring, managing, and generating insights from mobility data, the mobility data science community is largely fragmented, developing solutions in silos. It stems from a range of disciplines with expertise in moving object data storage and management [99], geographic information science [88], spatiotemporal data mining [210], human mobility modelling [27], ubiquitous computing, computational geometry, and more. The sheer volumes of mobility data along with the immense need of mobility data analysis in various applications call for employing a complete Data Science pipeline [190] over mobility data (Figure 1). This includes the whole pipeline of Data Science applications, starting from the data storage and management infrastructure and going through data collection, data cleaning and preprocessing, and data analysis. Unfortunately, this is not straightforward as current Data Science systems, tools, and algorithms are not directly applicable to mobility data. This is mainly due to the fact that these systems, tools, and algorithms are designed in a generic way to support any data type and, hence, they do not lend themselves to the distinguishing characteristics of mobility data. Examples of such characteristics include the spatial and temporal dimensions of the data, the rate of updates, and the privacy requirements. In particular, mobility data is always spatial, in which nearby objects are more related to each other. This is unlike traditional data, in which the concepts of nearby and locality are not taken into account. Also, similar to time series data, mobility data is temporal, in which one object may have hundreds of updates to its location and all updates are related to each other (e.g., one trajectory). This is again unlike traditional data, in which temporal updates of a single object are not frequent and older updates would be of less importance. Similar to streaming data, mobility data has a high frequency of updates, which is not supported in typical data science applications. Finally, mobility data is more sensitive to privacy. While privacy preserving in traditional data can be achieved by removing (quasi-)identifier attributes, in mobility data, locations by themselves are considered private information that can reveal not only the users’ identities, but also their behavior, lifestyle, medical conditions, and workplaces.

Motivated by ubiquity and sheer volume of mobility data, the importance of mobility applications, and the lack of support from current data science pipelines, this article presents a pipeline for Mobility Data Science. We define Mobility Data Science as an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract or extrapolate knowledge and insights from potentially noisy structured and unstructured mobility data, and apply knowledge from mobility data across a broad range of application domains. While currently, the community of developers, practitioners, and researchers dealing with mobility data use off-the-shelf data science techniques and systems to collect, clean, manage, and analyze their mobility data, we firmly believe that this leads to sub-bar performance. We urge this community to build its own mobility data science pipeline to better serve its own purpose. This article makes the case for the need for a mobility data science pipeline along with presenting the challenges that need to be addressed to realize it.

2 Mobility Data Collection

The abundant availability of real data is a cornerstone to any data science application, and mobility data science applications are no exception. However, it is much easier to collect tons of data for data science applications than is the case for mobility data science. In particular, for data science applications, well-established research in anonymizing personal data allows wide data sharing. This is to the extent that governments have released various datasets for public use (e.g., Data.gov). In addition, companies already collect their own inventory data that does not include any personal identifiers; hence, it is suitable to be fed to data science applications. On the other side, data-driven mobility data science research has been in a constant struggle with the need for available mobility data. A main reason is that non-aggregated individual human location data is considered personal identifiable information as it may lead to tracing an individual’s identity. For example, it has been shown that only a few spatial locations are sufficient to uniquely identify individuals even among a large population of people [202]. As a result, most datasets are collected in aggregated form, which hinders the deployment of various mobility data science applications. This section discusses current efforts and challenges of mobility data collection.

2.1 Efforts in Mobility Data Collection

Before the wide availability of personal digital devices, human mobility data collection was expensive; therefore, datasets were very sparse. With the advent of personal location-enabled devices, many people’s movements have started leaving digital traces that are being collected either by industry as a means of providing location-based services [196] or by governmental entries as a means of data analysis, e.g., traffic-related studies [232]. However, this did not result in a similar explosion of publicly available mobility data, mainly due to privacy and data-sharing concerns.

Current efforts in releasing public non-aggregated mobility data are mainly limited to small datasets and small regions, while removing locations that can lead to one’s whereabouts. They mostly include trips obtained from taxis, ride-sharing services, or public transportation. Some of these datasets include detailed trajectory data for the following cities (ordered alphabetically): (1) Athens [28]. 500K trajectories collected over 5 days in downtown Athens, Greece; (2) Beijing 1 [271]. 17+K trajectories with 26 million GPS points over 3 years in Beijing, China; (3) Beijing 2 [259]. 10+K trajectories with 15 million GPS points over 1 week in Beijing, China; (4) Rio [69]. 12+K buses with detailed trajectories of 118+ million GPS points over 30 days in Rio de Janeiro, Brazil; (5) Rome [41]. 320 taxis with detailed trajectories of 21+ million GPS points over 30 days in Rome, Italy; (6) San Francisco 1 [179]. 536 taxis with detailed trajectories of 11+ million GPS points over 30 days in San Francisco, California, (7) San Francisco 2 [1]. 20+K detailed trajectories with 5+ million GPS points in San Francisco, California; (8) Shenzhen [238]. 664 taxis with detailed trajectories of 1.1+ million GPS points over 1 day in Shenzhen, China; (9) Singapore [107]. 84K trajectories with 80+ million GPS points over 1 month in Singapore. Other datasets only include the origin and destination of each trajectory. Examples include the following cities: (1) Austin [192]. 1.5 million trips for a period of 10 months in Austin, Texas, (2) Guangdong [256]. 2.5 million trips over 1 day in Guangdong Province, China; (3) New York City [168]. 1.5 million taxi trips over a period of 6 months in New York City, New York; (4) Porto [180]. 426K taxi trips over 3 months in Porto, Portugal,

Other than trip and trajectory road network data, there are tons of available biking data across the world, including tens of millions of trips in the Bay Area [139], Boston [138], Chicago [70], Columbus [62], London [135], Los Angeles [38], Madrid [141], Minneapolis [164], New York City [60], Philadelphia [111], Portland [39], and Washington D.C. [44]. There are also available public marine traffic datasets that include detailed vessel trajectories (e.g., [176]), sport datasets for basketball and soccer that include a variety of events that took place in major leagues within one season [174], and indoor data about the behaviors of nearly 30 students in Grade-10 and their teachers collected over four weeks in Australia, with spatial reference (associations to rooms) and highly granular wearable data [83].

However, there are some large-scale aggregated datasets with a coarse granularity that can help in high-level analysis, but not to get insight details of mobility data. Examples of such aggregate data include origin–destination employment statistics in the United States that contain home-to-work commuting flows aggregated to the census tract level [90], cell phone trace datasets capturing the locations of individuals aggregated to their nearest cell tower [235], foot traffic data of check-ins of 35 million anonymized mobile devices in United States aggregated to census block groups [197], and a global database about aggregate indoor occupant behavior, composed of 34 datasets from 15 countries and 39 institutions, collected by occupancy sensors that measure the occupancy count of each space being monitored [72]. An additional source of human mobility data is location-based social network (LBSN) data. LBSN data captures both (1) discrete check-ins between users and locations and (2) a social network between users. This dimension of location bridges the gap between the physical world and online social networking services [269]. However, it has been shown that existing LBSN datasets are too small to broadly understand, analyze, and predict human behavior [126].

The lack of available mobility data, combined with the need to stress test various research ideas, has motivated various research groups to either develop their own data simulators or develop publicly available simulators that can also be used by other researchers for benchmark datasets. However, such simulators were mainly designed to test specific aspects of research and not meant to be representative of real mobility data. For example, various simulators were mainly designed to test new index structures for mobility data, query processing algorithms, and system infrastructure scalability for managing spatiotemporal data (e.g., [153]). Within the transportation community, more fine granularity simulators (e.g., [34]) were proposed to study traffic infrastructure, but none of them is meant to provide a comprehensive mobility study.

2.2 Challenges in Mobility Data Collection

This section presents some of the challenges in mobility data collection that the community needs to address towards realizing the pipeline of mobility data science.

Challenge 1. Mobility Data Privacy. In most cases, (human) mobility data is sensitive and considered to be personal identifiable information. This raises major privacy concerns regarding data sharing. Hence, any attempt to collect fine-granularity detailed trajectory or human mobility data must first address the privacy challenge. Though the general topic of data privacy has been well studied in literature with practical solutions, such solutions are not directly applicable to the case of mobility data. In particular, mobility data gives rise to the Trajectory-User Linking (TUL) problem [85]. To protect users’ actual locations while preserving meaningful mobility information for various learning tasks, one may wish to generate realistic motions based on real-world mobility datasets [272]. Since privacy is a core problem in mobility data that does not only impact data collection but also impacts all other components of the mobility data science pipeline, we dedicate Section 6 to discussing mobility data privacy in detail.

Challenge 2. Mobility Data Bias. Mobility data collection procedures suffer from all kinds of bias. For example, mobile application data and mobile phone network data are biased against people who do not use smartphones or use prepaid plans. Most traffic counting sensors are installed to count cars but do not count pedestrians, cyclists, wheelchairs, or similar modes of transport. Cells in mobile phone networks vary widely in size. The data traces that are usually collected in cellular networks are cellular themselves. This affects rural areas with larger cells more than urban areas. Volunteered tracking data is biased towards technically savvy people. Sports tracking data is biased towards health-conscious members of the middle and upper classes. It is important to understand, measure, and mitigate data bias in mobility datasets to ensure that actions and policies that are based on mobility data science results are equitable, fair, and include vulnerable populations [205].

Challenge 3. Incentives for Data Sharing. Users need to have good incentives to share their locations. To some degree, users agree to share their locations with commercial entities to get location-based services, ride sharing, cell phone coverage, delivery, and other services. However, it is understood that users would be reluctant to publicly share their mobility traces. Conversely, the biking community have shown a great affinity for sharing their biking trails. A main reason is that, in many places of the world, most of these trails are not really home-to-work commuting, but it is more of an outdoor activity. Hence, sharing biking trails helps fellow bikers in knowing the conditions of biking trails, which is a great incentive for sharing. More incentives need to be offered for drivers to share their mobility traces, even for sporadic trips that do not lead to identifiable locations. Sharing could be for part of the trajectory, where rewards are given back based on the sharing length and resolution. A gamification concept may be exploited to encourage more participants to share.

Challenge 4. Simulated Mobility Data. The dire need for mobility data along with the difficulty of obtaining it made it apparent that simulated synthetic data is immensely needed to enrich and train mobility data science applications. However, the challenge is to go beyond earlier attempts of simulating data for testing very specific techniques to simulating data for the general purpose of having realistic life scenarios. Empowered by modern computational capabilities that make it possible to simulate large populations, the mobility community should work with social scientists to create realistic individual-level human mobility data. Lessons have been learned from the experience of the deep learning community by applying generative adversarial networks (GANs) for trajectory generation [262]. However, it is unclear as of yet how to measure the extent to which mobility data is realistic. If synthetic mobility data is too realistic, for example, due to training on real human trajectories, it may invade someone’s privacy if, for instance, it shows where members of a given household actually visit. On the flip side, benchmark data that is too disconnected from the real world and does not represent realistic human behavior would not allow generalization to the real world.

3 Mobility Data Cleaning

Until the early 21st century, location data and mobility data available for geographic information science (GIS) was mainly collected, curated, standardized [78, 79], and published by authoritative sources such as the United States Geological Survey (USGS) [231]. Now, data used for mobility data science is often obtained from sources of volunteered geographic information (VGI) [216]. Such data is contributed by millions of individual users (more than 10 million contributors in the case of OpenStreetMap [170]) and is rarely curated. Mobility data collected from such sources is highly uncertain due to physical limitations of sensing devices, due to obsoleteness of observations, and in many cases is simply incorrect due to deliberate misinformation [157]. Consequentially, our ability to unearth valuable knowledge from large sets of mobility data is often impaired by the uncertainty of the data such that geography has been named the “Achilles heel of GIS” [89].

Data cleaning and preprocessing is a milestone to all data science. In fact, it has been reported that data scientists spend more than 80% of their time in data cleaning and preparation [162]. As a result, there are huge efforts in the data science community dedicated to developing various data cleaning algorithms [57] and full-fledged systems [67]. Mobility data is of no exception in terms of its need for data cleaning and preparation procedures. However, for numerous reasons, data cleaning and preparation yields unique challenges. This section discusses current efforts and challenges of mobility data cleaning.

3.1 Efforts in Mobility Data Cleaning

A recent survey [125] and data quality assessment tool [91] have discussed various sorts of errors that negatively impact data quality in spatial and mobile environments. Motivated by the inaccuracy of location tracking devices, several efforts were dedicated to address (a) the spatial inherent inaccuracy of GPS devices and (b) the uncertainty of moving object whereabouts between two known locations, which is a result of low sampling rates due to bandwidth and battery limitations.

As the spatial inaccuracy indicates erroneous GPS coordinates, the efforts to identify and correct such coordinates have focused on either finding and eliminating outliers or map matching all coordinates to an underlying fixed and trusted infrastructure (e.g., road network map). For the case of map matching, existing efforts aim to match/snap all GPS traces to an underlying road network [42, 46]. Proposed techniques vary from as simple as snapping each point to its nearest road to applying Markov Chain to identify the most probable road segment that each point should be snapped to. In the case in which there is no underlying road infrastructure (e.g., marine transportation or animal movement), outlier detection techniques are used to identify and remove erroneous points [224].

Irrespective of the collection method and device settings, there is also indispensable uncertainty in movement data caused by their discreteness. Since time is continuous, the data cannot refer to every possible instant. For any two successive instants, there is a temporal gap in which the whereabouts of the moving objects are unknown. To overcome such location uncertainty, several efforts were dedicated to modeling the uncertainty of mobility data surveyed in [278].

3.2 Challenges in Mobility Data Cleaning

This section delves into some challenges linked to cleaning mobility data that the community needs to tackle.

Challenge 5. Inaccuracy in the Movement Space Infrastructure. A unique challenge in mobility data is that, in many cases, its reference points are the ones that are inaccurate. In particular, mobility data that represent movement on a road network may be more accurate than the road network itself. Road networks, like any other type of data, suffer from all sorts of inaccuracy and may not even be available in many places [160]. In fact, Microsoft has recently announced that it has found more than 1 million kilometers of roads missing from current maps [148]. This is why there is a whole area of industrial and academic research about map inference, which aims to infer (all or missing parts) of the road network from either satellite images [29] or trajectory data [37]. However, almost all of these techniques focus on making accurate maps in terms of topology. There need to be more efforts to develop map inference algorithms that go beyond inferring the map topology to inferring map metadata (e.g., road speed, traffic lights, number of lanes, and turns), without which mobility data would not be accurate as its road network reference itself is missing important data. A major step towards cleaning mobility data would be to first clean its reference map.

Challenge 6. Filling in Temporal Mobility Gaps. As mentioned earlier, there are lots of efforts dedicated to modeling the uncertainty of moving objects’ whereabouts between two consecutive time instances. However, uncertainty poses different challenges to downstream functions and applications, including the need to develop new techniques for indexing, query processing, and data analysis for various uncertainty models. One way to overcome this is to try to infer the actual whereabouts of a moving object between any two time instances with known locations. There are already several efforts to insert artificial points between two consecutive trajectory points, with the promise that these points act as if the trajectory was collected in a very high sampling rate. This process has various names, e.g., trajectory interpolation [136, 268], trajectory completion [130], trajectory data cleaning [261], trajectory restoration [124], trajectory map matching [42], trajectory recovery [243], and trajectory imputation [76]. However, the large majority of such work relies on matching the trajectory points on the underlying road network, where the imputation becomes finding the road network’s shortest path between two consecutive trajectory points. Unfortunately, this is not applicable to the case in which the road network is unknown, untrusted, or inaccurate. Hence, more recent attempts try to do data-driven trajectory imputation without relying on the underlying road network [76, 80]. However, these techniques are either not scalable to city-scale trajectory datasets or require dense historical data that derives its imputation process. There is an immense need to develop a scalable, accurate, and fine-grained imputation that almost mimics a continuous datastream of trajectory locations.

4 Mobility Data Analytics

Spatial data is special. Unlike non-spatial features, location attributes (e.g., longitude and latitude) rarely exhibit linear or other simple functional relationships to variables of interest. It rarely makes sense to model a variable of interest directly in relation to spatial attributes. Instead, it is distances that matter. According to Tobler’s first rule of Geography, “everything is related to everything else, but closer things are more related than things that are far apart” [221]. For mobility data, proximity is further extended with time, i.e., objects that are close in space and time. In addition to this concept of spatiotemporal autocorrelation, what makes mobility data even more challenging to handle is that it is often observed from humans whose behavior can often be irrational and difficult to explain. As Nobel Prize laureate Murray Gell-Mann famously said, “Think how hard physics would be if particles could think” [172]. However, unlike in physics, the “particles” of interest are often humans who can think. Data collection sensors have the capability to capture the spatiotemporal locations of moving objects, but not their behavioral aspects. These difficulties require new paradigms, techniques, and algorithms to analyze and learn from the spatiotemporal data and that can explain and predict the associated behavior. This section discusses current efforts and challenges of mobility data analysis.

4.1 Efforts in Mobility Data Analytics

Mobility data analytics has already gained momentum in research in recent years. Dedicated workshops have existed in major conferences, including the ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data (BigSpatial) since 2011 [209], the Big Mobility Data Analytics (BMDA) workshop in EDBT since 2018 [177], and the ACM SIGSPATIAL International Workshop on Animal Movement Ecology and Human Mobility (HANIMOB)@SIGSPATIAL since 2021 [171]. Surveys on the status of research exist [20, 198].

Mobility data analytics encompasses various application domains and involves analyzing data from different sources such as urban [265], maritime [61], aviation [59], animal movement [171], and indoor movement [114]. Among these different themes, urban mobility stands out with a fairly large body of research, including green routing [10], traffic anomaly detection [173], hot spot and hot path analysis [166], road traffic prediction [161], and travel time estimation [240]. Trajectories of moving objects have been used as means to create and continuously update the road network [159]. Public transport systems also collect ticketing data in the form of passenger check-ins, sometimes also associated with check-outs. This data has been shown to be very useful to transit planners in understanding passenger demand and movement patterns in daily operations as well as in the strategic long-term planning of the network [227]. Personal mobility of individuals is also a subject of analysis that includes, e.g., activity recognition [50, 175], personalized routing [66], matching with ride-sharing services [19], and crowd-sourcing [178].

While a significant portion of research focuses on understanding and analyzing data through analytics, there are also important efforts dedicated to developing generic analysis tools for spatiotemporal data that are agnostic to the application domain. Efforts regarding generic methods for mobility data analysis include, among many others, trajectory clustering [244], trajectory similarity measures [224], outlier detection [101], transportation mode classification [40], spatiotemporal pattern detection [199], and trajectory completion [121]. However, and despite these many research efforts towards analyzing mobility data, there is a lack of common data analysis tools and systems. The scientific software environment for mobility data analysis is rather fragmented. For example, [117] lists 58 packages in their review of R packages for movement and [92] reviews Python libraries for movement data analysis and visualization.

Recent years have seen a notable increase in research on deep learning for mobility data analysis [137, 250]. This brought an increased adoption of various paradigms and (adapted versions of) architectures used in other areas in whic deep learning has brought improvements in tasks, e.g., clustering/classification [149], prediction [122] and recommendation [30], information propagation [274], etc. For example, Generative Adversarial Network (GAN)–based architectures have been used recently to learn representations of trajectories and generate synthetic trajectory techniques [84]. Given the introduction of Transformers [233], transformed-based approaches have also been used for mobility modelling and trajectory prediction [254] given the sequential properties of mobility data. Other deep learning approaches, such as contrastive learning [273], have also been exploited in mobile data settings, along with investigation of the impact/benefits of representation learning [86].

4.2 Challenges in Mobility Data Analysis

This section highlights open problems related to mobility data analysis that need consideration from the community.

Challenge 7. Machine Learning (ML) for Mobility Data. The state-of-the-art deep learning (DL) models, such as Transformers [233], were not developed initially for mobility data science in mind. They were derived from natural language processing (NLP) and computer vision domains. The community needs to provide best-case practices for doing ML (and DL) for mobility data.

A major hurdle, and a research opportunity as well, is that existing ML and analytics tools, e.g., TensorFlow and PyTorch, do not support location and mobility as base data types to reason about. Thus, even the basic analysis, such as clustering, classification, and similarity, need to be extended when mobility data is involved. These tasks, as well as higher-level analysis, cannot be totally independent. Instead, common basic building blocks could have an impact on all or some of them. For example, exploring the effectiveness of embedding for mobility data analysis is a basic block that could impact different ML-based analysis tasks. This raises a challenge to build analysis primitives and common building blocks for applications that could shape a framework of ML-based mobility data analysis.

Another major hurdle is the robustness in data-driven mobility models. It is widely known that data-driven models (as in the case of ML and DL) are only as good as the data used to train them. However, given the changes in mobility behaviors, such as during the COVID-19 pandemic and the associated lockdowns, and environmental events and disasters, traditional ML-based, and even recent DL-based, methods are no longer robust. The models’ performances deteriorate in unseen events, especially as new behaviors emerge and then persist. Recent effort includes the incorporation of ‘contextual awareness’ and ‘memory’ in an enhanced event-aware spatiotemporal network [245] for predicting mobility in multiple modes of transportation, including taxis, cycling, and subways during the unprecedented events such as COVID lockdowns or snowstorms as events emerged and up to 30 days post the event. However, more work needs to be done on modelling and understanding mobility behavior that is robust to changes due to societal events.

Challenge 8. Progressing from Next Location Prediction to Movement Behavior Understanding. Due to the wide availability of aggregated check-in and foot-traffic data, many researchers focus on the problem of location prediction, e.g., [253]. Leveraging predictions such as “User X will visit Coffee Shop A next” or “ \(32\pm 4\) users will visit Coffee Shop A in the next hour” has some direct applications. It could be useful for providing information about parking (“parking at location X appears to be a problem today, so consider...”), for battery-charging opportunities, or for providing information about collective transportation status (“Metro station X that you are expected to visit is closed for repairs, so instead...”). One could provide a new transportation schedule and departure time in response to problems at an anticipated future location of a user, just like airlines at times update itineraries in the case of issues. Earlier work has been based on data mining techniques to detect periodic behavior, e.g., [36, 75, 116]. Beyond predicting locations, if we understand the underlying behavior at the individual-, group-, or population-scales that leads to these predictions, we could understand why one coffee shop chain has increasing visitor rates (e.g., due to a movement towards organically grown coffee sold by the coffee shop). Through inferring from the data about such behaviors, only then can we take corresponding actions not only to predict locations but also to prescribe actions (e.g., offering more organic coffee) to improve visitor rates. This understanding of (human) behavior will broadly affect applications using mobility data. Traditional spatiotemporal data science allows for predictive analytics to predict the future. In contrast, mobility data science enables prescriptive analytics by understanding the underlying human behavior to devise actions and policies that aim to achieve desirable targets.

An open problem for understanding mobility behavior data is the lack of labels or human annotation to provide insights on the actual observations. There are several other tricks that have been proposed, including cross-domain data fusion as well as developing interpretability mechanisms for ML or DL models. When geographical information is fused with contextual features and social behaviors, not only location prediction can be improved but also insights can be provided about the underlying visitor behavior [253], even if no human-labelled data are provided about the mobility behaviors.

Therefore, explainability of AI and ML models that have underpinned many of such predictive behavior models remain an open challenge, especially since DL models are black boxes. One such approach for DL-based models is disentangled representation learning, and a recent work [266] shows that the disentanglement of latent spatiotemporal factors can assist the explainability of how the underlying latent factors learned by DL models are correlated. It can also be used for dimensionality reduction and can assist in few-shot learning cases.

Challenge 9. Visual Analytics. Visualization and exploratory analysis of mobility data has long been a hot topic in visual analytics [15]. More recently, the trend turned to combining visualization with modeling and simulation to support decision-making [123]. This kind of research is by necessity application oriented, while much less is done on developing more general ideas and approaches.

One general research problem that has only been slightly touched on in visual analytics but not systematically addressed is human involvement in real-time analysis of big mobility data. Is it possible to define realistic scenarios for involving human intelligence in big data analytics taking into account the cognitive limitations of human analysts with regard to the amount of information that can be perceived, speed of processing, and time required for analytical reasoning and contributing to the analysis process? Also how does one combine computational methods of analysis, such as ML, with human expert knowledge and reasoning? The involvement of human intelligence is limited to thoughtful data preparation, feature selection, parameter setting, and so on. It would be great to find ways to make more direct and effective use of human-possessed concepts and, particularly, knowledge of causal relationships. Hence, a grand research challenge for visual mobility analytics is to develop approaches to understanding and modeling mobility behaviors from low-level movement data, such as trajectories of moving entities.

The following research problem is how to analyze behaviors after they have been extracted from elementary movement data and represented by appropriate data structures. A conceptual framework should be developed to enable defining the types of conceivable patterns of movement behavior. This will provide orientation for developing visualization techniques facilitating visual discovery of behavioral patterns as well as algorithmic methods for detection of specified types of patterns. These techniques and methods should be incorporated into systems and workflows for analyzing the contexts in which various patterns take place and developing models for describing and predicting mobility behaviors depending on the context.

5 Mobility Data Management Infrastructure

Classical data management systems have been designed for generic data types, where spatial and temporal data can be supported as new additional types. Yet, the core functionality of the data management engine does not acknowledge the spatial and temporal properties of mobility data. For example, mobility data calls for storing and querying locations of objects that evolve over time. The evolution can be in the location, the extent, and/or the properties of the object. The evolution can happen in discrete steps, e.g., check-ins, or in a continuous form. Thus, it is desired that the data management platform is able to represent the history, the current location, and possibly the near future of the moving object. Another example is classical index structures that are built with the assumption that the read workload is significantly higher than the write workload and, hence, the index structure does not change often. Mobility data exhibits a different workload, in which the write workload (e.g., object location update) is significantly higher than the read workload, which makes all classical index structures simply not applicable to mobility data. A third example is that simple queries of mobility data, e.g., nearest neighbor search, can be supported by classical data management systems by finding the distance between the user location and all other objects, sorting all objects based on that distance, and getting the closest one. This cumbersome approach is mainly due to the lack of having a specialized nearest-neighbor operator. Should we have one, that operator could seamlessly integrate with the query executor and optimize a data management engine to efficiently support a pretty important query in most data mobility applications. A last example is that classical methods for scaling up data management in distributed environments rely on data distribution, mostly based on the data keys. This does not work well in scaling up mobility data, as it is always desired to distribute mobility data in a way that, spatially and temporally, nearby objects are grouped together in the same cluster or computing node. This section discusses current efforts and challenges of mobility data management.

5.1 Efforts in Mobility Data Management

There has already been extensive research in all layers of mobility data management infrastructure. In terms of data modeling, early models based on the constraint database model aim to support simple moving objects (i.e., points), e.g., [93]. More complex data types (e.g., moving regions) have been supported by later models based on abstract data types, e.g., [100], that are still being used in recent systems, e.g., [276]. More recent efforts have been introduced to capture the semantics of trajectories of moving objects. Other models were also proposed to capture specialized modes of movement, including indoor environments, e.g., [113], network constrained, e.g., [98], fuzzy trajectories, e.g., [225], and detecting periodic moving patterns, e.g., [33, 36, 75, 116]. In terms of indexing, tens of index structures have been proposed to support efficient indexing, storage, and retrieval for spatiotemporal data as either historical data, current locations, or continuously updated locations, e.g., [143, 146, 154, 163]. This forms the infrastructure support for various spatiotemporal query processing techniques for various query operators over moving objects, including spatiotemporal range queries [156], spatiotemporal nearest-neighbor queries, e.g., [11, 12, 13, 214, 252], reverse nearest neighbor queries [35], skyline queries [108], and scalable spatial and spatiotemporal joins, e.g., [247, 251].

In terms of academic full-fledged systems, the SECONDO system has been introduced in the early 2000s as a comprehensive testbed for distributed moving object databases covering all aspects of data modeling, indexing, and querying [97]. More recently, MobilityDB, implemented on PostGIS, has been introduced as a scalable system with a wider functionality on moving object databases [228, 276]. In terms of Big Data systems, ST-Hadoop [8], SUMMIT [7] and HadoopTrajectory [22] systems extend the Hadoop system to support spatiotemporal data and trajectories, respectively, while other systems, e.g., [65, 144, 145], extend the Twitter Storm distributed data streaming system to support streamed location data. TrajSpark [264], Dita [207], and TrajMesa [127] extend the Spark system to support various index structures and query operations over trajectory data. SharkDB [242] extends in-memory column-oriented storage engines to support trajectories. In the open-source community and in industry, PostGIS [181] supports very basic trajectory functions and Oracle spatial supports streaming point data to capture real-time mobility [169], whereas Microsoft Azure [25] supports storing trajectory data in Azure table and utilizing Azure Redis for indexing. Distributed-MobilityDB [23] integrates the trajectory data management of MobilityDB with a distributed PostgreSQL database to provide a distributed moving object database.

5.2 Challenges in Mobility Data Management Infrastructure

Though there is already a lot of work in various components of mobility data management infrastructure, there is an apparent lack of integrated systems that offer comprehensive functionality to end users, encapsulated in full-fledged systems that support mobility data science. Hence, the challenges in this section mainly focus on system building.

Challenge 10. Building Systems with Mobility Data in Mind. Location data has almost always been supported in data systems as an afterthought problem. Many systems, e.g., Postgres, Storm, Spark, and Hadoop, have not been originally designed with location data support in mind. What typically happens is that spatial data types get augmented into tuple-oriented systems to support the location data type. For example, a restaurant tuple that describes various attributes of a restaurant is augmented with the latitude and longitude of the location attribute of the restaurant to support location services. Spatial indexes are provided to speed up the access to these attributes, and some accompanying spatial operators are provided to operate on the location attributes to provide location services, e.g., range or k-nearest-neighbor searches. While this approach works to some extent, systems coming out of this approach end up with sub-par performance for spatial data and, hence, for mobility data. Given the myriad applications that rely on mobility data, it is important that systems are extended with native support for locations and mobility data. Thus, mobility data types and operations should be integrated in the core of these systems and should not be considered as an afterthought problem. This can go through all kinds of systems, starting from database management systems that need to be spatially and temporally aware to support mobility data to scalable big data and NoSQL systems, where injecting spatial and temporal awareness into their core functionality will inherit their scalability to support scalable mobility data science.

Challenge 11. Location Data as First-Class Citizens. Having locations as the core of mobility data calls for treating location data as a first-class citizen in a location data system that at the same time can be extended to support other data types [16]. These location data systems can be presented as Location+X systems, e.g., as in [16], where the data types “X” can be keywords (e.g., to support spatial keywords and tweets), graphs (e.g., to support road-network data), relational data (e.g., to support descriptions of spatial data objects), click streams (e.g., to support check-in data), document data (e.g., to support points of interest and documents that describe them), or annotated trajectories (e.g., location + time + textual annotations), among others. In many location services, more than one data type X may need to be supported, e.g., a graph data type combined with a document or keyword data types, which calls for a multi-model-like data system. This gives rise to an ecosystem where location is at the core with some form of an extensible multi-model data system that supports the multitude of data types “X”. However, current multi-model data system technology is lacking in several aspects. First, they do not support data streaming, which is a cornerstone in mobility data due to the online streamed locations of moving objects. Second, we do not want to fall into the trap of adopting existing multi-model technologies that may affect location being a first-class citizen. However, the need for supporting multi-models in one seamlessly integrated location+X system remains a necessity. In addition to supporting location data via a native location+X engine, an ecosystem for mobility data would also include many important utilities to facilitate a broad spectrum of location service applications. From the input data side, to help navigate the vast amounts of available location datasets and discover the right datasets for a given task, a location dataset lake infrastructure and location dataset discovery, cleaning, and integration facilities are needed. From the presentation side, a comprehensive visualization suite is envisioned to support visualizations for combinations of spatial and temporal data analytics on top of location data.

Challenge 12. Streaming, Batch, and Hybrid Workloads. Motivated by the application needs, mobility data management needs to support both batch and real-time data through all system layers, from digesting the data to analyzing and visualizing it. For example, a common requirement is to visualize the positions of a fleet of vehicles in real time, which only requires access to the most recent positions of the vehicles. Yet, at the same time, there is a need to perform batch analytics on the full trajectory of these vehicles (e.g., to assess whether the trajectories exhibit some unexpected behavior). Generally speaking, the need to have both real-time and historical data has led to the development of the data warehouse domain, where operational databases cover the real-time Online Transaction Processing (OLTP) whereas data warehouses cover the historical Online Analytical Processing (OLAP). Since having two different systems for the two kinds of workloads is very costly, a new approach referred to as Hybrid Transactional and Analytical Processing (HTAP) has recently been proposed. However, mobility data exhibits significantly different workloads from other data, where streaming data is dominant in terms of objects continuously streaming their new locations. Historical data is not of less importance and is continuously appended. While some efforts have been spent in the direction of write-optimized indexing for location data, e.g., as in [211], more research efforts need to be spent to adopt the concepts behind HTAP systems to support the nature of mobility data.

6 Mobility Data Privacy

As we discussed in Challenge 1, mobility data privacy is a core problem in the mobility data science pipeline. Studies have shown that location data could reveal sensitive personal information, such as home and workplace, and religious and sexual inclinations [183]. As localization technology advances and extremely fine-grained location tracking is being enabled, it may even reveal products of interest in the stores we have visited, doctors we saw at a hospital, bookshelves of interest in a library we have visited, artifacts we observed in a museum, and generally anything that might publicize our preferences, beliefs, and habits. A recent survey has shown that 78% of smartphone users among 180 participants believe that apps accessing their location pose privacy threats [47].

While there are many privacy-preserving data collection and data analysis techniques developed for personal data, mobility data introduces unique challenges due to (1) spatiotemporal correlations in the mobility data, which often results in increased privacy cost due to privacy composition for correlated data or downgraded utility for downstream applications; (2) complex location semantics (e.g., corresponding points of interest of locations) and mobility behaviors (e.g., regular vs. one-time visit of a location) that existing privacy definitions may not be able to capture; and (3) diverse and emerging application scenarios, such as contact tracing using mobility data for which existing privacy algorithms designed for aggregate data analytics are not suitable. In this section, we briefly review existing privacy notions and techniques developed for location and mobility data and discuss several open challenges.

6.1 Efforts in Mobility Data Privacy

We categorize existing techniques in mobility data privacy into two main settings corresponding to our data pipeline: (1) local setting (data collection stage) and (2) central setting (data analysis stage). In the local setting, the mobility service provider that collects mobility data is assumed to be untrusted. Hence, each mobile user or entity can apply privacy-preserving mechanisms before the data is collected by the service provider. In the central or global setting, the mobility service provider is assumed to be trusted and collects the raw mobility data. The provider can apply privacy-preserving mechanisms for statistical analysis and share aggregated data, ML models trained from the data, or synthetic data mimicking the original data with untrusted third parties.

Local Setting. In recent years, local differential privacy (LDP), the local variant of differential privacy [63, 94], has become the de facto standard for preserving privacy at the data collection stage. Users can perturb their raw data using an LDP mechanism before uploading it to an untrusted server. Most existing mechanisms are designed to ensure utility for aggregate queries or analytics (e.g., frequency or density estimation), which requires the aggregation of the perturbed values from a large group of users, whereas the individual perturbed value may not provide much utility. Several works applied existing LDP schemes to location data but the utility is poor [119, 267]. Other works relaxed LDP to personalized LDP [52]. Recent works developed improved LDP mechanisms for location data with better utility [239].

In addition to supporting aggregate data analytics, location based services (LBSs), including range queries, spatial crowdsourcing, and the emerging contact tracing for pandemic control, require the precision of the perturbed locations themselves. Geo-indistinguishability (GeoInd) [14] relaxes LDP for location data, which requires the locations to be indistinguishable only within a radius and the indistinguishabilty is scaled by their distances, providing a better privacy utility trade-off for LBSs. Later works extended GeoInd to account for temporal correlations between consecutive locations of mobile users [249] and protection of customizable spatiotemporal activities instead of raw locations or trajectories [43]. Other works applied the GeoInd mechanisms and variants for privacy-enhanced spatial crowdsourcing and contact tracing [64, 220]. Besides statistical privacy techniques, Private Information Retrieval (PIR) and secure multiparty computation (MPC) techniques have also been developed to allow LBS queries such as range queries and contact tracing without revealing individual locations [6, 56, 87, 186] but are generally more computationally expensive and need to be designed for each different query.

Global Setting. Many works have applied differential privacy (DP) for computing and publishing aggregate mobility data. Compared with DP algorithms for tabular data, they typically exploit the hierarchical structure of locations and sequential patterns of trajectories to improve utility [2, 49, 150, 184, 204]. Some works also utilized the DP aggregates for task assignment in spatial crowdsourcing [219]. In practice, mobility data providers have started sharing aggregated mobility datasets with DP, especially in response to the pandemic, such as Meta’s population density maps and Movement Range maps, Google’s COVID-19 Community Mobility Reports, and SafeGraph’s Patterns [24]. Other works have applied DP for training ML models using mobility data, for example, for location prediction [5]. Another line of work attempts to generate synthetic trajectories or mobility data based on raw trajectories with formal DP guarantees [103, 241]. From the privacy attack side, recent works demonstrated the possibility of membership inference attacks on aggregate location data and linking attacks, and the defense power of DP against some of these attacks, reinforcing the need for ensuring rigorous privacy even for seemingly anonymous aggregate mobility data and ML models trained from mobility data [115, 182].

6.2 Challenges in Mobility Data Privacy

This section highlights open problems related to mobility data privacy that need consideration from the community.

Challenge 12. Threat Models and Privacy Definitions. The first challenge for mobility data privacy is the need to understand the threat models and adopt or define proper criteria by which to enforce privacy. We need to define first what needs to be protected (i.e., the sensitive information). This may vary for different mobile users and applications. It may be the exact location coordinates of a user at a given time (most existing efforts focus on this). It may also be the association of a user with a sensitive place, co-location of two users (while it’s okay for the users to reveal the exact location coordinates), or spatiotemporal activities of a user (e.g., stay at a place, or a trajectory). When defining privacy models and designing subsequent privacy mechanisms, there will almost always be attacks based on side channel information exploitation. While privacy notions such as DP typically assume the worst case, which also means sacrificed utility, relaxed versions may be needed given specific threat models to enhance the privacy and utility trade-off.

Besides developing rigorous privacy-enhancing mechanisms, it is equally important to understand the privacy risks and the empirical defense power of privacy-enhancing technology (PET). While there has been some work on privacy attacks on aggregate mobility data [182], more work is needed to understand what sensitive information may be revealed and reconstructed from mobility data-based models, e.g., whether membership inference attacks or feature reconstruction attacks [81, 212] can be carried out and potentially build benchmark attacks that can be used to audit the privacy risk of mobility data science systems and privacy mechanisms.

Challenge 13. Privacy and Utility Trade-off and Other Factors. When designing privacy mechanisms for mobility data collection and analysis, it is important to consider the utility of the privacy protected data for the downstream applications. For LBS (as typical in the local setting), the utility needs to be measured by the precision or accuracy of range queries for POI search, or contact detection for contact tracing (instead of how accurate the perturbed location is from the original location for which most algorithms following GeoInd are focused on). Hybrid methods that combine DP and cryptographic techniques may be needed, especially for critical applications such as contact tracing and public health [56]. For aggregate data analytics and ML applications using mobility data (in both local and global settings), the utility need to be measured by the accuracy of the statistics (e.g., frequency or density estimation for which most existing work focuses on), the trained model, or the fidelity of the synthetic data. As a result, the algorithms need to be designed to optimize the corresponding utility and many remain an open challenge. For example, existing methods for DP trajectory synthesization are mainly based on statistical models or low-order Markov models and perform well on some utility metrics [103, 241]. While there are more powerful generative adversarial network (GAN)—based models or diffusion models for generating more realistic synthetic trajectories [137, 275], ensuring formal DP for these models would result in deteriorated utility due to the complexity of the models. Designing methods for optimal privacy utility trade-off remains an open challenge.

In addition to the privacy and utility trade-off, privacy-enhancing technology may exacerbate bias in the data or learning algorithms. Mobility data may have inherent bias, as we discussed in Challenge 2. Data analysis algorithms may also have unfair performance for groups that are underrepresented in training data. It has been demonstrated that learning with DP could exacerbate such unfairness, i.e., underrepresented groups suffer from worse privacy/utility trade-offs [21]. Research is needed to understand this impact on mobility data and design privacy algorithms to optimize the privacy utility trade-off while ensuring fairness.

Challenge 14. Explainability and Societal Education. Another important challenge of mobility data privacy is to improve the explainability of privacy definitions and mechanisms and communicate them to the stakeholders, including mobile users (data contributors), mobility service providers, and data analysts. This is a general challenge for privacy-enhancing technology, but more so for mobility data given the complex semantics of location information and diverse applications, as we mentioned. DP-compliant algorithms and location privacy models (such as GeoInd) as described earlier use privacy parameters to control the trade-off between privacy guarantee and the utility of the private outputs. However, there is a significant gap between the theory and practice of DP: we lack principles and guidelines for choosing privacy parameters when collecting or processing mobility data using DP techniques in the real world. While the technology companies have employed DP in releasing the mobility datasets, as we discussed earlier, the choice of the privacy parameter and the associated noise and uncertainty are often not precisely specified or uniform across companies. This makes it difficult for the downstream applications to quantify the uncertainty of the analysis result.

The parameter \(\epsilon\) of DP is mathematically defined but not well aligned with the stakeholders’ interests. Even for the same \(\epsilon\) , the privacy guarantees could be different based on the different variants of DP and algorithms at hand. In addition, the \(\epsilon\) is not always linked to a specific privacy risk for the users (such as “the probability that an attacker can correctly infer my data”) or a precise utility level for data analysts (such as “the accuracy of the DP-ML model”). To promote the adoption of mobility data privacy technology such as those based on DP, we should establish principles and design guidelines, and provide tools for explaining DP’s protection and limitation from stakeholders’ practical interests. For example, we can help data contributors understand the privacy risk (such as membership inference attacks or reconstruction attacks) under different privacy parameters given a concrete DP algorithm. We can also design efficient methods to visualize how data analyzers’ utility metrics (such as MSE or model accuracy) may change along with different privacy parameters for specific mobility applications.

7 Mobility Data Science Applications

Mobility data science used to be limited to the domain of transportation. However, recent technological inventions have created an abundance of mobility data, resulting in applications in many other domains of interest for society. Such applications leverage mobility data to understand, explain, and predict where moving entities such as humans, animals, or infectious diseases go, why they go where they go, and where they will go next. This section outlines broad applications of mobility data science to illustrate the recent landscape of mobility data science.

7.1 Traffic

Traffic is a problem of global scale, as recognized by transportation science over a decade ago. Drivers in the United States spend 6.9 billion driving-hours stuck in traffic and waste more than 11 billion liters of fuel per year according to INRIX [112]. Measured per capita, people in Russia and Thailand spend even more time in traffic, whereas Brazil, South Africa, the United Kingdom, and Germany are only slightly behind the United States. Leveraging mobility data science and understanding the underlying behavior of human participants concomitantly with different transportation modes can enable more effective solutions to multiple problems at the heart of improving traffic management. Two main lines of research focus on (1) traffic monitoring at an aggregate level, e.g., to help city administration; and (2) provision of services to road users. Existing work regarding traffic monitoring includes monitoring congestion [128], assessing the safety of roads and intersections [142], traffic prediction [131], evacuation routing [263], and optimizing public transportation schedules [191]. Efforts regarding the services provided to road users include routing queries that balance the traffic across roads [68], helping drivers to find nearest facilities [120], personalized routing [129], eco-routing for minimizing greenhouse emissions [133], and enabling multi-modal trip planning [223]. But there are many open opportunities and challenges in using mobility data to improve traffic conditions. One example is devising accurate models for the dynamic scheduling of public transportation. Another example is the context-aware optimization of traffic signals, e.g., incorporating the impact of additional flux of pedestrians in bus/train stations, to minimize the stop-and-go impacts for vehicles. A challenge of using mobility data science in the transportation domain is monitoring and reduction of emissions. Being able to quantify emissions (e.g., from transportation) is essential to accountability and reduction of emissions. Using data on emissions collected from in-situ sensors but also sensed remotely through earth observation (satellite) data will allow us to better understand the effects of e-mobility, better collective transportation, and infrastructure improvements.

7.2 Urban Areas

In 2018, 55% of the world’s population (4.2 billion people) resided in urban areas. This proportion is projected to increase to 68% by 2050 [230]. Urban areas are a focal point for mobility application as they introduce a variety of mobility modalities such as electric vehicles [234] and bicycles and scooters with respective sharing programs [132]. By understanding how, where, and why people move in cities, outer suburban areas, and regional areas, the demand for infrastructure and energy can be better understood [270]. Improving this understanding helps reduce urban inequalities in cities [165] such as access to high-quality food [236] and healthcare [95]. Mobility data also helps improve urban safety by improving crime prediction [82] and helping to recommend safe routes [203].

A specific urban mobility data science supports urban areas through data-driven map construction [3] and updating of existing maps to account for blocked or new road segments [48], which is paramount in autonomous driving applications [140].

The real-time monitoring of urban mobility could result in situational awareness, initially a term coined in defense applications, involving perception of the environmental states using the surrounding data, comprehension of the ingested data to understand the emerging situations, and projection of future states and/or events that require predictive analytics. Mobility data provides critical components and insights into situational awareness in cities. When achieved, this applies not only to enabling robust critical infrastructures in cities but also to protecting them from harm, e.g., forest fires, earthquakes, and terrorist attacks. Many researchers use mobility data as input to enable situational awareness in cities as well as in airports [208].

7.3 Health Informatics

The spread of infectious diseases is a highly complex spatiotemporal process that is strongly tied to human mobility [106] and human behavior [74]. Many recent works have used human mobility data for data-driven epidemic forecasting, as surveyed in [195]. A specific example of leveraging mobility data for public health is contact tracing, which refers to the process of tracking persons who may have come into spatial contact with an infected person, and subsequently collecting further information about these contacts [151]. The feature-rich interaction, processing and localization/communication modalities of smartphone devices have brought these to battle on the technological forefront and have curbed the fast spread of pandemics, such as COVID-19. To date, the community has proposed a wide range of contact tracing approaches, including opportunistic [185] and participatory approaches [64] as well as privacy-sensitive [260], decentralized [226], proximity-based (e.g., Bluetooth Low Energy (BLE), sound) [187], and location-based approaches (e.g., Wi-Fi, GPS) [64]. However, a wide range of challenges remain unanswered, including methodologies to improve the penetration and adoption rates, alleviate privacy or expectation skepticism [32], ubiquitous availability on low-end terminals as well as technological/psychological adoption barriers [31], achieving cross-country interoperability with standard formations beyond recommendations, scalability/reliability and accuracy verification of engaged spatial technologies as well as lessons about effectiveness from real large-scale deployments.

Another specific health application for mobility data is health monitoring of older adults. GPS-enabled smartwatch technology can be used to monitor the movement of older-adult users [215]. In particular, if the monitored user is showing early signs of dementia, the user’s trajectories could show an abrupt change from the individual’s movement history [222]. For instance, a user who normally walks in a park and then goes to a restaurant is found to only stay in the park for a substantial amount of time. Indoor sensors installed in the room can also be used to track whether an an older adult or a patient falls from the bed. Trajectory outlier analysis methods, together with gerontology knowledge, can be very useful for this kind of application.

7.4 Indoor Environments

Indoor mobility data management has been described as a new frontier in data management [114]. However, in addition to data management, large-scale indoor localization data also raises challenges in data collection, data analysis, and data privacy. Indoor data collection is an open research problem due to the non-existence of the indoor equivalent of GPS: a system that can provide the user location in any building worldwide. This is particularly important in applications related to emergency management and infectious disease contact tracing. Systems have been developed over the years to address this problem based on different data sources, including WiFi signal strength and time of arrival [255], cellular signal [194], ultra-wideband [9], ultrasonic [110], magnetic tracking [213], and inertial sensors [102], among others. These novel data sources enable new applications in indoor navigation, contact tracing, indoor analytics, and evacuation management.

Indoor data analytics allows improvement of understanding of indoor behavior, which has multiple benefits and applications, including for crowd management [4], retail and POI recommendation systems [189], and for optimizing energy use and improving sustainability in the long term [200]. For example, by utilizing WiFi logs, Ren et al. [188] find strong correlations between behaviors and user demography (e.g., age, gender, and visitor types), indicating that indoor mobility behavior, in conjunction with online behavior, can be used to predict the underlying demography of the visitors.

Occupancy behaviors are also highly linked with building management systems and controls [45]. By having a more accurate energy use estimation using indoor spatial and mobility data, in addition to historical energy consumption data, the performance of the buildings can be better optimized, towards achieving more sustainable operations [71]. The responsible use of mobility behavior analytics, including indoor and outdoor mobility behaviors, strongly points to the increased capacity for improving sustainable operations of buildings [200], enabling net zero goals to be achieved.

7.5 Marine Transportation

According to UNCTAD, over 80% of the volume of international trade in goods is carried by sea, and the percentage is even higher for most developing countries [229]. Estimates say that global shipping activity emitted \(3\%\) of the global emissions worldwide in 2022 [109]. These significant numbers, as well as the availability of large-scale ship trajectory data obtained from the automatic identification system (AIS) [18], motivated a lot of research efforts on mobility data analysis for maritime transportation. The stakeholders who seek the benefit of such analyses include the maritime authorities, environment officers, ship owners, port and canal managers, and the transport and logistic sectors.

One major challenge is to ensure safety at sea, which splits down to the technical challenges of identifying positional anomalies [193], locating dark vessels (vessels that switch off their AIS devices) [147], and cleaning location and identity spoofing [73]. Additionally, an essential aspect is the detection of fishing activities to ensure sustainable fishing practices [58]. Since vessels do not have fixed routes in the sea, research has also investigated the density of ship routes [248].

Multi-criteria routing using multiple optimization criteria, including estimated time of arrival, fuel consumption, safety, and comfort, has been increasingly recognized as an important path planning problem [104]. An optimization of ship routes could effectively lead to significant reductions of greenhouse gas emissions and contribute to the actions against anthropogenic global warming. The influence of ocean currents, waves, and wind on the course and speed of ships have been known for centuries. Used optimally, ocean currents lead to more efficient paths between two given ports. Ship route computation approaches that exploit the potentials of wind, wave, and weather models aiming at minimizing fuel consumption have been addressed by the marine science, maritime engineering, and transportation communities [77].

Since green mobility is currently gaining a huge amount of attention, carbon dioxide emission–aware ship routing is expected to make an enormous impact on the economy, politics, and society and provides very promising opportunities for the spatial and spatiotemporal database and mobility communities. Marine transportation becomes particularly important in the scope of climate change (e.g., the advent of hydrogen/battery/fossil/atom hybrid vessels) as well as digitization for new infrastructure-free localization technologies on-board.

7.6 Social Connections

Location-based social networks (LBSNs) bridge the gap between the physical world and online social networking services [269]. LBSN data capture both human mobility (in the form of check-ins to discrete points of interest) and a social network between individual humans. Combining mobility data and social networks, LBSN data finds many applications. A first application found in the literature was on modeling and describing human mobility patterns (e.g., [55, 167]), analyzing these patterns (e.g.,[54]), and explaining why individual users choose locations and how social ties affect this choice (e.g., [237]). Another application is that of location recommendation, which leverages check-ins of users and their ratings in the user-location network to recommend new locations to users [26]. A closely related application area is location prediction (e.g., [53]), which predicts the future check-ins of users. Another active research field in LBSN analysis is friend recommendation or social link prediction (e.g., [201]), which suggests new friends to users based on similar interests at similar locations while also having similar social connections. Other research topics concerning LBSNs include efficient query processing (e.g., [17]), finding user communities (e.g., [257]), and estimating the social influence of users (e.g., [246]).

This plethora of applications and research shows how mobility data in connection with social network data can be used to understand the social fabric that ties us together. A potential future application is using human mobility data to reinforce this social fabric by recommending social events and meetings to groups of people to help them find new friends, collaborators, sports mates, teachers, mentors, and family members.

8 Conclusions

This article presented the current state of the mobility data science pipeline in addressing the specific challenges of mobility data. A main question that this article answered is how mobility data science is different from data science. The space and time dimensions in mobility data call for different methods of data acquisition, management, analysis, and privacy preservation that are not addressed by the common data science tools. Accordingly, we surveyed the main problems that are currently being researched, we identified major research questions for the coming years, and described applications that lead to broader impacts of mobility data science. Co-authored by a diversity of academics and industry professionals, this article also conferred a community effort to sketch the boundary of mobility data science as an interdisciplinary field and bring together a dedicated research community around the identified research challenges.

Acknowledgements

Mohamed F. Mokbel acknowledges the support of the National Science Foundation under grants nos. IIS-1907855 and IIS-2203553. Mahmoud Sakr acknowledges the support of the EU’s Horizon Europe research and innovation program under grant agreement nos. 101070279 (MobiSpaces) and 101093051 (EMERALDS). Li Xiong acknowledges the support of the National Science Foundation under grant nos. CNS-2125530 and CNS-2041952. Andreas Züfle and Taylor Anderson acknowledge the support of the National Science Foundation under grant no. DEB-2109647. Walid G. Aref acknowledges the support of the National Science Foundation under grant no. IIS-1910216. Gennady and Natalia Andrienko acknowledge the support of the Federal Ministry of Education and Research of Germany and the state of North-Rhine Westphalia as part of the Lamarr Institute for Machine Learning and Artificial Intelligence (Lamarr22B), and of the EU in projects SoBigData++ and CrexData (grant agreement no. 101092749). Reynold Cheng acknowledges the support of the Hong Kong Jockey Club Charities Trust (Project No. 260920140), the University of Hong Kong (Project No. 109000579), and the HKU Outstanding Research Student Supervisor Award 2022-23. Panos K. Chrysanthis acknowledges the support of the National Science Foundation under grant no. SES-2017614 and of National Institute of Health under grant no. R01HL159805. Anita Graser acknowledges the support of the EU’s Horizon Europe research and innovation program under grant agreement nos. 101070279 (MobiSpaces) and 101093051 (EMERALDS). Matthias Renz acknowledges the support of the German Research Foundation under grant nos. 290391021 and 491008639, the Helmholtz School for Marine Data Science (MarDATA) partially funded by the Helmholtz Association (grant no. HIDSS-0005) and the Federal Ministry for Economic Affairs and Climate Action (BMWi) under grant no. 68GX21002E. Flora Salim acknowledges the support of the Australian Research Council (ARC) Centre of Excellence for Automated Decision-Making and Society (ADM+S) (grant no. CE200100005). Maxime Schoemans acknowledges the support of the Fund for Scientific Research (FNRS) under grant no. 40018132. Yannis Theodoridis acknowledges the support of the EU’s Horizon Europe research and innovation program under grant agreement nos. 101070279 (MobiSpaces) and 101093051 (EMERALDS). Song Wu acknowledges the support of the EU’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement no. 955895 (DEDS). Jianqiu Xu acknowledges the support of the National Science Foundation under grant no. U23A20296.

Supplementary Material

3652158-vor (3652158-vor.pdf)

Version of Record for "Mobility Data Science: Perspectives and Challenges" by Mokbel et al., ACM Transactions on Spatial Algorithms and Systems, Volume 10, No. 2 (TSAS 10:2).

Download
602.30 KB

References

[1]

ACM SIGSPATIAL CUP 2017 n. d. ACM SIGSPATIAL CUP 2017. Retrieved from http://sigspatial2017.sigspatial.org/giscup2017/download.

Editorial Notes

Abstract

1 Introduction

2 Mobility Data Collection

2.1 Efforts in Mobility Data Collection

2.2 Challenges in Mobility Data Collection

3 Mobility Data Cleaning

3.1 Efforts in Mobility Data Cleaning

3.2 Challenges in Mobility Data Cleaning

4 Mobility Data Analytics

4.1 Efforts in Mobility Data Analytics

4.2 Challenges in Mobility Data Analysis

5 Mobility Data Management Infrastructure

5.1 Efforts in Mobility Data Management

5.2 Challenges in Mobility Data Management Infrastructure

6 Mobility Data Privacy

6.1 Efforts in Mobility Data Privacy

6.2 Challenges in Mobility Data Privacy

7 Mobility Data Science Applications

7.1 Traffic

7.2 Urban Areas

7.3 Health Informatics

7.4 Indoor Environments

7.5 Marine Transportation

7.6 Social Connections

8 Conclusions

Acknowledgements

Supplementary Material

References

Cited By

Index Terms

Recommendations

Data-driven generation of spatio-temporal routines in human mobility

How a new realistic mobility model can affect the relative performance of a mobile networking scheme: Research Articles

A Novel Vehicular Mobility Model for Wireless Networks

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations