Learning Cellular Network Connection Quality with Conformal
Abstract.
In this paper, we address the problem of uncertainty quantification for cellular network speed. It is a well-known fact that the actual internet speed experienced by a mobile phone can fluctuate significantly, even when remaining in a single location. This high degree of variability underscores that mere point estimation of network speed is insufficient. Rather, it is advantageous to establish a prediction interval that can encompass the expected range of speed variations. In order to build an accurate network estimation map, numerous mobile data need to be collected at different locations. Currently, public datasets rely on users to upload data through apps. Although massive data has been collected, the datasets suffer from significant noise due to the nature of cellular networks and various other factors. Additionally, the uneven distribution of population density affects the spatial consistency of data collection, leading to substantial uncertainty in the network quality maps derived from this data. We focus our analysis on large-scale internet-quality datasets provided by Ookla to construct an estimated map of connection quality. To improve the reliability of this map, we introduce a novel conformal prediction technique to build an uncertainty map. We identify regions with heightened uncertainty to prioritize targeted, manual data collection. In addition, the uncertainty map quantifies how reliable the prediction is in different areas. Our method also leads to a sampling strategy that guides researchers to selectively gather high-quality data that best complement the current dataset to improve the overall accuracy of the prediction model.
1. Introduction
Since the advent of cellular-network-based wireless internet over three decades ago, mobile internet has become a cornerstone of daily life, facilitating a wide range of activities in sectors such as communication, entertainment, commerce, and education. While urban centers typically enjoy strong mobile internet connectivity due to well-developed infrastructure, rural areas frequently suffer from inadequate connectivity (Vogels, 2021). This discrepancy primarily stems from an insufficient number of cell towers and infrastructural challenges, which are influenced by factors like geographic spread, lower population density, and diminished financial incentives for investing in these areas. Accurately pinpointing these locations can significantly improve network planning (Niu et al., 2021).
Identifying regions that are poorly served or entirely unserved is crucial for telecommunication operators and policymakers (Mangla et al., 2021). By focusing on these areas, service providers can deliberately improve infrastructure to enhance both coverage and the quality of service. On a larger scale, detailed connectivity maps can guide policy decisions, promoting digital inclusion and helping to close the gap between urban and rural communities. Enhancing mobile internet access in rural areas also offers considerable socio-economic benefits, fostering regional development and inclusivity. Therefore, addressing the challenges of mobile internet access in under-connected regions is essential for fostering digital equity and propelling societal advancement. In the United States, the Broadband Equity, Access, and Deployment (BEAD) program is set to direct significant funding towards expansion in the coming years, influenced partly by data on spatial quality (National Telecommunications and Information Administration, 2022).
However, current mobile signal estimation work (Zhang et al., 2020) often utilizes propagation models to assess connection quality. The cell tower locations are not shared by Internet service providers, and the public datasets are contributed mostly by volunteers. This makes the dataset noisy, inaccurate, and not up-to-date. Furthermore, the propagation models are limited in accuracy. In reality, the connection quality can be quite different either because of incomplete data or other contributing factors.
Another popular approach to estimating quality is to use data-motivated methods (Jiang et al., 2023) like kernel regression or Kriging. These methods are able to reach higher accuracy but still suffer from noisy and incomplete datasets. The uneven distribution of the population results in insufficient data in rural and undeveloped areas, which are targets for investments. If we can identify those areas that require more data, we could manually collect high-quality data in those places and improve predictions.
In this work, we propose a novel method called Ensemble Spatial Conformal Prediction (ESCP) to identify those areas where the prediction models have higher uncertainty. The reason for a higher uncertainty in an area can be an inherent higher variation in a cellular network, lack of data, failure of the prediction model, and so on. The uncertainty quantification can identify these areas for further improvement. Our work is based on an in-depth analysis using one of the largest open datasets, namely the Ookla dataset (Ookla, 2022), which captures mobile internet connectivity measurements worldwide. Users conduct speed tests on their internet connections using Ookla’s tools. When a user initiates a speed test, data such as the user’s IP address, the Internet Service Provider’s (ISP) identity, and the connection’s speed (both download and upload measured connection bandwidth) are logged. However, like many real-world datasets, it presents certain challenges. It is characterized by a relatively high noise and a conspicuous absence of measurements in specific areas.
Conformal prediction (CP) is a widely recognized distribution-free method used to quantify uncertainty in contemporary machine learning contexts (Volkhonskiy et al., 2017). This technique enhances conventional predictive algorithms by not only generating point estimates but also providing uncertainty intervals. These intervals are designed to encapsulate the unobserved ground truth with a high probability specified by the user. Consequently, CP has been effectively employed across various domains, including anomaly detection (Xu and Xie, 2021a), classification (Angelopoulos et al., 2021; Xu and Xie, 2022), and regression (Barber et al., 2021a), among others.
Fundamentally, CP operates as a flexible wrapper around any predictive model, typically described as a black-box function . It integrates three key elements: the model , an input feature , and a potential outcome . The core of CP involves computing a ”non-conformity” score, which assesses the degree to which the potential outcome deviates from typical model predictions. This score is crucial for determining how well the potential outcome aligns with established patterns recognized by the model. Through this approach, CP provides a robust framework for uncertainty quantification in predictive modeling, enhancing the reliability and interpretability of machine learning applications.
For the prediction model , we design a self-tuning bandwidth kernel regression method. The method is designed to deal with the spatial imbalance of data. In contrast to conventional models, the approach is designed to adaptively determine the kernel regression bandwidth for a location.
In this study, we aim to create a prediction region map of mobile internet quality that covers an entire state, not just urban areas seen in prior studies. Some works, including (Adarsh et al., 2021), realize the importance of this field, but they focus more on the data collection and analysis part instead of developing methods. The lack of data from rural areas makes this a tough challenge, setting our work apart from existing approaches.
The rest of the paper is organized as follows. We first give a detailed description of the datasets used in the paper. After that, we introduce the methods used in this paper. Finally, we present the numerical experiments on the two different datasets to show the performance of our proposed method.
2. Data Description
The dataset utilized in our study is sourced from the open datasets provided by Ookla. The data gathered by Ookla from 2019 to 2022 encapsulates the performance metrics of mobile internet connections for a multitude of users worldwide. Key variables in this dataset include geographical coordinates (longitude and latitude), mean download speed (MB/s), mean upload speed (MB/s), count of tests conducted in each area (aggregated for user privacy into grid blocks), the number of distinct devices utilized for testing, and a comprehensive score assessing the connection speed. The public Ookla datasets do not provide information on the cellular provider, hence our estimation is of coverage quality aggregated over all providers present in the measurements. The availability of data labeled by provider would be of great interest to the research community.
The following provides a brief description of some of these variables:
-
•
Location: Denotes the location of the center of a grid block where the users measured the connection speed.
-
•
Download speed: Represents the measured rate of data transfer from a server to the user’s device. This is a critical metric as it affects the speed of web page loading and file downloading, and hence is quite noticeable to the user. Download speeds are reported as an average in the grid location.
-
•
Upload speed: Corresponds to the measured data transfer rate from the user’s device to the server. This metric is crucial for tasks like video calls, cloud file uploads, and user-generated live streaming. Upload speeds are reported as an average in the grid location.
-
•
Number of tests at each location: This refers to the number of times tests were conducted in a given grid block. More tests tend to indicate a more reliable measurement score. However, in around of locations, only a single test is conducted.
-
•
Number of devices used for testing: This refers to the number of distinct devices used for testing in a certain grid block, as the performance can vary across different mobile phones and mobile providers.
-
•
Score: This variable combines both the download and upload speeds to provide a holistic view of the connection bandwidth performance.
We have mobile connection data from the states of Georgia in the United States Southeast and New Mexico in the United States Southwest. The geographic scatter plot of the connection scores for each dataset is shown in Figure 1. There are 28,587 data points in the Georgia dataset. The frequency of scores is shown in Figure 2. There are 7,579 data points in the New Mexico dataset and the frequency of scores is shown in Figure 2. The dataset does not include the providers of the devices, which means that this user-collected dataset is a mixture of different internet providers.
Meanwhile, we have removed approximately of outliers from the datasets. These outliers are identified as points that deviate significantly from the mean of their -nearest neighbors. Specifically, a point is considered an outlier if the difference between its score and the neighborhood mean exceeds three times the standard deviation of the scores within that neighborhood. This method of identifying outliers is a common practice in data analysis.
3. Methodology
Our approach consists of two distinct phases. In the initial phase, we construct a map estimating the score using the kernel regression method. In the subsequent phase, we introduce a novel ensembled spatial conformal prediction method that generates an uncertainty map throughout the region.
3.1. Self-tuning Bandwidth Kernel Regression
Kernel regression is a non-parametric technique to estimate the conditional expectation of a random variable. The objective is to find a non-linear relationship between the input variable and the corresponding output. Kernel regression employs kernel functions, which can capture more complex patterns than linear regression.
For a given dataset with inputs and corresponding outputs , the kernel regression estimate at a new input point is given by:
(1) |
Here, is a kernel function, and is a bandwidth parameter. The kernel function is often chosen to be a Gaussian kernel, though other choices are also possible. The bandwidth parameter controls the width of the kernel, and hence the smoothness of the estimated function.
In the Gaussian case, the kernel function is defined as:
(2) |
Kernel regression estimates the conditional mean function without imposing a parametric form for the functional relationship between predictors and the outcome variable. It allows for flexible, data-driven model specification.
The bandwidth is a crucial parameter in kernel regression. If is too small, the estimate will be very rough, capturing too much noise in the data (overfitting). If is too large, the estimate will be too smooth, not capturing important patterns in the data (underfitting). The choice of an appropriate often involves cross-validation or some other form of bandwidth selection strategy.
Kernel regression, by its standard definition, utilizes a constant bandwidth that is applied uniformly across all data points. However, this one-size-fits-all approach might not be the most suitable in scenarios where the data exhibits spatial imbalances, as is the case in our study.
In regions where data points are sparse, the lack of neighboring points could potentially lead to unrepresentative averages and, consequently, inaccurate predictions. To overcome this, we propose the use of a larger bandwidth in such areas, thereby encompassing more points for computation and increasing the representativeness of the estimates.
To this end, we propose the self-tuning bandwidth in the kernel regression framework (STBKR). The self-tuning bandwidth mechanism adapts the bandwidth for each point based on its surrounding density of points. The formula for our self-tuning bandwidth kernel regression can be expressed as follows:
(3) |
In this formula, we let . Here, is a parameter that is determined by cross-validation, and denotes the Euclidean distance from a given data point to its -th nearest neighbor. Notably, in areas where data is sparsely distributed, the distance will be larger, thereby leading to an increased bandwidth .
This strategy of self-tuning bandwidth effectively addresses the issue of spatial imbalance in the dataset. By allowing the bandwidth to adapt based on the local data density, it ensures a more representative sampling of neighbors, leading to more accurate and reliable predictions. Furthermore, the choice of through cross-validation aids in avoiding overfitting or underfitting, further strengthening the robustness of our regression model.
3.2. Conformal Prediction
To begin with, we introduce the basic idea of conformal prediction. Denote the kernel regression model introduced in the previous section as our prediction model . The split conformal prediction (Vovk et al., 2005) uses a hold-out set to learn the distribution of the non-conformity score. Assume the hold-out set consists of data , the residuals of the prediction model would be .
The prediction interval with confidence at the new point would be .
In other word, it is saying that the residual at the new point should fall in the interval .
The method guarantees a distribution-free predictive coverage at the target level , which means that when the data are independently and identically distributed (i.i.d.), the marginal coverage of prediction region in split conformal prediction satisfies . In other words, the prediction interval contains the target point with probability at least . The result holds for any sample size , which is called the finite-sample guarantee.
3.3. Ensemble Spatial Conformal Prediction
The classic setting of conformal prediction is the i.i.d. or exchangeable setting, which fails to work well when the data has correlations. Moreover, few works consider spatial data that is common in practice. Here we propose a novel ensembled conformal prediction method that is designed for a spatial setting.
In Algorithm 1, the prediction model we use is the self-tuning bandwidth kernel regression method in Section 3.1. We enhance this method by using an ensemble approach, which significantly boosts both accuracy and robustness. The enhancement is also effective in handling spatial data. To this end, for each data point, we identify its -nearest neighbors to define a local area, which reflects the spatial dependencies relevant to that point.
For training the model, we employ the bootstrap technique, randomly sampling several points with replacement from this localized dataset. These sampled points serve as the training data for fitting the predictive model. This process is not just a one-off but is iterated multiple times, yielding a diverse collection of predictors. Each iteration contributes unique perspectives to the ensemble, enhancing the model’s overall predictive power and stability. Based on these predictors, we use an “leave-one-out” estimator that is related to the Jackknife+ procedure (Barber et al., 2021b) to predict the signal quality. The procedure avoids splitting the dataset to fit the prediction model, maintaining the statistical efficiency while still being computationally efficient.
Once we compute the non-conformity scores (essentially, the residuals) for each point within the local neighborhood, we proceed to fit a quantile regression model . Recent studies (Romano et al., 2019; Xu and Xie, 2023) have found that replacing empirical quantile with quantile regression can capture the dependency between residuals and often leads to tighter confidence regions. There is no restriction on the quantile regression method used in Algorithm 1. In general, quantile regression algorithms rely on minimizing the pinball loss for specific regression algorithms. However, we typically want the method to be computationally efficient and require few tunings of hyperparameters. In the experiment, we chose the Quantile Random Forest (QRF) (Meinshausen and Ridgeway, 2006) that is fast and easy to use.
In quantile regression, the model learns the conditional distribution of , where is the response and is the feature. In our setting, we train the QRF on the residuals in the neighborhood to leverage the dependency. To be precise, for a data point located at , we find its -nearest neighbor set , where the set is sorted from geographically closest to furthest. Based on the prediction model , we are able to compute the non-conformity scores , and we denote the score at point as . Then we let
(4) |
For each point in the training set, we are able to construct such a neighborhood and compute corresponding non-conformity scores. Then we can train QRF a large neighborhood of in the training set. When re-fitting QRF at each prediction point, we are capturing the local spatial dependency between residuals.
4. Numerical Experiment
We initiate our numerical experiment by partitioning the data into training and test sets with a split of and respectively. To optimize the parameters and in the self-tuning bandwith kernel regression, we implement a 5-fold cross-validation on the training set. We select the combination that minimizes the prediction error. The choice is and .
For the ensemble spatial conformal prediction, we select the bootstrap number , and the batch size . A larger choice of and will lead to high accuracy and robustness. In our experiment, the current choice is able to reach promising results while reducing computational costs to a large extent. We set to be , which means that the intervals should have at least probability of coverage. This significance level produces a tighter region, whereas a higher significance level ensures greater coverage, though it may lead to a conservatively larger prediction interval.
Although neither Georgia nor New Mexico is rectangular in shape, we have included additional data to extend their areas to a rectangular format for the ease of displaying results.
In the experiment, we use our ensemble spatial conformal prediction method on the Georgia dataset and New Mexico dataset. We compare the performance of our method with two baseline methods. The first one is the standard split conformal prediction. The second one is EnbPI (Xu and Xie, 2021b), which is also a conformal prediction method that utilizes ensemble predictors. We split the dataset into
We compare the coverage rate on the test set and average width of prediction intervals on the test data. The results are presented in Table 1 and 2. From the table we can see that all the three methods achieve valid coverage () on the test data. Our proposed ESCP method generates a tighter prediction region compared to the other two methods.
From Figure 3 and 4, we observe that our method identifies several small areas exhibiting high uncertainty. High uncertainty indicates that the prediction model lacks confidence in its estimates, meaning the actual values could significantly deviate from the predictions. It is apparent that uncertainty peaks in urban regions where extensive measurements are recorded. This heightened uncertainty in densely populated areas arises from factors such as network congestion, signal obstructions from buildings, and interference from other electronic signals. Additional data collection may not significantly enhance estimation accuracy in these areas due to the inherently high variability of internet speeds.
On the other hand, certain rural areas also display elevated uncertainty due to limited data availability. In contrast, rural regions typically experience less interference and network congestion, contributing to more stable internet speeds. For instance, in the New Mexico dataset shown in Figure 1, numerous data points are collected along highways in rural areas. Interestingly, our uncertainty quantification map indicates low uncertainty along these highways, suggesting more stable internet conditions there. However, other rural regions with scant data collection do show higher uncertainty. This observation underscores the need for targeted data collection in these rural areas, as acquiring more data could significantly refine our internet speed estimations, given the inherently lower variability in these regions.
split CP | EnbPI | ESCP | |
Coverage | 83.4 | 82.4 | 80.8 |
Width | 1501 | 1677 | 1180 |
split CP | EnbPI | ESCP | |
Coverage | 84.6 | 82.9 | 82.6 |
Width | 780 | 839 | 701 |
5. Discussion
This paper explores the challenge of quantifying uncertainty in cellular network speeds across diverse geographic regions. We have demonstrated that the variability of internet speeds significantly complicates accurate speed prediction due to factors such as high user density, structural barriers, and interference from other signals. We propose a novel Ensemble Spatial Conformal Prediction (ESCP) method that not only complements the prediction but also enables more informed decisions about where to focus data collection efforts.
We must emphasize that the conformal prediction method we utilize is based on the self-tuning bandwidth kernel regression model. The precision of this model greatly influences the prediction interval’s width, or its tightness. Accurately estimating mobile internet speeds is particularly challenging due to their high variability. There is limited research focused on creating comprehensive coverage maps across entire states, and traditional models often struggle with noisy and incomplete datasets.
Using the Ookla dataset, we have generated uncertainty maps for Georgia and New Mexico. These maps reveal greater uncertainty in densely populated areas and in rural regions lacking sufficient data. This outcome confirms our method’s capability to identify areas of uncertainty. Furthermore, our tests on both datasets demonstrate that our approach achieves valid coverage and produces tighter prediction intervals than those generated by baseline methods.
By identifying areas where the prediction model is uncertain, we can guide researchers in manually collecting high-quality data. Gathering more data in underrepresented areas will enhance both the accuracy of our estimations and the tightness of our prediction intervals. Given that our method has empirically shown to provide valid coverage, actively collecting data in these areas will undoubtedly refine the prediction intervals and guarantee accuracy. This active sampling strategy significantly boosts the quality of the data collected.
Acknowledgements.
This work was partially supported by National Science Foundation (NSF) CNS-2220387.Appendix A Ethics
Cellular measurement data has the potential to reveal an individual’s location or movement over time. The data used in our work helps protect against this with spatial aggregation into regions and time aggregation over multiple years of data collection.
References
- (1)
- Adarsh et al. (2021) Vivek Adarsh, Michael Nekrasov, Udit Paul, Tarun Mangla, Arpit Gupta, Morgan Vigil-Hayes, Ellen Zegura, and Elizabeth Belding. 2021. Coverage is not binary: Quantifying mobile broadband quality in urban, rural, and tribal contexts. In 2021 International Conference on Computer Communications and Networks (ICCCN). IEEE, 1–9.
- Angelopoulos et al. (2021) Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. 2021. Uncertainty Sets for Image Classifiers using Conformal Prediction. In International Conference on Learning Representations. https://openreview.net/forum?id=eNdiU_DbM9
- Barber et al. (2021a) Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. 2021a. Predictive inference with the jackknife+. The Annals of Statistics 49, 1 (2021), 486 – 507. https://doi.org/10.1214/20-AOS1965
- Barber et al. (2021b) Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. 2021b. Predictive inference with the jackknife+. (2021).
- Jiang et al. (2023) Hanyang Jiang, Henry Shaowu Yuchi, Elizabeth Belding, Ellen Zegura, and Yao Xie. 2023. Mobile Internet Quality Estimation using Self-Tuning Kernel Regression. arXiv preprint arXiv:2311.05641 (2023).
- Mangla et al. (2021) Tarun Mangla, Esther Showalter, Vivek Adarsh, Kipp Jones, Morgan Vigil-Hayes, Elizabeth Belding, and Ellen Zegura. 2021. A Tale of Three Datasets: Towards Characterizing Mobile Broadband Access in the United States. arXiv preprint arXiv:2102.07288 (2021).
- Meinshausen and Ridgeway (2006) Nicolai Meinshausen and Greg Ridgeway. 2006. Quantile regression forests. Journal of machine learning research 7, 6 (2006).
- National Telecommunications and Information Administration (2022) National Telecommunications and Information Administration. 2022. Broadband Equity, Access and Deployment (BEAD) Program. https://broadbandusa.ntia.doc.gov/funding-programs/broadband-equity-access-and-deployment-bead-program-0 Accessed: 2023-06-02.
- Niu et al. (2021) Zhisheng Niu, Sheng Zhou, and Noel Crespi. 2021. Greening 6G: New Horizons. Shaping Future 6G Networks: Needs, Impacts, and Technologies (2021), 39–53.
- Ookla (2022) Ookla. 2022. Global Broadband Performance Metrics: Q1. http://www.speedtest.net/global-index. Accessed:2022-09-20.
- Romano et al. (2019) Yaniv Romano, Evan Patterson, and Emmanuel Candes. 2019. Conformalized quantile regression. Advances in neural information processing systems 32 (2019).
- Vogels (2021) Emily A Vogels. 2021. Some digital divides persist between rural, urban and suburban America. (2021).
- Volkhonskiy et al. (2017) Denis Volkhonskiy, Evgeny Burnaev, Ilia Nouretdinov, Alexander Gammerman, and Vladimir Vovk. 2017. Inductive conformal martingales for change-point detection. In Conformal and Probabilistic Prediction and Applications. PMLR, 132–153.
- Vovk et al. (2005) Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. 2005. Algorithmic learning in a random world. Vol. 29. Springer.
- Xu and Xie (2021a) Chen Xu and Yao Xie. 2021a. Conformal anomaly detection on spatio-temporal observations with missing data. arXiv preprint arXiv:2105.11886 (2021). ICML 2021 Workshop on Distribution-Free Uncertainty Quantification.
- Xu and Xie (2021b) Chen Xu and Yao Xie. 2021b. Conformal prediction interval for dynamic time-series. In International Conference on Machine Learning. PMLR, 11559–11569.
- Xu and Xie (2022) Chen Xu and Yao Xie. 2022. Conformal prediction set for time-series. arXiv preprint arXiv:2206.07851 (2022). ICML 2022 Workshop on Distribution-Free Uncertainty Quantification.
- Xu and Xie (2023) Chen Xu and Yao Xie. 2023. Sequential predictive conformal inference for time series. In International Conference on Machine Learning. PMLR, 38707–38727.
- Zhang et al. (2020) Xin Zhang, Xiujun Shu, Bingwen Zhang, Jie Ren, Lizhou Zhou, and Xin Chen. 2020. Cellular network radio propagation modeling with deep convolutional neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on knowledge discovery & data mining. 2378–2386.