Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Learning Cellular Network Connection Quality with Conformal

Hanyang Jiang, Elizabeth Belding, Ellen Zegure, Yao Xie
Abstract.

In this paper, we address the problem of uncertainty quantification for cellular network speed. It is a well-known fact that the actual internet speed experienced by a mobile phone can fluctuate significantly, even when remaining in a single location. This high degree of variability underscores that mere point estimation of network speed is insufficient. Rather, it is advantageous to establish a prediction interval that can encompass the expected range of speed variations. In order to build an accurate network estimation map, numerous mobile data need to be collected at different locations. Currently, public datasets rely on users to upload data through apps. Although massive data has been collected, the datasets suffer from significant noise due to the nature of cellular networks and various other factors. Additionally, the uneven distribution of population density affects the spatial consistency of data collection, leading to substantial uncertainty in the network quality maps derived from this data. We focus our analysis on large-scale internet-quality datasets provided by Ookla to construct an estimated map of connection quality. To improve the reliability of this map, we introduce a novel conformal prediction technique to build an uncertainty map. We identify regions with heightened uncertainty to prioritize targeted, manual data collection. In addition, the uncertainty map quantifies how reliable the prediction is in different areas. Our method also leads to a sampling strategy that guides researchers to selectively gather high-quality data that best complement the current dataset to improve the overall accuracy of the prediction model.

Network measurement, Conformal prediction, Kernel regression

1. Introduction

Since the advent of cellular-network-based wireless internet over three decades ago, mobile internet has become a cornerstone of daily life, facilitating a wide range of activities in sectors such as communication, entertainment, commerce, and education. While urban centers typically enjoy strong mobile internet connectivity due to well-developed infrastructure, rural areas frequently suffer from inadequate connectivity (Vogels, 2021). This discrepancy primarily stems from an insufficient number of cell towers and infrastructural challenges, which are influenced by factors like geographic spread, lower population density, and diminished financial incentives for investing in these areas. Accurately pinpointing these locations can significantly improve network planning (Niu et al., 2021).

Identifying regions that are poorly served or entirely unserved is crucial for telecommunication operators and policymakers (Mangla et al., 2021). By focusing on these areas, service providers can deliberately improve infrastructure to enhance both coverage and the quality of service. On a larger scale, detailed connectivity maps can guide policy decisions, promoting digital inclusion and helping to close the gap between urban and rural communities. Enhancing mobile internet access in rural areas also offers considerable socio-economic benefits, fostering regional development and inclusivity. Therefore, addressing the challenges of mobile internet access in under-connected regions is essential for fostering digital equity and propelling societal advancement. In the United States, the Broadband Equity, Access, and Deployment (BEAD) program is set to direct significant funding towards expansion in the coming years, influenced partly by data on spatial quality (National Telecommunications and Information Administration, 2022).

However, current mobile signal estimation work (Zhang et al., 2020) often utilizes propagation models to assess connection quality. The cell tower locations are not shared by Internet service providers, and the public datasets are contributed mostly by volunteers. This makes the dataset noisy, inaccurate, and not up-to-date. Furthermore, the propagation models are limited in accuracy. In reality, the connection quality can be quite different either because of incomplete data or other contributing factors.

Another popular approach to estimating quality is to use data-motivated methods (Jiang et al., 2023) like kernel regression or Kriging. These methods are able to reach higher accuracy but still suffer from noisy and incomplete datasets. The uneven distribution of the population results in insufficient data in rural and undeveloped areas, which are targets for investments. If we can identify those areas that require more data, we could manually collect high-quality data in those places and improve predictions.

In this work, we propose a novel method called Ensemble Spatial Conformal Prediction (ESCP) to identify those areas where the prediction models have higher uncertainty. The reason for a higher uncertainty in an area can be an inherent higher variation in a cellular network, lack of data, failure of the prediction model, and so on. The uncertainty quantification can identify these areas for further improvement. Our work is based on an in-depth analysis using one of the largest open datasets, namely the Ookla dataset (Ookla, 2022), which captures mobile internet connectivity measurements worldwide. Users conduct speed tests on their internet connections using Ookla’s tools. When a user initiates a speed test, data such as the user’s IP address, the Internet Service Provider’s (ISP) identity, and the connection’s speed (both download and upload measured connection bandwidth) are logged. However, like many real-world datasets, it presents certain challenges. It is characterized by a relatively high noise and a conspicuous absence of measurements in specific areas.

Conformal prediction (CP) is a widely recognized distribution-free method used to quantify uncertainty in contemporary machine learning contexts (Volkhonskiy et al., 2017). This technique enhances conventional predictive algorithms by not only generating point estimates but also providing uncertainty intervals. These intervals are designed to encapsulate the unobserved ground truth with a high probability specified by the user. Consequently, CP has been effectively employed across various domains, including anomaly detection (Xu and Xie, 2021a), classification (Angelopoulos et al., 2021; Xu and Xie, 2022), and regression (Barber et al., 2021a), among others.

Fundamentally, CP operates as a flexible wrapper around any predictive model, typically described as a black-box function f𝑓fitalic_f. It integrates three key elements: the model f𝑓fitalic_f, an input feature X𝑋Xitalic_X, and a potential outcome Y𝑌Yitalic_Y. The core of CP involves computing a ”non-conformity” score, which assesses the degree to which the potential outcome deviates from typical model predictions. This score is crucial for determining how well the potential outcome aligns with established patterns recognized by the model. Through this approach, CP provides a robust framework for uncertainty quantification in predictive modeling, enhancing the reliability and interpretability of machine learning applications.

For the prediction model f𝑓fitalic_f, we design a self-tuning bandwidth kernel regression method. The method is designed to deal with the spatial imbalance of data. In contrast to conventional models, the approach is designed to adaptively determine the kernel regression bandwidth for a location.

In this study, we aim to create a prediction region map of mobile internet quality that covers an entire state, not just urban areas seen in prior studies. Some works, including (Adarsh et al., 2021), realize the importance of this field, but they focus more on the data collection and analysis part instead of developing methods. The lack of data from rural areas makes this a tough challenge, setting our work apart from existing approaches.

The rest of the paper is organized as follows. We first give a detailed description of the datasets used in the paper. After that, we introduce the methods used in this paper. Finally, we present the numerical experiments on the two different datasets to show the performance of our proposed method.

2. Data Description

The dataset utilized in our study is sourced from the open datasets provided by Ookla. The data gathered by Ookla from 2019 to 2022 encapsulates the performance metrics of mobile internet connections for a multitude of users worldwide. Key variables in this dataset include geographical coordinates (longitude and latitude), mean download speed (MB/s), mean upload speed (MB/s), count of tests conducted in each area (aggregated for user privacy into 600m2600superscript𝑚2600m^{2}600 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT grid blocks), the number of distinct devices utilized for testing, and a comprehensive score assessing the connection speed. The public Ookla datasets do not provide information on the cellular provider, hence our estimation is of coverage quality aggregated over all providers present in the measurements. The availability of data labeled by provider would be of great interest to the research community.

The following provides a brief description of some of these variables:

  • Location: Denotes the location of the center of a 600m2600superscript𝑚2600m^{2}600 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT grid block where the users measured the connection speed.

  • Download speed: Represents the measured rate of data transfer from a server to the user’s device. This is a critical metric as it affects the speed of web page loading and file downloading, and hence is quite noticeable to the user. Download speeds are reported as an average in the grid location.

  • Upload speed: Corresponds to the measured data transfer rate from the user’s device to the server. This metric is crucial for tasks like video calls, cloud file uploads, and user-generated live streaming. Upload speeds are reported as an average in the grid location.

  • Number of tests at each location: This refers to the number of times tests were conducted in a given grid block. More tests tend to indicate a more reliable measurement score. However, in around 50%percent5050\%50 % of locations, only a single test is conducted.

  • Number of devices used for testing: This refers to the number of distinct devices used for testing in a certain grid block, as the performance can vary across different mobile phones and mobile providers.

  • Score: This variable combines both the download and upload speeds to provide a holistic view of the connection bandwidth performance.

We have mobile connection data from the states of Georgia in the United States Southeast and New Mexico in the United States Southwest. The geographic scatter plot of the connection scores for each dataset is shown in Figure 1. There are 28,587 data points in the Georgia dataset. The frequency of scores is shown in Figure 2. There are 7,579 data points in the New Mexico dataset and the frequency of scores is shown in Figure 2. The dataset does not include the providers of the devices, which means that this user-collected dataset is a mixture of different internet providers.

Meanwhile, we have removed approximately 2.5%percent2.52.5\%2.5 % of outliers from the datasets. These outliers are identified as points that deviate significantly from the mean of their 50505050-nearest neighbors. Specifically, a point is considered an outlier if the difference between its score and the neighborhood mean exceeds three times the standard deviation of the scores within that neighborhood. This method of identifying outliers is a common practice in data analysis.

Refer to caption
Refer to caption
Figure 1. Mobile connection score of Georgia and New Mexico, where the data in Georgia is very dense, and in New Mexico is sparse.
Refer to caption
Refer to caption
Figure 2. The left plot is the frequency of scores in Georgia dataset. The right plot is the frequency of scores in New Mexico dataset.

3. Methodology

Our approach consists of two distinct phases. In the initial phase, we construct a map estimating the score using the kernel regression method. In the subsequent phase, we introduce a novel ensembled spatial conformal prediction method that generates an uncertainty map throughout the region.

3.1. Self-tuning Bandwidth Kernel Regression

Kernel regression is a non-parametric technique to estimate the conditional expectation of a random variable. The objective is to find a non-linear relationship between the input variable and the corresponding output. Kernel regression employs kernel functions, which can capture more complex patterns than linear regression.

For a given dataset with inputs 𝐱𝐱\mathbf{x}bold_x and corresponding outputs 𝐲𝐲\mathbf{y}bold_y, the kernel regression estimate y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG at a new input point x𝑥xitalic_x is given by:

(1) y^(x)=i=1nKh(xxi)yii=1nKh(xxi).^𝑦𝑥superscriptsubscript𝑖1𝑛subscript𝐾norm𝑥subscript𝑥𝑖subscript𝑦𝑖superscriptsubscript𝑖1𝑛subscript𝐾norm𝑥subscript𝑥𝑖\hat{y}(x)=\frac{\sum_{i=1}^{n}K_{h}(\|x-x_{i}\|)y_{i}}{\sum_{i=1}^{n}K_{h}(\|% x-x_{i}\|)}.over^ start_ARG italic_y end_ARG ( italic_x ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ∥ italic_x - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ∥ italic_x - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) end_ARG .

Here, Kh(u)=1hK(uh)subscript𝐾𝑢1𝐾𝑢K_{h}(u)=\frac{1}{h}K(\frac{u}{h})italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) = divide start_ARG 1 end_ARG start_ARG italic_h end_ARG italic_K ( divide start_ARG italic_u end_ARG start_ARG italic_h end_ARG ) is a kernel function, and hhitalic_h is a bandwidth parameter. The kernel function K()𝐾K(\cdot)italic_K ( ⋅ ) is often chosen to be a Gaussian kernel, though other choices are also possible. The bandwidth parameter hhitalic_h controls the width of the kernel, and hence the smoothness of the estimated function.

In the Gaussian case, the kernel function K()𝐾K(\cdot)italic_K ( ⋅ ) is defined as:

(2) K(u)=12πexp(12u2).𝐾𝑢12𝜋12superscript𝑢2K(u)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}u^{2}\right).italic_K ( italic_u ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Kernel regression estimates the conditional mean function without imposing a parametric form for the functional relationship between predictors and the outcome variable. It allows for flexible, data-driven model specification.

The bandwidth hhitalic_h is a crucial parameter in kernel regression. If hhitalic_h is too small, the estimate will be very rough, capturing too much noise in the data (overfitting). If hhitalic_h is too large, the estimate will be too smooth, not capturing important patterns in the data (underfitting). The choice of an appropriate hhitalic_h often involves cross-validation or some other form of bandwidth selection strategy.

Kernel regression, by its standard definition, utilizes a constant bandwidth hhitalic_h that is applied uniformly across all data points. However, this one-size-fits-all approach might not be the most suitable in scenarios where the data exhibits spatial imbalances, as is the case in our study.

In regions where data points are sparse, the lack of neighboring points could potentially lead to unrepresentative averages and, consequently, inaccurate predictions. To overcome this, we propose the use of a larger bandwidth in such areas, thereby encompassing more points for computation and increasing the representativeness of the estimates.

To this end, we propose the self-tuning bandwidth in the kernel regression framework (STBKR). The self-tuning bandwidth mechanism adapts the bandwidth for each point based on its surrounding density of points. The formula for our self-tuning bandwidth kernel regression can be expressed as follows:

(3) y^(x)=i=1nKh(x)(xxi)yii=1nKh(x)(xxi).^𝑦𝑥superscriptsubscript𝑖1𝑛subscript𝐾𝑥norm𝑥subscript𝑥𝑖subscript𝑦𝑖superscriptsubscript𝑖1𝑛subscript𝐾𝑥norm𝑥subscript𝑥𝑖\hat{y}(x)=\frac{\sum_{i=1}^{n}K_{h(x)}(\|x-x_{i}\|)y_{i}}{\sum_{i=1}^{n}K_{h(% x)}(\|x-x_{i}\|)}.over^ start_ARG italic_y end_ARG ( italic_x ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_h ( italic_x ) end_POSTSUBSCRIPT ( ∥ italic_x - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_h ( italic_x ) end_POSTSUBSCRIPT ( ∥ italic_x - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ) end_ARG .

In this formula, we let h(x)=cRk(x)2𝑥𝑐subscript𝑅𝑘superscript𝑥2h(x)=cR_{k}(x)^{2}italic_h ( italic_x ) = italic_c italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Here, c𝑐citalic_c is a parameter that is determined by cross-validation, and Rk(x)subscript𝑅𝑘𝑥R_{k}(x)italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) denotes the Euclidean distance from a given data point x𝑥xitalic_x to its k𝑘kitalic_k-th nearest neighbor. Notably, in areas where data is sparsely distributed, the distance Rk(x)subscript𝑅𝑘𝑥R_{k}(x)italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) will be larger, thereby leading to an increased bandwidth h(x)𝑥h(x)italic_h ( italic_x ).

This strategy of self-tuning bandwidth effectively addresses the issue of spatial imbalance in the dataset. By allowing the bandwidth to adapt based on the local data density, it ensures a more representative sampling of neighbors, leading to more accurate and reliable predictions. Furthermore, the choice of c𝑐citalic_c through cross-validation aids in avoiding overfitting or underfitting, further strengthening the robustness of our regression model.

3.2. Conformal Prediction

To begin with, we introduce the basic idea of conformal prediction. Denote the kernel regression model introduced in the previous section as our prediction model f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG. The split conformal prediction (Vovk et al., 2005) uses a hold-out set to learn the distribution of the non-conformity score. Assume the hold-out set consists of data (X1,Y1),,(Xn,Yn)subscript𝑋1subscript𝑌1subscript𝑋𝑛subscript𝑌𝑛(X_{1},Y_{1}),\cdots,(X_{n},Y_{n})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), the residuals of the prediction model f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG would be εi=|Yif^(Xi)|subscript𝜀𝑖subscript𝑌𝑖^𝑓subscript𝑋𝑖\varepsilon_{i}=|Y_{i}-\hat{f}(X_{i})|italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = | italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_f end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | (i=1,,n)𝑖1𝑛(i=1,\cdots,n)( italic_i = 1 , ⋯ , italic_n ).

The prediction interval with confidence 1α1𝛼1-\alpha1 - italic_α at the new point Xtestsubscript𝑋testX_{\text{test}}italic_X start_POSTSUBSCRIPT test end_POSTSUBSCRIPT would be 𝒞^nα(Xtest)=f^(Xtest)±( the (1α)(n+1)-th smallest of ε1,,εn,)superscriptsubscript^𝒞𝑛𝛼subscript𝑋testplus-or-minus^𝑓subscript𝑋test the 1𝛼𝑛1-th smallest of subscript𝜀1subscript𝜀𝑛\hat{\mathcal{C}}_{n}^{\alpha}(X_{\text{test}})=\hat{f}(X_{\text{test}})\pm(% \text{ the }\lceil(1-\alpha)(n+1)\rceil\text{-th smallest of }\varepsilon_{1},% \ldots,\varepsilon_{n},\infty)over^ start_ARG caligraphic_C end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) = over^ start_ARG italic_f end_ARG ( italic_X start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) ± ( the ⌈ ( 1 - italic_α ) ( italic_n + 1 ) ⌉ -th smallest of italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ε start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , ∞ ).

In other word, it is saying that the residual εtestsubscript𝜀test\varepsilon_{\text{test}}italic_ε start_POSTSUBSCRIPT test end_POSTSUBSCRIPT at the new point Xtestsubscript𝑋testX_{\text{test}}italic_X start_POSTSUBSCRIPT test end_POSTSUBSCRIPT should fall in the interval [0, the (1α)(n+1)-th smallest of R1,,Rn,]0 the 1𝛼𝑛1-th smallest of subscript𝑅1subscript𝑅𝑛[0,\text{ the }\lceil(1-\alpha)(n+1)\rceil\text{-th smallest of }R_{1},\ldots,% R_{n},\infty][ 0 , the ⌈ ( 1 - italic_α ) ( italic_n + 1 ) ⌉ -th smallest of italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , ∞ ].

The method guarantees a distribution-free predictive coverage at the target level 1α1𝛼1-\alpha1 - italic_α, which means that when the data are independently and identically distributed (i.i.d.), the marginal coverage of prediction region in split conformal prediction satisfies (Ytest𝒞^nα(Xtest))1αsubscript𝑌testsuperscriptsubscript^𝒞𝑛𝛼subscript𝑋test1𝛼\mathbb{P}(Y_{\text{test}}\in\hat{\mathcal{C}}_{n}^{\alpha}(X_{\text{test}}))% \geq 1-\alphablackboard_P ( italic_Y start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ∈ over^ start_ARG caligraphic_C end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) ) ≥ 1 - italic_α. In other words, the prediction interval contains the target point with probability at least 1α1𝛼1-\alpha1 - italic_α. The result holds for any sample size n𝑛nitalic_n, which is called the finite-sample guarantee.

3.3. Ensemble Spatial Conformal Prediction

The classic setting of conformal prediction is the i.i.d. or exchangeable setting, which fails to work well when the data has correlations. Moreover, few works consider spatial data that is common in practice. Here we propose a novel ensembled conformal prediction method that is designed for a spatial setting.

Input: Training data {(xi,yi)}i=1Nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\left\{\left(x_{i},y_{i}\right)\right\}_{i=1}^{N}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, prediction algorithm f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG, significance level α𝛼\alphaitalic_α, number of bootstrap models B𝐵Bitalic_B, neighborhood size K𝐾Kitalic_K, batch size s𝑠sitalic_s, and test data {(xj,yj)}j=1Ntsuperscriptsubscriptsubscript𝑥𝑗subscript𝑦𝑗𝑗1subscript𝑁𝑡\left\{\left(x_{j},y_{j}\right)\right\}_{j=1}^{N_{t}}{ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
Output: Ensemble prediction intervals {Cα(xj)}superscript𝐶𝛼subscript𝑥𝑗\{C^{\alpha}\left(x_{j}\right)\}{ italic_C start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } for xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in test set
foreach xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in test set do
       Initialize 𝜺^={}^𝜺\hat{\bm{\varepsilon}}=\{\}over^ start_ARG bold_italic_ε end_ARG = { } as an ordered set
       Select K𝐾Kitalic_K-nearest neighbors of xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in training data as a set N(xj)={xn1,,xnK}𝑁subscript𝑥𝑗subscript𝑥subscript𝑛1subscript𝑥subscript𝑛𝐾N(x_{j})=\{x_{n_{1}},\cdots,x_{n_{K}}\}italic_N ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = { italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT }
       for b=1𝑏1b=1italic_b = 1 to B𝐵Bitalic_B do
             Sample with replacement an index set Db(xj)=(ni1,,nis)subscript𝐷𝑏subscript𝑥𝑗subscript𝑛subscript𝑖1subscript𝑛subscript𝑖𝑠D_{b}(x_{j})=(n_{i_{1}},\cdots,n_{i_{s}})italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = ( italic_n start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_n start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) from N(xj)𝑁subscript𝑥𝑗N(x_{j})italic_N ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
             Compute f^b=f^Db(xj)superscript^𝑓𝑏subscript^𝑓subscript𝐷𝑏subscript𝑥𝑗\hat{f}^{b}=\hat{f}_{D_{b}(x_{j})}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT
            
      for i=1𝑖1i=1italic_i = 1 to K𝐾Kitalic_K do
             Compute f^i(xni)=mean(f^b(xni),niDb)subscript^𝑓𝑖subscript𝑥subscript𝑛𝑖meansuperscript^𝑓𝑏subscript𝑥subscript𝑛𝑖subscript𝑛𝑖subscript𝐷𝑏\hat{f}_{-i}(x_{n_{i}})=\text{mean}(\hat{f}^{b}(x_{n_{i}}),n_{i}\notin D_{b})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = mean ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT )
             Compute ε^i=|f^i(xni)yni|subscript^𝜀𝑖subscript^𝑓𝑖subscript𝑥subscript𝑛𝑖subscript𝑦subscript𝑛𝑖\hat{\varepsilon}_{-i}=|\hat{f}_{-i}(x_{n_{i}})-y_{n_{i}}|over^ start_ARG italic_ε end_ARG start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT = | over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT |
             𝜺^=𝜺^{ε^i}^𝜺^𝜺subscript^𝜀𝑖\hat{\bm{\varepsilon}}=\hat{\bm{\varepsilon}}\cup\{\hat{\varepsilon}_{-i}\}over^ start_ARG bold_italic_ε end_ARG = over^ start_ARG bold_italic_ε end_ARG ∪ { over^ start_ARG italic_ε end_ARG start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT }
            
      Fit quantile regression Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG on 𝜺^^𝜺\hat{\bm{\varepsilon}}over^ start_ARG bold_italic_ε end_ARG
       Compute ωα(xj)=(1α)superscript𝜔𝛼subscript𝑥𝑗1𝛼\omega^{\alpha}(x_{j})=(1-\alpha)italic_ω start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = ( 1 - italic_α ) quantile of 𝜺^^𝜺\hat{\bm{\varepsilon}}over^ start_ARG bold_italic_ε end_ARG
       Return Cα(xj)=[f^D0(xj)(xj)ωα(xj),f^D0(xj)(xj)+ωα(xj)]superscript𝐶𝛼subscript𝑥𝑗subscript^𝑓subscript𝐷0subscript𝑥𝑗subscript𝑥𝑗superscript𝜔𝛼subscript𝑥𝑗subscript^𝑓subscript𝐷0subscript𝑥𝑗subscript𝑥𝑗superscript𝜔𝛼subscript𝑥𝑗C^{\alpha}\left(x_{j}\right)=[\hat{f}_{D_{0}(x_{j})}(x_{j})-\omega^{\alpha}(x_% {j}),\hat{f}_{D_{0}(x_{j})}(x_{j})+\omega^{\alpha}(x_{j})]italic_C start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = [ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_ω start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_ω start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ]
      
Algorithm 1 Ensemble Spatial Conformal Prediction (ESCP)

In Algorithm 1, the prediction model f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG we use is the self-tuning bandwidth kernel regression method in Section 3.1. We enhance this method by using an ensemble approach, which significantly boosts both accuracy and robustness. The enhancement is also effective in handling spatial data. To this end, for each data point, we identify its K𝐾Kitalic_K-nearest neighbors to define a local area, which reflects the spatial dependencies relevant to that point.

For training the model, we employ the bootstrap technique, randomly sampling several points with replacement from this localized dataset. These sampled points serve as the training data for fitting the predictive model. This process is not just a one-off but is iterated multiple times, yielding a diverse collection of predictors. Each iteration contributes unique perspectives to the ensemble, enhancing the model’s overall predictive power and stability. Based on these predictors, we use an “leave-one-out” estimator f^isubscript^𝑓𝑖\hat{f}_{-i}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT that is related to the Jackknife+ procedure (Barber et al., 2021b) to predict the signal quality. The procedure avoids splitting the dataset to fit the prediction model, maintaining the statistical efficiency while still being computationally efficient.

Once we compute the non-conformity scores (essentially, the residuals) for each point within the local neighborhood, we proceed to fit a quantile regression model Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG. Recent studies (Romano et al., 2019; Xu and Xie, 2023) have found that replacing empirical quantile with quantile regression can capture the dependency between residuals and often leads to tighter confidence regions. There is no restriction on the quantile regression method used in Algorithm 1. In general, quantile regression algorithms rely on minimizing the pinball loss for specific regression algorithms. However, we typically want the method to be computationally efficient and require few tunings of hyperparameters. In the experiment, we chose the Quantile Random Forest (QRF) (Meinshausen and Ridgeway, 2006) that is fast and easy to use.

In quantile regression, the model learns the conditional distribution of Y|Xconditional𝑌𝑋Y|Xitalic_Y | italic_X, where Y𝑌Yitalic_Y is the response and X𝑋Xitalic_X is the feature. In our setting, we train the QRF on the residuals in the neighborhood to leverage the dependency. To be precise, for a data point located at x𝑥xitalic_x, we find its K𝐾Kitalic_K-nearest neighbor set N(x)={xn1,,xnK}𝑁𝑥subscript𝑥subscript𝑛1subscript𝑥subscript𝑛𝐾N(x)=\{x_{n_{1}},\cdots,x_{n_{K}}\}italic_N ( italic_x ) = { italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, where the set is sorted from geographically closest to furthest. Based on the prediction model f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG, we are able to compute the non-conformity scores εnisubscript𝜀subscript𝑛𝑖\varepsilon_{n_{i}}italic_ε start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and we denote the score at point x𝑥xitalic_x as εn0subscript𝜀subscript𝑛0\varepsilon_{n_{0}}italic_ε start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Then we let

(4) X=[εn1,,εnK],Y=εn0.formulae-sequence𝑋subscript𝜀subscript𝑛1subscript𝜀subscript𝑛𝐾𝑌subscript𝜀subscript𝑛0X=[\varepsilon_{n_{1}},\cdots,\varepsilon_{n_{K}}],Y=\varepsilon_{n_{0}}.italic_X = [ italic_ε start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_ε start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] , italic_Y = italic_ε start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

For each point in the training set, we are able to construct such a neighborhood and compute corresponding non-conformity scores. Then we can train QRF a large neighborhood of x𝑥xitalic_x in the training set. When re-fitting QRF at each prediction point, we are capturing the local spatial dependency between residuals.

4. Numerical Experiment

We initiate our numerical experiment by partitioning the data into training and test sets with a split of 80%percent8080\%80 % and 20%percent2020\%20 % respectively. To optimize the parameters k𝑘kitalic_k and c𝑐citalic_c in the self-tuning bandwith kernel regression, we implement a 5-fold cross-validation on the training set. We select the combination that minimizes the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT prediction error. The choice is k=10𝑘10k=10italic_k = 10 and c=0.01𝑐0.01c=0.01italic_c = 0.01.

For the ensemble spatial conformal prediction, we select the bootstrap number B=50𝐵50B=50italic_B = 50, and the batch size s=100𝑠100s=100italic_s = 100. A larger choice of B𝐵Bitalic_B and s𝑠sitalic_s will lead to high accuracy and robustness. In our experiment, the current choice is able to reach promising results while reducing computational costs to a large extent. We set α𝛼\alphaitalic_α to be 0.20.20.20.2, which means that the intervals should have at least 1α=80%1𝛼percent801-\alpha=80\%1 - italic_α = 80 % probability of coverage. This significance level produces a tighter region, whereas a higher significance level ensures greater coverage, though it may lead to a conservatively larger prediction interval.

Although neither Georgia nor New Mexico is rectangular in shape, we have included additional data to extend their areas to a rectangular format for the ease of displaying results.

Refer to caption
Figure 3. The left plot is the uncertainty map of Georgia, and the magnitude shows the width of the prediction interval of score in the dataset. The significance level 1α=0.81𝛼0.81-\alpha=0.81 - italic_α = 0.8. The plot on the right shows the locations of training data.
Refer to caption
Figure 4. Uncertainty map of New Mexico, and the magnitude shows the width of the prediction interval of score in the dataset. The significance level 1α=0.81𝛼0.81-\alpha=0.81 - italic_α = 0.8. The plot on the right shows the locations of training data.

In the experiment, we use our ensemble spatial conformal prediction method on the Georgia dataset and New Mexico dataset. We compare the performance of our method with two baseline methods. The first one is the standard split conformal prediction. The second one is EnbPI (Xu and Xie, 2021b), which is also a conformal prediction method that utilizes ensemble predictors. We split the dataset into 80%20%percent80percent2080\%-20\%80 % - 20 %

We compare the coverage rate on the test set and average width of prediction intervals on the test data. The results are presented in Table 1 and 2. From the table we can see that all the three methods achieve valid coverage (0.80.80.80.8) on the test data. Our proposed ESCP method generates a tighter prediction region compared to the other two methods.

From Figure 3 and 4, we observe that our method identifies several small areas exhibiting high uncertainty. High uncertainty indicates that the prediction model lacks confidence in its estimates, meaning the actual values could significantly deviate from the predictions. It is apparent that uncertainty peaks in urban regions where extensive measurements are recorded. This heightened uncertainty in densely populated areas arises from factors such as network congestion, signal obstructions from buildings, and interference from other electronic signals. Additional data collection may not significantly enhance estimation accuracy in these areas due to the inherently high variability of internet speeds.

On the other hand, certain rural areas also display elevated uncertainty due to limited data availability. In contrast, rural regions typically experience less interference and network congestion, contributing to more stable internet speeds. For instance, in the New Mexico dataset shown in Figure 1, numerous data points are collected along highways in rural areas. Interestingly, our uncertainty quantification map indicates low uncertainty along these highways, suggesting more stable internet conditions there. However, other rural regions with scant data collection do show higher uncertainty. This observation underscores the need for targeted data collection in these rural areas, as acquiring more data could significantly refine our internet speed estimations, given the inherently lower variability in these regions.

Table 1. Coverage and width of prediction interval on Georgia dataset.

split CP EnbPI ESCP
Coverage 83.4 82.4 80.8
Width 1501 1677 1180
Table 2. Coverage and width of prediction interval on New Mexico dataset.

split CP EnbPI ESCP
Coverage 84.6 82.9 82.6
Width 780 839 701

5. Discussion

This paper explores the challenge of quantifying uncertainty in cellular network speeds across diverse geographic regions. We have demonstrated that the variability of internet speeds significantly complicates accurate speed prediction due to factors such as high user density, structural barriers, and interference from other signals. We propose a novel Ensemble Spatial Conformal Prediction (ESCP) method that not only complements the prediction but also enables more informed decisions about where to focus data collection efforts.

We must emphasize that the conformal prediction method we utilize is based on the self-tuning bandwidth kernel regression model. The precision of this model greatly influences the prediction interval’s width, or its tightness. Accurately estimating mobile internet speeds is particularly challenging due to their high variability. There is limited research focused on creating comprehensive coverage maps across entire states, and traditional models often struggle with noisy and incomplete datasets.

Using the Ookla dataset, we have generated uncertainty maps for Georgia and New Mexico. These maps reveal greater uncertainty in densely populated areas and in rural regions lacking sufficient data. This outcome confirms our method’s capability to identify areas of uncertainty. Furthermore, our tests on both datasets demonstrate that our approach achieves valid coverage and produces tighter prediction intervals than those generated by baseline methods.

By identifying areas where the prediction model is uncertain, we can guide researchers in manually collecting high-quality data. Gathering more data in underrepresented areas will enhance both the accuracy of our estimations and the tightness of our prediction intervals. Given that our method has empirically shown to provide valid coverage, actively collecting data in these areas will undoubtedly refine the prediction intervals and guarantee accuracy. This active sampling strategy significantly boosts the quality of the data collected.

Acknowledgements.
This work was partially supported by National Science Foundation (NSF) CNS-2220387.

Appendix A Ethics

Cellular measurement data has the potential to reveal an individual’s location or movement over time. The data used in our work helps protect against this with spatial aggregation into 600m2600superscript𝑚2600m^{2}600 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT regions and time aggregation over multiple years of data collection.

References

  • (1)
  • Adarsh et al. (2021) Vivek Adarsh, Michael Nekrasov, Udit Paul, Tarun Mangla, Arpit Gupta, Morgan Vigil-Hayes, Ellen Zegura, and Elizabeth Belding. 2021. Coverage is not binary: Quantifying mobile broadband quality in urban, rural, and tribal contexts. In 2021 International Conference on Computer Communications and Networks (ICCCN). IEEE, 1–9.
  • Angelopoulos et al. (2021) Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. 2021. Uncertainty Sets for Image Classifiers using Conformal Prediction. In International Conference on Learning Representations. https://openreview.net/forum?id=eNdiU_DbM9
  • Barber et al. (2021a) Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. 2021a. Predictive inference with the jackknife+. The Annals of Statistics 49, 1 (2021), 486 – 507. https://doi.org/10.1214/20-AOS1965
  • Barber et al. (2021b) Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. 2021b. Predictive inference with the jackknife+. (2021).
  • Jiang et al. (2023) Hanyang Jiang, Henry Shaowu Yuchi, Elizabeth Belding, Ellen Zegura, and Yao Xie. 2023. Mobile Internet Quality Estimation using Self-Tuning Kernel Regression. arXiv preprint arXiv:2311.05641 (2023).
  • Mangla et al. (2021) Tarun Mangla, Esther Showalter, Vivek Adarsh, Kipp Jones, Morgan Vigil-Hayes, Elizabeth Belding, and Ellen Zegura. 2021. A Tale of Three Datasets: Towards Characterizing Mobile Broadband Access in the United States. arXiv preprint arXiv:2102.07288 (2021).
  • Meinshausen and Ridgeway (2006) Nicolai Meinshausen and Greg Ridgeway. 2006. Quantile regression forests. Journal of machine learning research 7, 6 (2006).
  • National Telecommunications and Information Administration (2022) National Telecommunications and Information Administration. 2022. Broadband Equity, Access and Deployment (BEAD) Program. https://broadbandusa.ntia.doc.gov/funding-programs/broadband-equity-access-and-deployment-bead-program-0 Accessed: 2023-06-02.
  • Niu et al. (2021) Zhisheng Niu, Sheng Zhou, and Noel Crespi. 2021. Greening 6G: New Horizons. Shaping Future 6G Networks: Needs, Impacts, and Technologies (2021), 39–53.
  • Ookla (2022) Ookla. 2022. Global Broadband Performance Metrics: Q1. http://www.speedtest.net/global-index. Accessed:2022-09-20.
  • Romano et al. (2019) Yaniv Romano, Evan Patterson, and Emmanuel Candes. 2019. Conformalized quantile regression. Advances in neural information processing systems 32 (2019).
  • Vogels (2021) Emily A Vogels. 2021. Some digital divides persist between rural, urban and suburban America. (2021).
  • Volkhonskiy et al. (2017) Denis Volkhonskiy, Evgeny Burnaev, Ilia Nouretdinov, Alexander Gammerman, and Vladimir Vovk. 2017. Inductive conformal martingales for change-point detection. In Conformal and Probabilistic Prediction and Applications. PMLR, 132–153.
  • Vovk et al. (2005) Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. 2005. Algorithmic learning in a random world. Vol. 29. Springer.
  • Xu and Xie (2021a) Chen Xu and Yao Xie. 2021a. Conformal anomaly detection on spatio-temporal observations with missing data. arXiv preprint arXiv:2105.11886 (2021). ICML 2021 Workshop on Distribution-Free Uncertainty Quantification.
  • Xu and Xie (2021b) Chen Xu and Yao Xie. 2021b. Conformal prediction interval for dynamic time-series. In International Conference on Machine Learning. PMLR, 11559–11569.
  • Xu and Xie (2022) Chen Xu and Yao Xie. 2022. Conformal prediction set for time-series. arXiv preprint arXiv:2206.07851 (2022). ICML 2022 Workshop on Distribution-Free Uncertainty Quantification.
  • Xu and Xie (2023) Chen Xu and Yao Xie. 2023. Sequential predictive conformal inference for time series. In International Conference on Machine Learning. PMLR, 38707–38727.
  • Zhang et al. (2020) Xin Zhang, Xiujun Shu, Bingwen Zhang, Jie Ren, Lizhou Zhou, and Xin Chen. 2020. Cellular network radio propagation modeling with deep convolutional neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on knowledge discovery & data mining. 2378–2386.