Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Department of Computer Science & Engineering, University of Minnesota, Minneapolis, MN, USAghosh117@umn.eduDepartment of Computer Science & Engineering, University of Minnesota, Minneapolis, MN, USAgupta423@umn.edu Department of Computer Science & Engineering, University of Minnesota, Minneapolis, MN, USAsharm485@umn.edu Department of Economics, University of Minnesota, Minneapolis, MN, USAan000033@umn.edu Department of Computer Science & Engineering, University of Minnesota, Minneapolis, MN, USAshekhar@umn.edu \CopyrightSubhankar Ghosh, Jayant Gupta, Arun Sharma, Shuai An and Shashi Shekhar \ccsdescInformation systems Data mining\ccsdescComputing methodologies Spatial and physical reasoning \supplement\fundingThis material is based upon work supported by the National Science Foundation under Grants No. 2118285, 2040459, 1901099, and 1916518.

Acknowledgements.
We also thank Kim Koffolt, Yash Travadi, and the Spatial Computing Research Group for valuable comments and refinements.\EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23

Reducing False Discoveries in Statistically-Significant Regional-Colocation Mining: A Summary of Results

Subhankar Ghosh    Jayant Gupta    Arun Sharma    Shuai An    Shashi Shekhar
Abstract

Given a set S of spatial feature types, its feature instances, a study area, and a neighbor relationship, the goal is to find pairs <<<a region (rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT), a subset C of S>>> such that C is a statistically significant regional-colocation pattern in rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. This problem is important for applications in various domains including ecology, economics, and sociology. The problem is computationally challenging due to the exponential number of regional colocation patterns and candidate regions. Previously, we proposed a miner [9] that finds statistically significant regional colocation patterns. However, the numerous simultaneous statistical inferences raise the risk of false discoveries (also known as the multiple comparisons problem) and carry a high computational cost. We propose a novel algorithm, namely, multiple comparisons regional colocation miner (MultComp-RCM) which uses a Bonferroni correction. Theoretical analysis, experimental evaluation, and case study results show that the proposed method reduces both the false discovery rate and computational cost.

keywords:
Colocation pattern, Participation index, Multiple comparisons problem, Spatial heterogeneity, Statistical significance.
category:
\relatedversion

1 Introduction

Regional-colocation patterns are (study sub-area R𝑅Ritalic_R, feature-type subset C𝐶Citalic_C) pairs such that instances of feature-types in C𝐶Citalic_C often are present in R𝑅Ritalic_R in close proximity. Given a set S of spatial features (e.g., coffee shops, restaurants), their feature instances, a study area, and a neighbor relationship (e.g., geographic proximity), the goal is to identify pairs <<<region rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, subset C of S>>> such that instances of C are statistically significant in that region rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. Figure 1(a) shows a set of instances input into a regional-colocation miner, consisting of three different spatial feature types, a neighborhood relation between feature instances, and a space partitioning. Figure 1 (b), shows the set of statistically significant regional colocations identified after significance testing (described in Section 2.2). The output is a pair of regional colocations: r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT showing a strong regional colocation between all three features (i.e., fAsubscript𝑓𝐴f_{A}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, fBsubscript𝑓𝐵f_{B}italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, and fCsubscript𝑓𝐶f_{C}italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT) and r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT showing a strong regional-colocation between two features (i.e., fAsubscript𝑓𝐴f_{A}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and fBsubscript𝑓𝐵f_{B}italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT). The rest of the area within the map shows less spatial interaction (low participation index) between these features.

Refer to caption
Figure 1: Regions where all or subsets of fAsubscript𝑓𝐴f_{A}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, fBsubscript𝑓𝐵f_{B}italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT and fCsubscript𝑓𝐶f_{C}italic_f start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT significantly co-locate in the study area

The problem of mining statistically significant regional-colocation patterns is societally important with applications in retail, public health, ecology, public security, transportation, etc. For example, retail establishments (e.g., fast food chains and coffee shops) often colocate to reach each other’s customers. Thus, finding statistically significant regional colocation patterns among competing retail stores has tremendous value for retail analysis. When identifying colocation patterns in societal domains, it’s important to minimize the chance of false discoveries. A famous historical example was between 1900190019001900 and 1904190419041904 when urban districts of San Francisco experienced an outbreak of bubonic plague, resulting in 119119119119 deaths. The federal and state authorities falsely identified the victims’ ethnicity as a highly correlated feature to the plague. This false discovery brought an immense adverse impact on San Francisco’s management of the plague. Even when we don’t unfairly stigmatize groups or regions, false discoveries waste money, and resources. Comparing the city’s response to the same plague between 1907190719071907 and 1908190819081908, where rats were correctly identified as a highly correlated feature and the plague was swiftly contained, the negative impact of false discovery was even more strongly felt [13]. Table 1 provides application domains and use cases.

Table 1: Regional-colocation applications.
Application Domain Example
Retail <<<China, {McDonald’s and KFC}>>>, <<<USA, {McDonald’s and Jimmy John’s}>>>
Public Health <<<Ports, {Plague and rats}>>>, <<<Middle East, {Middle East Respiratory Syndrome (MERS) in 2012 and MERS-CoV}>>>
Ecology <<<Indian/Pacific Ocean, {Anemone and Clownfish}>>>, <<<Nile River delta, {Nile Crocodile and Egyptian Plover}>>>
Public Safety <<<Region around bars, {Assault crimes and drunk driving}>>>
Transportation <<<Near bus depots, {High NOx𝑁subscript𝑂𝑥NO_{x}italic_N italic_O start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT concentrations and buses}>>>

The problem of statistically significant regional-colocation pattern detection (SSRCPD𝑆𝑆𝑅𝐶𝑃𝐷SSRCPDitalic_S italic_S italic_R italic_C italic_P italic_D) is computationally challenging due to the following reasons: (1) Significance testing in this problem requires considering multiple statistical inferences simultaneously which leads to an increase in Type-I𝐼Iitalic_I error (i.e false discoveries). (2) There is an exponential number of candidate regional patterns, e.g., the dataset used in the case study (Section 6) consists of 1473147314731473 different retail brands and their locations in Minnesota, resulting in 21473superscript214732^{1473}2 start_POSTSUPERSCRIPT 1473 end_POSTSUPERSCRIPT different candidate patterns. (3) Spatial partitioning approach would lead to an infinite number of candidate region subsets.

Refer to caption
Figure 2: Comparison with Related Work

Figure 2 shows a decision tree that distinguishes our manuscript from previous works, where SSRCPD𝑆𝑆𝑅𝐶𝑃𝐷SSRCPDitalic_S italic_S italic_R italic_C italic_P italic_D refers to Statisticallysignificantregionalcolocationpatterndetection𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙𝑙𝑦𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡𝑟𝑒𝑔𝑖𝑜𝑛𝑎𝑙𝑐𝑜𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑝𝑎𝑡𝑡𝑒𝑟𝑛𝑑𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛Statistically\ significant\ regional\ colocation\ pattern\ detectionitalic_S italic_t italic_a italic_t italic_i italic_s italic_t italic_i italic_c italic_a italic_l italic_l italic_y italic_s italic_i italic_g italic_n italic_i italic_f italic_i italic_c italic_a italic_n italic_t italic_r italic_e italic_g italic_i italic_o italic_n italic_a italic_l italic_c italic_o italic_l italic_o italic_c italic_a italic_t italic_i italic_o italic_n italic_p italic_a italic_t italic_t italic_e italic_r italic_n italic_d italic_e italic_t italic_e italic_c italic_t italic_i italic_o italic_n. Earlier work on regional-colocation pattern detection either uses data unaware space partitioning (e.g., Quadtree [4, 12]) or clustering of colocation instances [5, 7]. However, these techniques lack statistical significance testing and depend on input parameters (e.g., participation index threshold) which may vary geographically. Statistically significant global colocation mining was introduced by [1], while statistically significant regional colocation mining was first explored in [9]. In [9] we proposed SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M which utilizes a subgraph enumeration approach to detect statistically significant regional colocation patterns where the regions would be composed of one or more contiguous atomic partitions (smallest region within which a candidate pattern is statistically significant). This algorithm was expensive because expanding the region within which the pattern was statistically significant required recalculating the p𝑝pitalic_p-value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e. Since detecting statistically significant regional colocation patterns requires performing multiple simultaneous statistical inferences, this results in the multiple comparisons problem [14], which risks false discoveries (a.k.a. Type-I errors). The problem results in a rapid increase in the probability of Type-I error as the number of partitions increases. To address the multiple comparisons problem, we propose a robust statistically significant regional colocation miner (MultComp-RCM) using a Bonferroni correction [3]. The proposed approach recommends stricter p𝑝pitalic_p-values𝑣𝑎𝑙𝑢𝑒𝑠valuesitalic_v italic_a italic_l italic_u italic_e italic_s to reduce false discoveries (Type-I errors), thus setting an upper bound on the overall significance level (α𝛼\alphaitalic_α, which is 0.05 for a 95% statistical confidence).

Contributions:

  • We proposed a new approach Multiple comparisons regional colocation miner (MultComp-RCM) to reduce false positives using a well-established statistical technique for multiple comparisons correction, the Bonferroni test.

  • The paper provides a comparative analysis showing that the proposed MultComp-RCM is computationally more efficient than SSRCM.

  • The paper describes a sensitivity analysis using synthetic data which shows that MultComp-RCM requires an increasingly smaller number of significance tests and participation index computations for an increasing number of regions.

  • We proposed a case study on retail establishments in Minnesota using the Safegraph POI dataset [9]. The proposed method discovers new regional-colocation patterns involving fast food and coffee retailer feature-type subsets in a Minnesota counties study area. We also confirm that the Bonferroni correction in our method reduces false discoveries.

Scope: For simplicity, this paper focuses on regional-colocation patterns consisting of two or three different features. In our case study, we enumerated regions based on a contiguous collection of counties. Nevertheless, this work can be extended to different types of regions (e.g., ports). We also do not consider segregation patterns (negative spatial interaction) or the temporal aspects of the patterns.

Organization: The paper is organized as follows. Section 2 reviews basic concepts and formally defines the problem. In section 3 we briefly review SSRCM and decribe the proposed approach (MultComp-RCM). Section 4 gives a theoretical analysis of MultComp-RCM. We present the experimental evaluation in Section 5 and a case study in Section 6. Section 7 briefly surveys related work and discussion. Section 8 concludes the paper with future work.

2 Basic Concepts and Problem Definition.

First, we review basic concepts related to colocation detection, statistical significance testing, and the multiple comparisons problem. Then, we formally define statistically significant regional colocation pattern detection.

2.1 Colocation detection:

In this section, we briefly introduce some taxonomy and the basic concept used to define colocation pattern detection with examples. The basic concepts are as follows:

A feature instance is a geo-located spatial entity which is a type of Boolean feature f𝑓fitalic_f with a geo-reference point location p𝑝pitalic_p (e.g., latitude, longitude), represented as <f,p><f,p>< italic_f , italic_p >. Multiple instances of a feature are represented as fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and can be related to other feature instances fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT via a neighbor relation \mathcal{R}caligraphic_R. For example, geographic proximity is represented as fi,fjsubscriptsubscript𝑓𝑖subscript𝑓𝑗\mathcal{R}_{f_{i},f_{j}}caligraphic_R start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT \leq θ𝜃\thetaitalic_θ, where θ𝜃\thetaitalic_θ is the neighbor relation threshold. In a neighbor graph, we represent features that satisfy such relations as a node and their relationship as an edge𝑒𝑑𝑔𝑒edgeitalic_e italic_d italic_g italic_e.

A colocation candidate C𝐶Citalic_C is a set of features defined in the given study area (SA𝑆𝐴SAitalic_S italic_A) or a sub-region (rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT) where rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT \in SA𝑆𝐴SAitalic_S italic_A. For example, Figure 1(a) shows 17171717 spatial objects of type fAsubscript𝑓𝐴f_{A}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT (circle), 12121212 spatial objects of type fBsubscript𝑓𝐵f_{B}italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT (triangle), and 9999 instances of colocation pattern {fAsubscript𝑓𝐴f_{A}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, fBsubscript𝑓𝐵f_{B}italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT}. An instance of a colocation satisfies the neighborhood relation \mathcal{R}caligraphic_R and forms a clique.

A participation ratio (pr𝑝𝑟pritalic_p italic_r) is the ratio of feature instances participating in a relation \mathcal{R}caligraphic_R to the total number of instances inside the study region (SA)𝑆𝐴(SA)( italic_S italic_A ). For a given colocation candidate C𝐶Citalic_C and feature f𝑓fitalic_f, it is represented as pr(f,C)𝑝𝑟𝑓𝐶pr(f,C)italic_p italic_r ( italic_f , italic_C ) as shown in Equation 1:

pr(f,C)=participating_instances(f,C)instance(f)𝑝𝑟𝑓𝐶𝑝𝑎𝑟𝑡𝑖𝑐𝑖𝑝𝑎𝑡𝑖𝑛𝑔_𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠𝑓𝐶𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑓pr(f,C)=\frac{participating\_instances(f,C)}{instance(f)}italic_p italic_r ( italic_f , italic_C ) = divide start_ARG italic_p italic_a italic_r italic_t italic_i italic_c italic_i italic_p italic_a italic_t italic_i italic_n italic_g _ italic_i italic_n italic_s italic_t italic_a italic_n italic_c italic_e italic_s ( italic_f , italic_C ) end_ARG start_ARG italic_i italic_n italic_s italic_t italic_a italic_n italic_c italic_e ( italic_f ) end_ARG (1)

For the feature instances shown in Figure 1(a) the participation ratio values for the relation {fA,fB}subscript𝑓𝐴subscript𝑓𝐵\{f_{A},f_{B}\}{ italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } are pr(fA,{fA,fB})=917𝑝𝑟subscript𝑓𝐴subscript𝑓𝐴subscript𝑓𝐵917pr(f_{A},\{f_{A},f_{B}\})=\frac{9}{17}italic_p italic_r ( italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , { italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } ) = divide start_ARG 9 end_ARG start_ARG 17 end_ARG and pr(fB,{fA,fB})=812𝑝𝑟subscript𝑓𝐵subscript𝑓𝐴subscript𝑓𝐵812pr(f_{B},\{f_{A},f_{B}\})=\frac{8}{12}italic_p italic_r ( italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , { italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } ) = divide start_ARG 8 end_ARG start_ARG 12 end_ARG. Further, the participation ratio within a region (rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT) for a feature f𝑓fitalic_f is defined as pr(f,[rg,C])𝑝𝑟𝑓subscript𝑟𝑔𝐶pr(f,[r_{g},C])italic_p italic_r ( italic_f , [ italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_C ] ). For example, in Figure 1 pr(fA,[r2,{fA,fB}])𝑝𝑟subscript𝑓𝐴subscript𝑟2subscript𝑓𝐴subscript𝑓𝐵pr(f_{A},[r_{2},\{f_{A},f_{B}\}])italic_p italic_r ( italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , [ italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , { italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } ] ) and pr(fB,[r2,{fA,fB}])𝑝𝑟subscript𝑓𝐵subscript𝑟2subscript𝑓𝐴subscript𝑓𝐵pr(f_{B},[r_{2},\{f_{A},f_{B}\}])italic_p italic_r ( italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , [ italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , { italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } ] ) in region r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and has the value 4444\frac{4}{4}divide start_ARG 4 end_ARG start_ARG 4 end_ARG and 3434\frac{3}{4}divide start_ARG 3 end_ARG start_ARG 4 end_ARG respectively.

A participation index (pi𝑝𝑖piitalic_p italic_i) is the minimal participation ratio of all feature types in a colocation candidate as described in Equation 2:

pi(C)=minfC(pr(f,C))𝑝𝑖𝐶𝑓𝐶𝑚𝑖𝑛𝑝𝑟𝑓𝐶pi(C)=\underset{f\in C}{min}(pr(f,C))italic_p italic_i ( italic_C ) = start_UNDERACCENT italic_f ∈ italic_C end_UNDERACCENT start_ARG italic_m italic_i italic_n end_ARG ( italic_p italic_r ( italic_f , italic_C ) ) (2)

The participation index quantifies the spatial interaction within features. Figure 1(a) shows participation index of features fA,fBsubscript𝑓𝐴subscript𝑓𝐵f_{A},f_{B}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT which can be represented as pi({fA,fB})𝑝𝑖subscript𝑓𝐴subscript𝑓𝐵pi(\{f_{A},f_{B}\})italic_p italic_i ( { italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } ) which is min(917,812)𝑚𝑖𝑛917812min(\frac{9}{17},\frac{8}{12})italic_m italic_i italic_n ( divide start_ARG 9 end_ARG start_ARG 17 end_ARG , divide start_ARG 8 end_ARG start_ARG 12 end_ARG ) or 917917\frac{9}{17}divide start_ARG 9 end_ARG start_ARG 17 end_ARG. A regional participation index is the minimal participation ratio of all feature types in the colocation candidate C𝐶Citalic_C within region rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT as shown below

pi([rg,C])=minfC(pr(f,[rg,C]))𝑝𝑖subscript𝑟𝑔𝐶𝑓𝐶𝑚𝑖𝑛𝑝𝑟𝑓subscript𝑟𝑔𝐶pi([r_{g},C])=\underset{f\in C}{min}(pr(f,[r_{g},C]))italic_p italic_i ( [ italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_C ] ) = start_UNDERACCENT italic_f ∈ italic_C end_UNDERACCENT start_ARG italic_m italic_i italic_n end_ARG ( italic_p italic_r ( italic_f , [ italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_C ] ) ) (3)

For instance in Figure 1, pi([r2,{fA,fB}])=min(44,34)=34𝑝𝑖subscript𝑟2subscript𝑓𝐴subscript𝑓𝐵𝑚𝑖𝑛443434pi([r_{2},\{f_{A},f_{B}\}])=min(\frac{4}{4},\frac{3}{4})=\frac{3}{4}italic_p italic_i ( [ italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , { italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } ] ) = italic_m italic_i italic_n ( divide start_ARG 4 end_ARG start_ARG 4 end_ARG , divide start_ARG 3 end_ARG start_ARG 4 end_ARG ) = divide start_ARG 3 end_ARG start_ARG 4 end_ARG.

Colocation patterns [16] is the set of prevalent colocation candidates (based on a prevalence measure, e.g. pi𝑝𝑖piitalic_p italic_i), i.e., candidates comprised of features having a high positive spatial interaction. A regional-colocation pattern [12] is a paired region (rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT) and colocation pattern (C𝐶Citalic_C), i.e., <rg,C><r_{g},C>< italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_C > where the features in pattern C𝐶Citalic_C have a high positive spatial interaction in rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

2.2 Statistical Significance in Colocation Detection:

A statistically significant colocation determines whether an assigned positive spatial interaction between features is statistically significant or could have been observed if the features were in complete spatial randomness (CSR). Other properties in CSR are as follows:

  • Every feature instance has an equal probability of existing at any point in the study area.

  • The locations of any feature instances in the study area are independent of each other.

A null hypothesis (H0)subscript𝐻0(H_{0})( italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is a statement of ‘no effect’ or ‘no difference’. In our problem, the null hypothesis represents the scenario under which there is no spatial interaction between the features in the dataset, i.e., their existence is completely independent of each other.

An alternative hypothesis (Ha)subscript𝐻𝑎(H_{a})( italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) is a statement that is tested against a null hypothesis. In our problem, an alternative hypothesis represents the scenario under which there is a positive spatial interaction between the features in the dataset in a region of interest.

A Type-I error refers to the erroneous rejection of an actually true null hypothesis (or a false positive). In our problem, this would refer to incorrectly assigning a candidate regional-colocation pattern as statistically significant, even though there is a high probability of this pattern being found in CSR or H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

A Type-II error refers to the failure to reject a null hypothesis (H0)subscript𝐻0(H_{0})( italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) that is actually false (or a false negative). This would translate into incorrectly assigning a candidate regional-colocation pattern as not statistically significant.

A point distribution is a collection of geo-distributed points referring to an event (e.g., road accident) in a spatial domain. A point process (PP𝑃𝑃PPitalic_P italic_P) is a statistical process that defines the probability distribution of a point over a region. Point processes are essential for defining the null or alternative hypothesis for our statistical significance test.

A Poisson point process is defined in a generalized space SPsubscript𝑆𝑃S_{P}italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT with intensity ΛΛ\Lambdaroman_Λ having the following properties:

  1. 1.

    The number of points in a bounded Borel set (bounded sets that can be constructed from open or closed sets by repeatedly taking countable unions and intersections) BSP𝐵subscript𝑆𝑃B\subset S_{P}italic_B ⊂ italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is a Poisson random variable with mean Λ(B)Λ𝐵\Lambda(B)roman_Λ ( italic_B ).

  2. 2.

    The number of points in n𝑛nitalic_n disjoint Borel sets forms n𝑛nitalic_n independent random variables. This property results in independent scattering or complete independence.

Null hypothesis generation:

  • For an identical distribution, we generate an equal number of instances of each feature in every partition using summary statistics of the constituent features of the pattern. This ensures that the null hypotheses datasets (although in CSR) closely model the observed dataset in each atomic partition.

  • For independence, we sample instances from a Poisson point process [11]. To check for acceptable auto-correlation, we use a pair correlation function (PCF) or g(d)𝑔𝑑g(d)italic_g ( italic_d ) up to a distance d𝑑ditalic_d, where d𝑑ditalic_d is data-driven. When g(d)>1𝑔𝑑1g(d)>1italic_g ( italic_d ) > 1, it suggests there is clustering at a distance d𝑑ditalic_d within the feature instances, while g(d)=1𝑔𝑑1g(d)=1italic_g ( italic_d ) = 1 represents CSR.

Statistical significance test: Since the participation index (pi𝑝𝑖piitalic_p italic_i) is used to quantify the strength of spatial interaction, the objective is to determine the probability of a pattern’s pi𝑝𝑖piitalic_p italic_i in the observed data. Let pi(C)𝑝subscript𝑖𝐶pi_{\emptyset}(C)italic_p italic_i start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT ( italic_C ) denote the participation index for pattern C𝐶Citalic_C in the null hypothesis and piobs(C)𝑝subscript𝑖𝑜𝑏𝑠𝐶pi_{obs}(C)italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT ( italic_C ) represent the participation index for candidate colocation C in the observed data. Then, we compute the following probability [1]:

p=pr(pi(C)piobs(C))=Rpiobs+1R+1𝑝𝑝𝑟𝑝subscript𝑖𝐶𝑝subscript𝑖𝑜𝑏𝑠𝐶superscript𝑅absent𝑝subscript𝑖𝑜𝑏𝑠1𝑅1p=pr(pi_{\emptyset}(C)\geq pi_{obs}(C))=\frac{R^{\geq pi_{obs}}+1}{R+1}italic_p = italic_p italic_r ( italic_p italic_i start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT ( italic_C ) ≥ italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT ( italic_C ) ) = divide start_ARG italic_R start_POSTSUPERSCRIPT ≥ italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + 1 end_ARG start_ARG italic_R + 1 end_ARG (4)

where Rpiobssuperscript𝑅absent𝑝subscript𝑖𝑜𝑏𝑠R^{\geq pi_{obs}}italic_R start_POSTSUPERSCRIPT ≥ italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT represents the number of Monte Carlo simulations within which the participation index (pi(C)𝑝subscript𝑖𝐶pi_{\emptyset}(C)italic_p italic_i start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT ( italic_C )) for pattern C𝐶Citalic_C is greater than in the observed data (piobs(C)𝑝subscript𝑖𝑜𝑏𝑠𝐶pi_{obs}(C)italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT ( italic_C )) and R𝑅Ritalic_R refers to the total number of Monte Carlo simulations. If pα𝑝𝛼p\leq\alphaitalic_p ≤ italic_α, we consider piobs(C)𝑝subscript𝑖𝑜𝑏𝑠𝐶pi_{obs}(C)italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT ( italic_C ) as statistically significant at level α𝛼\alphaitalic_α.

Regional statistical significance test: To test for regional significance, we use simulated (i.e., computer-generated) candidate regions. For example, if we are trying to determine if fAsubscript𝑓𝐴f_{A}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and fBsubscript𝑓𝐵f_{B}italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are statistically significant in locality r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Figure 1(b)), we generate null hypothesis samples within the locality’s boundaries and use the participation index result from each sample to perform the significance test for rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. Figure 3(a) and 3(b) display two of the R𝑅Ritalic_R different null hypotheses which are used to compare the participation index of the regional-colocation pattern [r2,{fA,fB}]subscript𝑟2subscript𝑓𝐴subscript𝑓𝐵[r_{2},\{f_{A},f_{B}\}][ italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , { italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } ] in the observed data. As shown, the participation index for the pattern [r2,{fA,fB}]subscript𝑟2subscript𝑓𝐴subscript𝑓𝐵[r_{2},\{f_{A},f_{B}\}][ italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , { italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } ] is 1/3131/31 / 3 in both null hypotheses respectively. For a statistical confidence level of 95%percent9595\%95 % the following inequality should hold:

[r=1R=99𝟙(piobs([rg,{fA,fB}])pir([rg,{fA,fB}]))]<5delimited-[]superscriptsubscript𝑟1𝑅991𝑝subscript𝑖𝑜𝑏𝑠subscript𝑟𝑔subscript𝑓𝐴subscript𝑓𝐵𝑝subscript𝑖subscript𝑟subscript𝑟𝑔subscript𝑓𝐴subscript𝑓𝐵5\centering\Big{[}\sum_{r=1}^{R=99}\mathds{1}\bm{(}pi_{obs}([r_{g},\{f_{A},f_{B% }\}])\leq pi_{\emptyset_{r}}([r_{g},\{f_{A},f_{B}\}])\bm{)}\Big{]}<5\@add@centering[ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R = 99 end_POSTSUPERSCRIPT blackboard_1 bold_( italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT ( [ italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , { italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } ] ) ≤ italic_p italic_i start_POSTSUBSCRIPT ∅ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( [ italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , { italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } ] ) bold_) ] < 5 (5)

where rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the region of interest, and 𝟙1\mathds{1}blackboard_1 is an indicator function. We can compute R𝑅Ritalic_R from α=0.05𝛼0.05\alpha=0.05italic_α = 0.05 using α(R+1)=5𝛼𝑅15\alpha(R+1)=5italic_α ( italic_R + 1 ) = 5 [2].

Refer to caption
Figure 3: Two example null hypotheses for significance testing of the observed data in Figure 1

The multiple comparisons problem [14] occurs when every inference in a set of statistical inferences simultaneously has the potential to produce a discovery. The more inferences are made on a particular data set, the more likely it is to incorrectly reject the null hypothesis. Most techniques to address this problem require a stricter significance threshold for individual comparisons to compensate for the number of inferences being made. A stated confidence level generally applies to individual tests. It is often desirable to have a confidence level for a whole family of simultaneous tests.

The Bonferroni correction [3] is a method to address the multiple comparisons problem and the simplest method for reducing Type-I errors. It is a conservative method with a greater risk of failure to reject a false null hypothesis, thus resulting in Type-II errors.

2.3 Formal problem formulation

The problem of statistically significant regional-colocation pattern detection is as follows:
Input:

  1. 1.

    A set (F𝐹Fitalic_F) of spatial-features

  2. 2.

    N𝑁Nitalic_N geo-located spatial feature instances.

  3. 3.

    A study area SAsubscript𝑆𝐴S_{A}italic_S start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT composed of space partitions (e.g., counties).

  4. 4.

    A statistical significance level α𝛼\alphaitalic_α.

  5. 5.

    A neighbor relationship (\mathcal{R}caligraphic_R).

Output: Statistically significant regional-colocation patterns, <rg,C><r_{g},C>< italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_C > where CF𝐶𝐹C\subset Fitalic_C ⊂ italic_F.
Objective: Reducing Type-I error (false positives).
Constraints: Higher statistical confidence of output patterns.

Reasoning behind problem output: Testing for statistical significance on regional-colocation outputs ensures that spurious patterns aren’t detected from the dataset. Otherwise, regions may be enumerated due to a high density of feature instances or spatial auto-correlation. In addition, significance testing for the union of many partitions leads to multiple statistical inferences. Due to the union of partitions, the probability of finding chance patterns within the bigger region (i.e., the union of partitions) is higher; this phenomenon is not accounted for by the p𝑝pitalic_p-value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e threshold for a single partition. This leads to the multiple comparisons problem, resulting in a higher false discovery rate. In application domains related to regional-colocation pattern detection, reducing Type-I errors (false positives) takes higher priority over reducing Type-II errors (false negatives). These Type-II errors might result in missing the detection of certain patterns which might have a lower pvalue𝑝𝑣𝑎𝑙𝑢𝑒p-valueitalic_p - italic_v italic_a italic_l italic_u italic_e.

In this situation, checking for a particular α𝛼\alphaitalic_α level in the individual statistical inferences is insufficient. We also need to control the family-wise error rate which represents the probability of making one or more false discoveries (Type-I errors) [14]. We use the Bonferroni correction in MultComp-RCM to tackle this problem arising from multiple hypothesis tests. This conservative approach ensures that the pattern output has high statistical confidence while ignoring patterns that might have comparatively lower confidence, which is our primary objective. Another benefit of this method is the computational efficiency due to the smaller number of significance tests and participation index computations required as compared to the baseline [9]. The Bonferroni correction proposes stricter pvalue𝑝𝑣𝑎𝑙𝑢𝑒p-valueitalic_p - italic_v italic_a italic_l italic_u italic_e thresholds which might be a bottleneck for large scale applications, such as when dealing with hundreds of atomic partitions. This may also lead to a higher possibility of false negatives (Type-II errors).

3 Methodology

To keep the paper self-contained, we first briefly review the SSRCM, our previous statistically significant regional-colocation miner [9], and a sub-routine on significance testing. We then describe the proposed approach in Section 3.2 and provide an example highlighting the computational cost savings of the new approach.

3.1 Statistically Significant Regional-Colocation Miner

Key idea: In [9], we started by considering partitions with at least 3333 instances of each feature which comprise the regional-colocation pattern. This ensures that the features constituting the pattern all have a considerable presence in the enumerated partitions. We then use the regional statistical significance test as described in Algorithm 1 to determine the atomic footprints of the pattern, i.e., statistically significant pattern within individual partitions. While computing the participation index, we limit our neighborhood to an empirically determined distance (d𝑑ditalic_d) to mine meaningful colocated features.

Algorithm 1 Significance testing

Input:
      - A spatial dataset S consisting of features {fA,fB,subscript𝑓𝐴subscript𝑓𝐵f_{A},f_{B},...italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , …}

- A study area (SA)subscript𝑆𝐴(S_{A})( italic_S start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) and an atomic partition rgSAsubscript𝑟𝑔subscript𝑆𝐴r_{g}\subset S_{A}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⊂ italic_S start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT

- Statistical significance level α𝛼\alphaitalic_α

- A candidate colocation pattern C𝐶Citalic_C

- A set of R𝑅Ritalic_R Null hypotheses (NH𝑁subscript𝐻NH_{\emptyset}italic_N italic_H start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT) data each modelled as colocation C𝐶Citalic_C in atomic partition rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT

- Distance d𝑑ditalic_d for participation index (pi)𝑝𝑖(pi)( italic_p italic_i ) calculation

Output:
      1. <rg,C><r_{g},C>< italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_C > is significant or not

2. p𝑝pitalic_p-valueCrg𝑣𝑎𝑙𝑢superscriptsubscript𝑒𝐶subscript𝑟𝑔value_{C}^{r_{g}}italic_v italic_a italic_l italic_u italic_e start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

1:procedure Significance Testing
2:    Statistically significant result SSRCrg𝑆𝑆superscriptsubscript𝑅𝐶subscript𝑟𝑔SSR_{C}^{r_{g}}italic_S italic_S italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT \leftarrow False
3:    Counter Rpiobssuperscript𝑅absent𝑝subscript𝑖𝑜𝑏𝑠R^{\geq pi_{obs}}italic_R start_POSTSUPERSCRIPT ≥ italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT \leftarrow 0
4:    Calculate piobs𝑝subscript𝑖𝑜𝑏𝑠pi_{obs}italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT for C𝐶Citalic_C at d𝑑ditalic_d in rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT
5:    for i𝑖absenti\initalic_i ∈ [1,R1𝑅1,R1 , italic_Rdo
6:        Calculate the pi,i𝑝subscript𝑖𝑖pi_{\emptyset,i}italic_p italic_i start_POSTSUBSCRIPT ∅ , italic_i end_POSTSUBSCRIPT of C𝐶Citalic_C at d𝑑ditalic_d in the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT NH𝑁subscript𝐻NH_{\emptyset}italic_N italic_H start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT
7:        if pi,ipiobs𝑝subscript𝑖𝑖𝑝subscript𝑖𝑜𝑏𝑠pi_{\emptyset,i}\geq pi_{obs}italic_p italic_i start_POSTSUBSCRIPT ∅ , italic_i end_POSTSUBSCRIPT ≥ italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT then
8:           RpiobsRpiobs+1superscript𝑅absent𝑝subscript𝑖𝑜𝑏𝑠superscript𝑅absent𝑝subscript𝑖𝑜𝑏𝑠1R^{\geq pi_{obs}}\leftarrow R^{\geq pi_{obs}}+1italic_R start_POSTSUPERSCRIPT ≥ italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ← italic_R start_POSTSUPERSCRIPT ≥ italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + 1             
9:    p𝑝pitalic_p-valueCrg=Rpiobs+1R+1𝑣𝑎𝑙𝑢superscriptsubscript𝑒𝐶subscript𝑟𝑔superscript𝑅absent𝑝subscript𝑖𝑜𝑏𝑠1𝑅1value_{C}^{r_{g}}=\frac{R^{\geq pi_{obs}}+1}{R+1}italic_v italic_a italic_l italic_u italic_e start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = divide start_ARG italic_R start_POSTSUPERSCRIPT ≥ italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + 1 end_ARG start_ARG italic_R + 1 end_ARG
10:    if p𝑝pitalic_p-valueCrg𝑣𝑎𝑙𝑢superscriptsubscript𝑒𝐶subscript𝑟𝑔value_{C}^{r_{g}}italic_v italic_a italic_l italic_u italic_e start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT \leq α𝛼\alphaitalic_α then
11:        SSRCrg𝑆𝑆superscriptsubscript𝑅𝐶subscript𝑟𝑔SSR_{C}^{r_{g}}italic_S italic_S italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT \leftarrow True \triangleright (i.e., <rg,C><r_{g},C>< italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_C > is statistically significant)
12:    else
13:        SSRCrg𝑆𝑆superscriptsubscript𝑅𝐶subscript𝑟𝑔SSR_{C}^{r_{g}}italic_S italic_S italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT \leftarrow False \triangleright (i.e., <rg,C><r_{g},C>< italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_C > is not statistically significant)     
14:    return SSRCrg𝑆𝑆superscriptsubscript𝑅𝐶subscript𝑟𝑔SSR_{C}^{r_{g}}italic_S italic_S italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, p𝑝pitalic_p-valueCrg𝑣𝑎𝑙𝑢superscriptsubscript𝑒𝐶subscript𝑟𝑔value_{C}^{r_{g}}italic_v italic_a italic_l italic_u italic_e start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

To find the union of partitions, we first form an undirected unweighted graph (G=V,E)𝐺𝑉𝐸(G=V,E)( italic_G = italic_V , italic_E ) where each vertex (V)𝑉(V)( italic_V ) refers to a partition within which the pattern is statistically significant and an edge (E)𝐸(E)( italic_E ) between two V𝑉Vitalic_Vs represents a shared boundary between them. The graph representation allows the use of graph traversal algorithms (e.g., DFS) to find statistically significant regions that are the union of partitions in V𝑉Vitalic_V.

We note that the union of two atomic footprints within which a candidate regional colocation pattern is statistically significant does not imply that the resultant footprint is a significant regional colocation pattern. Thus, we must recompute the pi𝑝𝑖piitalic_p italic_i for the candidate pattern in the new region and perform the significance test again. As we progress along the edges of G𝐺Gitalic_G, the final output is a larger region composed of contiguous atomic partitions such that the candidate pattern is statistically significant, both within the atomic partitions as well as in the region formed by the union of the output atomic footprints. This is represented by the largest connected component. Algorithm 2 provides the pseudo-code of SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M to find statistically significant regional colocations.

Algorithm 2 Statistically Significant Regional-Colocation Miner (SSCRM)

Input:
      - A Spatial dataset S consisting of features {fA,fB,subscript𝑓𝐴subscript𝑓𝐵f_{A},f_{B},...italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , …}

- A study area (SA)subscript𝑆𝐴(S_{A})( italic_S start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) and a space partitioning Rgsubscript𝑅𝑔R_{g}italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT

- Statistical significance level α𝛼\alphaitalic_α

- Maximum pattern size N𝑁Nitalic_N

- Lower bound LB𝐿𝐵LBitalic_L italic_B (in meters)

- Upper bound UB𝑈𝐵UBitalic_U italic_B (in meters)

Output:
      1. List of statistically significant regional colocation patterns [<rg,C>][<r_{g},C>][ < italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_C > ]

Variables:
      Distance between feature instances d𝑑ditalic_d

1:procedure Statistically Significant Regional-Colocation Miner
2:    for each: fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in {fA,fB,subscript𝑓𝐴subscript𝑓𝐵f_{A},f_{B},...italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , …do
3:        Generate R𝑅Ritalic_R null hypotheses (NH𝑁subscript𝐻NH_{\emptyset}italic_N italic_H start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT) using summary statistics in each rgRgsubscript𝑟𝑔subscript𝑅𝑔r_{g}\in R_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∈ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.     
4:    for each: candidate pattern Cm{C1,C2,,CM}subscript𝐶𝑚subscript𝐶1subscript𝐶2subscript𝐶𝑀C_{m}\in\{C_{1},C_{2},...,C_{M}\}italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } do
5:        for distance d𝑑ditalic_d \in [LB, LB + 10,…, UB] do
6:           for each: rgRgsubscript𝑟𝑔subscript𝑅𝑔r_{g}\in R_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∈ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT do
7:               SSRCmrg𝑆𝑆superscriptsubscript𝑅subscript𝐶𝑚subscript𝑟𝑔SSR_{C_{m}}^{r_{g}}italic_S italic_S italic_R start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, pvalue𝑝𝑣𝑎𝑙𝑢𝑒p-valueitalic_p - italic_v italic_a italic_l italic_u italic_e \leftarrow Significance Testing(S, rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, α𝛼\alphaitalic_α, Cmsubscript𝐶𝑚C_{m}italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, NH𝑁subscript𝐻NH_{\emptyset}italic_N italic_H start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT, d𝑑ditalic_d)
8:               if SSRCmrg𝑆𝑆superscriptsubscript𝑅subscript𝐶𝑚subscript𝑟𝑔SSR_{C_{m}}^{r_{g}}italic_S italic_S italic_R start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is True then
9:                  Insert rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT in significant atomic partitions list                           
10:           Compose Neighborhood graph (G𝐺Gitalic_G) from significant atomic partitions list
11:           rgfinalrgmaxPIsuperscriptsubscript𝑟𝑔𝑓𝑖𝑛𝑎𝑙superscriptsubscript𝑟𝑔𝑚𝑎𝑥𝑃𝐼r_{g}^{final}\leftarrow r_{g}^{maxPI}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUPERSCRIPT ← italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x italic_P italic_I end_POSTSUPERSCRIPT \triangleright atomic partition in G𝐺Gitalic_G with highest pi𝑝𝑖piitalic_p italic_i
12:           for each: rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT \in Depth First Graph Traversal of G𝐺Gitalic_G
13:from vertices adjacent to rgmaxPIsuperscriptsubscript𝑟𝑔𝑚𝑎𝑥𝑃𝐼r_{g}^{maxPI}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x italic_P italic_I end_POSTSUPERSCRIPT do \triangleright rgrgmaxPIsubscript𝑟𝑔superscriptsubscript𝑟𝑔𝑚𝑎𝑥𝑃𝐼r_{g}\neq r_{g}^{maxPI}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ≠ italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x italic_P italic_I end_POSTSUPERSCRIPT
14:               rgtemprgfinalrgsuperscriptsubscript𝑟𝑔𝑡𝑒𝑚𝑝superscriptsubscript𝑟𝑔𝑓𝑖𝑛𝑎𝑙subscript𝑟𝑔r_{g}^{temp}\leftarrow r_{g}^{final}\cup r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_m italic_p end_POSTSUPERSCRIPT ← italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUPERSCRIPT ∪ italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT
15:               SSRCmrg𝑆𝑆superscriptsubscript𝑅subscript𝐶𝑚subscript𝑟𝑔SSR_{C_{m}}^{r_{g}}italic_S italic_S italic_R start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, p-value \leftarrow Significance Testing(S𝑆Sitalic_S, rgtempsuperscriptsubscript𝑟𝑔𝑡𝑒𝑚𝑝r_{g}^{temp}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_m italic_p end_POSTSUPERSCRIPT, α𝛼\alphaitalic_α, Cmsubscript𝐶𝑚C_{m}italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, NH𝑁subscript𝐻NH_{\emptyset}italic_N italic_H start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT, d𝑑ditalic_d)
16:               if SSRCmrg𝑆𝑆superscriptsubscript𝑅subscript𝐶𝑚subscript𝑟𝑔SSR_{C_{m}}^{r_{g}}italic_S italic_S italic_R start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is True then
17:                  rgfinalrgtempsuperscriptsubscript𝑟𝑔𝑓𝑖𝑛𝑎𝑙superscriptsubscript𝑟𝑔𝑡𝑒𝑚𝑝r_{g}^{final}\leftarrow r_{g}^{temp}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUPERSCRIPT ← italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_m italic_p end_POSTSUPERSCRIPT                           
18:           Add <rgfinal,Cm><r_{g}^{final},C_{m}>< italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT > to [<rg,C>][<r_{g},C>][ < italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_C > ]             
19:    return [<rg,C>][<r_{g},C>][ < italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_C > ]

3.2 Multiple Comparisons Regional Colocation Miner (MultComp-RCM)

Key Idea: The baseline SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M computes a significance test for every union of statistically significant partitions, resulting in many participation index (pi)𝑝𝑖(pi)( italic_p italic_i ) computations and significance tests. We address this by using a Bonferroni correction, which selects atomic partitions conservatively, increasing the chances that their union is also statistically significant. The Bonferroni correction reduces the need for a regional statistical significance test for each union operation.

A Bonferroni correction is used when several independent statistical inferences are being performed simultaneously. Although a given significance threshold (α)𝛼(\alpha)( italic_α ) on the p𝑝pitalic_p-value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e may be appropriate for an individual test, it is not sufficient for the set of all comparisons. To reduce many false positives, the α𝛼\alphaitalic_α needs to be lowered to account for the number of comparisons performed. The Bonferroni correction sets the statistical significance threshold for the entire set of n𝑛nitalic_n comparisons to α/n𝛼𝑛\alpha/nitalic_α / italic_n or by multiplying the p𝑝pitalic_p-value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e by n𝑛nitalic_n and then applying the standard threshold α𝛼\alphaitalic_α. This conservative correction works even under the most extreme circumstances (e.g. when all n𝑛nitalic_n tests are independent of one another).

We check for statistical significance in each input partition in the proposed approach. Then, we perform a graph traversal starting from the atomic partition with the highest pi𝑝𝑖piitalic_p italic_i value for the candidate regional-colocation pattern. Then, instead of recomputing the pi𝑝𝑖piitalic_p italic_i and testing the candidate pattern in the new bigger region (composed of atomic partitions) for statistical significance, we perform a Bonferroni correction. Thus, if we were initially checking for a threshold level of 0.050.050.050.05, we would be checking for a threshold level of 0.05/20.0520.05/20.05 / 2 in each atomic partition for the union of two partitions. This conservative threshold reduces Type-I error by returning regions with much higher statistical confidence. The union of the atomic partitions is sequential, and every atomic partition must satisfy the adjusted p-value threshold to be considered for the union.

Algorithm 3 provides a snippet of MultComp-RCM showing the use of the Bonferroni correction. Lines 13131313-18181818 show the new steps in the refined approach.

Algorithm 3 MultComp-RCM snippet
1:procedure MultComp-RCM
2:    \vdots
12:    n𝑛nitalic_n \leftarrow 1 \triangleright Number of atomic partitions in the region.
13:    for each: rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT \in Depth First Graph Traversal of G𝐺Gitalic_G
14:from vertices adjacent to rgmaxPIsuperscriptsubscript𝑟𝑔𝑚𝑎𝑥𝑃𝐼r_{g}^{maxPI}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x italic_P italic_I end_POSTSUPERSCRIPT do \triangleright rgrgmaxPIsubscript𝑟𝑔superscriptsubscript𝑟𝑔𝑚𝑎𝑥𝑃𝐼r_{g}\neq r_{g}^{maxPI}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ≠ italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x italic_P italic_I end_POSTSUPERSCRIPT
15:        Subgraph (SG𝑆𝐺SGitalic_S italic_G) \leftarrow rgfinalsuperscriptsubscript𝑟𝑔𝑓𝑖𝑛𝑎𝑙r_{g}^{final}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUPERSCRIPT \cup rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT
16:        flag \leftarrow 1
17:        flag = BONF_CHECK(flag , SG𝑆𝐺SGitalic_S italic_G, n)
18:        if flag == 1 then
19:           Update rgfinalsuperscriptsubscript𝑟𝑔𝑓𝑖𝑛𝑎𝑙r_{g}^{final}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUPERSCRIPT \leftarrow SG𝑆𝐺SGitalic_S italic_G
20:           nn+1𝑛𝑛1n\leftarrow n+1italic_n ← italic_n + 1             
21:    Add <rgfinal,Cm><r_{g}^{final},C_{m}>< italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT > to <rgC,C><r_{g}^{C},C>< italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , italic_C >
22:    \vdots
23:procedure Bonf_Check(flag , SG𝑆𝐺SGitalic_S italic_G, n)
24:    p𝑝pitalic_p-valuethreshold𝑣𝑎𝑙𝑢subscript𝑒𝑡𝑟𝑒𝑠𝑜𝑙𝑑value_{threshold}italic_v italic_a italic_l italic_u italic_e start_POSTSUBSCRIPT italic_t italic_h italic_r italic_e italic_s italic_h italic_o italic_l italic_d end_POSTSUBSCRIPT \leftarrow α/(n+1)𝛼𝑛1\alpha/(n+1)italic_α / ( italic_n + 1 )
25:    for each: nodeSG𝑛𝑜𝑑𝑒𝑆𝐺node\in SGitalic_n italic_o italic_d italic_e ∈ italic_S italic_G do
26:        if pvalue𝑝𝑣𝑎𝑙𝑢𝑒p-valueitalic_p - italic_v italic_a italic_l italic_u italic_e of pattern Cmsubscript𝐶𝑚C_{m}italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e not-less-than-or-equals\not\leq p𝑝pitalic_p-valuethreshold𝑣𝑎𝑙𝑢subscript𝑒𝑡𝑟𝑒𝑠𝑜𝑙𝑑value_{threshold}italic_v italic_a italic_l italic_u italic_e start_POSTSUBSCRIPT italic_t italic_h italic_r italic_e italic_s italic_h italic_o italic_l italic_d end_POSTSUBSCRIPT then
27:           flag \leftarrow 0             
28:    return flag

Figure 4 shows an execution trace of merging 4444 neighboring partitions. Each region has a participation index and a p𝑝pitalic_p-value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e computed individually. Then, Steps 1-4 show the process of combining these partitions based on either additional statistical significance tests and participation index computations (for SSRCM) or using a tighter p𝑝pitalic_p-value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e threshold for MultComp-RCM. Table 2 compares the number of computations for the two approaches and clearly shows the lower computational requirements of MultComp-RCM. When performing the union of two regions using MultComp-RCM, the new threshold as per the Bonferroni correction is applied to each of the two regions (as in procedure BONF_CHECK in Algorithm 3) for a successful union.

Refer to caption
Figure 4: Execution trace of SSRCM and MultComp-RCM.
Table 2: Comparing the cumulative number of statistical significance tests (C#), participation index computation (pi cal.), and p-value thresholds (p-val th.) between SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M (denoted as S𝑆Sitalic_S) and MultComp-RCM (denoted as R𝑅Ritalic_R).
Steps C#S𝑆Sitalic_S C#R𝑅Ritalic_R pi cal. S pi cal. R p-val. th. S𝑆Sitalic_S p-val. th. R𝑅Ritalic_R
0 4 4 4 4 0.05 0.05
1 5 4 5 4 0.05 0.05
2 6 4 6 4 0.05 0.025
3 7 4 7 4 0.05 0.0167
4 8 4 8 4 0.05 0.0125

4 Theoretical Analysis

Lemma 4.1.

MultCompRCM𝑀𝑢𝑙𝑡𝐶𝑜𝑚𝑝𝑅𝐶𝑀MultComp-RCMitalic_M italic_u italic_l italic_t italic_C italic_o italic_m italic_p - italic_R italic_C italic_M has lower or equal Type-I error than SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M.

Proof 4.2.

Algorithm 1 called by Algorithm 2 and 3 in line 7 extracts atomic partitions within which a regional-colocation pattern is statistically significant.

The Bonferroni correction in procedure BONF_CHECK in Algorithm 3 controls the experiment-wide false positive rate (π𝜋\piitalic_π) by specifying the significance level (α𝛼\alphaitalic_α) for each test, where a test is significant if pvalueα𝑝𝑣𝑎𝑙𝑢𝑒𝛼p-value\leq\alphaitalic_p - italic_v italic_a italic_l italic_u italic_e ≤ italic_α. The probability of no Type I𝐼Iitalic_I error (false positives) in n𝑛nitalic_n independent tests is (1α)nsuperscript1𝛼𝑛(1-\alpha)^{n}( 1 - italic_α ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, if each test is at level α𝛼\alphaitalic_α. Therefore, the probability of at least one false positive π𝜋\piitalic_π is 1(1α)n1superscript1𝛼𝑛1-(1-\alpha)^{n}1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. For an experiment-wide false positive rate of π𝜋\piitalic_π, the α𝛼\alphaitalic_α for each test should be α=1(1π)1/n𝛼1superscript1𝜋1𝑛\alpha=1-(1-\pi)^{1/n}italic_α = 1 - ( 1 - italic_π ) start_POSTSUPERSCRIPT 1 / italic_n end_POSTSUPERSCRIPT. Using binomial approximation, (1α)n1nαsimilar-to-or-equalssuperscript1𝛼𝑛1𝑛𝛼(1-\alpha)^{n}\simeq 1-n\alpha( 1 - italic_α ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ≃ 1 - italic_n italic_α, which gives α=π/n𝛼𝜋𝑛\alpha=\pi/nitalic_α = italic_π / italic_n. For an experiment-wide false positive value π=0.05𝜋0.05\pi=0.05italic_π = 0.05, the α𝛼\alphaitalic_α (false positive rate for each test) should be less than π,𝜋\pi,italic_π , i.e. απ𝛼𝜋\alpha\leq\piitalic_α ≤ italic_π. Therefore each region and sub-region output by MultComp-RCM has lower Type-I and a precision close to 1111.

Lemma 4.3.

MultComp-RCM has lower or equal computational cost than SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M for all observed data, where Bonferroni-revised p-values eliminate lower confidence candidates considered by the original p-value, i.e. CostMultCompRCMCostSSRCM𝐶𝑜𝑠subscript𝑡𝑀𝑢𝑙𝑡𝐶𝑜𝑚𝑝𝑅𝐶𝑀𝐶𝑜𝑠subscript𝑡𝑆𝑆𝑅𝐶𝑀Cost_{MultComp-RCM}\leq Cost_{SSRCM}italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_M italic_u italic_l italic_t italic_C italic_o italic_m italic_p - italic_R italic_C italic_M end_POSTSUBSCRIPT ≤ italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_S italic_S italic_R italic_C italic_M end_POSTSUBSCRIPT.

Proof 4.4.

Let Cpi(d)subscript𝐶𝑝𝑖𝑑C_{pi}(d)italic_C start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT ( italic_d ) be the complexity of participation index (pi)𝑝𝑖(pi)( italic_p italic_i ) computation for a specific region (dependent on the data d𝑑ditalic_d). Let Cst(piobs,pinull,d)subscript𝐶𝑠𝑡𝑝subscript𝑖𝑜𝑏𝑠𝑝subscript𝑖𝑛𝑢𝑙𝑙𝑑C_{st}(pi_{obs},pi_{null},d)italic_C start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT ( italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT , italic_p italic_i start_POSTSUBSCRIPT italic_n italic_u italic_l italic_l end_POSTSUBSCRIPT , italic_d ) be the complexity of significance testing for a specific region (dependent on the pi𝑝𝑖piitalic_p italic_i in observed data d𝑑ditalic_d and the null hypothesis). Assume N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the number of space partitions/regions in the dataset, N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the number of space partitions extracted from Algorithm 1, and N2N1subscript𝑁2subscript𝑁1N_{2}\leq N_{1}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Further, assume d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the initial dataset and d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the dataset in each iteration in the SSRCM. Then, the cost of SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M is

N1(Cpi(d1)+Cst(piobs,pinull,d1))+N2(Cpi(d2)+Cst(piobs,pinull,d2))subscript𝑁1subscript𝐶𝑝𝑖subscript𝑑1subscript𝐶𝑠𝑡𝑝subscript𝑖𝑜𝑏𝑠𝑝subscript𝑖𝑛𝑢𝑙𝑙subscript𝑑1subscript𝑁2subscript𝐶𝑝𝑖subscript𝑑2subscript𝐶𝑠𝑡𝑝subscript𝑖𝑜𝑏𝑠𝑝subscript𝑖𝑛𝑢𝑙𝑙subscript𝑑2N_{1}(C_{pi}(d_{1})+C_{st}(pi_{obs},pi_{null},d_{1}))+N_{2}(C_{pi}(d_{2})+C_{% st}(pi_{obs},pi_{null},d_{2}))italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT ( italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT , italic_p italic_i start_POSTSUBSCRIPT italic_n italic_u italic_l italic_l end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) + italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT ( italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT , italic_p italic_i start_POSTSUBSCRIPT italic_n italic_u italic_l italic_l end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) (6)

By contrast, the cost of the proposed MultComp-RCM approach is only N1(Cpi(d1)+Cst(piobs,N_{1}(C_{pi}(d_{1})+C_{st}(pi_{obs},italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT ( italic_p italic_i start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT , pinull,d1))+N2pi_{null},d_{1}))+N_{2}italic_p italic_i start_POSTSUBSCRIPT italic_n italic_u italic_l italic_l end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) + italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Here, N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represents the number of significant partitions for which the p-value needs a comparison against the threshold obtained from Bonferroni correction.

5 Experimental Evaluation

We had three goals for the experiments: (1) To compare the time taken by SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M and MultComp-RCM with varying numbers of regional-colocation instances, varying number of atomic partitions, and change in the number of feature instances. (2) To compare the number of significance tests, pi𝑝𝑖piitalic_p italic_i calculations for a varying number of regions. (3) To compare solution quality between SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M and MultComp-RCM.

Experiment design: Figure 5 shows the overall validation framework. The metric for comparing the solution quality of SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M with MultComp-RCM was the false positive rate (FPR), while the runtime comparisons were based on the execution time (in seconds) of the individual algorithms. The experiments were done on both real (Safegraph POI) and synthetic data to perform both comparative and sensitivity analysis.

Refer to caption
Figure 5: Overall validation framework

Synthetic data generation: We began with a space partitioning (Rg)subscript𝑅𝑔(R_{g})( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ), a maximum union (or traversal) of regions (Lmax)subscript𝐿𝑚𝑎𝑥(L_{max})( italic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ), and a number of regional-colocation patterns i.e., pairs of <rg,C><r_{g},C>< italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_C >. We then generated reference points within the partitions using the Poisson point process. At each reference point, we generated circles of diameter dgsubscript𝑑𝑔d_{g}italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT which was determined empirically for each region in Rgsubscript𝑅𝑔R_{g}italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT in the observed dataset. The diameter signifies the smallest distance between features in a colocation C𝐶Citalic_C at which they become statistically significant regional colocations. We populated each circle with instances of C𝐶Citalic_C. We note that the circles were only used to place colocated instances in a region and were not separate partitioning. Figure 6 shows the process of synthetic data generation.

Refer to caption
Figure 6: Synthetic data generation process.

Comparative Analysis: Figure 7 shows the time taken (in log scale) for different regional-colocation instances. For this experiment, we varied the number of regional-colocation instances in each atomic partition from 4444 to 84848484 while keeping other parameters (like the number of regions) constant and record the execution time of both algorithms. Figure 7 compares the execution time with a varying number of atomic partitions (or regions) while keeping the number of regional-colocation instances in each partition constant. Figure 7 shows the time taken with a varying number of feature instances (which constitute the regional-colocation pattern) in each region while keeping the number of regions constant. In all experiments, MultComp𝑀𝑢𝑙𝑡𝐶𝑜𝑚𝑝MultCompitalic_M italic_u italic_l italic_t italic_C italic_o italic_m italic_p-RCM𝑅𝐶𝑀RCMitalic_R italic_C italic_M is much faster than the baseline SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M. These results are consistent with Lemma 4.3, which says that CostMultCompRCMCostSSRCM𝐶𝑜𝑠subscript𝑡𝑀𝑢𝑙𝑡𝐶𝑜𝑚𝑝𝑅𝐶𝑀𝐶𝑜𝑠subscript𝑡𝑆𝑆𝑅𝐶𝑀Cost_{MultComp-RCM}\leq Cost_{SSRCM}italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_M italic_u italic_l italic_t italic_C italic_o italic_m italic_p - italic_R italic_C italic_M end_POSTSUBSCRIPT ≤ italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_S italic_S italic_R italic_C italic_M end_POSTSUBSCRIPT.

[Number of reg. col. instances]Refer to caption [Number of regions]Refer to caption [Number of feature instances]Refer to caption[Number of significance tests]Refer to caption              [Number of PI calculations]Refer to caption

Figure 7: MultComp-RCM outperforms SSRCM [9]

Sensitivity Analysis: Figure 7 shows the number of significance tests performed by both algorithms with varying number of regions, while keeping the number of regional-colocation instances constant in each partition. Figure 7 shows the number of participation index computations performed with varying number of regions with the same constant parameters as above. In both cases, the proposed MultComp-RCM requires lesser number of significance tests and participation index computations for an increasing number of regions.

Solution Quality: We performed controlled experiments on synthetic datasets to compare the solution quality of MultComp𝑀𝑢𝑙𝑡𝐶𝑜𝑚𝑝MultCompitalic_M italic_u italic_l italic_t italic_C italic_o italic_m italic_p-RCM𝑅𝐶𝑀RCMitalic_R italic_C italic_M with SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M. Metric for comparison was the false positive rate (FPR).

FPR=FPFP+TN𝐹𝑃𝑅𝐹𝑃𝐹𝑃𝑇𝑁FPR=\frac{FP}{FP+TN}italic_F italic_P italic_R = divide start_ARG italic_F italic_P end_ARG start_ARG italic_F italic_P + italic_T italic_N end_ARG, where FP𝐹𝑃FPitalic_F italic_P is the number of false positives, and TN𝑇𝑁TNitalic_T italic_N is the number of true negatives. Table 3 shows the experiment results. As shown MultComp-RCM exhibits a lower rate for false pattern discovery than SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M. This is mainly because MultComp-RCM eliminates regions which barely pass the atomic significance test (borderline statistical confidence) in Algorithm 1, which SSRCM fails to reject in the final output.

Table 3: MultComp-RCM generates less false positives
Pattern SSRCM False Positive Rate MultComp-RCM False Positive Rate
A, B, C 0.15 0.03
A, B 0.17 0.01
B, C 0.14 0.01
A, C 0.19 0.04

6 Case Study

We extended our previous case study [9] to show the effectiveness of the proposed approach.

Dataset: We used data from SafeGraph, a mobility data vendor who provides anonymized aggregated location data to researchers studying the effects of COVID-19 on citizen mobility patterns towards numerous Points Of Interest (POIs). The dataset consists of 1473 retail brands in Minnesota. Experiments were performed on colocation patterns consisting of two (e.g., Jimmy John’s, McDonald’s) or three (e.g., Jimmy John’s, McDonald’s, Subway) features. Our null hypothesis generation followed the procedure described in Section 2.2.

Case Study Results: The pattern C:={JimmyJohns,McDonalds,Subway}assign𝐶𝐽𝑖𝑚𝑚𝑦𝐽𝑜superscript𝑛𝑠𝑀𝑐𝐷𝑜𝑛𝑎𝑙superscript𝑑𝑠𝑆𝑢𝑏𝑤𝑎𝑦C:=\{JimmyJohn^{\prime}s,McDonald^{\prime}s,Subway\}italic_C := { italic_J italic_i italic_m italic_m italic_y italic_J italic_o italic_h italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s , italic_M italic_c italic_D italic_o italic_n italic_a italic_l italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s , italic_S italic_u italic_b italic_w italic_a italic_y } was found to be statistically significant when the distance between feature instances was about 1400140014001400 meters. The regional footprint was the union of ‘Dakota’ and ‘Hennepin’ Counties. The pi𝑝𝑖piitalic_p italic_i values in the counties were 0.340.340.340.34 and 0.450.450.450.45 respectively. The p𝑝pitalic_p-value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e for the pattern within the counties were 0.020.020.020.02 and 0.010.010.010.01, satisfying the p𝑝pitalic_p-value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e threshold of 0.0520.052\frac{0.05}{2}divide start_ARG 0.05 end_ARG start_ARG 2 end_ARG as per the Bonferroni correction for the two partitions. A few additional significant patterns are shown in Table 4 (values rounded to two decimal places).

Table 4: Regional-colocation patterns found to be statistically-significant at distance d𝑑ditalic_d.
Colocated features Counties (participationindex,pvalue𝑝𝑎𝑟𝑡𝑖𝑐𝑖𝑝𝑎𝑡𝑖𝑜𝑛𝑖𝑛𝑑𝑒𝑥𝑝𝑣𝑎𝑙𝑢𝑒participationindex,p-valueitalic_p italic_a italic_r italic_t italic_i italic_c italic_i italic_p italic_a italic_t italic_i italic_o italic_n italic_i italic_n italic_d italic_e italic_x , italic_p - italic_v italic_a italic_l italic_u italic_e) d𝑑ditalic_d
{ Caribou coffee, Starbucks} Hennepin (0.34,0.010.340.010.34,0.010.34 , 0.01) 200200200200 m
{ Caribou coffee, Starbucks} Carver (0.5,0.020.50.020.5,0.020.5 , 0.02), Hennepin (0.51,0.010.510.010.51,0.010.51 , 0.01), Washington (0.41,0.010.410.010.41,0.010.41 , 0.01) 400400400400 m
{ Caribou coffee, Starbucks, Dunn Bros} Hennepin (0.52,0.010.520.010.52,0.010.52 , 0.01) 1900190019001900 m
{ Caribou coffee, Starbucks, Dunn Bros} Hennepin (0.72,0.010.720.010.72,0.010.72 , 0.01), Washington (0.36,0.020.360.020.36,0.020.36 , 0.02) 3000300030003000 m
{ Jimmy John’s, McDonald’s} Hennepin (0.39,0.010.390.010.39,0.010.39 , 0.01) 500500500500 m
{ Jimmy John’s, McDonald’s} Dakota (0.36,0.020.360.020.36,0.020.36 , 0.02), Hennepin (0.51,0.010.510.010.51,0.010.51 , 0.01) 700700700700 m
{ Jimmy John’s, McDonald’s, Subway} Dakota (0.34,0.020.340.020.34,0.020.34 , 0.02), Hennepin (0.45,0.010.450.010.45,0.010.45 , 0.01) 1400140014001400 m
{ Jimmy John’s, McDonald’s, Subway} Dakota (0.47,0.020.470.020.47,0.020.47 , 0.02), Hennepin (0.57,0.010.570.010.57,0.010.57 , 0.01), Washington (0.43,0.020.430.020.43,0.020.43 , 0.02) 1500150015001500 m

In our previous paper [9], we compared SSRCM with the Quad and QGFR algorithms [12] whose data-aware space partitioning approach is based on the minimum orthogonal bounding rectangle (MOBR𝑀𝑂𝐵𝑅MOBRitalic_M italic_O italic_B italic_R). We found that the MOBR-based approach with a participation index threshold of 0.60.60.60.6 produced 3368336833683368 potential localities for the pattern {rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, [Caribou Coffee, Starbucks]}. With a confidence level of 95%percent9595\%95 %, MOBR-based approach resulted in 2917291729172917 significant and 451451451451 non-significant patterns. Hence, a regional-colocation miner without statistical significance may enumerate output regions where colocations occurred by chance.

7 Related Work and Discussion

Related Work: The concept of colocation was introduced by Shekhar et al. [16]. Huang et al. [10] provided extensive experiments and rigorous discussions regarding the topic and the participation index as a prevalence measure between constituent features. Later, Barua et al. [1] introduced statistical significance testing in global colocation and segregation pattern detection to avoid enumeration of chance patterns in the dataset for both aggregation and segregation patterns but did not mention patterns that are regional (or local). Regional colocation with minimum orthogonal bounding rectangle (MOBR) based approach was studied by Li et al. [12] while [17] and [4] focused on shapes and zonal patterns, respectively. These methods utilized a threshold on the participation index (pi𝑝𝑖piitalic_p italic_i) without statistical significance testing, leading to the detection of spurious patterns (as discussed in [9]). We [9] recently proposed a subgraph-based approach that incorporates statistical significance in detecting regional colocation patterns. This approach reduced the number of spurious patterns detected by previous methods. However, due to a large number of simultaneous statistical inferences, an increase in false discoveries is also observed. Besides, other patterns [15] and several statistical significance and false discovery reduction techniques have been studied in association rule mining [18], [6]. However, these approaches do not address the inherent variability in spatial data (i.e., different summary statistics of features in each atomic partition). To find subgroups of items, which are generally observed to be statistically significant associations, they compare a quality measure (which assigns to each itemset a numeric value) on the subgroup against that in a statistical model (which corresponds to the null hypothesis). These null hypotheses for significance testing are uniform and do not address spatial variability. Thus these approaches are not directly applicable to regional colocation patterns (more details in Appendix A).

8 Conclusion and Future Work

In this paper, we refined the problem of the statistically significant regional-colocation pattern (SSCRP𝑆𝑆𝐶𝑅𝑃SSCRPitalic_S italic_S italic_C italic_R italic_P). We proposed a robust MultComp𝑀𝑢𝑙𝑡𝐶𝑜𝑚𝑝MultCompitalic_M italic_u italic_l italic_t italic_C italic_o italic_m italic_p-RCM𝑅𝐶𝑀RCMitalic_R italic_C italic_M approach that reduces the number of false positives using a Bonferroni correction. We theoretically show that MultComp-RCM has a lower or equal Type-I error and computational cost than SSRCM𝑆𝑆𝑅𝐶𝑀SSRCMitalic_S italic_S italic_R italic_C italic_M along with experimental results. We extended the previous case study on retail establishments in Minnesota using the proposed approach showing a contrast between significant and non-significant patterns.
Future Work: We plan to explore other methods to reduce Type-I errors (false positives) while also addressing Type-II errors (false negatives) arising from the conservative Bonferroni correction approach and further add temporal dimension to these patterns.

References

  • [1] Sajib Barua and Jörg Sander. Mining statistically significant co-location and segregation patterns. IEEE TKDE, 26(5):1185–1199, 2013.
  • [2] Julian Besag and Peter J Diggle. Simple monte carlo tests for spatial pattern. Journal of the Royal Statistical Society: Series C (Applied Statistics), 26(3):327–333, 1977.
  • [3] Carlo Bonferroni. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8:3–62, 1936.
  • [4] M Celik et al. Zonal co-location pattern discovery with dynamic parameters. ICDM, 2007.
  • [5] Min Deng et al. Multi-level method for discovery of regional co-location patterns. IJGIS, 2017.
  • [6] Wouter Duivesteijn and Arno Knobbe. Exploiting false discoveries–statistical validation of patterns and quality measures in subgroup discovery. In 2011 IEEE 11th International Conference on Data Mining, pages 151–160. IEEE, 2011.
  • [7] Christoph F. Eick, Rachana Parmar, et al. Finding regional co-location patterns for sets of continuous variables in spatial datasets. In SIGSPATIAL, 2008.
  • [8] Yan Li et al. Cscd: Towards spatially resolving the heterogeneous landscape of mxif oncology data. In Proceedings of the 10th ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, BigSpatial ’22, pages 36–46, New York, NY, USA, 2022. ACM.
  • [9] Subhankar Ghosh, Jayant Gupta, Arun Sharma, Shuai An, and Shashi Shekhar. Towards geographically robust statistically significant regional colocation pattern detection. In Proceedings of the 5th ACM SIGSPATIAL International Workshop on GeoSpatial Simulation, GeoSim ’22, page 11–20, New York, NY, USA, 2022. Association for Computing Machinery. doi:10.1145/3557989.3566158.
  • [10] Yan Huang et al. Discovering colocation patterns from spatial data sets: a general approach. IEEE TKDE, 16(12):1472–1485, 2004.
  • [11] Janine Illian, Antti Penttinen, et al. Statistical analysis and modelling of spatial point patterns, volume 70. John Wiley & Sons, 2008.
  • [12] Yan Li and Shashi Shekhar. Local co-location pattern detection: a summary of results. In GIScience. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
  • [13] Guenter B Risse. “a long pull, a strong pull, and all together”: San francisco and bubonic plague, 1907-1908. Bulletin of the History of Medicine, 66(2):260–286, 1992.
  • [14] G Rupert Jr et al. Simultaneous statistical inference. Springer Science & Business Media, 2012.
  • [15] Arun Sharma, Jayant Gupta, and Subhankar Ghosh. Towards a tighter bound on possible-rendezvous areas: preliminary results. In Proceedings of the 30th International Conference on Advances in Geographic Information Systems, pages 1–11, 2022.
  • [16] Shashi Shekhar and Yan Huang. Discovering spatial co-location patterns: A summary of results. In Intl. symposium on spatial and temporal databases, pages 236–256. Springer, 2001.
  • [17] Song Wang et al. Regional co-locations of arbitrary shapes. In SSTD, 2013.
  • [18] Geoffrey I Webb. Discovering significant patterns. Machine learning, 68(1):1–33, 2007.
  • [19] David WS Wong. The modifiable areal unit problem (maup). In WorldMinds: geographical perspectives on 100 problems: commemorating the 100th anniversary of the association of American geographers 1904–2004, pages 571–575. Springer, 2004.

In this appendix, we address the following questions:

Appendix A Why can’t we use existing false discovery reduction techniques from local pattern mining?

Existing techniques for reducing false discoveries in local pattern mining cannot be applied to this problem, because of spatial variability (i.e. constituent features of a regional colocation pattern might have different summary statistics in different atomic partitions). Webb [18] proposed a holdout approach where one divides the data into exploratory and holdout sets. Patterns are generated using the exploratory data, while statistical tests are performed on the generated patterns using the holdout data. This technique may apply to atomic partitions with a large presence of constituent features (e.g., partition T41 in Figure 8). However, it would be counterproductive in partitions where the number of feature instances is very low (e.g., partition T42 in Fig. 8). In such partitions splitting the data points into exploratory and holdout sets would result in very few instances for the pattern detection process.

Refer to caption
Figure 8: Feature instances exhibit spatial variability within atomic partitions.

Appendix B Why can’t this problem be cast as a modified version of frequent itemset mining?

In frequent itemset mining, the task is to find subgroups of items that often occur together in a transaction, e.g., laptop and antivirus software. Previous works have been done on addressing false discoveries in this problem [6]. Such approaches assign the association in the mined subgroup as the alternate hypothesis while the null hypothesis is formulated using a randomized baseline subset. Thus these approaches do not address the independent relationship between hypotheses in different spatial partitions in our problem. As noted earlier, in regional colocation pattern detection, different features might have different summary statistics in different atomic partitions. To model the complete spatial randomness of these features, we generate the null hypotheses in each atomic partition as per the summary statistics of the said features in that specific partition. Thus the null hypothesis generated for the features in one atomic partition is independent of the null hypothesis in other atomic partitions. Therefore, the problem of regional colocation pattern detection cannot be considered a modified version of subgroup discovery in frequent itemset mining.

Appendix C How does spatial colocation mining differ from association rule mining?

Data mining techniques have been widely developed to solve challenging problems in various domains. Yet, the underlying assumption of these algorithms does not address the problem of spatial variability. This leads to the detection of spurious patterns in spatial data, also known as the modifiable aerial unit problem (MAUP [19]). Colocation pattern detection resembles association rule mining, but the absence of transactions in colocation mining means techniques in association rule mining cannot be used directly to mine colocation patterns.

[Map of 3 spatial feature types]Refer to caption [Spatial Partition 1]Refer to caption [Spatial Partition 2]Refer to caption[Spatial Partition 3]Refer to caption [Transactions]Refer to caption

Figure 9: Association rule mining [8] returning different results depending on the spatial partition

Transactions in association rule mining refer to groups of items purchased together. An itemset’s support is the fraction of transactions that contain the itemset. Itemsets greater than a user-specified support value yield to the association rule. In spatial data mining, the choice of partition affects the transaction. For example, Figure 9 below shows a dataset with 3 feature types, i.e. <squares>, <triangles>, <circles>. In partition P1 (Figure 9) <squares, triangles, circles> is a transaction, while in partitions P2 (Figure 9) and P3 (Figure 9) <triangles, circles> and <squares, triangles> are the transactions respectively. This is known as the MAUP problem. In colocation pattern detection this is addressed using a neighborhood graph as shown in Figure 10. A user-defined neighbor relationship R𝑅Ritalic_R is used to find subsets of features in close geographic proximity. Thus the colocation miner provides a transaction-free approach to mine prevalent patterns.

[Neighbor graph based on relation R𝑅Ritalic_R] Refer to caption [PI of candidate patterns]Refer to caption

Figure 10: Colocation pattern detection [9]