Álvaro J. Riascos Villegas, Francisco Gómez, Jose Sebastian Ñungo, Lucas Gómez Tobón and Mateo Dulce Rubio
This repository contains all the instructions to replicate the paper "Modelling Underreported Spatio-temporal Crime Events" including its results, tables, and figures. To do so, first, it is necessary to download all the data from our Zenodo repository and locate it in the "Data" folder to facilitate the exercise of running the code.
This repository has three main folders. The content of each one is outlined below:
- Data: This folder contains "files_dictionary.txt" which is a file with the description of the files that are in our Zenodo repository. The purpose of this folder is to store all the data from Zenodo, therefore you must download them and put them in there.
- Scripts: This folder contains all the Python codes to develop our research. It is divided into three subfolders:
- 1_preprocess_data: This folder contains three scripts that are used to transform the raw data of citizens' crime reports and the official dataset of crime from the Colombian National Police into our clean matrices of events. Each script is a Jupyter Notebook with a detailed description of the process made. The raw data used in our analysis is not provided as it contains sensitive information about crimes, victims and complainants. For this reason, some dummy observations are provided to run the code.
- 2_modeling: It contains two Python files to create the functions to perform the algorithms described in Section 2 of our paper.
- 3_create_outputs: It contains Jupyter Notebooks to produce all the results of our research. The data needed for those codes are located in our Zenodo repository so it is important to download all the files and located them in the "Data" folder. The following is the list of outputs produced by each script on this folder:
- figure_8_Bogota_jurisdiction_grid.ipynb: produces Figure 8.
- figure_9_crimes_by_source_of_information.ipynb: produces Figure 9.
- figure_15_Bogota_heatmap.ipynb: produces Figure 15
- results_graphs.ipynb: produces Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14.
- section_model_validation_graphs.ipynb: produces Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7.
- Outputs: This folder contains the results of our research in the form of Data, Figures, and Tables. Therefore, this folder is divided into those three categories:
- Data: Contains the results of our models in CSV files. Those files are the inputs to produce our main Figures.
- Figures: Contains the Figures of the paper.
- Tables: Contains the Tables of the paper.
Due capacity of GitHub, we decided to upload all the data needed to replicate our paper in Zenodo. The files uploaded to Zenodo are listed below:
- distance_1000.csv: is a data frame with 5000 rows and 3 columns. Each row is a time step of the algorithms and reports the Euclidean distance between the vector with the real crime rate in each cell and the estimation made by the algorithm. The exercise was performed in the case of 1,000 arms and at most 100 super arms. This file is created in the times.py script of our repository.
- distance_10000.csv: is a data frame with 5000 rows and 3 columns. Each row is a time step of the algorithms and reports the Euclidean distance between the vector with the real crime rate in each cell and the estimation made by the algorithm. The exercise was performed in the case of 10,000 arms and at most 1,000 super arms. This file is created in the times.py script of our repository.
- distance_50000.csv: is a data frame with 5000 rows and 3 columns. Each row is a time step of the algorithms and reports the Euclidean distance between the vector with the real crime rate in each cell and the estimation made by the algorithm. The exercise was performed in the case of 50,000 arms and at most 5,000 super arms. This file is created in the times.py script of our repository.
- grilla_bogota.csv: is a data frame with 1638 rows and 5 columns in which each row described one grid of Bogotá. The difference between this file and grilla_bogota2.csv is that this file is used to plot Figure 9 which includes the rural area of the city. Something that is removed in our analysis due to the low density of crime in this zone. This file is created in the 3_create_grid.ipynb script of our repository.
- grilla_bogota2.csv: is a data frame with 1008 rows and 10 columns in which each row described one grid of Bogotá. This file is more complete than grilla_bogota.csv because it includes the name of the Localidad in which the centroid of the cell belongs and its Rep. Rate. However, this file does not contain the rural area of the city. This file is created in the 3_create_grid.ipynb script of our repository.
- localidades.zip: this zipped folder contains the shapefiles to draw the map of Bogotá with its respective administrative limits. The information contained herein is of a public nature and can also be found on the government's open data page.
- matriz_eventos_real.csv: is a matrix of 498 rows and 368 columns in which each row represents one cell of Bogota's grid and each column represents the number of real crimes for each date. Recall that we assume that the total of crimes is the combination of NUSE and SIEDCO crimes after the removal of duplicates. This file is created in the 3_create_grid.ipynb script of our repository.
- matriz_eventos_subreporte.csv: is a matrix of 498 rows and 368 columns in which each row represents one cell of Bogota's grid and each column represents the number of subreported crimes for each date. Recall that we assume that the number of sub-reported crimes is the number of crimes reported in NUSE. This file is created in the 3_create_grid.ipynb script of our repository.
- subreporte_ccb.csv: is a data frame of 498 rows and 4 columns that describe the Rep. Rate and lambda for each cell of Bogota's grid. This file is created in the 3_create_grid.ipynb script of our repository.
- upla.zip: this zipped folder contains other extra shapefiles to draw the map of Bogotá with its respective administrative limits. The information contained herein is of a public nature and can also be found on the government's open data page.
- victimización.xlsx: is an Excel file with 20 rows and 4 columns that contains the Vict. Rate and the Rep. Rate for each Localidad of Bogotá. This information comes from survey-based victimization and victim crime reporting rates presented by Bogotá’s Chamber of commerce (2014).
Crime observations are one of the principal inputs used by governments for designing citizens' security strategies. However, crime measurements are obscured by underreporting biases, resulting in the so-called "dark figure of crime". Current approaches for estimating the "true" crime rate do not account for underreporting temporal crime dynamics. This work studies the possibility of recovering "true" crime incident rates over time using data from underreported crime observations and complementary crime-related measurements acquired online. For this, a novel underreporting model of spatiotemporal events based on the combinatorial multi-armed bandit framework was proposed. Through extensive simulations, the proposed methodology was validated for identifying the fundamental parameters of the proposed model: the "true" rates of incidence and underreporting of events. Once the proposed model was validated, crime data from a large city, Bogotá (Colombia), was used to estimate the "true" crime and underreporting rates. Our results suggest that this methodology could be used to rapidly estimate the underreporting rates of spatiotemporal events, which is a critical problem in public policy design.