The goal of this repository is to make available the code used for the paper Goepp and van de Kassteele (2021). It provides all the code necessary to reproduce the paper's figures, simulation results and real data application results.
This research paper introduces a method for defining clusters on graph-based signals. It is applied in the domain of spatial statistics for detecting clusters in areal data.
The method developed in the paper is available in the R package graphseg.
Below is a visual illustration of the method, producing a clustering of a spatial signal. The areas used are the neighborhoods around the city of Utrecht, NL:
Keywords: Graph signal processing, Areal lattice data, Spatial clustering, Hot spot detection, Graph-fused lasso, Adaptive Ridge
-
simu/contains all simulations done in the paper:-
graphical_abstract/is used to generate the illustrative example displayed in the graphical abstract -
figure/contains figures used in the paper -
synthetic/contains the R objects defining the 6 datasets -
synthetic/gathers the.rdsfiles used for simulations: the adjacency graph and thesfobject, for each simulation setting. -
table/contains the table summarizing the simulation resutls in latex format. -
Runtime experiments:
computing_time.Rruns simulations comparing the computing time of thegraphseg::agraphmethod withflsa(see the paper).extract_subgraph.Rcreates the subgraphs of different size to run the runtime experiments. -
Additional runtime experiments (not shown in the paper):
computation_time_wrt_q.Rruns runtime simulations showing that the number of zones does not impact the runtime (see paper).extract_subgraph_wrt_q.Rextracts the subgraphs needed for this simulation. -
Download and formatting the geographical areal data:
fetch_save_data.Rdownloads the areal data (intocbs*/) andformat_dataset.Rconverts them tosfformat and saves them undersimu/synthetic/. -
df_rms_dim_clust_score_table.Rformats the simulation results into latex tables. -
Running simulations:
infer_<region>_<area>_pc_<zone>.R(for instanceinfer_utrecht neigh_pc_municip.R) are the script running the simulations on the 6 simulation settings.infer_any.Rruns simulations on all 6 settings. -
Running simulations on a cluster:
script_infer_x.sh(where x=1..6) are bash scripts to run simulations on a cluster, running the scheduler slurm.script_infer_any.shfactorizes the code to run any of the 6 simulations.script_computation_time.shruns the runtime simulations incomputation_time.R. -
Running simulations on a local machine: we can use
parallel::mclapplyto run simulations in parallel.parallel_utrecht_neigh_pc_municip.Rshows how. The files for the five other simulation settings are not available. -
Plotting the outcome of the method:
plot_all_input_signals.Rproduces the figures of the noisy signal andplot_any.Rproduces the figures of the estimated clustering obtained by out method in the 6 settings.
-
-
real_data/contains the real data application done in the paper:-
raw_data/map_netherland.geojsonis the spatial data defining the geographical areas. -
raw_data/mrf_overweight_utrecht.txtis the signal to be segmented by our method: the odds-ratio of being overweight for each neighborhood in the region of Utrecht. More details in Goepp and van de Kassteele (2021) and in van de Kassteele et al (2017). -
Pre-processing: the spatial signal is the estimate of a previous estimation method. It comes with an estimate of its covariance matrix, which is stored in
raw_data/V_mrf.txt.precision_matrix_sparse.Rcomputes its inverse (the precision matrix) under the assumption that it is sparse. The result is stored inutrecht_prec.RData. -
utrecht_mrf.R: main file, performing spatial segmentation (i.e. clustering) of the odds-ratio of overweight in the Utrecht region. The estimates are stored inresults/. -
Creating figures:
plot_mrf_agraph_flsa.Rproduces the figures of the segmented spatial signal.
-
-
utils/contains utility R functions:infer_functions.Rcontains the implementation of the graph-sued adaptive ridge method used in the paper. It is a snapshot of the R package graphseg, plus a few wrapper functions.div_pal.Rcontains functions for setting color scales in the figures.sf2nb.Rcontains a utility functions for defining the adjacency graph from the geographical areal data.
The R packages used in this repository are stored in a renv. An renv allows to run the R code in this repo with the same package versions. Here are the steps for running the code of this repo:
- clone it:
git clone https://github.com/goepp/graphseg-paper.git - install renv:
install.packages("renv") - activate the
renv:renv::activate(). At this point, R is using a different.libPathfor this project. You can check it by running.libPaths(). - setup the packages stored in renv:
renv::restore()
renv does not allow complete reproducibility. Some remarks:
- I used R version 4.2.1. Make sure you have a version >=4.2.1 and not too far away from it if possible.
- This repo was written using Ubuntu 22.04 LTS. On Linux, there are some linux packages you may need to install before installing the R packages:
sudo apt install libgeos-dev
sudo apt install libharfbuzz-dev libfribidi-dev
sudo apt install libfontconfig1-dev
sudo apt install libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev
sudo apt install libudunits2-dev
sudo apt install libgdal-dev
sudo apt install cmake
sudo apt install r-cran-rjava
sudo apt install default-jdk && sudo R CMD javareconf
