Title: | Spatial Resampling Infrastructure |
---|---|
Description: | Functions and classes for spatial resampling to use with the 'rsample' package, such as spatial cross-validation (Brenning, 2012) <doi:10.1109/IGARSS.2012.6352393>. The scope of 'rsample' and 'spatialsample' is to provide the basic building blocks for creating and analyzing resamples of a spatial data set, but neither package includes functions for modeling or computing statistics. The resampled spatial data sets created by 'spatialsample' do not contain much overhead in memory. |
Authors: | Michael Mahoney [aut, cre] , Julia Silge [aut] , Posit Software, PBC [cph, fnd] |
Maintainer: | Michael Mahoney <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.6.0.9000 |
Built: | 2024-11-01 06:12:37 UTC |
Source: | https://github.com/tidymodels/spatialsample |
This method provides a good visualization method for spatial resampling.
## S3 method for class 'spatial_rset' autoplot(object, ..., alpha = 0.6) ## S3 method for class 'spatial_block_cv' autoplot(object, show_grid = TRUE, ..., alpha = 0.6)
## S3 method for class 'spatial_rset' autoplot(object, ..., alpha = 0.6) ## S3 method for class 'spatial_block_cv' autoplot(object, show_grid = TRUE, ..., alpha = 0.6)
object |
A |
... |
Options passed to |
alpha |
Opacity, passed to |
show_grid |
When plotting spatial_block_cv objects, should the grid itself be drawn on top of the data? Set to FALSE to remove the grid. |
The plot method for spatial_rset
displays which fold each observation
is assigned to. Note that if data is assigned to multiple folds
(which is common if resamples were created with a non-zero radius
) only
the "last" fold for each observation will appear on the plot.
Consider adding ggplot2::facet_wrap(~ fold)
to visualize all members of
each fold separately.
Alternatively, consider plotting each split using the spatial_rsplit
method
(for example, via lapply(object$splits, autoplot)
).
A ggplot object with each fold assigned a color, made using
ggplot2::geom_sf()
.
boston_block <- spatial_block_cv(boston_canopy, v = 2) autoplot(boston_block) autoplot(boston_block$splits[[1]])
boston_block <- spatial_block_cv(boston_canopy, v = 2) autoplot(boston_block) autoplot(boston_block$splits[[1]])
A dataset containing data on tree canopy coverage and change for the city of Boston, Massachusetts from 2014-2019, as well as temperature and heat index data for July 2019. Data is aggregated to a grid of regular 25 hectare hexagons, clipped to city boundaries. This data is made available under the Public Domain Dedication and License v1.0 whose full text can be found at: https://opendatacommons.org/licenses/pddl/1-0/.
boston_canopy
boston_canopy
A data frame (of class sf
, tbl_df
, tbl
, and data.frame
)
containing 682 records of 22 variables:
Unique identifier for each hexagon. Letters represent the hexagon's X position in the grid (ordered West to East), while numbers represent the Y position (ordered North to South).
Area excluding water bodies
Area of canopy gain between the two years
Area of canopy loss between the two years
Area of no canopy change between the two years
2014 total canopy area (baseline)
2019 total canopy area
The change in area of tree canopy between the two years
Relative change calculation used in economics is the gain or loss of tree canopy relative to the earlier time period: (2019 Canopy-2014 Canopy)/(2014 Canopy)
2014 canopy percentage
2019 canopy percentage
Absolute change. Magnitude of change in percent tree canopy from 2014 to 2019 (% 2019 Canopy - % 2014 Canopy)
Mean temperature for July 2019 from 6am - 7am
Mean temperature for July 2019 from 7pm - 8pm
Mean temperature for July 2019 from 6am - 7am, 3pm - 4pm, and 7pm - 8pm (combined)
Mean heat index for July 2019 from 6am - 7am
Mean heat index for July 2019 from 7pm - 8pm
Mean heat index for July 2019 from 6am - 7am, 3pm - 4pm, and 7pm - 8pm (combined)
Geometry of each hexagon, encoded using EPSG:2249 as a coordinate reference system (NAD83 / Massachusetts Mainland (ftUS)). Note that the linear units of this CRS are in US feet.
Note that this dataset is in the EPSG:2249
(NAD83 / Massachusetts Mainland (ftUS)) coordinate reference system (CRS),
which may not be installed by default on your computer. Before working with
boston_canopy
, run:
sf::sf_proj_network(TRUE)
to install the CRS itself
sf::sf_add_proj_units()
to add US customary units to your units
database
These steps only need to be taken once per computer (or per PROJ installation).
Canopy data is from https://data.boston.gov/dataset/hex-tree-canopy-change-metrics. Heat data is from https://data.boston.gov/dataset/hex-mean-heat-index. Most field definitions are from https://data.boston.gov/dataset/canopy-change-assessment-data-dictionary.
Block cross-validation splits the area of your data into a number of grid cells, or "blocks", and then assigns all data into folds based on the blocks their centroid falls into.
spatial_block_cv( data, method = c("random", "snake", "continuous"), v = 10, relevant_only = TRUE, radius = NULL, buffer = NULL, ..., repeats = 1, expand_bbox = 1e-05 )
spatial_block_cv( data, method = c("random", "snake", "continuous"), v = 10, relevant_only = TRUE, radius = NULL, buffer = NULL, ..., repeats = 1, expand_bbox = 1e-05 )
data |
An object of class |
method |
The method used to sample blocks for cross validation folds.
Currently supports |
v |
The number of partitions for the resampling. Set to |
relevant_only |
For systematic sampling, should only blocks containing data be included in fold labeling? |
radius |
Numeric: points within this distance of the initially-selected
test points will be assigned to the assessment set. If |
buffer |
Numeric: points within this distance of any point in the
test set (after |
... |
Arguments passed to |
repeats |
The number of times to repeat the V-fold partitioning. |
expand_bbox |
A numeric of length 1, representing a proportion to expand
the bounding box of |
The grid blocks can be controlled by passing arguments to
sf::st_make_grid()
via ...
. Some particularly useful arguments include:
cellsize
: Target cellsize, expressed as the "diameter" (shortest
straight-line distance between opposing sides; two times the apothem)
of each block, in map units.
n
: The number of grid blocks in the x and y direction (columns, rows).
square
: A logical value indicating whether to create square (TRUE
) or
hexagonal (FALSE
) cells.
If both cellsize
and n
are provided, then the number of blocks requested
by n
of sizes specified by cellsize
will be returned, likely not
lining up with the bounding box of data
. If only cellsize
is provided, this function will return as many blocks of size
cellsize
as fit inside the bounding box of data
. If only n
is provided,
then cellsize
will be automatically adjusted to create the requested
number of cells.
A tibble with classes spatial_block_cv
, spatial_rset
, rset
,
tbl_df
, tbl
, and data.frame
. The results include a column for the
data split objects and an identification variable id
.
D. R. Roberts, V. Bahn, S. Ciuti, M. S. Boyce, J. Elith, G. Guillera-Arroita, S. Hauenstein, J. J. Lahoz-Monfort, B. Schröder, W. Thuiller, D. I. Warton, B. A. Wintle, F. Hartig, and C. F. Dormann. "Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure," 2016, Ecography 40(8), pp. 913-929, doi: 10.1111/ecog.02881.
spatial_block_cv(boston_canopy, v = 3)
spatial_block_cv(boston_canopy, v = 3)
V-fold cross-validation (also known as k-fold cross-validation) randomly
splits the data into V groups of roughly equal size (called "folds").
A resample of the analysis data consists of V-1 of the folds while the
assessment set contains the final fold.
These functions extend rsample::vfold_cv()
and rsample::group_vfold_cv()
to also apply an inclusion radius and exclusion buffer to the assessment set,
ensuring that your analysis data is spatially separated from the assessment
set.
In basic V-fold cross-validation (i.e. no repeats), the number of resamples
is equal to V.
spatial_buffer_vfold_cv( data, radius, buffer, v = 10, repeats = 1, strata = NULL, breaks = 4, pool = 0.1, ... ) spatial_leave_location_out_cv( data, group, v = NULL, radius = NULL, buffer = NULL, ..., repeats = 1 )
spatial_buffer_vfold_cv( data, radius, buffer, v = 10, repeats = 1, strata = NULL, breaks = 4, pool = 0.1, ... ) spatial_leave_location_out_cv( data, group, v = NULL, radius = NULL, buffer = NULL, ..., repeats = 1 )
data |
A data frame. |
radius |
Numeric: points within this distance of the initially-selected
test points will be assigned to the assessment set. If |
buffer |
Numeric: points within this distance of any point in the
test set (after |
v |
The number of partitions for the resampling. Set to |
repeats |
The number of times to repeat the V-fold partitioning. |
strata |
A variable in |
breaks |
A single number giving the number of bins desired to stratify a numeric stratification variable. |
pool |
A proportion of data used to determine if a particular group is too small and should be pooled into another group. We do not recommend decreasing this argument below its default of 0.1 because of the dangers of stratifying groups that are too small. |
... |
These dots are for future extensions and must be empty. |
group |
A variable in data (single character or name) used to create folds. For leave-location-out CV, this should be a variable containing the locations to group observations by, for leave-time-out CV the time blocks to group by, and for leave-location-and-time-out the spatiotemporal blocks to group by. |
When radius
and buffer
are both NULL
, spatial_buffer_vfold_cv
is equivalent to rsample::vfold_cv()
and spatial_leave_location_out_cv
is equivalent to rsample::group_vfold_cv()
.
K. Le Rest, D. Pinaud, P. Monestiez, J. Chadoeuf, and C. Bretagnolle. 2014. "Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation," Global Ecology and Biogeography 23, pp. 811-820, doi: 10.1111/geb.12161.
H. Meyer, C. Reudenbach, T. Hengl, M. Katurji, and T. Nauss. 2018. "Improving performance of spatio-temporal machine learning models using forward feature selection and target-oriented validation," Environmental Modelling & Software 101, pp. 1-9, doi: 10.1016/j.envsoft.2017.12.001.
data(Smithsonian, package = "modeldata") Smithsonian_sf <- sf::st_as_sf( Smithsonian, coords = c("longitude", "latitude"), crs = 4326 ) spatial_buffer_vfold_cv( Smithsonian_sf, buffer = 500, radius = NULL ) data(ames, package = "modeldata") ames_sf <- sf::st_as_sf(ames, coords = c("Longitude", "Latitude"), crs = 4326) ames_neighborhoods <- spatial_leave_location_out_cv(ames_sf, Neighborhood)
data(Smithsonian, package = "modeldata") Smithsonian_sf <- sf::st_as_sf( Smithsonian, coords = c("longitude", "latitude"), crs = 4326 ) spatial_buffer_vfold_cv( Smithsonian_sf, buffer = 500, radius = NULL ) data(ames, package = "modeldata") ames_sf <- sf::st_as_sf(ames, coords = c("Longitude", "Latitude"), crs = 4326) ames_neighborhoods <- spatial_leave_location_out_cv(ames_sf, Neighborhood)
Spatial clustering cross-validation splits the data into V groups of disjointed sets by clustering points based on their spatial coordinates. A resample of the analysis data consists of V-1 of the folds/clusters while the assessment set contains the final fold/cluster.
spatial_clustering_cv( data, v = 10, cluster_function = c("kmeans", "hclust"), radius = NULL, buffer = NULL, ..., repeats = 1, distance_function = function(x) as.dist(sf::st_distance(x)) )
spatial_clustering_cv( data, v = 10, cluster_function = c("kmeans", "hclust"), radius = NULL, buffer = NULL, ..., repeats = 1, distance_function = function(x) as.dist(sf::st_distance(x)) )
data |
An |
v |
The number of partitions of the data set. |
cluster_function |
Which function should be used for clustering?
Options are either |
radius |
Numeric: points within this distance of the initially-selected
test points will be assigned to the assessment set. If |
buffer |
Numeric: points within this distance of any point in the
test set (after |
... |
Extra arguments passed on to |
repeats |
The number of times to repeat the clustered partitioning. |
distance_function |
Which function should be used for distance
calculations? Defaults to |
Clusters are created based on the distances between observations
if data
is an sf
object. Each cluster is used as a fold for
cross-validation. Depending on how the data are distributed spatially, there
may not be an equal number of observations in each fold.
You can optionally provide a custom function to distance_function.
The
function should take an sf
object and return a stats::dist()
object with
distances between data points.
You can optionally provide a custom function to cluster_function
. The
function must take three arguments:
dists
, a stats::dist()
object with distances between data points
v
, a length-1 numeric for the number of folds to create
...
, to pass any additional named arguments to your function
The function should return a vector of cluster assignments of length
nrow(data)
, with each element of the vector corresponding to the matching
row of the data frame.
A tibble with classes spatial_clustering_cv
, spatial_rset
,
rset
, tbl_df
, tbl
, and data.frame
.
The results include a column for the data split objects and
an identification variable id
.
Resamples created from non-sf
objects will not have the
spatial_rset
class.
As of spatialsample version 0.3.0, this function no longer accepts non-sf
objects as arguments to data
. In order to perform clustering with
non-spatial data, consider using rsample::clustering_cv()
.
Also as of version 0.3.0, this function now calculates edge-to-edge distance for non-point geometries, in line with the rest of the package. Earlier versions relied upon between-centroid distances.
A. Brenning, "Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest," 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, 2012, pp. 5372-5375, doi: 10.1109/IGARSS.2012.6352393.
data(Smithsonian, package = "modeldata") smithsonian_sf <- sf::st_as_sf( Smithsonian, coords = c("longitude", "latitude"), # Set CRS to WGS84 crs = 4326 ) # When providing sf objects, coords are inferred automatically spatial_clustering_cv(smithsonian_sf, v = 5) # Can use hclust instead: spatial_clustering_cv(smithsonian_sf, v = 5, cluster_function = "hclust")
data(Smithsonian, package = "modeldata") smithsonian_sf <- sf::st_as_sf( Smithsonian, coords = c("longitude", "latitude"), # Set CRS to WGS84 crs = 4326 ) # When providing sf objects, coords are inferred automatically spatial_clustering_cv(smithsonian_sf, v = 5) # Can use hclust instead: spatial_clustering_cv(smithsonian_sf, v = 5, cluster_function = "hclust")
NNDM is a variant of leave-one-out cross-validation which assigns each observation to a single assessment fold, and then attempts to remove data from each analysis fold until the nearest neighbor distance distribution between assessment and analysis folds matches the nearest neighbor distance distribution between training data and the locations a model will be used to predict. Proposed by Milà et al. (2022), this method aims to provide accurate estimates of how well models will perform in the locations they will actually be predicting. This method was originally implemented in the CAST package.
spatial_nndm_cv( data, prediction_sites, ..., autocorrelation_range = NULL, prediction_sample_size = 1000, min_analysis_proportion = 0.5 )
spatial_nndm_cv( data, prediction_sites, ..., autocorrelation_range = NULL, prediction_sample_size = 1000, min_analysis_proportion = 0.5 )
data |
An object of class |
prediction_sites |
An |
... |
Additional arguments passed to |
autocorrelation_range |
A numeric of length 1 representing the landscape
autocorrelation range ("phi" in the terminology of Milà et al. (2022)). If
|
prediction_sample_size |
A numeric of length 1: the number of points to
sample when |
min_analysis_proportion |
The minimum proportion of |
Note that, as a form of leave-one-out cross-validation, this method can be rather slow for larger data (and fitting models to these resamples will be even slower).
A tibble with classes spatial_nndm_cv
, spatial_rset
, rset
,
tbl_df
, tbl
, and data.frame
. The results include a column for the
data split objects and an identification variable id
.
C. Milà, J. Mateu, E. Pebesma, and H. Meyer. 2022. "Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation." Methods in Ecology and Evolution 2022:13, pp 1304– 1316. doi: 10.1111/2041-210X.13851.
H. Meyer and E. Pebesma. 2022. "Machine learning-based global maps of ecological variables and the challenge of assessing them." Nature Communications 13, pp 2208. doi: 10.1038/s41467-022-29838-9.
data(ames, package = "modeldata") ames_sf <- sf::st_as_sf(ames, coords = c("Longitude", "Latitude"), crs = 4326) # Using a small subset of the data, to make the example run faster: spatial_nndm_cv(ames_sf[1:100, ], ames_sf[2001:2100, ])
data(ames, package = "modeldata") ames_sf <- sf::st_as_sf(ames, coords = c("Longitude", "Latitude"), crs = 4326) # Using a small subset of the data, to make the example run faster: spatial_nndm_cv(ames_sf[1:100, ], ames_sf[2001:2100, ])