Title: | A Compilation of Applicability Domain Methods |
---|---|
Description: | A modeling package compiling applicability domain methods in R. It combines different methods to measure the amount of extrapolation new samples can have from the training set. See Netzeva et al (2005) <doi:10.1177/026119290503300209> for an overview of applicability domains. |
Authors: | Marly Gotti [aut, cre], Max Kuhn [aut], Posit Software, PBC [cph, fnd] |
Maintainer: | Marly Gotti <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.1.1 |
Built: | 2024-11-16 05:06:47 UTC |
Source: | https://github.com/tidymodels/applicable |
More data related to the set described by De Cock (2011) where data where data were recorded for 2,930 properties in Ames IA.
This data sets includes three more properties added since the original reference. There are less fields in this data set; only those that could be transcribed from the assessor's office were included.
ames_new |
a tibble |
De Cock, D. (2011). "Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project," Journal of Statistics Education, Volume 19, Number 3.
https://www.cityofames.org/government/departments-divisions-a-h/city-assessor
http://jse.amstat.org/v19n3/decock/DataDocumentation.txt
http://jse.amstat.org/v19n3/decock.pdf
apd_hat_values
apd_hat_values()
fits a model.
apd_hat_values(x, ...) ## Default S3 method: apd_hat_values(x, ...) ## S3 method for class 'data.frame' apd_hat_values(x, ...) ## S3 method for class 'matrix' apd_hat_values(x, ...) ## S3 method for class 'formula' apd_hat_values(formula, data, ...) ## S3 method for class 'recipe' apd_hat_values(x, data, ...)
apd_hat_values(x, ...) ## Default S3 method: apd_hat_values(x, ...) ## S3 method for class 'data.frame' apd_hat_values(x, ...) ## S3 method for class 'matrix' apd_hat_values(x, ...) ## S3 method for class 'formula' apd_hat_values(formula, data, ...) ## S3 method for class 'recipe' apd_hat_values(x, data, ...)
x |
Depending on the context:
|
... |
Not currently used, but required for extensibility. |
formula |
A formula specifying the predictor terms on the right-hand side. No outcome should be specified. |
data |
When a recipe or formula is used,
|
A apd_hat_values
object.
predictors <- mtcars[, -1] # Data frame interface mod <- apd_hat_values(predictors) # Formula interface mod2 <- apd_hat_values(mpg ~ ., mtcars) # Recipes interface library(recipes) rec <- recipe(mpg ~ ., mtcars) rec <- step_log(rec, disp) mod3 <- apd_hat_values(rec, mtcars)
predictors <- mtcars[, -1] # Data frame interface mod <- apd_hat_values(predictors) # Formula interface mod2 <- apd_hat_values(mpg ~ ., mtcars) # Recipes interface library(recipes) rec <- recipe(mpg ~ ., mtcars) rec <- step_log(rec, disp) mod3 <- apd_hat_values(rec, mtcars)
apd_pca
apd_pca()
fits a model.
apd_pca(x, ...) ## Default S3 method: apd_pca(x, ...) ## S3 method for class 'data.frame' apd_pca(x, threshold = 0.95, ...) ## S3 method for class 'matrix' apd_pca(x, threshold = 0.95, ...) ## S3 method for class 'formula' apd_pca(formula, data, threshold = 0.95, ...) ## S3 method for class 'recipe' apd_pca(x, data, threshold = 0.95, ...)
apd_pca(x, ...) ## Default S3 method: apd_pca(x, ...) ## S3 method for class 'data.frame' apd_pca(x, threshold = 0.95, ...) ## S3 method for class 'matrix' apd_pca(x, threshold = 0.95, ...) ## S3 method for class 'formula' apd_pca(formula, data, threshold = 0.95, ...) ## S3 method for class 'recipe' apd_pca(x, data, threshold = 0.95, ...)
x |
Depending on the context:
|
... |
Not currently used, but required for extensibility. |
threshold |
A number indicating the percentage of variance desired from the principal components. It must be a number greater than 0 and less or equal than 1. |
formula |
A formula specifying the predictor terms on the right-hand side. No outcome should be specified. |
data |
When a recipe or formula is used,
|
The function computes the principal components that account for
up to either 95% or the provided threshold
of variability. It also
computes the percentiles of the absolute value of the principal components.
Additionally, it calculates the mean of each principal component.
A apd_pca
object.
predictors <- mtcars[, -1] # Data frame interface mod <- apd_pca(predictors) # Formula interface mod2 <- apd_pca(mpg ~ ., mtcars) # Recipes interface library(recipes) rec <- recipe(mpg ~ ., mtcars) rec <- step_log(rec, disp) mod3 <- apd_pca(rec, mtcars)
predictors <- mtcars[, -1] # Data frame interface mod <- apd_pca(predictors) # Formula interface mod2 <- apd_pca(mpg ~ ., mtcars) # Recipes interface library(recipes) rec <- recipe(mpg ~ ., mtcars) rec <- step_log(rec, disp) mod3 <- apd_pca(rec, mtcars)
apd_similarity()
is used to analyze samples in terms of similarity scores
for binary data. All features in the data should be binary (i.e. zero or
one).
apd_similarity(x, ...) ## Default S3 method: apd_similarity(x, quantile = NA_real_, ...) ## S3 method for class 'data.frame' apd_similarity(x, quantile = NA_real_, ...) ## S3 method for class 'matrix' apd_similarity(x, quantile = NA_real_, ...) ## S3 method for class 'formula' apd_similarity(formula, data, quantile = NA_real_, ...) ## S3 method for class 'recipe' apd_similarity(x, data, quantile = NA_real_, ...)
apd_similarity(x, ...) ## Default S3 method: apd_similarity(x, quantile = NA_real_, ...) ## S3 method for class 'data.frame' apd_similarity(x, quantile = NA_real_, ...) ## S3 method for class 'matrix' apd_similarity(x, quantile = NA_real_, ...) ## S3 method for class 'formula' apd_similarity(formula, data, quantile = NA_real_, ...) ## S3 method for class 'recipe' apd_similarity(x, data, quantile = NA_real_, ...)
x |
Depending on the context:
|
... |
Options to pass to |
quantile |
A real number between 0 and 1 or NA for how the similarity
values for each sample versus the training set should be summarized. A value
of |
formula |
A formula specifying the predictor terms on the right-hand side. No outcome should be specified. |
data |
When a recipe or formula is used,
|
The function computes measures of similarity for different samples
points. For example, suppose samples A
and B
both contain p binary
variables. First, a 2x2 table is constructed between A
and B
across
their elements. The table will contain p entries across the four cells
(see the example below). From this, different measures of likeness are
computed.
For a training set of n samples, a new sample is compared to each, resulting in n similarity scores. These can be summarized into a single value; the median similarity is used by default by the scoring function.
For this method, the computational methods are fairly taxing for large data sets. The training set must be stored (albeit in a sparse matrix format) so object sizes may become large.
By default, the computations are run in parallel using all possible
cores. To change this, call the setThreadOptions
function in the
RcppParallel
package.
A apd_similarity
object.
Leach, A. and Gillet V. (2007). An Introduction to Chemoinformatics. Springer, New York
data(qsar_binary) jacc_sim <- apd_similarity(binary_tr) jacc_sim # plot the empirical cumulative distribution function (ECDF) for the training set: library(ggplot2) autoplot(jacc_sim) # Example calculations for two samples: A <- as.matrix(binary_tr[1, ]) B <- as.matrix(binary_tr[2, ]) xtab <- table(A, B) xtab # Jaccard statistic xtab[2, 2] / (xtab[1, 2] + xtab[2, 1] + xtab[2, 2]) # Hamman statistic ((xtab[1, 1] + xtab[2, 2]) - (xtab[1, 2] + xtab[2, 1])) / sum(xtab) # Faith statistic (xtab[1, 1] + xtab[2, 2] / 2) / sum(xtab) # Summarize across all training set similarities mean_sim <- score(jacc_sim, new_data = binary_unk) mean_sim
data(qsar_binary) jacc_sim <- apd_similarity(binary_tr) jacc_sim # plot the empirical cumulative distribution function (ECDF) for the training set: library(ggplot2) autoplot(jacc_sim) # Example calculations for two samples: A <- as.matrix(binary_tr[1, ]) B <- as.matrix(binary_tr[2, ]) xtab <- table(A, B) xtab # Jaccard statistic xtab[2, 2] / (xtab[1, 2] + xtab[2, 1] + xtab[2, 2]) # Hamman statistic ((xtab[1, 1] + xtab[2, 2]) - (xtab[1, 2] + xtab[2, 1])) / sum(xtab) # Faith statistic (xtab[1, 1] + xtab[2, 2] / 2) / sum(xtab) # Summarize across all training set similarities mean_sim <- score(jacc_sim, new_data = binary_unk) mean_sim
Plot the distribution function for pcas
## S3 method for class 'apd_pca' autoplot(object, ...)
## S3 method for class 'apd_pca' autoplot(object, ...)
object |
An object produced by |
... |
An optional set of |
A ggplot
object that shows the distribution function for each
principal component.
library(ggplot2) library(dplyr) library(modeldata) data(biomass) biomass_ad <- apd_pca(biomass[, 3:8]) autoplot(biomass_ad) # Using selectors in `...` autoplot(biomass_ad, distance) + scale_x_log10() autoplot(biomass_ad, matches("PC[1-2]"))
library(ggplot2) library(dplyr) library(modeldata) data(biomass) biomass_ad <- apd_pca(biomass[, 3:8]) autoplot(biomass_ad) # Using selectors in `...` autoplot(biomass_ad, distance) + scale_x_log10() autoplot(biomass_ad, matches("PC[1-2]"))
Plot the cumulative distribution function for similarity metrics
## S3 method for class 'apd_similarity' autoplot(object, ...)
## S3 method for class 'apd_similarity' autoplot(object, ...)
object |
An object produced by |
... |
Not currently used. |
A ggplot
object that shows the cumulative probability versus the
unique similarity values in the training set. Not that for large samples,
this is an approximation based on a random sample of 5,000 training set
points.
set.seed(535) tr_x <- matrix( sample(0:1, size = 20 * 50, prob = rep(.5, 2), replace = TRUE), ncol = 20 ) model <- apd_similarity(tr_x)
set.seed(535) tr_x <- matrix( sample(0:1, size = 20 * 50, prob = rep(.5, 2), replace = TRUE), ncol = 20 ) model <- apd_similarity(tr_x)
Binary QSAR Data
These data are from two different sources on quantitative
structure-activity relationship (QSAR) modeling and contain 67 predictors
that are either 0 or 1. The training set contains 4,330 samples and there
are five unknown samples (both from the Mutagen
data in the QSARdata
package).
binary_tr , binary_ukn
|
data frame frames with 67 columns |
data(qsar_binary) str(binary_tr)
data(qsar_binary) str(binary_tr)
OkCupid Binary Predictors
Data originally from Kim (2015) includes a training and test set consistent with Kuhn and Johnson (2020). Predictors include ethnicity indicators and a set of keywords derived from text essay data.
okc_binary_train , okc_binary_test
|
data frame frames with 61 columns |
Kim (2015), "OkCupid Data for Introductory Statistics and Data Science Courses", Journal of Statistics Education, Volume 23, Number 2. https://www.tandfonline.com/doi/abs/10.1080/10691898.2015.11889737
Kuhn and Johnson (2020), Feature Engineering and Selection, Chapman and Hall/CRC . https://bookdown.org/max/FES/ and https://github.com/topepo/FES
data(okc_binary) str(okc_binary_train)
data(okc_binary) str(okc_binary_train)
Print number of predictors and principal components used.
## S3 method for class 'apd_hat_values' print(x, ...)
## S3 method for class 'apd_hat_values' print(x, ...)
x |
A |
... |
Not currently used, but required for extensibility. |
None
model <- apd_hat_values(~ Sepal.Length + Sepal.Width, iris) print(model)
model <- apd_hat_values(~ Sepal.Length + Sepal.Width, iris) print(model)
Print number of predictors and principal components used.
## S3 method for class 'apd_pca' print(x, ...)
## S3 method for class 'apd_pca' print(x, ...)
x |
A |
... |
Not currently used, but required for extensibility. |
None
model <- apd_pca(~ Sepal.Length + Sepal.Width, iris) print(model)
model <- apd_pca(~ Sepal.Length + Sepal.Width, iris) print(model)
Print number of predictors and principal components used.
## S3 method for class 'apd_similarity' print(x, ...)
## S3 method for class 'apd_similarity' print(x, ...)
x |
A |
... |
Not currently used, but required for extensibility. |
None
set.seed(535) tr_x <- matrix( sample(0:1, size = 20 * 50, prob = rep(.5, 2), replace = TRUE), ncol = 20 ) model <- apd_similarity(tr_x) print(model)
set.seed(535) tr_x <- matrix( sample(0:1, size = 20 * 50, prob = rep(.5, 2), replace = TRUE), ncol = 20 ) model <- apd_similarity(tr_x) print(model)
A scoring function
score(object, ...) ## Default S3 method: score(object, ...)
score(object, ...) ## Default S3 method: score(object, ...)
object |
Depending on the context:
|
... |
Not currently used, but required for extensibility. |
A tibble of predictions.
Score new samples using hat values
## S3 method for class 'apd_hat_values' score(object, new_data, type = "numeric", ...)
## S3 method for class 'apd_hat_values' score(object, new_data, type = "numeric", ...)
object |
A |
new_data |
A data frame or matrix of new predictors. |
type |
A single character. The type of predictions to generate. Valid options are:
|
... |
Not used, but required for extensibility. |
A tibble of predictions. The number of rows in the tibble is guaranteed
to be the same as the number of rows in new_data
. For type = "numeric"
,
the tibble contains two columns hat_values
and hat_values_pctls
. The
column hat_values_pctls
is in percent units so that a value of 11.5
indicates that, in the training set, 11.5 percent of the training set
samples had smaller values than the sample being scored.
train_data <- mtcars[1:20, ] test_data <- mtcars[21:32, ] hat_values_model <- apd_hat_values(train_data) hat_values_scoring <- score(hat_values_model, new_data = test_data) hat_values_scoring
train_data <- mtcars[1:20, ] test_data <- mtcars[21:32, ] hat_values_model <- apd_hat_values(train_data) hat_values_scoring <- score(hat_values_model, new_data = test_data) hat_values_scoring
apd_pca
Predict from a apd_pca
## S3 method for class 'apd_pca' score(object, new_data, type = "numeric", ...)
## S3 method for class 'apd_pca' score(object, new_data, type = "numeric", ...)
object |
A |
new_data |
A data frame or matrix of new samples. |
type |
A single character. The type of predictions to generate. Valid options are:
|
... |
Not used, but required for extensibility. |
The function computes the principal components of the new data and
their percentiles as compared to the training data. The number of principal
components computed depends on the threshold
given at fit time. It also
computes the multivariate distance between each principal component and its
mean.
A tibble of predictions. The number of rows in the tibble is guaranteed
to be the same as the number of rows in new_data
.
train <- mtcars[1:20, ] test <- mtcars[21:32, -1] # Fit mod <- apd_pca(mpg ~ cyl + log(drat), train) # Predict, with preprocessing score(mod, test)
train <- mtcars[1:20, ] test <- mtcars[21:32, -1] # Fit mod <- apd_pca(mpg ~ cyl + log(drat), train) # Predict, with preprocessing score(mod, test)
Score new samples using similarity methods
## S3 method for class 'apd_similarity' score(object, new_data, type = "numeric", add_percentile = TRUE, ...)
## S3 method for class 'apd_similarity' score(object, new_data, type = "numeric", add_percentile = TRUE, ...)
object |
A |
new_data |
A data frame or matrix of new predictors. |
type |
A single character. The type of predictions to generate. Valid options are:
|
add_percentile |
A single logical; should the percentile of the similarity score relative to the training set values by computed? |
... |
Not used, but required for extensibility. |
A tibble of predictions. The number of rows in the tibble is guaranteed
to be the same as the number of rows in new_data
. For type = "numeric"
,
the tibble contains a column called "similarity". If add_percentile = TRUE
,
an additional column called similarity_pctl
will be added. These values are
in percent units so that a value of 11.5 indicates that, in the training set,
11.5 percent of the training set samples had smaller values than the sample
being scored.
data(qsar_binary) jacc_sim <- apd_similarity(binary_tr) mean_sim <- score(jacc_sim, new_data = binary_unk) mean_sim
data(qsar_binary) jacc_sim <- apd_similarity(binary_tr) mean_sim <- score(jacc_sim, new_data = binary_unk) mean_sim