Title: | Iterative Steps for Postprocessing Model Predictions |
---|---|
Description: | Postprocessors refine predictions outputted from machine learning models to improve predictive performance or better satisfy distributional limitations. This package introduces 'tailor' objects, which compose iterative adjustments to model predictions. A number of pre-written adjustments are provided with the package, like calibration and equivocal zones, as well as utilities to compose new ones. Tailors are tightly integrated with the 'tidymodels' framework. |
Authors: | Simon Couch [aut], Hannah Frick [aut], Emil HvitFeldt [aut], Max Kuhn [aut, cre], Posit Software, PBC [cph, fnd] |
Maintainer: | Max Kuhn <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.0.9001 |
Built: | 2024-11-22 22:23:34 UTC |
Source: | https://github.com/tidymodels/tailor |
Equivocal zones describe intervals of predicted probabilities that are deemed too uncertain or ambiguous to be assigned a hard class. Rather than predicting a hard class when the probability is very close to a threshold, tailors using this adjustment predict "[EQ]".
adjust_equivocal_zone(x, value = 0.1, threshold = 1/2)
adjust_equivocal_zone(x, value = 0.1, threshold = 1/2)
x |
A |
value |
A numeric value (between zero and 1/2) or |
threshold |
A numeric value (between zero and one) or |
This adjustment doesn't require estimation and, as such, the same data that's
used to train it with fit()
can be predicted on with predict()
; fitting
this adjustment just collects metadata on the supplied column names and does
not risk data leakage.
library(dplyr) library(modeldata) head(two_class_example) # `predicted` gives hard class predictions based on probabilities two_class_example %>% count(predicted) # when probabilities are within (.25, .75), consider them equivocal tlr <- tailor() %>% adjust_equivocal_zone(value = 1 / 4) tlr # fit by supplying column names. situate in a modeling workflow # with `workflows::add_tailor()` to avoid having to do so manually tlr_fit <- fit( tlr, two_class_example, outcome = c(truth), estimate = c(predicted), probabilities = c(Class1, Class2) ) tlr_fit # adjust hard class predictions predict(tlr_fit, two_class_example) %>% count(predicted)
library(dplyr) library(modeldata) head(two_class_example) # `predicted` gives hard class predictions based on probabilities two_class_example %>% count(predicted) # when probabilities are within (.25, .75), consider them equivocal tlr <- tailor() %>% adjust_equivocal_zone(value = 1 / 4) tlr # fit by supplying column names. situate in a modeling workflow # with `workflows::add_tailor()` to avoid having to do so manually tlr_fit <- fit( tlr, two_class_example, outcome = c(truth), estimate = c(predicted), probabilities = c(Class1, Class2) ) tlr_fit # adjust hard class predictions predict(tlr_fit, two_class_example) %>% count(predicted)
Calibration for regression models involves adjusting the model's predictions to adjust for correlated errors, ensuring that predicted values align closely with actual observed values across the entire range of outputs.
adjust_numeric_calibration(x, method = NULL)
adjust_numeric_calibration(x, method = NULL)
x |
A |
method |
Character. One of |
This adjustment requires estimation and, as such, different subsets of data
should be used to train it and evaluate its predictions. See the section
by the same name in ?workflows::add_tailor()
for more information on
preventing data leakage with postprocessors that require estimation. When
situated in a workflow, tailors will automatically be estimated with
appropriate subsets of data.
library(tibble) # create example data set.seed(1) d_calibration <- tibble(y = rnorm(100), y_pred = y/2 + rnorm(100)) d_test <- tibble(y = rnorm(100), y_pred = y/2 + rnorm(100)) d_calibration # specify calibration tlr <- tailor() %>% adjust_numeric_calibration(method = "linear") # train tailor on a subset of data. situate in a modeling workflow with # `workflows::add_tailor()` to avoid having to specify column names manually tlr_fit <- fit(tlr, d_calibration, outcome = y, estimate = y_pred) # apply to predictions on another subset of data d_test predict(tlr_fit, d_test)
library(tibble) # create example data set.seed(1) d_calibration <- tibble(y = rnorm(100), y_pred = y/2 + rnorm(100)) d_test <- tibble(y = rnorm(100), y_pred = y/2 + rnorm(100)) d_calibration # specify calibration tlr <- tailor() %>% adjust_numeric_calibration(method = "linear") # train tailor on a subset of data. situate in a modeling workflow with # `workflows::add_tailor()` to avoid having to specify column names manually tlr_fit <- fit(tlr, d_calibration, outcome = y, estimate = y_pred) # apply to predictions on another subset of data d_test predict(tlr_fit, d_test)
Truncating ranges involves limiting the output of a model to a specific range of values, typically to avoid extreme or unrealistic predictions. This technique can help improve the practical applicability of a model's outputs by constraining them within reasonable bounds based on domain knowledge or physical limitations.
adjust_numeric_range(x, lower_limit = -Inf, upper_limit = Inf)
adjust_numeric_range(x, lower_limit = -Inf, upper_limit = Inf)
x |
A |
upper_limit , lower_limit
|
A numeric value, NA (for no truncation) or
|
This adjustment doesn't require estimation and, as such, the same data that's
used to train it with fit()
can be predicted on with predict()
; fitting
this adjustment just collects metadata on the supplied column names and does
not risk data leakage.
library(tibble) # create example data set.seed(1) d <- tibble(y = rnorm(100), y_pred = y/2 + rnorm(100)) d # specify calibration tlr <- tailor() %>% adjust_numeric_range(lower_limit = 1) # train tailor by passing column names. situate in a modeling workflow with # `workflows::add_tailor()` to avoid having to specify column names manually tlr_fit <- fit(tlr, d, outcome = y, estimate = y_pred) predict(tlr_fit, d)
library(tibble) # create example data set.seed(1) d <- tibble(y = rnorm(100), y_pred = y/2 + rnorm(100)) d # specify calibration tlr <- tailor() %>% adjust_numeric_range(lower_limit = 1) # train tailor by passing column names. situate in a modeling workflow with # `workflows::add_tailor()` to avoid having to specify column names manually tlr_fit <- fit(tlr, d, outcome = y, estimate = y_pred) predict(tlr_fit, d)
This adjustment functions allows for arbitrary transformations of model
predictions using dplyr::mutate()
statements.
adjust_predictions_custom(x, ..., .pkgs = character(0))
adjust_predictions_custom(x, ..., .pkgs = character(0))
x |
A |
... |
Name-value pairs of expressions. See |
.pkgs |
A character string of extra packages that are needed to execute the commands. |
This adjustment doesn't require estimation and, as such, the same data that's
used to train it with fit()
can be predicted on with predict()
; fitting
this adjustment just collects metadata on the supplied column names and does
not risk data leakage.
library(modeldata) head(two_class_example) tlr <- tailor() %>% adjust_equivocal_zone() %>% adjust_predictions_custom(linear_predictor = binomial()$linkfun(Class2)) tlr_fit <- fit( tlr, two_class_example, outcome = c(truth), estimate = c(predicted), probabilities = c(Class1, Class2) ) predict(tlr_fit, two_class_example) %>% head()
library(modeldata) head(two_class_example) tlr <- tailor() %>% adjust_equivocal_zone() %>% adjust_predictions_custom(linear_predictor = binomial()$linkfun(Class2)) tlr_fit <- fit( tlr, two_class_example, outcome = c(truth), estimate = c(predicted), probabilities = c(Class1, Class2) ) predict(tlr_fit, two_class_example) %>% head()
Calibration is the process of adjusting a model's outputted probabilities to match the observed frequencies of events. This technique aims to ensure that when a model predicts a certain probability for an outcome, that probability accurately reflects the true likelihood of that outcome occurring.
adjust_probability_calibration(x, method = NULL)
adjust_probability_calibration(x, method = NULL)
x |
A |
method |
Character. One of |
This adjustment requires estimation and, as such, different subsets of data
should be used to train it and evaluate its predictions. See the section
by the same name in ?workflows::add_tailor()
for more information on
preventing data leakage with postprocessors that require estimation. When
situated in a workflow, tailors will automatically be estimated with
appropriate subsets of data.
library(modeldata) # split example data set.seed(1) in_rows <- sample(c(TRUE, FALSE), nrow(two_class_example), replace = TRUE) d_calibration <- two_class_example[in_rows, ] d_test <- two_class_example[!in_rows, ] head(d_calibration) # specify calibration tlr <- tailor() %>% adjust_probability_calibration(method = "logistic") # train tailor on a subset of data. situate in a modeling workflow with # `workflows::add_tailor()` to avoid having to specify column names manually tlr_fit <- fit( tlr, d_calibration, outcome = c(truth), estimate = c(predicted), probabilities = c(Class1, Class2) ) # apply to predictions on another subset of data head(d_test) predict(tlr_fit, d_test)
library(modeldata) # split example data set.seed(1) in_rows <- sample(c(TRUE, FALSE), nrow(two_class_example), replace = TRUE) d_calibration <- two_class_example[in_rows, ] d_test <- two_class_example[!in_rows, ] head(d_calibration) # specify calibration tlr <- tailor() %>% adjust_probability_calibration(method = "logistic") # train tailor on a subset of data. situate in a modeling workflow with # `workflows::add_tailor()` to avoid having to specify column names manually tlr_fit <- fit( tlr, d_calibration, outcome = c(truth), estimate = c(predicted), probabilities = c(Class1, Class2) ) # apply to predictions on another subset of data head(d_test) predict(tlr_fit, d_test)
Many machine learning systems determine hard class predictions by first predicting the probability of an event and then predicting that an event will occur if its respective probability is above 0.5. This adjustment allows practitioners to determine hard class predictions using a threshold other than 0.5. By setting appropriate thresholds, one can balance the trade-off between different types of errors (such as false positives and false negatives) to optimize the model's performance for specific use cases.
adjust_probability_threshold(x, threshold = 0.5)
adjust_probability_threshold(x, threshold = 0.5)
x |
A |
threshold |
A numeric value (between zero and one) or |
This adjustment doesn't require estimation and, as such, the same data that's
used to train it with fit()
can be predicted on with predict()
; fitting
this adjustment just collects metadata on the supplied column names and does
not risk data leakage.
library(modeldata) # `predicted` gives hard class predictions based on probability threshold .5 head(two_class_example) # use a threshold of .1 instead: tlr <- tailor() %>% adjust_probability_threshold(.1) # fit by supplying column names. situate in a modeling workflow # with `workflows::add_tailor()` to avoid having to do so manually tlr_fit <- fit( tlr, two_class_example, outcome = c(truth), estimate = c(predicted), probabilities = c(Class1, Class2) ) # adjust hard class predictions predict(tlr_fit, two_class_example) %>% head()
library(modeldata) # `predicted` gives hard class predictions based on probability threshold .5 head(two_class_example) # use a threshold of .1 instead: tlr <- tailor() %>% adjust_probability_threshold(.1) # fit by supplying column names. situate in a modeling workflow # with `workflows::add_tailor()` to avoid having to do so manually tlr_fit <- fit( tlr, two_class_example, outcome = c(truth), estimate = c(predicted), probabilities = c(Class1, Class2) ) # adjust hard class predictions predict(tlr_fit, two_class_example) %>% head()
Tailors compose iterative adjustments to model predictions. After
initializing a tailor with this function, add adjustment specifications
with adjust_*()
functions:
For probability distributions: adjust_probability_calibration()
For transformation of probabilities to hard class predictions:
adjust_probability_threshold()
, adjust_equivocal_zone()
For numeric distributions: adjust_numeric_calibration()
,
adjust_numeric_range()
For ad-hoc adjustments, see adjust_predictions_custom()
.
Tailors must be trained with fit() before being applied to
new data with predict(). Tailors are tightly integrated
with the tidymodels framework; for greatest ease
of use, situate tailors in model workflows with ?workflows::add_tailor()
.
tailor(outcome = NULL, estimate = NULL, probabilities = NULL)
tailor(outcome = NULL, estimate = NULL, probabilities = NULL)
outcome |
< |
estimate |
< |
probabilities |
< |
library(dplyr) library(modeldata) # `predicted` gives hard class predictions based on probabilities two_class_example %>% count(predicted) # change the probability threshold to allot one class vs the other tlr <- tailor() %>% adjust_probability_threshold(threshold = .1) tlr # fit by supplying column names. situate in a modeling workflow # with `workflows::add_tailor()` to avoid having to do so manually tlr_fit <- fit( tlr, two_class_example, outcome = c(truth), estimate = c(predicted), probabilities = c(Class1, Class2) ) tlr_fit # adjust hard class predictions predict(tlr_fit, two_class_example) %>% count(predicted)
library(dplyr) library(modeldata) # `predicted` gives hard class predictions based on probabilities two_class_example %>% count(predicted) # change the probability threshold to allot one class vs the other tlr <- tailor() %>% adjust_probability_threshold(threshold = .1) tlr # fit by supplying column names. situate in a modeling workflow # with `workflows::add_tailor()` to avoid having to do so manually tlr_fit <- fit( tlr, two_class_example, outcome = c(truth), estimate = c(predicted), probabilities = c(Class1, Class2) ) tlr_fit # adjust hard class predictions predict(tlr_fit, two_class_example) %>% count(predicted)
Describe a tailor's adjustments in a tibble with one row per adjustment.
## S3 method for class 'tailor' tidy(x, number = NA, ...)
## S3 method for class 'tailor' tidy(x, number = NA, ...)
x |
A |
number |
Optional. A single integer between 1 and the number of adjustments. |
... |
Currently unused; must be empty. |
A tibble containing information about the tailor's adjustments including their ordering, whether they've been trained, and whether they require training with a separate calibration set.