Title: | Construct Modeling Packages |
---|---|
Description: | Building modeling packages is hard. A large amount of effort generally goes into providing an implementation for a new method that is efficient, fast, and correct, but often less emphasis is put on the user interface. A good interface requires specialized knowledge about S3 methods and formulas, which the average package developer might not have. The goal of 'hardhat' is to reduce the burden around building new modeling packages by providing functionality for preprocessing, predicting, and validating input. |
Authors: | Hannah Frick [aut, cre] , Davis Vaughan [aut], Max Kuhn [aut], Posit Software, PBC [cph, fnd] |
Maintainer: | Hannah Frick <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.4.0.9002 |
Built: | 2024-11-12 14:20:58 UTC |
Source: | https://github.com/tidymodels/hardhat |
data
This function adds an integer column of 1
's to data
.
add_intercept_column(data, name = "(Intercept)")
add_intercept_column(data, name = "(Intercept)")
data |
A data frame or matrix. |
name |
The name for the intercept column. Defaults to |
If a column named name
already exists in data
, then data
is returned
unchanged and a warning is issued.
data
with an intercept column.
add_intercept_column(mtcars) add_intercept_column(mtcars, "intercept") add_intercept_column(as.matrix(mtcars))
add_intercept_column(mtcars) add_intercept_column(mtcars, "intercept") add_intercept_column(as.matrix(mtcars))
This pages holds the details for the formula preprocessing blueprint. This
is the blueprint used by default from mold()
if x
is a formula.
default_formula_blueprint( intercept = FALSE, allow_novel_levels = FALSE, indicators = "traditional", composition = "tibble" ) ## S3 method for class 'formula' mold(formula, data, ..., blueprint = NULL)
default_formula_blueprint( intercept = FALSE, allow_novel_levels = FALSE, indicators = "traditional", composition = "tibble" ) ## S3 method for class 'formula' mold(formula, data, ..., blueprint = NULL)
intercept |
A logical. Should an intercept be included in the
processed data? This information is used by the |
allow_novel_levels |
A logical. Should novel factor levels be allowed at
prediction time? This information is used by the |
indicators |
A single character string. Control how factors are expanded into dummy variable indicator columns. One of:
|
composition |
Either "tibble", "matrix", or "dgCMatrix" for the format of the processed predictors. If "matrix" or "dgCMatrix" are chosen, all of the predictors must be numeric after the preprocessing method has been applied; otherwise an error is thrown. |
formula |
A formula specifying the predictors and the outcomes. |
data |
A data frame or matrix containing the outcomes and predictors. |
... |
Not used. |
blueprint |
A preprocessing |
While not different from base R, the behavior of expanding factors into
dummy variables when indicators = "traditional"
and an intercept is not
present is not always intuitive and should be documented.
When an intercept is present, factors are expanded into K-1
new columns,
where K
is the number of levels in the factor.
When an intercept is not present, the first factor is expanded into
all K
columns (one-hot encoding), and the remaining factors are expanded
into K-1
columns. This behavior ensures that meaningful predictions can
be made for the reference level of the first factor, but is not the exact
"no intercept" model that was requested. Without this behavior, predictions
for the reference level of the first factor would always be forced to 0
when there is no intercept.
Offsets can be included in the formula method through the use of the inline
function stats::offset()
. These are returned as a tibble with 1 column
named ".offset"
in the $extras$offset
slot of the return value.
For default_formula_blueprint()
, a formula blueprint.
When mold()
is used with the default formula blueprint:
Predictors
The RHS of the formula
is isolated, and converted to its own
1 sided formula: ~ RHS
.
Runs stats::model.frame()
on the RHS formula and uses data
.
If indicators = "traditional"
, it then runs stats::model.matrix()
on the result.
If indicators = "none"
, factors are removed before model.matrix()
is run, and then added back afterwards. No interactions or inline
functions involving factors are allowed.
If indicators = "one_hot"
, it then runs stats::model.matrix()
on the
result using a contrast function that creates indicator columns for all
levels of all factors.
If any offsets are present from using offset()
, then they are
extracted with model_offset()
.
If intercept = TRUE
, adds an intercept column.
Coerces the result of the above steps to a tibble.
Outcomes
The LHS of the formula
is isolated, and converted to its own
1 sided formula: ~ LHS
.
Runs stats::model.frame()
on the LHS formula and uses data
.
Coerces the result of the above steps to a tibble.
When forge()
is used with the default formula blueprint:
It calls shrink()
to trim new_data
to only the required columns and
coerce new_data
to a tibble.
It calls scream()
to perform validation on the structure of the columns
of new_data
.
Predictors
It runs stats::model.frame()
on new_data
using the stored terms
object corresponding to the predictors.
If, in the original mold()
call, indicators = "traditional"
was
set, it then runs stats::model.matrix()
on the result.
If, in the original mold()
call, indicators = "none"
was set, it
runs stats::model.matrix()
on the result without the factor columns,
and then adds them on afterwards.
If, in the original mold()
call, indicators = "one_hot"
was set, it
runs stats::model.matrix()
on the result with a contrast function that
includes indicators for all levels of all factor columns.
If any offsets are present from using offset()
in the original call
to mold()
, then they are extracted with model_offset()
.
If intercept = TRUE
in the original call to mold()
, then an
intercept column is added.
It coerces the result of the above steps to a tibble.
Outcomes
It runs stats::model.frame()
on new_data
using the
stored terms object corresponding to the outcomes.
Coerces the result to a tibble.
There are a number of differences from base R regarding how formulas are
processed by mold()
that require some explanation.
Multivariate outcomes can be specified on the LHS using syntax that is
similar to the RHS (i.e. outcome_1 + outcome_2 ~ predictors
).
If any complex calculations are done on the LHS and they return matrices
(like stats::poly()
), then those matrices are flattened into multiple
columns of the tibble after the call to model.frame()
. While this is
possible, it is not recommended, and if a large amount of preprocessing is
required on the outcomes, then you are better off
using a recipes::recipe()
.
Global variables are not allowed in the formula. An error will be thrown
if they are included. All terms in the formula should come from data
. If
you need to use inline functions in the formula, the safest way to do so is
to prefix them with their package name, like pkg::fn()
. This ensures that
the function will always be available at mold()
(fit) and forge()
(prediction) time. That said, if the package is attached
(i.e. with library()
), then you should be able to use the inline function
without the prefix.
By default, intercepts are not included in the predictor output from the
formula. To include an intercept, set
blueprint = default_formula_blueprint(intercept = TRUE)
. The rationale
for this is that many packages either always require or never allow an
intercept (for example, the earth
package), and they do a large amount of
extra work to keep the user from supplying one or removing it. This
interface standardizes all of that flexibility in one place.
# --------------------------------------------------------------------------- data("hardhat-example-data") # --------------------------------------------------------------------------- # Formula Example # Call mold() with the training data processed <- mold( log(num_1) ~ num_2 + fac_1, example_train, blueprint = default_formula_blueprint(intercept = TRUE) ) # Then, call forge() with the blueprint and the test data # to have it preprocess the test data in the same way forge(example_test, processed$blueprint) # Use `outcomes = TRUE` to also extract the preprocessed outcome forge(example_test, processed$blueprint, outcomes = TRUE) # --------------------------------------------------------------------------- # Factors without an intercept # No intercept is added by default processed <- mold(num_1 ~ fac_1 + fac_2, example_train) # So, for factor columns, the first factor is completely expanded into all # `K` columns (the number of levels), and the subsequent factors are expanded # into `K - 1` columns. processed$predictors # In the above example, `fac_1` is expanded into all three columns, # `fac_2` is not. This behavior comes from `model.matrix()`, and is somewhat # known in the R community, but can lead to a model that is difficult to # interpret since the corresponding p-values are testing wildly different # hypotheses. # To get all indicators for all columns (irrespective of the intercept), # use the `indicators = "one_hot"` option processed <- mold( num_1 ~ fac_1 + fac_2, example_train, blueprint = default_formula_blueprint(indicators = "one_hot") ) processed$predictors # It is not possible to construct a no-intercept model that expands all # factors into `K - 1` columns using the formula method. If required, a # recipe could be used to construct this model. # --------------------------------------------------------------------------- # Global variables y <- rep(1, times = nrow(example_train)) # In base R, global variables are allowed in a model formula frame <- model.frame(fac_1 ~ y + num_2, example_train) head(frame) # mold() does not allow them, and throws an error try(mold(fac_1 ~ y + num_2, example_train)) # --------------------------------------------------------------------------- # Dummy variables and interactions # By default, factor columns are expanded # and interactions are created, both by # calling `model.matrix()`. Some models (like # tree based models) can take factors directly # but still might want to use the formula method. # In those cases, set `indicators = "none"` to not # run `model.matrix()` on factor columns. Interactions # are still allowed and are run on numeric columns. bp_no_indicators <- default_formula_blueprint(indicators = "none") processed <- mold( ~ fac_1 + num_1:num_2, example_train, blueprint = bp_no_indicators ) processed$predictors # An informative error is thrown when `indicators = "none"` and # factors are present in interaction terms or in inline functions try(mold(num_1 ~ num_2:fac_1, example_train, blueprint = bp_no_indicators)) try(mold(num_1 ~ paste0(fac_1), example_train, blueprint = bp_no_indicators)) # --------------------------------------------------------------------------- # Multivariate outcomes # Multivariate formulas can be specified easily processed <- mold(num_1 + log(num_2) ~ fac_1, example_train) processed$outcomes # Inline functions on the LHS are run, but any matrix # output is flattened (like what happens in `model.matrix()`) # (essentially this means you don't wind up with columns # in the tibble that are matrices) processed <- mold(poly(num_2, degree = 2) ~ fac_1, example_train) processed$outcomes # TRUE ncol(processed$outcomes) == 2 # Multivariate formulas specified in mold() # carry over into forge() forge(example_test, processed$blueprint, outcomes = TRUE) # --------------------------------------------------------------------------- # Offsets # Offsets are handled specially in base R, so they deserve special # treatment here as well. You can add offsets using the inline function # `offset()` processed <- mold(num_1 ~ offset(num_2) + fac_1, example_train) processed$extras$offset # Multiple offsets can be included, and they get added together processed <- mold( num_1 ~ offset(num_2) + offset(num_3), example_train ) identical( processed$extras$offset$.offset, example_train$num_2 + example_train$num_3 ) # Forging test data will also require # and include the offset forge(example_test, processed$blueprint) # --------------------------------------------------------------------------- # Intercept only # Because `1` and `0` are intercept modifying terms, they are # not allowed in the formula and are instead controlled by the # `intercept` argument of the blueprint. To use an intercept # only formula, you should supply `NULL` on the RHS of the formula. mold( ~NULL, example_train, blueprint = default_formula_blueprint(intercept = TRUE) ) # --------------------------------------------------------------------------- # Matrix output for predictors # You can change the `composition` of the predictor data set bp <- default_formula_blueprint(composition = "dgCMatrix") processed <- mold(log(num_1) ~ num_2 + fac_1, example_train, blueprint = bp) class(processed$predictors)
# --------------------------------------------------------------------------- data("hardhat-example-data") # --------------------------------------------------------------------------- # Formula Example # Call mold() with the training data processed <- mold( log(num_1) ~ num_2 + fac_1, example_train, blueprint = default_formula_blueprint(intercept = TRUE) ) # Then, call forge() with the blueprint and the test data # to have it preprocess the test data in the same way forge(example_test, processed$blueprint) # Use `outcomes = TRUE` to also extract the preprocessed outcome forge(example_test, processed$blueprint, outcomes = TRUE) # --------------------------------------------------------------------------- # Factors without an intercept # No intercept is added by default processed <- mold(num_1 ~ fac_1 + fac_2, example_train) # So, for factor columns, the first factor is completely expanded into all # `K` columns (the number of levels), and the subsequent factors are expanded # into `K - 1` columns. processed$predictors # In the above example, `fac_1` is expanded into all three columns, # `fac_2` is not. This behavior comes from `model.matrix()`, and is somewhat # known in the R community, but can lead to a model that is difficult to # interpret since the corresponding p-values are testing wildly different # hypotheses. # To get all indicators for all columns (irrespective of the intercept), # use the `indicators = "one_hot"` option processed <- mold( num_1 ~ fac_1 + fac_2, example_train, blueprint = default_formula_blueprint(indicators = "one_hot") ) processed$predictors # It is not possible to construct a no-intercept model that expands all # factors into `K - 1` columns using the formula method. If required, a # recipe could be used to construct this model. # --------------------------------------------------------------------------- # Global variables y <- rep(1, times = nrow(example_train)) # In base R, global variables are allowed in a model formula frame <- model.frame(fac_1 ~ y + num_2, example_train) head(frame) # mold() does not allow them, and throws an error try(mold(fac_1 ~ y + num_2, example_train)) # --------------------------------------------------------------------------- # Dummy variables and interactions # By default, factor columns are expanded # and interactions are created, both by # calling `model.matrix()`. Some models (like # tree based models) can take factors directly # but still might want to use the formula method. # In those cases, set `indicators = "none"` to not # run `model.matrix()` on factor columns. Interactions # are still allowed and are run on numeric columns. bp_no_indicators <- default_formula_blueprint(indicators = "none") processed <- mold( ~ fac_1 + num_1:num_2, example_train, blueprint = bp_no_indicators ) processed$predictors # An informative error is thrown when `indicators = "none"` and # factors are present in interaction terms or in inline functions try(mold(num_1 ~ num_2:fac_1, example_train, blueprint = bp_no_indicators)) try(mold(num_1 ~ paste0(fac_1), example_train, blueprint = bp_no_indicators)) # --------------------------------------------------------------------------- # Multivariate outcomes # Multivariate formulas can be specified easily processed <- mold(num_1 + log(num_2) ~ fac_1, example_train) processed$outcomes # Inline functions on the LHS are run, but any matrix # output is flattened (like what happens in `model.matrix()`) # (essentially this means you don't wind up with columns # in the tibble that are matrices) processed <- mold(poly(num_2, degree = 2) ~ fac_1, example_train) processed$outcomes # TRUE ncol(processed$outcomes) == 2 # Multivariate formulas specified in mold() # carry over into forge() forge(example_test, processed$blueprint, outcomes = TRUE) # --------------------------------------------------------------------------- # Offsets # Offsets are handled specially in base R, so they deserve special # treatment here as well. You can add offsets using the inline function # `offset()` processed <- mold(num_1 ~ offset(num_2) + fac_1, example_train) processed$extras$offset # Multiple offsets can be included, and they get added together processed <- mold( num_1 ~ offset(num_2) + offset(num_3), example_train ) identical( processed$extras$offset$.offset, example_train$num_2 + example_train$num_3 ) # Forging test data will also require # and include the offset forge(example_test, processed$blueprint) # --------------------------------------------------------------------------- # Intercept only # Because `1` and `0` are intercept modifying terms, they are # not allowed in the formula and are instead controlled by the # `intercept` argument of the blueprint. To use an intercept # only formula, you should supply `NULL` on the RHS of the formula. mold( ~NULL, example_train, blueprint = default_formula_blueprint(intercept = TRUE) ) # --------------------------------------------------------------------------- # Matrix output for predictors # You can change the `composition` of the predictor data set bp <- default_formula_blueprint(composition = "dgCMatrix") processed <- mold(log(num_1) ~ num_2 + fac_1, example_train, blueprint = bp) class(processed$predictors)
This pages holds the details for the recipe preprocessing blueprint. This
is the blueprint used by default from mold()
if x
is a recipe.
default_recipe_blueprint( intercept = FALSE, allow_novel_levels = FALSE, fresh = TRUE, strings_as_factors = TRUE, composition = "tibble" ) ## S3 method for class 'recipe' mold(x, data, ..., blueprint = NULL)
default_recipe_blueprint( intercept = FALSE, allow_novel_levels = FALSE, fresh = TRUE, strings_as_factors = TRUE, composition = "tibble" ) ## S3 method for class 'recipe' mold(x, data, ..., blueprint = NULL)
intercept |
A logical. Should an intercept be included in the
processed data? This information is used by the |
allow_novel_levels |
A logical. Should novel factor levels be allowed at
prediction time? This information is used by the |
fresh |
Should already trained operations be re-trained when |
strings_as_factors |
Should character columns be converted to factors
when |
composition |
Either "tibble", "matrix", or "dgCMatrix" for the format of the processed predictors. If "matrix" or "dgCMatrix" are chosen, all of the predictors must be numeric after the preprocessing method has been applied; otherwise an error is thrown. |
x |
An unprepped recipe created from |
data |
A data frame or matrix containing the outcomes and predictors. |
... |
Not used. |
blueprint |
A preprocessing |
For default_recipe_blueprint()
, a recipe blueprint.
When mold()
is used with the default recipe blueprint:
It calls recipes::prep()
to prep the recipe.
It calls recipes::juice()
to extract the outcomes and predictors. These
are returned as tibbles.
If intercept = TRUE
, adds an intercept column to the predictors.
When forge()
is used with the default recipe blueprint:
It calls shrink()
to trim new_data
to only the required columns and
coerce new_data
to a tibble.
It calls scream()
to perform validation on the structure of the columns
of new_data
.
It calls recipes::bake()
on the new_data
using the prepped recipe
used during training.
It adds an intercept column onto new_data
if intercept = TRUE
.
# example code library(recipes) # --------------------------------------------------------------------------- # Setup train <- iris[1:100, ] test <- iris[101:150, ] # --------------------------------------------------------------------------- # Recipes example # Create a recipe that logs a predictor rec <- recipe(Species ~ Sepal.Length + Sepal.Width, train) %>% step_log(Sepal.Length) processed <- mold(rec, train) # Sepal.Length has been logged processed$predictors processed$outcomes # The underlying blueprint is a prepped recipe processed$blueprint$recipe # Call forge() with the blueprint and the test data # to have it preprocess the test data in the same way forge(test, processed$blueprint) # Use `outcomes = TRUE` to also extract the preprocessed outcome! # This logged the Sepal.Length column of `new_data` forge(test, processed$blueprint, outcomes = TRUE) # --------------------------------------------------------------------------- # With an intercept # You can add an intercept with `intercept = TRUE` processed <- mold(rec, train, blueprint = default_recipe_blueprint(intercept = TRUE)) processed$predictors # But you also could have used a recipe step rec2 <- step_intercept(rec) mold(rec2, iris)$predictors # --------------------------------------------------------------------------- # Matrix output for predictors # You can change the `composition` of the predictor data set bp <- default_recipe_blueprint(composition = "dgCMatrix") processed <- mold(rec, train, blueprint = bp) class(processed$predictors) # --------------------------------------------------------------------------- # Non standard roles # If you have custom recipes roles, they are assumed to be required at # `bake()` time when passing in `new_data`. This is an assumption that both # recipes and hardhat makes, meaning that those roles are required at # `forge()` time as well. rec_roles <- recipe(train) %>% update_role(Sepal.Width, new_role = "predictor") %>% update_role(Species, new_role = "outcome") %>% update_role(Sepal.Length, new_role = "id") %>% update_role(Petal.Length, new_role = "important") processed_roles <- mold(rec_roles, train) # The custom roles will be in the `mold()` result in case you need # them for modeling. processed_roles$extras # And they are in the `forge()` result forge(test, processed_roles$blueprint)$extras # If you remove a column with a custom role from the test data, then you # won't be able to `forge()` even though this recipe technically didn't # use that column in any steps test2 <- test test2$Petal.Length <- NULL try(forge(test2, processed_roles$blueprint)) # Most of the time, if you find yourself in the above scenario, then we # suggest that you remove `Petal.Length` from the data that is supplied to # the recipe. If that isn't an option, you can declare that that column # isn't required at `bake()` time by using `update_role_requirements()` rec_roles <- update_role_requirements(rec_roles, "important", bake = FALSE) processed_roles <- mold(rec_roles, train) forge(test2, processed_roles$blueprint)
# example code library(recipes) # --------------------------------------------------------------------------- # Setup train <- iris[1:100, ] test <- iris[101:150, ] # --------------------------------------------------------------------------- # Recipes example # Create a recipe that logs a predictor rec <- recipe(Species ~ Sepal.Length + Sepal.Width, train) %>% step_log(Sepal.Length) processed <- mold(rec, train) # Sepal.Length has been logged processed$predictors processed$outcomes # The underlying blueprint is a prepped recipe processed$blueprint$recipe # Call forge() with the blueprint and the test data # to have it preprocess the test data in the same way forge(test, processed$blueprint) # Use `outcomes = TRUE` to also extract the preprocessed outcome! # This logged the Sepal.Length column of `new_data` forge(test, processed$blueprint, outcomes = TRUE) # --------------------------------------------------------------------------- # With an intercept # You can add an intercept with `intercept = TRUE` processed <- mold(rec, train, blueprint = default_recipe_blueprint(intercept = TRUE)) processed$predictors # But you also could have used a recipe step rec2 <- step_intercept(rec) mold(rec2, iris)$predictors # --------------------------------------------------------------------------- # Matrix output for predictors # You can change the `composition` of the predictor data set bp <- default_recipe_blueprint(composition = "dgCMatrix") processed <- mold(rec, train, blueprint = bp) class(processed$predictors) # --------------------------------------------------------------------------- # Non standard roles # If you have custom recipes roles, they are assumed to be required at # `bake()` time when passing in `new_data`. This is an assumption that both # recipes and hardhat makes, meaning that those roles are required at # `forge()` time as well. rec_roles <- recipe(train) %>% update_role(Sepal.Width, new_role = "predictor") %>% update_role(Species, new_role = "outcome") %>% update_role(Sepal.Length, new_role = "id") %>% update_role(Petal.Length, new_role = "important") processed_roles <- mold(rec_roles, train) # The custom roles will be in the `mold()` result in case you need # them for modeling. processed_roles$extras # And they are in the `forge()` result forge(test, processed_roles$blueprint)$extras # If you remove a column with a custom role from the test data, then you # won't be able to `forge()` even though this recipe technically didn't # use that column in any steps test2 <- test test2$Petal.Length <- NULL try(forge(test2, processed_roles$blueprint)) # Most of the time, if you find yourself in the above scenario, then we # suggest that you remove `Petal.Length` from the data that is supplied to # the recipe. If that isn't an option, you can declare that that column # isn't required at `bake()` time by using `update_role_requirements()` rec_roles <- update_role_requirements(rec_roles, "important", bake = FALSE) processed_roles <- mold(rec_roles, train) forge(test2, processed_roles$blueprint)
This pages holds the details for the XY preprocessing blueprint. This
is the blueprint used by default from mold()
if x
and y
are provided
separately (i.e. the XY interface is used).
default_xy_blueprint( intercept = FALSE, allow_novel_levels = FALSE, composition = "tibble" ) ## S3 method for class 'data.frame' mold(x, y, ..., blueprint = NULL) ## S3 method for class 'matrix' mold(x, y, ..., blueprint = NULL)
default_xy_blueprint( intercept = FALSE, allow_novel_levels = FALSE, composition = "tibble" ) ## S3 method for class 'data.frame' mold(x, y, ..., blueprint = NULL) ## S3 method for class 'matrix' mold(x, y, ..., blueprint = NULL)
intercept |
A logical. Should an intercept be included in the
processed data? This information is used by the |
allow_novel_levels |
A logical. Should novel factor levels be allowed at
prediction time? This information is used by the |
composition |
Either "tibble", "matrix", or "dgCMatrix" for the format of the processed predictors. If "matrix" or "dgCMatrix" are chosen, all of the predictors must be numeric after the preprocessing method has been applied; otherwise an error is thrown. |
x |
A data frame or matrix containing the predictors. |
y |
A data frame, matrix, or vector containing the outcomes. |
... |
Not used. |
blueprint |
A preprocessing |
As documented in standardize()
, if y
is a vector, then the returned
outcomes tibble has 1 column with a standardized name of ".outcome"
.
The one special thing about the XY method's forge function is the behavior of
outcomes = TRUE
when a vector y
value was provided to the original
call to mold()
. In that case, mold()
converts y
into a tibble, with
a default name of .outcome
. This is the column that forge()
will look
for in new_data
to preprocess. See the examples section for a
demonstration of this.
For default_xy_blueprint()
, an XY blueprint.
When mold()
is used with the default xy blueprint:
It converts x
to a tibble.
It adds an intercept column to x
if intercept = TRUE
.
It runs standardize()
on y
.
When forge()
is used with the default xy blueprint:
It calls shrink()
to trim new_data
to only the required columns and
coerce new_data
to a tibble.
It calls scream()
to perform validation on the structure of the columns
of new_data
.
It adds an intercept column onto new_data
if intercept = TRUE
.
# --------------------------------------------------------------------------- # Setup train <- iris[1:100, ] test <- iris[101:150, ] train_x <- train["Sepal.Length"] train_y <- train["Species"] test_x <- test["Sepal.Length"] test_y <- test["Species"] # --------------------------------------------------------------------------- # XY Example # First, call mold() with the training data processed <- mold(train_x, train_y) # Then, call forge() with the blueprint and the test data # to have it preprocess the test data in the same way forge(test_x, processed$blueprint) # --------------------------------------------------------------------------- # Intercept processed <- mold(train_x, train_y, blueprint = default_xy_blueprint(intercept = TRUE)) forge(test_x, processed$blueprint) # --------------------------------------------------------------------------- # XY Method and forge(outcomes = TRUE) # You can request that the new outcome columns are preprocessed as well, but # they have to be present in `new_data`! processed <- mold(train_x, train_y) # Can't do this! try(forge(test_x, processed$blueprint, outcomes = TRUE)) # Need to use the full test set, including `y` forge(test, processed$blueprint, outcomes = TRUE) # With the XY method, if the Y value used in `mold()` is a vector, # then a column name of `.outcome` is automatically generated. # This name is what forge() looks for in `new_data`. # Y is a vector! y_vec <- train_y$Species processed_vec <- mold(train_x, y_vec) # This throws an informative error that tell you # to include an `".outcome"` column in `new_data`. try(forge(iris, processed_vec$blueprint, outcomes = TRUE)) test2 <- test test2$.outcome <- test2$Species test2$Species <- NULL # This works, and returns a tibble in the $outcomes slot forge(test2, processed_vec$blueprint, outcomes = TRUE) # --------------------------------------------------------------------------- # Matrix output for predictors # You can change the `composition` of the predictor data set bp <- default_xy_blueprint(composition = "dgCMatrix") processed <- mold(train_x, train_y, blueprint = bp) class(processed$predictors)
# --------------------------------------------------------------------------- # Setup train <- iris[1:100, ] test <- iris[101:150, ] train_x <- train["Sepal.Length"] train_y <- train["Species"] test_x <- test["Sepal.Length"] test_y <- test["Species"] # --------------------------------------------------------------------------- # XY Example # First, call mold() with the training data processed <- mold(train_x, train_y) # Then, call forge() with the blueprint and the test data # to have it preprocess the test data in the same way forge(test_x, processed$blueprint) # --------------------------------------------------------------------------- # Intercept processed <- mold(train_x, train_y, blueprint = default_xy_blueprint(intercept = TRUE)) forge(test_x, processed$blueprint) # --------------------------------------------------------------------------- # XY Method and forge(outcomes = TRUE) # You can request that the new outcome columns are preprocessed as well, but # they have to be present in `new_data`! processed <- mold(train_x, train_y) # Can't do this! try(forge(test_x, processed$blueprint, outcomes = TRUE)) # Need to use the full test set, including `y` forge(test, processed$blueprint, outcomes = TRUE) # With the XY method, if the Y value used in `mold()` is a vector, # then a column name of `.outcome` is automatically generated. # This name is what forge() looks for in `new_data`. # Y is a vector! y_vec <- train_y$Species processed_vec <- mold(train_x, y_vec) # This throws an informative error that tell you # to include an `".outcome"` column in `new_data`. try(forge(iris, processed_vec$blueprint, outcomes = TRUE)) test2 <- test test2$.outcome <- test2$Species test2$Species <- NULL # This works, and returns a tibble in the $outcomes slot forge(test2, processed_vec$blueprint, outcomes = TRUE) # --------------------------------------------------------------------------- # Matrix output for predictors # You can change the `composition` of the predictor data set bp <- default_xy_blueprint(composition = "dgCMatrix") processed <- mold(train_x, train_y, blueprint = bp) class(processed$predictors)
delete_response()
is exactly the same as delete.response()
, except
that it fixes a long standing bug by also removing the part of the
"dataClasses"
attribute corresponding to the response, if it exists.
delete_response(terms)
delete_response(terms)
terms |
A terms object. |
The bug is described here:
https://stat.ethz.ch/pipermail/r-devel/2012-January/062942.html
terms
with the response sections removed.
framed <- model_frame(Species ~ Sepal.Width, iris) attr(delete.response(framed$terms), "dataClasses") attr(delete_response(framed$terms), "dataClasses")
framed <- model_frame(Species ~ Sepal.Width, iris) attr(delete.response(framed$terms), "dataClasses") attr(delete_response(framed$terms), "dataClasses")
fct_encode_one_hot()
encodes a factor as a one-hot indicator matrix.
This matrix consists of length(x)
rows and length(levels(x))
columns.
Every value in row i
of the matrix is filled with 0L
except for the
column that has the same name as x[[i]]
, which is instead filled with 1L
.
fct_encode_one_hot(x)
fct_encode_one_hot(x)
x |
A factor.
|
The columns are returned in the same order as levels(x)
.
If x
has names, the names are propagated onto the result as the row names.
An integer matrix with length(x)
rows and length(levels(x))
columns.
fct_encode_one_hot(factor(letters)) fct_encode_one_hot(factor(letters[1:2], levels = letters)) set.seed(1234) fct_encode_one_hot(factor(sample(letters[1:4], 10, TRUE)))
fct_encode_one_hot(factor(letters)) fct_encode_one_hot(factor(letters[1:2], levels = letters)) set.seed(1234) fct_encode_one_hot(factor(sample(letters[1:4], 10, TRUE)))
forge()
applies the transformations requested by the specific blueprint
on a set of new_data
. This new_data
contains new predictors
(and potentially outcomes) that will be used to generate predictions.
All blueprints have consistent return values with the others, but each is
unique enough to have its own help page. Click through below to learn
how to use each one in conjunction with forge()
.
XY Method - default_xy_blueprint()
Formula Method - default_formula_blueprint()
Recipes Method - default_recipe_blueprint()
forge(new_data, blueprint, ..., outcomes = FALSE)
forge(new_data, blueprint, ..., outcomes = FALSE)
new_data |
A data frame or matrix of predictors to process. If
|
blueprint |
A preprocessing |
... |
Not used. |
outcomes |
A logical. Should the outcomes be processed and returned as well? |
If the outcomes are present in new_data
, they can optionally be processed
and returned in the outcomes
slot of the returned list by setting
outcomes = TRUE
. This is very useful when doing cross validation where
you need to preprocess the outcomes of a test set before computing
performance.
A named list with 3 elements:
predictors
: A tibble containing the preprocessed
new_data
predictors.
outcomes
: If outcomes = TRUE
, a tibble containing the preprocessed
outcomes found in new_data
. Otherwise, NULL
.
extras
: Either NULL
if the blueprint returns no extra information,
or a named list containing the extra information.
# See the blueprint specific documentation linked above # for various ways to call forge with different # blueprints. train <- iris[1:100, ] test <- iris[101:150, ] # Formula processed <- mold( log(Sepal.Width) ~ Species, train, blueprint = default_formula_blueprint(indicators = "none") ) forge(test, processed$blueprint, outcomes = TRUE)
# See the blueprint specific documentation linked above # for various ways to call forge with different # blueprints. train <- iris[1:100, ] test <- iris[101:150, ] # Formula processed <- mold( log(Sepal.Width) ~ Species, train, blueprint = default_formula_blueprint(indicators = "none") ) forge(test, processed$blueprint, outcomes = TRUE)
frequency_weights()
creates a vector of frequency weights which allow you
to compactly repeat an observation a set number of times. Frequency weights
are supplied as a non-negative integer vector, where only whole numbers are
allowed.
frequency_weights(x)
frequency_weights(x)
x |
An integer vector. |
Frequency weights are integers that denote how many times a particular row of the data has been observed. They help compress redundant rows into a single entry.
In tidymodels, frequency weights are used for all parts of the preprocessing, model fitting, and performance estimation operations.
A new frequency weights vector.
# Record that the first observation has 10 replicates, the second has 12 # replicates, and so on frequency_weights(c(10, 12, 2, 1)) # Fractional values are not allowed try(frequency_weights(c(1.5, 2.3, 10)))
# Record that the first observation has 10 replicates, the second has 12 # replicates, and so on frequency_weights(c(10, 12, 2, 1)) # Fractional values are not allowed try(frequency_weights(c(1.5, 2.3, 10)))
When predicting from a model, it is often important for the new_data
to
have the same classes as the original data used to fit the model.
get_data_classes()
extracts the classes from the original training data.
get_data_classes(data)
get_data_classes(data)
data |
A data frame or matrix. |
A named list. The names are the column names of data
and the values are
character vectors containing the class of that column.
get_data_classes(iris) get_data_classes(as.matrix(mtcars)) # Unlike .MFclass(), the full class # vector is returned data <- data.frame(col = ordered(c("a", "b"))) .MFclass(data$col) get_data_classes(data)
get_data_classes(iris) get_data_classes(as.matrix(mtcars)) # Unlike .MFclass(), the full class # vector is returned data <- data.frame(col = ordered(c("a", "b"))) .MFclass(data$col) get_data_classes(data)
get_levels()
extracts the levels from any factor columns in data
. It is
mainly useful for extracting the original factor levels from the predictors
in the training set. get_outcome_levels()
is a small wrapper around
get_levels()
for extracting levels from a factor outcome
that first calls standardize()
on y
.
get_levels(data) get_outcome_levels(y)
get_levels(data) get_outcome_levels(y)
data |
A data.frame to extract levels from. |
y |
The outcome. This can be:
|
A named list with as many elements as there are factor columns in data
or y
. The names are the names of the factor columns, and the values
are character vectors of the levels.
If there are no factor columns, NULL
is returned.
# Factor columns are returned with their levels get_levels(iris) # No factor columns get_levels(mtcars) # standardize() is first run on `y` # which converts the input to a data frame # with an automatically named column, `".outcome"` get_outcome_levels(y = factor(letters[1:5]))
# Factor columns are returned with their levels get_levels(iris) # No factor columns get_levels(mtcars) # standardize() is first run on `y` # which converts the input to a data frame # with an automatically named column, `".outcome"` get_outcome_levels(y = factor(letters[1:5]))
Example data for hardhat
Data objects for a training and test set with the same variables: three numeric and two factor columns.
example_train , example_test
|
tibbles |
data("hardhat-example-data")
data("hardhat-example-data")
These generics are used to extract elements from various model objects. Methods are defined in other packages, such as tune, workflows, and workflowsets, but the returned object is always the same.
extract_fit_engine()
returns the engine specific fit embedded within
a parsnip model fit. For example, when using parsnip::linear_reg()
with the "lm"
engine, this returns the underlying lm
object.
extract_fit_parsnip()
returns a parsnip model fit.
extract_mold()
returns the preprocessed "mold" object returned
from mold()
. It contains information about the preprocessing,
including either the prepped recipe, the formula terms object, or
variable selectors.
extract_spec_parsnip()
returns a parsnip model specification.
extract_preprocessor()
returns the formula, recipe, or variable
expressions used for preprocessing.
extract_recipe()
returns a recipe, possibly estimated.
extract_workflow()
returns a workflow, possibly fit.
extract_parameter_dials()
returns a single dials parameter object.
extract_parameter_set_dials()
returns a set of dials parameter objects.
extract_fit_time()
returns a tibble with fit times.
extract_workflow(x, ...) extract_recipe(x, ...) extract_spec_parsnip(x, ...) extract_fit_parsnip(x, ...) extract_fit_engine(x, ...) extract_mold(x, ...) extract_preprocessor(x, ...) extract_postprocessor(x, ...) extract_parameter_dials(x, ...) extract_parameter_set_dials(x, ...) extract_fit_time(x, ...)
extract_workflow(x, ...) extract_recipe(x, ...) extract_spec_parsnip(x, ...) extract_fit_parsnip(x, ...) extract_fit_engine(x, ...) extract_mold(x, ...) extract_preprocessor(x, ...) extract_postprocessor(x, ...) extract_parameter_dials(x, ...) extract_parameter_set_dials(x, ...) extract_fit_time(x, ...)
x |
An object. |
... |
Extra arguments passed on to methods. |
# See packages where methods are defined for examples, such as `parsnip` or # `workflows`.
# See packages where methods are defined for examples, such as `parsnip` or # `workflows`.
importance_weights()
creates a vector of importance weights which allow you
to apply a context dependent weight to your observations. Importance weights
are supplied as a non-negative double vector, where fractional values are
allowed.
importance_weights(x)
importance_weights(x)
x |
A double vector. |
Importance weights focus on how much each row of the data set should influence model estimation. These can be based on data or arbitrarily set to achieve some goal.
In tidymodels, importance weights only affect the model estimation and supervised recipes steps. They are not used with yardstick functions for calculating measures of model performance.
A new importance weights vector.
importance_weights(c(1.5, 2.3, 10))
importance_weights(c(1.5, 2.3, 10))
x
a preprocessing blueprint?is_blueprint()
checks if x
inherits from "hardhat_blueprint"
.
is_blueprint(x)
is_blueprint(x)
x |
An object. |
is_blueprint(default_xy_blueprint())
is_blueprint(default_xy_blueprint())
x
a case weights vector?is_case_weights()
checks if x
inherits from "hardhat_case_weights"
.
is_case_weights(x)
is_case_weights(x)
x |
An object. |
A single TRUE
or FALSE
.
is_case_weights(1) is_case_weights(frequency_weights(1))
is_case_weights(1) is_case_weights(frequency_weights(1))
x
a frequency weights vector?is_frequency_weights()
checks if x
inherits from
"hardhat_frequency_weights"
.
is_frequency_weights(x)
is_frequency_weights(x)
x |
An object. |
A single TRUE
or FALSE
.
is_frequency_weights(1) is_frequency_weights(frequency_weights(1)) is_frequency_weights(importance_weights(1))
is_frequency_weights(1) is_frequency_weights(frequency_weights(1)) is_frequency_weights(importance_weights(1))
x
an importance weights vector?is_importance_weights()
checks if x
inherits from
"hardhat_importance_weights"
.
is_importance_weights(x)
is_importance_weights(x)
x |
An object. |
A single TRUE
or FALSE
.
is_importance_weights(1) is_importance_weights(frequency_weights(1)) is_importance_weights(importance_weights(1))
is_importance_weights(1) is_importance_weights(frequency_weights(1)) is_importance_weights(importance_weights(1))
model_frame()
is a stricter version of stats::model.frame()
. There are
a number of differences, with the main being that rows are never dropped
and the return value is a list with the frame and terms separated into
two distinct objects.
model_frame(formula, data)
model_frame(formula, data)
formula |
A formula or terms object representing the terms of the model frame. |
data |
A data frame or matrix containing the terms of |
The following explains the rationale for some of the difference in arguments
compared to stats::model.frame()
:
subset
: Not allowed because the number of rows before and after
model_frame()
has been run should always be the same.
na.action
: Not allowed and is forced to "na.pass"
because the
number of rows before and after model_frame()
has been run should always
be the same.
drop.unused.levels
: Not allowed because it seems inconsistent for
data
and the result of model_frame()
to ever have the same factor column
but with different levels, unless specified though original_levels
. If
this is required, it should be done through a recipe step explicitly.
xlev
: Not allowed because this check should have been done ahead of
time. Use scream()
to check the integrity of data
against a training
set if that is required.
...
: Not exposed because offsets are handled separately, and
it is not necessary to pass weights here any more because rows are never
dropped (so weights don't have to be subset alongside the rest of the
design matrix). If other non-predictor columns are required, use the
"roles" features of recipes.
It is important to always use the results of model_frame()
with
model_matrix()
rather than stats::model.matrix()
because the tibble
in the result of model_frame()
does not have a terms object attached.
If model.matrix(<terms>, <tibble>)
is called directly, then a call to
model.frame()
will be made automatically, which can give faulty results.
A named list with two elements:
"data"
: A tibble containing the model frame.
"terms"
: A terms object containing the terms for the model frame.
# --------------------------------------------------------------------------- # Example usage framed <- model_frame(Species ~ Sepal.Width, iris) framed$data framed$terms # --------------------------------------------------------------------------- # Missing values never result in dropped rows iris2 <- iris iris2$Sepal.Width[1] <- NA framed2 <- model_frame(Species ~ Sepal.Width, iris2) head(framed2$data) nrow(framed2$data) == nrow(iris2)
# --------------------------------------------------------------------------- # Example usage framed <- model_frame(Species ~ Sepal.Width, iris) framed$data framed$terms # --------------------------------------------------------------------------- # Missing values never result in dropped rows iris2 <- iris iris2$Sepal.Width[1] <- NA framed2 <- model_frame(Species ~ Sepal.Width, iris2) head(framed2$data) nrow(framed2$data) == nrow(iris2)
model_matrix()
is a stricter version of stats::model.matrix()
. Notably,
model_matrix()
will never drop rows, and the result will be a tibble.
model_matrix(terms, data)
model_matrix(terms, data)
terms |
A terms object to construct a model matrix with. This is
typically the terms object returned from the corresponding call to
|
data |
A tibble to construct the design matrix with. This is
typically the tibble returned from the corresponding call to
|
The following explains the rationale for some of the difference in arguments
compared to stats::model.matrix()
:
contrasts.arg
: Set the contrasts argument, options("contrasts")
globally, or assign a contrast to the factor of interest directly using
stats::contrasts()
. See the examples section.
xlev
: Not allowed because model.frame()
is never called, so it is
unnecessary.
...
: Not allowed because the default method of model.matrix()
does
not use it, and the lm
method uses it to pass potential offsets and
weights through, which are handled differently in hardhat.
A tibble containing the design matrix.
# --------------------------------------------------------------------------- # Example usage framed <- model_frame(Sepal.Width ~ Species, iris) model_matrix(framed$terms, framed$data) # --------------------------------------------------------------------------- # Missing values never result in dropped rows iris2 <- iris iris2$Species[1] <- NA framed2 <- model_frame(Sepal.Width ~ Species, iris2) model_matrix(framed2$terms, framed2$data) # --------------------------------------------------------------------------- # Contrasts # Default contrasts y <- factor(c("a", "b")) x <- data.frame(y = y) framed <- model_frame(~y, x) # Setting contrasts directly y_with_contrast <- y contrasts(y_with_contrast) <- contr.sum(2) x2 <- data.frame(y = y_with_contrast) framed2 <- model_frame(~y, x2) # Compare! model_matrix(framed$terms, framed$data) model_matrix(framed2$terms, framed2$data) # Also, can set the contrasts globally global_override <- c(unordered = "contr.sum", ordered = "contr.poly") rlang::with_options( .expr = { model_matrix(framed$terms, framed$data) }, contrasts = global_override )
# --------------------------------------------------------------------------- # Example usage framed <- model_frame(Sepal.Width ~ Species, iris) model_matrix(framed$terms, framed$data) # --------------------------------------------------------------------------- # Missing values never result in dropped rows iris2 <- iris iris2$Species[1] <- NA framed2 <- model_frame(Sepal.Width ~ Species, iris2) model_matrix(framed2$terms, framed2$data) # --------------------------------------------------------------------------- # Contrasts # Default contrasts y <- factor(c("a", "b")) x <- data.frame(y = y) framed <- model_frame(~y, x) # Setting contrasts directly y_with_contrast <- y contrasts(y_with_contrast) <- contr.sum(2) x2 <- data.frame(y = y_with_contrast) framed2 <- model_frame(~y, x2) # Compare! model_matrix(framed$terms, framed$data) model_matrix(framed2$terms, framed2$data) # Also, can set the contrasts globally global_override <- c(unordered = "contr.sum", ordered = "contr.poly") rlang::with_options( .expr = { model_matrix(framed$terms, framed$data) }, contrasts = global_override )
model_offset()
extracts a numeric offset from a model frame. It is
inspired by stats::model.offset()
, but has nicer error messages and
is slightly stricter.
model_offset(terms, data)
model_offset(terms, data)
terms |
A |
data |
A data frame returned from a call to |
If a column that has been tagged as an offset is not numeric, a nice error message is thrown telling you exactly which column was problematic.
stats::model.offset()
also allows for a column named "(offset)"
to be
considered an offset along with any others that have been tagged by
stats::offset()
. However, stats::model.matrix()
does not recognize
these columns as offsets (so it doesn't remove them as it should). Because
of this inconsistency, columns named "(offset)"
are not treated specially
by model_offset()
.
A numeric vector representing the offset.
x <- model.frame(Species ~ offset(Sepal.Width), iris) model_offset(terms(x), x) xx <- model.frame(Species ~ offset(Sepal.Width) + offset(Sepal.Length), iris) model_offset(terms(xx), xx) # Problematic columns are caught with intuitive errors tryCatch( expr = { x <- model.frame(~ offset(Species), iris) model_offset(terms(x), x) }, error = function(e) { print(e$message) } )
x <- model.frame(Species ~ offset(Sepal.Width), iris) model_offset(terms(x), x) xx <- model.frame(Species ~ offset(Sepal.Width) + offset(Sepal.Length), iris) model_offset(terms(xx), xx) # Problematic columns are caught with intuitive errors tryCatch( expr = { x <- model.frame(~ offset(Species), iris) model_offset(terms(x), x) }, error = function(e) { print(e$message) } )
create_modeling_package()
will:
Call usethis::create_package()
to set up a new R package.
Call use_modeling_deps()
.
Call use_modeling_files()
.
use_modeling_deps()
will:
Add hardhat, rlang, and stats to Imports
Add recipes to Suggests
If roxygen2 is available, use roxygen markdown
use_modeling_files()
will:
Add a package documentation file
Generate and populate 3 files in R/
:
{{model}}-constructor.R
{{model}}-fit.R
{{model}}-predict.R
create_modeling_package(path, model, fields = NULL, open = interactive()) use_modeling_deps() use_modeling_files(model)
create_modeling_package(path, model, fields = NULL, open = interactive()) use_modeling_deps() use_modeling_files(model)
path |
A path. If it exists, it is used. If it does not exist, it is created, provided that the parent path exists. |
model |
A string. The name of the high level modeling function that
users will call. For example, |
fields |
A named list of fields to add to DESCRIPTION,
potentially overriding default values. See |
open |
If TRUE, activates the new project:
|
create_modeling_package()
returns the project path invisibly.
use_modeling_deps()
returns invisibly.
use_modeling_files()
return model
invisibly.
mold()
applies the appropriate processing steps required to get training
data ready to be fed into a model. It does this through the use of various
blueprints that understand how to preprocess data that come in various
forms, such as a formula or a recipe.
All blueprints have consistent return values with the others, but each is
unique enough to have its own help page. Click through below to learn
how to use each one in conjunction with mold()
.
XY Method - default_xy_blueprint()
Formula Method - default_formula_blueprint()
Recipes Method - default_recipe_blueprint()
mold(x, ...)
mold(x, ...)
x |
An object. See the method specific implementations linked in the Description for more information. |
... |
Not used. |
A named list containing 4 elements:
predictors
: A tibble containing the molded predictors to be used in the
model.
outcomes
: A tibble containing the molded outcomes to be used in the
model.
blueprint
: A method specific "hardhat_blueprint"
object for use when
making predictions.
extras
: Either NULL
if the blueprint returns no extra information,
or a named list containing the extra information.
# See the method specific documentation linked in Description # for the details of each blueprint, and more examples. # XY mold(iris["Sepal.Width"], iris$Species) # Formula mold(Species ~ Sepal.Width, iris) # Recipe library(recipes) mold(recipe(Species ~ Sepal.Width, iris), iris)
# See the method specific documentation linked in Description # for the details of each blueprint, and more examples. # XY mold(iris["Sepal.Width"], iris$Species) # Formula mold(Species ~ Sepal.Width, iris) # Recipe library(recipes) mold(recipe(Species ~ Sepal.Width, iris), iris)
new_case_weights()
is a developer oriented function for constructing a new
case weights type. The <case_weights>
type itself is an abstract type
with very little functionality. Because of this, class
is a required
argument.
new_case_weights(x, ..., class)
new_case_weights(x, ..., class)
x |
An integer or double vector. |
... |
Name-value pairs defining attributes |
class |
Name of subclass. |
A new subclassed case weights vector.
new_case_weights(1:5, class = "my_weights")
new_case_weights(1:5, class = "my_weights")
This page contains the constructors for the default blueprints. They can be
extended if you want to add extra behavior on top of what the default
blueprints already do, but generally you will extend the non-default versions
of the constructors found in the documentation for new_blueprint()
.
new_default_formula_blueprint( intercept = FALSE, allow_novel_levels = FALSE, ptypes = NULL, formula = NULL, indicators = "traditional", composition = "tibble", terms = list(predictors = NULL, outcomes = NULL), levels = NULL, ..., subclass = character() ) new_default_recipe_blueprint( intercept = FALSE, allow_novel_levels = FALSE, fresh = TRUE, strings_as_factors = TRUE, composition = "tibble", ptypes = NULL, recipe = NULL, extra_role_ptypes = NULL, ..., subclass = character() ) new_default_xy_blueprint( intercept = FALSE, allow_novel_levels = FALSE, composition = "tibble", ptypes = NULL, ..., subclass = character() )
new_default_formula_blueprint( intercept = FALSE, allow_novel_levels = FALSE, ptypes = NULL, formula = NULL, indicators = "traditional", composition = "tibble", terms = list(predictors = NULL, outcomes = NULL), levels = NULL, ..., subclass = character() ) new_default_recipe_blueprint( intercept = FALSE, allow_novel_levels = FALSE, fresh = TRUE, strings_as_factors = TRUE, composition = "tibble", ptypes = NULL, recipe = NULL, extra_role_ptypes = NULL, ..., subclass = character() ) new_default_xy_blueprint( intercept = FALSE, allow_novel_levels = FALSE, composition = "tibble", ptypes = NULL, ..., subclass = character() )
intercept |
A logical. Should an intercept be included in the
processed data? This information is used by the |
allow_novel_levels |
A logical. Should novel factor levels be allowed at
prediction time? This information is used by the |
ptypes |
Either |
formula |
Either |
indicators |
A single character string. Control how factors are expanded into dummy variable indicator columns. One of:
|
composition |
Either "tibble", "matrix", or "dgCMatrix" for the format of the processed predictors. If "matrix" or "dgCMatrix" are chosen, all of the predictors must be numeric after the preprocessing method has been applied; otherwise an error is thrown. |
terms |
A named list of two elements, |
levels |
Either |
... |
Name-value pairs for additional elements of blueprints that subclass this blueprint. |
subclass |
A character vector. The subclasses of this blueprint. |
fresh |
Should already trained operations be re-trained when |
strings_as_factors |
Should character columns be converted to factors
when |
recipe |
Either |
extra_role_ptypes |
A named list. The names are the unique non-standard
recipe roles (i.e. everything except |
These are the base classes for creating new preprocessing blueprints. All
blueprints inherit from the one created by new_blueprint()
, and the default
method specific blueprints inherit from the other three here.
If you want to create your own processing blueprint for a specific method,
generally you will subclass one of the method specific blueprints here. If
you want to create a completely new preprocessing blueprint for a totally new
preprocessing method (i.e. not the formula, xy, or recipe method) then
you should subclass new_blueprint()
.
In addition to creating a blueprint subclass, you will likely also need to
provide S3 methods for run_mold()
and run_forge()
for your subclass.
new_formula_blueprint( intercept = FALSE, allow_novel_levels = FALSE, ptypes = NULL, formula = NULL, indicators = "traditional", composition = "tibble", ..., subclass = character() ) new_recipe_blueprint( intercept = FALSE, allow_novel_levels = FALSE, fresh = TRUE, strings_as_factors = TRUE, composition = "tibble", ptypes = NULL, recipe = NULL, ..., subclass = character() ) new_xy_blueprint( intercept = FALSE, allow_novel_levels = FALSE, composition = "tibble", ptypes = NULL, ..., subclass = character() ) new_blueprint( intercept = FALSE, allow_novel_levels = FALSE, composition = "tibble", ptypes = NULL, ..., subclass = character() )
new_formula_blueprint( intercept = FALSE, allow_novel_levels = FALSE, ptypes = NULL, formula = NULL, indicators = "traditional", composition = "tibble", ..., subclass = character() ) new_recipe_blueprint( intercept = FALSE, allow_novel_levels = FALSE, fresh = TRUE, strings_as_factors = TRUE, composition = "tibble", ptypes = NULL, recipe = NULL, ..., subclass = character() ) new_xy_blueprint( intercept = FALSE, allow_novel_levels = FALSE, composition = "tibble", ptypes = NULL, ..., subclass = character() ) new_blueprint( intercept = FALSE, allow_novel_levels = FALSE, composition = "tibble", ptypes = NULL, ..., subclass = character() )
intercept |
A logical. Should an intercept be included in the
processed data? This information is used by the |
allow_novel_levels |
A logical. Should novel factor levels be allowed at
prediction time? This information is used by the |
ptypes |
Either |
formula |
Either |
indicators |
A single character string. Control how factors are expanded into dummy variable indicator columns. One of:
|
composition |
Either "tibble", "matrix", or "dgCMatrix" for the format of the processed predictors. If "matrix" or "dgCMatrix" are chosen, all of the predictors must be numeric after the preprocessing method has been applied; otherwise an error is thrown. |
... |
Name-value pairs for additional elements of blueprints that subclass this blueprint. |
subclass |
A character vector. The subclasses of this blueprint. |
fresh |
Should already trained operations be re-trained when |
strings_as_factors |
Should character columns be converted to factors
when |
recipe |
Either |
A preprocessing blueprint, which is a list containing the inputs used as arguments to the function, along with a class specific to the type of blueprint being created.
new_frequency_weights()
is a developer oriented function for constructing
a new frequency weights vector. Generally, you should use
frequency_weights()
instead.
new_frequency_weights(x = integer(), ..., class = character())
new_frequency_weights(x = integer(), ..., class = character())
x |
An integer vector. |
... |
Name-value pairs defining attributes |
class |
Name of subclass. |
A new frequency weights vector.
new_frequency_weights() new_frequency_weights(1:5)
new_frequency_weights() new_frequency_weights(1:5)
new_importance_weights()
is a developer oriented function for constructing
a new importance weights vector. Generally, you should use
importance_weights()
instead.
new_importance_weights(x = double(), ..., class = character())
new_importance_weights(x = double(), ..., class = character())
x |
A double vector. |
... |
Name-value pairs defining attributes |
class |
Name of subclass. |
A new importance weights vector.
new_importance_weights() new_importance_weights(c(1.5, 2.3, 10))
new_importance_weights() new_importance_weights(c(1.5, 2.3, 10))
A model is a scalar object, as classified in
Advanced R. As such, it
takes uniquely named elements in ...
and combines them into a list with
a class of class
. This entire object represent a single model.
new_model(..., blueprint = default_xy_blueprint(), class = character())
new_model(..., blueprint = default_xy_blueprint(), class = character())
... |
Name-value pairs for elements specific to the model defined by
|
blueprint |
A preprocessing |
class |
A character vector representing the class of the model. |
Because every model should have multiple interfaces, including formula
and recipes
interfaces, all models should have a blueprint
that
can process new data when predict()
is called. The easiest way to generate
an blueprint with all of the information required at prediction time is to
use the one that is returned from a call to mold()
.
A new scalar model object, represented as a classed list with named elements
specified in ...
.
new_model( custom_element = "my-elem", blueprint = default_xy_blueprint(), class = "custom_model" )
new_model( custom_element = "my-elem", blueprint = default_xy_blueprint(), class = "custom_model" )
quantile_pred()
is a special vector class used to efficiently store
predictions from a quantile regression model. It requires the same quantile
levels for each row being predicted.
quantile_pred(values, quantile_levels = double()) extract_quantile_levels(x) ## S3 method for class 'quantile_pred' as_tibble(x, ..., .rows = NULL, .name_repair = "minimal", rownames = NULL) ## S3 method for class 'quantile_pred' as.matrix(x, ...)
quantile_pred(values, quantile_levels = double()) extract_quantile_levels(x) ## S3 method for class 'quantile_pred' as_tibble(x, ..., .rows = NULL, .name_repair = "minimal", rownames = NULL) ## S3 method for class 'quantile_pred' as.matrix(x, ...)
values |
A matrix of values. Each column should correspond to one of the quantile levels. |
quantile_levels |
A vector of probabilities corresponding to |
x |
An object produced by |
... |
Not currently used. |
.rows , .name_repair , rownames
|
Arguments not used but required by the original S3 method. |
quantile_pred()
returns a vector of values associated with the
quantile levels.
extract_quantile_levels()
returns a numeric vector of levels.
as_tibble()
returns a tibble with rows ".pred_quantile"
,
".quantile_levels"
, and ".row"
.
as.matrix()
returns an unnamed matrix with rows as samples, columns as
quantile levels, and entries are predictions.
.pred_quantile <- quantile_pred(matrix(rnorm(20), 5), c(.2, .4, .6, .8)) unclass(.pred_quantile) # Access the underlying information extract_quantile_levels(.pred_quantile) # Matrix format as.matrix(.pred_quantile) # Tidy format library(tibble) as_tibble(.pred_quantile)
.pred_quantile <- quantile_pred(matrix(rnorm(20), 5), c(.2, .4, .6, .8)) unclass(.pred_quantile) # Access the underlying information extract_quantile_levels(.pred_quantile) # Matrix format as.matrix(.pred_quantile) # Tidy format library(tibble) as_tibble(.pred_quantile)
refresh_blueprint()
is a developer facing generic function that is called
at the end of update_blueprint()
. It simply is a wrapper around the
method specific new_*_blueprint()
function that runs the updated blueprint
through the constructor again to ensure that all of the elements of the
blueprint are still valid after the update.
refresh_blueprint(blueprint)
refresh_blueprint(blueprint)
blueprint |
A preprocessing blueprint. |
If you implement your own custom blueprint
, you should export a
refresh_blueprint()
method that just calls the constructor for your blueprint
and passes through all of the elements of the blueprint to the constructor.
blueprint
is returned after a call to the corresponding constructor.
blueprint <- default_xy_blueprint() # This should never be done manually, but is essentially # what `update_blueprint(blueprint, intercept = TRUE)` does for you blueprint$intercept <- TRUE # Then update_blueprint() will call refresh_blueprint() # to ensure that the structure is correct refresh_blueprint(blueprint) # So you can't do something like... blueprint_bad <- blueprint blueprint_bad$intercept <- 1 # ...because the constructor will catch it try(refresh_blueprint(blueprint_bad)) # And update_blueprint() catches this automatically try(update_blueprint(blueprint, intercept = 1))
blueprint <- default_xy_blueprint() # This should never be done manually, but is essentially # what `update_blueprint(blueprint, intercept = TRUE)` does for you blueprint$intercept <- TRUE # Then update_blueprint() will call refresh_blueprint() # to ensure that the structure is correct refresh_blueprint(blueprint) # So you can't do something like... blueprint_bad <- blueprint blueprint_bad$intercept <- 1 # ...because the constructor will catch it try(refresh_blueprint(blueprint_bad)) # And update_blueprint() catches this automatically try(update_blueprint(blueprint, intercept = 1))
forge()
according to a blueprintThis is a developer facing function that is only used if you are creating
your own blueprint subclass. It is called from forge()
and dispatches off
the S3 class of the blueprint
. This gives you an opportunity to forge the
new data in a way that is specific to your blueprint.
run_forge()
is always called from forge()
with the same arguments, unlike
run_mold()
, because there aren't different interfaces for calling
forge()
. run_forge()
is always called as:
run_forge(blueprint, new_data = new_data, outcomes = outcomes)
If you write a blueprint subclass for new_xy_blueprint()
,
new_recipe_blueprint()
, new_formula_blueprint()
, or new_blueprint()
,
then your run_forge()
method signature must match this.
run_forge(blueprint, new_data, ..., outcomes = FALSE) ## S3 method for class 'default_formula_blueprint' run_forge(blueprint, new_data, ..., outcomes = FALSE) ## S3 method for class 'default_recipe_blueprint' run_forge(blueprint, new_data, ..., outcomes = FALSE) ## S3 method for class 'default_xy_blueprint' run_forge(blueprint, new_data, ..., outcomes = FALSE)
run_forge(blueprint, new_data, ..., outcomes = FALSE) ## S3 method for class 'default_formula_blueprint' run_forge(blueprint, new_data, ..., outcomes = FALSE) ## S3 method for class 'default_recipe_blueprint' run_forge(blueprint, new_data, ..., outcomes = FALSE) ## S3 method for class 'default_xy_blueprint' run_forge(blueprint, new_data, ..., outcomes = FALSE)
blueprint |
A preprocessing |
new_data |
A data frame or matrix of predictors to process. If
|
... |
Not used. |
outcomes |
A logical. Should the outcomes be processed and returned as well? |
run_forge()
methods return the object that is then immediately returned
from forge()
. See the return value section of forge()
to understand what
the structure of the return value should look like.
bp <- default_xy_blueprint() outcomes <- mtcars["mpg"] predictors <- mtcars predictors$mpg <- NULL mold <- run_mold(bp, x = predictors, y = outcomes) run_forge(mold$blueprint, new_data = predictors)
bp <- default_xy_blueprint() outcomes <- mtcars["mpg"] predictors <- mtcars predictors$mpg <- NULL mold <- run_mold(bp, x = predictors, y = outcomes) run_forge(mold$blueprint, new_data = predictors)
mold()
according to a blueprintThis is a developer facing function that is only used if you are creating
your own blueprint subclass. It is called from mold()
and dispatches off
the S3 class of the blueprint
. This gives you an opportunity to mold the
data in a way that is specific to your blueprint.
run_mold()
will be called with different arguments depending on the
interface to mold()
that is used:
XY interface:
run_mold(blueprint, x = x, y = y)
Formula interface:
run_mold(blueprint, data = data)
Additionally, the blueprint
will have been updated to contain the
formula
.
Recipe interface:
run_mold(blueprint, data = data)
Additionally, the blueprint
will have been updated to contain the
recipe
.
If you write a blueprint subclass for new_xy_blueprint()
,
new_recipe_blueprint()
, or new_formula_blueprint()
then your run_mold()
method signature must match whichever interface listed above will be used.
If you write a completely new blueprint inheriting only from
new_blueprint()
and write a new mold()
method (because you aren't using
an xy, formula, or recipe interface), then you will have full control over
how run_mold()
will be called.
run_mold(blueprint, ...) ## S3 method for class 'default_formula_blueprint' run_mold(blueprint, ..., data) ## S3 method for class 'default_recipe_blueprint' run_mold(blueprint, ..., data) ## S3 method for class 'default_xy_blueprint' run_mold(blueprint, ..., x, y)
run_mold(blueprint, ...) ## S3 method for class 'default_formula_blueprint' run_mold(blueprint, ..., data) ## S3 method for class 'default_recipe_blueprint' run_mold(blueprint, ..., data) ## S3 method for class 'default_xy_blueprint' run_mold(blueprint, ..., x, y)
blueprint |
A preprocessing blueprint. |
... |
Not used. Required for extensibility. |
data |
A data frame or matrix containing the outcomes and predictors. |
x |
A data frame or matrix containing the predictors. |
y |
A data frame, matrix, or vector containing the outcomes. |
run_mold()
methods return the object that is then immediately returned from
mold()
. See the return value section of mold()
to understand what the
structure of the return value should look like.
bp <- default_xy_blueprint() outcomes <- mtcars["mpg"] predictors <- mtcars predictors$mpg <- NULL run_mold(bp, x = predictors, y = outcomes)
bp <- default_xy_blueprint() outcomes <- mtcars["mpg"] predictors <- mtcars predictors$mpg <- NULL run_mold(bp, x = predictors, y = outcomes)
scream()
ensures that the structure of data
is the same as
prototype, ptype
. Under the hood, vctrs::vec_cast()
is used, which
casts each column of data
to the same type as the corresponding
column in ptype
.
This casting enforces a number of important structural checks, including but not limited to:
Data Classes - Checks that the class of each column in data
is the
same as the corresponding column in ptype
.
Novel Levels - Checks that the factor columns in data
don't have any
new levels when compared with the ptype
columns. If there are new
levels, a warning is issued and they are coerced to NA
. This check is
optional, and can be turned off with allow_novel_levels = TRUE
.
Level Recovery - Checks that the factor columns in data
aren't
missing any factor levels when compared with the ptype
columns. If
there are missing levels, then they are restored.
scream(data, ptype, allow_novel_levels = FALSE)
scream(data, ptype, allow_novel_levels = FALSE)
data |
A data frame containing the new data to check the structure of. |
ptype |
A data frame prototype to cast |
allow_novel_levels |
Should novel factor levels in |
scream()
is called by forge()
after shrink()
but before the
actual processing is done. Generally, you don't need to call scream()
directly, as forge()
will do it for you.
If scream()
is used as a standalone function, it is good practice to call
shrink()
right before it as there are no checks in scream()
that ensure
that all of the required column names actually exist in data
. Those
checks exist in shrink()
.
A tibble containing the required columns after any required structural modifications have been made.
scream()
tries to be helpful by recovering missing factor levels and
warning about novel levels. The following graphic outlines how scream()
handles factor levels when coercing from a column in data
to a
column in ptype
.
Note that ordered factor handing is much stricter than factor handling.
Ordered factors in data
must have exactly the same levels as ordered
factors in ptype
.
# --------------------------------------------------------------------------- # Setup train <- iris[1:100, ] test <- iris[101:150, ] # mold() is run at model fit time # and a formula preprocessing blueprint is recorded x <- mold(log(Sepal.Width) ~ Species, train) # Inside the result of mold() are the prototype tibbles # for the predictors and the outcomes ptype_pred <- x$blueprint$ptypes$predictors ptype_out <- x$blueprint$ptypes$outcomes # --------------------------------------------------------------------------- # shrink() / scream() # Pass the test data, along with a prototype, to # shrink() to extract the prototype columns test_shrunk <- shrink(test, ptype_pred) # Now pass that to scream() to perform validation checks # If no warnings / errors are thrown, the checks were # successful! scream(test_shrunk, ptype_pred) # --------------------------------------------------------------------------- # Outcomes # To also extract the outcomes, use the outcome prototype test_outcome <- shrink(test, ptype_out) scream(test_outcome, ptype_out) # --------------------------------------------------------------------------- # Casting # scream() uses vctrs::vec_cast() to intelligently convert # new data to the prototype automatically. This means # it can automatically perform certain conversions, like # coercing character columns to factors. test2 <- test test2$Species <- as.character(test2$Species) test2_shrunk <- shrink(test2, ptype_pred) scream(test2_shrunk, ptype_pred) # It can also recover missing factor levels. # For example, it is plausible that the test data only had the # "virginica" level test3 <- test test3$Species <- factor(test3$Species, levels = "virginica") test3_shrunk <- shrink(test3, ptype_pred) test3_fixed <- scream(test3_shrunk, ptype_pred) # scream() recovered the missing levels levels(test3_fixed$Species) # --------------------------------------------------------------------------- # Novel levels # When novel levels with any data are present in `data`, the default # is to coerce them to `NA` values with a warning. test4 <- test test4$Species <- as.character(test4$Species) test4$Species[1] <- "new_level" test4$Species <- factor( test4$Species, levels = c(levels(test$Species), "new_level") ) test4 <- shrink(test4, ptype_pred) # Warning is thrown test4_removed <- scream(test4, ptype_pred) # Novel level is removed levels(test4_removed$Species) # No warning is thrown test4_kept <- scream(test4, ptype_pred, allow_novel_levels = TRUE) # Novel level is kept levels(test4_kept$Species)
# --------------------------------------------------------------------------- # Setup train <- iris[1:100, ] test <- iris[101:150, ] # mold() is run at model fit time # and a formula preprocessing blueprint is recorded x <- mold(log(Sepal.Width) ~ Species, train) # Inside the result of mold() are the prototype tibbles # for the predictors and the outcomes ptype_pred <- x$blueprint$ptypes$predictors ptype_out <- x$blueprint$ptypes$outcomes # --------------------------------------------------------------------------- # shrink() / scream() # Pass the test data, along with a prototype, to # shrink() to extract the prototype columns test_shrunk <- shrink(test, ptype_pred) # Now pass that to scream() to perform validation checks # If no warnings / errors are thrown, the checks were # successful! scream(test_shrunk, ptype_pred) # --------------------------------------------------------------------------- # Outcomes # To also extract the outcomes, use the outcome prototype test_outcome <- shrink(test, ptype_out) scream(test_outcome, ptype_out) # --------------------------------------------------------------------------- # Casting # scream() uses vctrs::vec_cast() to intelligently convert # new data to the prototype automatically. This means # it can automatically perform certain conversions, like # coercing character columns to factors. test2 <- test test2$Species <- as.character(test2$Species) test2_shrunk <- shrink(test2, ptype_pred) scream(test2_shrunk, ptype_pred) # It can also recover missing factor levels. # For example, it is plausible that the test data only had the # "virginica" level test3 <- test test3$Species <- factor(test3$Species, levels = "virginica") test3_shrunk <- shrink(test3, ptype_pred) test3_fixed <- scream(test3_shrunk, ptype_pred) # scream() recovered the missing levels levels(test3_fixed$Species) # --------------------------------------------------------------------------- # Novel levels # When novel levels with any data are present in `data`, the default # is to coerce them to `NA` values with a warning. test4 <- test test4$Species <- as.character(test4$Species) test4$Species[1] <- "new_level" test4$Species <- factor( test4$Species, levels = c(levels(test$Species), "new_level") ) test4 <- shrink(test4, ptype_pred) # Warning is thrown test4_removed <- scream(test4, ptype_pred) # Novel level is removed levels(test4_removed$Species) # No warning is thrown test4_kept <- scream(test4, ptype_pred, allow_novel_levels = TRUE) # Novel level is kept levels(test4_kept$Species)
shrink()
subsets data
to only contain the required columns specified by
the prototype, ptype
.
shrink(data, ptype)
shrink(data, ptype)
data |
A data frame containing the data to subset. |
ptype |
A data frame prototype containing the required columns. |
shrink()
is called by forge()
before scream()
and before the actual
processing is done.
A tibble containing the required columns.
# --------------------------------------------------------------------------- # Setup train <- iris[1:100, ] test <- iris[101:150, ] # --------------------------------------------------------------------------- # shrink() # mold() is run at model fit time # and a formula preprocessing blueprint is recorded x <- mold(log(Sepal.Width) ~ Species, train) # Inside the result of mold() are the prototype tibbles # for the predictors and the outcomes ptype_pred <- x$blueprint$ptypes$predictors ptype_out <- x$blueprint$ptypes$outcomes # Pass the test data, along with a prototype, to # shrink() to extract the prototype columns shrink(test, ptype_pred) # To extract the outcomes, just use the # outcome prototype shrink(test, ptype_out) # shrink() makes sure that the columns # required by `ptype` actually exist in the data # and errors nicely when they don't test2 <- subset(test, select = -Species) try(shrink(test2, ptype_pred))
# --------------------------------------------------------------------------- # Setup train <- iris[1:100, ] test <- iris[101:150, ] # --------------------------------------------------------------------------- # shrink() # mold() is run at model fit time # and a formula preprocessing blueprint is recorded x <- mold(log(Sepal.Width) ~ Species, train) # Inside the result of mold() are the prototype tibbles # for the predictors and the outcomes ptype_pred <- x$blueprint$ptypes$predictors ptype_out <- x$blueprint$ptypes$outcomes # Pass the test data, along with a prototype, to # shrink() to extract the prototype columns shrink(test, ptype_pred) # To extract the outcomes, just use the # outcome prototype shrink(test, ptype_out) # shrink() makes sure that the columns # required by `ptype` actually exist in the data # and errors nicely when they don't test2 <- subset(test, select = -Species) try(shrink(test2, ptype_pred))
The family of spruce_*()
functions convert predictions into a
standardized format. They are generally called from a prediction
implementation function for the specific type
of prediction to return.
spruce_numeric(pred) spruce_class(pred_class) spruce_prob(pred_levels, prob_matrix)
spruce_numeric(pred) spruce_class(pred_class) spruce_prob(pred_levels, prob_matrix)
pred |
( |
pred_class |
( |
pred_levels , prob_matrix
|
(
|
After running a spruce_*()
function, you should always use the validation
function validate_prediction_size()
to ensure that the number of rows
being returned is the same as the number of rows in the input (new_data
).
A tibble, ideally with the same number of rows as the new_data
passed
to predict()
. The column names and number of columns vary based on the
function used, but are standardized.
This family of spruce_*_multiple()
functions converts multi-outcome
predictions into a standardized format. They are generally called from a
prediction implementation function for the specific type
of prediction to
return.
spruce_numeric_multiple(...) spruce_class_multiple(...) spruce_prob_multiple(...)
spruce_numeric_multiple(...) spruce_class_multiple(...) spruce_prob_multiple(...)
... |
Multiple vectors of predictions:
If the |
For spruce_numeric_multiple()
, a tibble of numeric columns named with the
pattern .pred_*
.
For spruce_class_multiple()
, a tibble of factor columns named with the
pattern .pred_class_*
.
For spruce_prob_multiple()
, a tibble of data frame columns named with the
pattern .pred_*
.
spruce_numeric_multiple(1:3, foo = 2:4) spruce_class_multiple( one_step = factor(c("a", "b", "c")), two_step = factor(c("a", "c", "c")) ) one_step <- matrix(c(.3, .7, .0, .1, .3, .6), nrow = 2, byrow = TRUE) two_step <- matrix(c(.2, .7, .1, .2, .4, .4), nrow = 2, byrow = TRUE) binary <- matrix(c(.5, .5, .4, .6), nrow = 2, byrow = TRUE) spruce_prob_multiple( one_step = spruce_prob(c("a", "b", "c"), one_step), two_step = spruce_prob(c("a", "b", "c"), two_step), binary = spruce_prob(c("yes", "no"), binary) )
spruce_numeric_multiple(1:3, foo = 2:4) spruce_class_multiple( one_step = factor(c("a", "b", "c")), two_step = factor(c("a", "c", "c")) ) one_step <- matrix(c(.3, .7, .0, .1, .3, .6), nrow = 2, byrow = TRUE) two_step <- matrix(c(.2, .7, .1, .2, .4, .4), nrow = 2, byrow = TRUE) binary <- matrix(c(.5, .5, .4, .6), nrow = 2, byrow = TRUE) spruce_prob_multiple( one_step = spruce_prob(c("a", "b", "c"), one_step), two_step = spruce_prob(c("a", "b", "c"), two_step), binary = spruce_prob(c("yes", "no"), binary) )
Most of the time, the input to a model should be flexible enough to capture
a number of different input types from the user. standardize()
focuses
on capturing the flexibility in the outcome.
standardize(y)
standardize(y)
y |
The outcome. This can be:
|
standardize()
is called from mold()
when using an XY interface (i.e.
a y
argument was supplied).
All possible values of y
are transformed into a tibble
for
standardization. Vectors are transformed into a tibble
with
a single column named ".outcome"
.
standardize(1:5) standardize(factor(letters[1:5])) mat <- matrix(1:10, ncol = 2) colnames(mat) <- c("a", "b") standardize(mat) df <- data.frame(x = 1:5, y = 6:10) standardize(df)
standardize(1:5) standardize(factor(letters[1:5])) mat <- matrix(1:10, ncol = 2) colnames(mat) <- c("a", "b") standardize(mat) df <- data.frame(x = 1:5, y = 6:10) standardize(df)
tune()
is an argument placeholder to be used with the recipes, parsnip, and
tune packages. It marks recipes step and parsnip model arguments for tuning.
tune(id = "")
tune(id = "")
id |
A single character value that can be used to differentiate parameters that are used in multiple places but have the same name, or if the user wants to add a note to the specified parameter. |
A call object that echos the user's input.
tune::tune_grid()
, tune::tune_bayes()
tune() tune("your name here") # In practice, `tune()` is used alongside recipes or parsnip to mark # specific arguments for tuning library(recipes) recipe(mpg ~ ., data = mtcars) %>% step_normalize(all_numeric_predictors()) %>% step_pca(all_numeric_predictors, num_comp = tune())
tune() tune("your name here") # In practice, `tune()` is used alongside recipes or parsnip to mark # specific arguments for tuning library(recipes) recipe(mpg ~ ., data = mtcars) %>% step_normalize(all_numeric_predictors()) %>% step_pca(all_numeric_predictors, num_comp = tune())
update_blueprint()
is the correct way to alter elements of an existing
blueprint
object. It has two benefits over just doing
blueprint$elem <- new_elem
.
The name you are updating must already exist in the blueprint. This prevents you from accidentally updating non-existent elements.
The constructor for the blueprint is automatically run after the update by
refresh_blueprint()
to ensure that the blueprint is still valid.
update_blueprint(blueprint, ...)
update_blueprint(blueprint, ...)
blueprint |
A preprocessing blueprint. |
... |
Name-value pairs of existing elements in |
blueprint <- default_xy_blueprint() # `intercept` defaults to FALSE blueprint update_blueprint(blueprint, intercept = TRUE) # Can't update non-existent elements try(update_blueprint(blueprint, intercpt = TRUE)) # Can't add non-valid elements try(update_blueprint(blueprint, intercept = 1))
blueprint <- default_xy_blueprint() # `intercept` defaults to FALSE blueprint update_blueprint(blueprint, intercept = TRUE) # Can't update non-existent elements try(update_blueprint(blueprint, intercpt = TRUE)) # Can't add non-valid elements try(update_blueprint(blueprint, intercept = 1))
data
contains required column namesvalidate - asserts the following:
The column names of data
must contain all original_names
.
check - returns the following:
ok
A logical. Does the check pass?
missing_names
A character vector. The missing column names.
validate_column_names(data, original_names) check_column_names(data, original_names)
validate_column_names(data, original_names) check_column_names(data, original_names)
data |
A data frame to check. |
original_names |
A character vector. The original column names. |
A special error is thrown if the missing column is named ".outcome"
. This
only happens in the case where mold()
is called using the xy-method, and
a vector y
value is supplied rather than a data frame or matrix. In that
case, y
is coerced to a data frame, and the automatic name ".outcome"
is
added, and this is what is looked for in forge()
. If this happens, and the
user tries to request outcomes using forge(..., outcomes = TRUE)
but
the supplied new_data
does not contain the required ".outcome"
column,
a special error is thrown telling them what to do. See the examples!
validate_column_names()
returns data
invisibly.
check_column_names()
returns a named list of two components,
ok
, and missing_names
.
hardhat provides validation functions at two levels.
check_*()
: check a condition, and return a list. The list
always contains at least one element, ok
, a logical that specifies if the
check passed. Each check also has check specific elements in the returned
list that can be used to construct meaningful error messages.
validate_*()
: check a condition, and error if it does not pass. These
functions call their corresponding check function, and
then provide a default error message. If you, as a developer, want a
different error message, then call the check_*()
function yourself,
and provide your own validation function.
Other validation functions:
validate_no_formula_duplication()
,
validate_outcomes_are_binary()
,
validate_outcomes_are_factors()
,
validate_outcomes_are_numeric()
,
validate_outcomes_are_univariate()
,
validate_prediction_size()
,
validate_predictors_are_numeric()
# --------------------------------------------------------------------------- original_names <- colnames(mtcars) test <- mtcars bad_test <- test[, -c(3, 4)] # All good check_column_names(test, original_names) # Missing 2 columns check_column_names(bad_test, original_names) # Will error try(validate_column_names(bad_test, original_names)) # --------------------------------------------------------------------------- # Special error when `.outcome` is missing train <- iris[1:100, ] test <- iris[101:150, ] train_x <- subset(train, select = -Species) train_y <- train$Species # Here, y is a vector processed <- mold(train_x, train_y) # So the default column name is `".outcome"` processed$outcomes # It doesn't affect forge() normally forge(test, processed$blueprint) # But if the outcome is requested, and `".outcome"` # is not present in `new_data`, an error is thrown # with very specific instructions try(forge(test, processed$blueprint, outcomes = TRUE)) # To get this to work, just create an .outcome column in new_data test$.outcome <- test$Species forge(test, processed$blueprint, outcomes = TRUE)
# --------------------------------------------------------------------------- original_names <- colnames(mtcars) test <- mtcars bad_test <- test[, -c(3, 4)] # All good check_column_names(test, original_names) # Missing 2 columns check_column_names(bad_test, original_names) # Will error try(validate_column_names(bad_test, original_names)) # --------------------------------------------------------------------------- # Special error when `.outcome` is missing train <- iris[1:100, ] test <- iris[101:150, ] train_x <- subset(train, select = -Species) train_y <- train$Species # Here, y is a vector processed <- mold(train_x, train_y) # So the default column name is `".outcome"` processed$outcomes # It doesn't affect forge() normally forge(test, processed$blueprint) # But if the outcome is requested, and `".outcome"` # is not present in `new_data`, an error is thrown # with very specific instructions try(forge(test, processed$blueprint, outcomes = TRUE)) # To get this to work, just create an .outcome column in new_data test$.outcome <- test$Species forge(test, processed$blueprint, outcomes = TRUE)
formula
validate - asserts the following:
formula
must not have duplicates terms on the left and right hand
side of the formula.
check - returns the following:
ok
A logical. Does the check pass?
duplicates
A character vector. The duplicate terms.
validate_no_formula_duplication(formula, original = FALSE) check_no_formula_duplication(formula, original = FALSE)
validate_no_formula_duplication(formula, original = FALSE) check_no_formula_duplication(formula, original = FALSE)
formula |
A formula to check. |
original |
A logical. Should the original names be checked, or should
the names after processing be used? If |
validate_no_formula_duplication()
returns formula
invisibly.
check_no_formula_duplication()
returns a named list of two components,
ok
and duplicates
.
hardhat provides validation functions at two levels.
check_*()
: check a condition, and return a list. The list
always contains at least one element, ok
, a logical that specifies if the
check passed. Each check also has check specific elements in the returned
list that can be used to construct meaningful error messages.
validate_*()
: check a condition, and error if it does not pass. These
functions call their corresponding check function, and
then provide a default error message. If you, as a developer, want a
different error message, then call the check_*()
function yourself,
and provide your own validation function.
Other validation functions:
validate_column_names()
,
validate_outcomes_are_binary()
,
validate_outcomes_are_factors()
,
validate_outcomes_are_numeric()
,
validate_outcomes_are_univariate()
,
validate_prediction_size()
,
validate_predictors_are_numeric()
# All good check_no_formula_duplication(y ~ x) # Not good! check_no_formula_duplication(y ~ y) # This is generally okay check_no_formula_duplication(y ~ log(y)) # But you can be more strict check_no_formula_duplication(y ~ log(y), original = TRUE) # This would throw an error try(validate_no_formula_duplication(log(y) ~ log(y)))
# All good check_no_formula_duplication(y ~ x) # Not good! check_no_formula_duplication(y ~ y) # This is generally okay check_no_formula_duplication(y ~ log(y)) # But you can be more strict check_no_formula_duplication(y ~ log(y), original = TRUE) # This would throw an error try(validate_no_formula_duplication(log(y) ~ log(y)))
validate - asserts the following:
outcomes
must have binary factor columns.
check - returns the following:
ok
A logical. Does the check pass?
bad_cols
A character vector. The names of the columns with problems.
num_levels
An integer vector. The actual number of levels of the columns
with problems.
validate_outcomes_are_binary(outcomes) check_outcomes_are_binary(outcomes)
validate_outcomes_are_binary(outcomes) check_outcomes_are_binary(outcomes)
outcomes |
An object to check. |
The expected way to use this validation function is to supply it the
$outcomes
element of the result of a call to mold()
.
validate_outcomes_are_binary()
returns outcomes
invisibly.
check_outcomes_are_binary()
returns a named list of three components,
ok
, bad_cols
, and num_levels
.
hardhat provides validation functions at two levels.
check_*()
: check a condition, and return a list. The list
always contains at least one element, ok
, a logical that specifies if the
check passed. Each check also has check specific elements in the returned
list that can be used to construct meaningful error messages.
validate_*()
: check a condition, and error if it does not pass. These
functions call their corresponding check function, and
then provide a default error message. If you, as a developer, want a
different error message, then call the check_*()
function yourself,
and provide your own validation function.
Other validation functions:
validate_column_names()
,
validate_no_formula_duplication()
,
validate_outcomes_are_factors()
,
validate_outcomes_are_numeric()
,
validate_outcomes_are_univariate()
,
validate_prediction_size()
,
validate_predictors_are_numeric()
# Not a binary factor. 0 levels check_outcomes_are_binary(data.frame(x = 1)) # Not a binary factor. 1 level check_outcomes_are_binary(data.frame(x = factor("A"))) # All good check_outcomes_are_binary(data.frame(x = factor(c("A", "B"))))
# Not a binary factor. 0 levels check_outcomes_are_binary(data.frame(x = 1)) # Not a binary factor. 1 level check_outcomes_are_binary(data.frame(x = factor("A"))) # All good check_outcomes_are_binary(data.frame(x = factor(c("A", "B"))))
validate - asserts the following:
outcomes
must have factor columns.
check - returns the following:
ok
A logical. Does the check pass?
bad_classes
A named list. The names are the names of problematic columns,
and the values are the classes of the matching column.
validate_outcomes_are_factors(outcomes) check_outcomes_are_factors(outcomes)
validate_outcomes_are_factors(outcomes) check_outcomes_are_factors(outcomes)
outcomes |
An object to check. |
The expected way to use this validation function is to supply it the
$outcomes
element of the result of a call to mold()
.
validate_outcomes_are_factors()
returns outcomes
invisibly.
check_outcomes_are_factors()
returns a named list of two components,
ok
and bad_classes
.
hardhat provides validation functions at two levels.
check_*()
: check a condition, and return a list. The list
always contains at least one element, ok
, a logical that specifies if the
check passed. Each check also has check specific elements in the returned
list that can be used to construct meaningful error messages.
validate_*()
: check a condition, and error if it does not pass. These
functions call their corresponding check function, and
then provide a default error message. If you, as a developer, want a
different error message, then call the check_*()
function yourself,
and provide your own validation function.
Other validation functions:
validate_column_names()
,
validate_no_formula_duplication()
,
validate_outcomes_are_binary()
,
validate_outcomes_are_numeric()
,
validate_outcomes_are_univariate()
,
validate_prediction_size()
,
validate_predictors_are_numeric()
# Not a factor column. check_outcomes_are_factors(data.frame(x = 1)) # All good check_outcomes_are_factors(data.frame(x = factor(c("A", "B"))))
# Not a factor column. check_outcomes_are_factors(data.frame(x = 1)) # All good check_outcomes_are_factors(data.frame(x = factor(c("A", "B"))))
validate - asserts the following:
outcomes
must have numeric columns.
check - returns the following:
ok
A logical. Does the check pass?
bad_classes
A named list. The names are the names of problematic columns,
and the values are the classes of the matching column.
validate_outcomes_are_numeric(outcomes) check_outcomes_are_numeric(outcomes)
validate_outcomes_are_numeric(outcomes) check_outcomes_are_numeric(outcomes)
outcomes |
An object to check. |
The expected way to use this validation function is to supply it the
$outcomes
element of the result of a call to mold()
.
validate_outcomes_are_numeric()
returns outcomes
invisibly.
check_outcomes_are_numeric()
returns a named list of two components,
ok
and bad_classes
.
hardhat provides validation functions at two levels.
check_*()
: check a condition, and return a list. The list
always contains at least one element, ok
, a logical that specifies if the
check passed. Each check also has check specific elements in the returned
list that can be used to construct meaningful error messages.
validate_*()
: check a condition, and error if it does not pass. These
functions call their corresponding check function, and
then provide a default error message. If you, as a developer, want a
different error message, then call the check_*()
function yourself,
and provide your own validation function.
Other validation functions:
validate_column_names()
,
validate_no_formula_duplication()
,
validate_outcomes_are_binary()
,
validate_outcomes_are_factors()
,
validate_outcomes_are_univariate()
,
validate_prediction_size()
,
validate_predictors_are_numeric()
# All good check_outcomes_are_numeric(mtcars) # Species is not numeric check_outcomes_are_numeric(iris) # This gives an intelligent error message try(validate_outcomes_are_numeric(iris))
# All good check_outcomes_are_numeric(mtcars) # Species is not numeric check_outcomes_are_numeric(iris) # This gives an intelligent error message try(validate_outcomes_are_numeric(iris))
validate - asserts the following:
outcomes
must have 1 column. Atomic vectors are treated as
1 column matrices.
check - returns the following:
ok
A logical. Does the check pass?
n_cols
A single numeric. The actual number of columns.
validate_outcomes_are_univariate(outcomes) check_outcomes_are_univariate(outcomes)
validate_outcomes_are_univariate(outcomes) check_outcomes_are_univariate(outcomes)
outcomes |
An object to check. |
The expected way to use this validation function is to supply it the
$outcomes
element of the result of a call to mold()
.
validate_outcomes_are_univariate()
returns outcomes
invisibly.
check_outcomes_are_univariate()
returns a named list of two components,
ok
and n_cols
.
hardhat provides validation functions at two levels.
check_*()
: check a condition, and return a list. The list
always contains at least one element, ok
, a logical that specifies if the
check passed. Each check also has check specific elements in the returned
list that can be used to construct meaningful error messages.
validate_*()
: check a condition, and error if it does not pass. These
functions call their corresponding check function, and
then provide a default error message. If you, as a developer, want a
different error message, then call the check_*()
function yourself,
and provide your own validation function.
Other validation functions:
validate_column_names()
,
validate_no_formula_duplication()
,
validate_outcomes_are_binary()
,
validate_outcomes_are_factors()
,
validate_outcomes_are_numeric()
,
validate_prediction_size()
,
validate_predictors_are_numeric()
validate_outcomes_are_univariate(data.frame(x = 1)) try(validate_outcomes_are_univariate(mtcars))
validate_outcomes_are_univariate(data.frame(x = 1)) try(validate_outcomes_are_univariate(mtcars))
validate - asserts the following:
The size of pred
must be the same as the size of new_data
.
check - returns the following:
ok
A logical. Does the check pass?
size_new_data
A single numeric. The size of new_data
.
size_pred
A single numeric. The size of pred
.
validate_prediction_size(pred, new_data) check_prediction_size(pred, new_data)
validate_prediction_size(pred, new_data) check_prediction_size(pred, new_data)
pred |
A tibble. The predictions to return from any prediction
|
new_data |
A data frame of new predictors and possibly outcomes. |
This validation function is one that is more developer focused rather than
user focused. It is a final check to be used right before a value is
returned from your specific predict()
method, and is mainly a "good
practice" sanity check to ensure that your prediction blueprint always returns
the same number of rows as new_data
, which is one of the modeling
conventions this package tries to promote.
validate_prediction_size()
returns pred
invisibly.
check_prediction_size()
returns a named list of three components,
ok
, size_new_data
, and size_pred
.
hardhat provides validation functions at two levels.
check_*()
: check a condition, and return a list. The list
always contains at least one element, ok
, a logical that specifies if the
check passed. Each check also has check specific elements in the returned
list that can be used to construct meaningful error messages.
validate_*()
: check a condition, and error if it does not pass. These
functions call their corresponding check function, and
then provide a default error message. If you, as a developer, want a
different error message, then call the check_*()
function yourself,
and provide your own validation function.
Other validation functions:
validate_column_names()
,
validate_no_formula_duplication()
,
validate_outcomes_are_binary()
,
validate_outcomes_are_factors()
,
validate_outcomes_are_numeric()
,
validate_outcomes_are_univariate()
,
validate_predictors_are_numeric()
# Say new_data has 5 rows new_data <- mtcars[1:5, ] # And somehow you generate predictions # for those 5 rows pred_vec <- 1:5 # Then you use `spruce_numeric()` to clean # up these numeric predictions pred <- spruce_numeric(pred_vec) pred # Use this check to ensure that # the number of rows or pred match new_data check_prediction_size(pred, new_data) # An informative error message is thrown # if the rows are different try(validate_prediction_size(spruce_numeric(1:4), new_data))
# Say new_data has 5 rows new_data <- mtcars[1:5, ] # And somehow you generate predictions # for those 5 rows pred_vec <- 1:5 # Then you use `spruce_numeric()` to clean # up these numeric predictions pred <- spruce_numeric(pred_vec) pred # Use this check to ensure that # the number of rows or pred match new_data check_prediction_size(pred, new_data) # An informative error message is thrown # if the rows are different try(validate_prediction_size(spruce_numeric(1:4), new_data))
validate - asserts the following:
predictors
must have numeric columns.
check - returns the following:
ok
A logical. Does the check pass?
bad_classes
A named list. The names are the names of problematic columns,
and the values are the classes of the matching column.
validate_predictors_are_numeric(predictors) check_predictors_are_numeric(predictors)
validate_predictors_are_numeric(predictors) check_predictors_are_numeric(predictors)
predictors |
An object to check. |
The expected way to use this validation function is to supply it the
$predictors
element of the result of a call to mold()
.
validate_predictors_are_numeric()
returns predictors
invisibly.
check_predictors_are_numeric()
returns a named list of two components,
ok
, and bad_classes
.
hardhat provides validation functions at two levels.
check_*()
: check a condition, and return a list. The list
always contains at least one element, ok
, a logical that specifies if the
check passed. Each check also has check specific elements in the returned
list that can be used to construct meaningful error messages.
validate_*()
: check a condition, and error if it does not pass. These
functions call their corresponding check function, and
then provide a default error message. If you, as a developer, want a
different error message, then call the check_*()
function yourself,
and provide your own validation function.
Other validation functions:
validate_column_names()
,
validate_no_formula_duplication()
,
validate_outcomes_are_binary()
,
validate_outcomes_are_factors()
,
validate_outcomes_are_numeric()
,
validate_outcomes_are_univariate()
,
validate_prediction_size()
# All good check_predictors_are_numeric(mtcars) # Species is not numeric check_predictors_are_numeric(iris) # This gives an intelligent error message try(validate_predictors_are_numeric(iris))
# All good check_predictors_are_numeric(mtcars) # Species is not numeric check_predictors_are_numeric(iris) # This gives an intelligent error message try(validate_predictors_are_numeric(iris))
weighted_table()
computes a weighted contingency table based on factors
provided in ...
and a double vector of weights provided in weights
. It
can be seen as a weighted extension to base::table()
and an alternative
to stats::xtabs()
.
weighted_table()
always uses the exact set of levels returned by
levels()
when constructing the table. This results in the following
properties:
Missing values found in the factors are never included in the table unless
there is an explicit NA
factor level. If needed, this can be added to a
factor with base::addNA()
or forcats::fct_expand(x, NA)
.
Levels found in the factors that aren't actually used in the underlying
data are included in the table with a value of 0
. If needed, you can
drop unused factor levels by re-running your factor through factor()
,
or by calling forcats::fct_drop()
.
See the examples section for more information about these properties.
weighted_table(..., weights, na_remove = FALSE)
weighted_table(..., weights, na_remove = FALSE)
... |
Factors of equal length to use in the weighted table. If the
|
weights |
A double vector of weights used to fill the cells of the
weighted table. This must be the same length as the factors provided in
|
na_remove |
A single |
The result of weighted_table()
does not have a "table"
class attached
to it. It is only a double array. This is because "table" objects are
defined as containing integer counts, but weighted tables can utilize
fractional weights.
The weighted table as an array of double values.
x <- factor(c("x", "y", "z", "x", "x", "y")) y <- factor(c("a", "b", "a", "a", "b", "b")) w <- c(1.5, 2, 1.1, .5, 3, 2) weighted_table(x = x, y = y, weights = w) # --------------------------------------------------------------------------- # If `weights` contains missing values, then missing values will be # propagated into the weighted table x <- factor(c("x", "y", "y")) y <- factor(c("a", "b", "b")) w <- c(1, NA, 3) weighted_table(x = x, y = y, weights = w) # You can remove the missing values while summing up the weights with # `na_remove = TRUE` weighted_table(x = x, y = y, weights = w, na_remove = TRUE) # --------------------------------------------------------------------------- # If there are missing values in the factors, those typically don't show # up in the weighted table x <- factor(c("x", NA, "y", "x")) y <- factor(c("a", "b", "a", NA)) w <- 1:4 weighted_table(x = x, y = y, weights = w) # This is because the missing values aren't considered explicit levels levels(x) # You can force them to show up in the table by using `addNA()` ahead of time # (or `forcats::fct_expand(x, NA)`) x <- addNA(x, ifany = TRUE) y <- addNA(y, ifany = TRUE) levels(x) weighted_table(x = x, y = y, weights = w) # --------------------------------------------------------------------------- # If there are levels in your factors that aren't actually used in the # underlying data, then they will still show up in the table with a `0` value x <- factor(c("x", "y", "x"), levels = c("x", "y", "z")) y <- factor(c("a", "b", "a"), levels = c("a", "b", "c")) w <- 1:3 weighted_table(x = x, y = y, weights = w) # If you want to drop these empty factor levels from the result, you can # rerun `factor()` ahead of time to drop them (or `forcats::fct_drop()`) x <- factor(x) y <- factor(y) levels(x) weighted_table(x = x, y = y, weights = w)
x <- factor(c("x", "y", "z", "x", "x", "y")) y <- factor(c("a", "b", "a", "a", "b", "b")) w <- c(1.5, 2, 1.1, .5, 3, 2) weighted_table(x = x, y = y, weights = w) # --------------------------------------------------------------------------- # If `weights` contains missing values, then missing values will be # propagated into the weighted table x <- factor(c("x", "y", "y")) y <- factor(c("a", "b", "b")) w <- c(1, NA, 3) weighted_table(x = x, y = y, weights = w) # You can remove the missing values while summing up the weights with # `na_remove = TRUE` weighted_table(x = x, y = y, weights = w, na_remove = TRUE) # --------------------------------------------------------------------------- # If there are missing values in the factors, those typically don't show # up in the weighted table x <- factor(c("x", NA, "y", "x")) y <- factor(c("a", "b", "a", NA)) w <- 1:4 weighted_table(x = x, y = y, weights = w) # This is because the missing values aren't considered explicit levels levels(x) # You can force them to show up in the table by using `addNA()` ahead of time # (or `forcats::fct_expand(x, NA)`) x <- addNA(x, ifany = TRUE) y <- addNA(y, ifany = TRUE) levels(x) weighted_table(x = x, y = y, weights = w) # --------------------------------------------------------------------------- # If there are levels in your factors that aren't actually used in the # underlying data, then they will still show up in the table with a `0` value x <- factor(c("x", "y", "x"), levels = c("x", "y", "z")) y <- factor(c("a", "b", "a"), levels = c("a", "b", "c")) w <- 1:3 weighted_table(x = x, y = y, weights = w) # If you want to drop these empty factor levels from the result, you can # rerun `factor()` ahead of time to drop them (or `forcats::fct_drop()`) x <- factor(x) y <- factor(y) levels(x) weighted_table(x = x, y = y, weights = w)