Title: | Preprocessing and Feature Engineering Steps for Modeling |
---|---|
Description: | A recipe prepares your data for modeling. We provide an extensible framework for pipeable sequences of feature engineering steps provides preprocessing tools to be applied to data. Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets. The resulting processed output can then be used as inputs for statistical or machine learning models. |
Authors: | Max Kuhn [aut, cre], Hadley Wickham [aut], Emil Hvitfeldt [aut], Posit Software, PBC [cph, fnd] |
Maintainer: | Max Kuhn <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.0.9001 |
Built: | 2024-11-14 23:24:48 UTC |
Source: | https://github.com/tidymodels/recipes |
The .get_data_types()
generic is used internally to supply types to
columns used in recipes. These functions underlie the work that the user sees
in selections.
.get_data_types(x) ## Default S3 method: .get_data_types(x) ## S3 method for class 'character' .get_data_types(x) ## S3 method for class 'ordered' .get_data_types(x) ## S3 method for class 'factor' .get_data_types(x) ## S3 method for class 'integer' .get_data_types(x) ## S3 method for class 'numeric' .get_data_types(x) ## S3 method for class 'double' .get_data_types(x) ## S3 method for class 'Surv' .get_data_types(x) ## S3 method for class 'logical' .get_data_types(x) ## S3 method for class 'Date' .get_data_types(x) ## S3 method for class 'POSIXct' .get_data_types(x) ## S3 method for class 'list' .get_data_types(x) ## S3 method for class 'textrecipes_tokenlist' .get_data_types(x) ## S3 method for class 'hardhat_case_weights' .get_data_types(x)
.get_data_types(x) ## Default S3 method: .get_data_types(x) ## S3 method for class 'character' .get_data_types(x) ## S3 method for class 'ordered' .get_data_types(x) ## S3 method for class 'factor' .get_data_types(x) ## S3 method for class 'integer' .get_data_types(x) ## S3 method for class 'numeric' .get_data_types(x) ## S3 method for class 'double' .get_data_types(x) ## S3 method for class 'Surv' .get_data_types(x) ## S3 method for class 'logical' .get_data_types(x) ## S3 method for class 'Date' .get_data_types(x) ## S3 method for class 'POSIXct' .get_data_types(x) ## S3 method for class 'list' .get_data_types(x) ## S3 method for class 'textrecipes_tokenlist' .get_data_types(x) ## S3 method for class 'hardhat_case_weights' .get_data_types(x)
x |
An object |
This function acts as an extended recipes-specific version of class()
. By
ignoring differences in similar types ("double" and "numeric") and allowing
each element to have multiple types ("factor" returns "factor", "unordered",
and "nominal", and "character" returns "string", "unordered", and
"nominal") we are able to create more natural selectors such as
all_nominal()
, all_string()
and all_integer()
.
The following list shows the data types for different classes, as defined
by recipes. If an object has a class not supported by .get_data_types()
,
it will get data type "other".
character: string, unordered, and nominal
ordered: ordered, and nominal
factor: factor, unordered, and nominal
integer: integer, and numeric
numeric: double, and numeric
double: double, and numeric
Surv: surv
logical: logical
Date: date
POSIXct: datetime
list: list
textrecipes_tokenlist: tokenlist
hardhat_case_weights: case_weights
data(Sacramento, package = "modeldata") lapply(Sacramento, .get_data_types)
data(Sacramento, package = "modeldata") lapply(Sacramento, .get_data_types)
add_step
adds a step to the last location in the recipe.
add_check
does the same for checks.
add_step(rec, object) add_check(rec, object)
add_step(rec, object) add_check(rec, object)
rec |
A |
object |
A step or check object. |
A updated recipe()
with the new operation in the last slot.
For a recipe with at least one preprocessing operation that has been trained
by prep()
, apply the computations to new data.
bake(object, ...) ## S3 method for class 'recipe' bake(object, new_data, ..., composition = "tibble")
bake(object, ...) ## S3 method for class 'recipe' bake(object, new_data, ..., composition = "tibble")
object |
A trained object such as a |
... |
One or more selector functions to choose which variables will be
returned by the function. See |
new_data |
A data frame, tibble, or sparse matrix from the |
composition |
Either |
bake()
takes a trained recipe and applies its operations to a data set to
create a design matrix. If you are using a recipe as a preprocessor for
modeling, we highly recommend that you use a workflow()
instead of
manually applying a recipe (see the example in recipe()
).
If the data set is not too large, time can be saved by using the retain = TRUE
option of prep()
. This stores the processed version of the training
set. With this option set, bake(object, new_data = NULL)
will return it for
free.
Also, any steps with skip = TRUE
will not be applied to the data when
bake()
is invoked with a data set in new_data
. bake(object, new_data = NULL)
will always have all of the steps applied.
A tibble, matrix, or sparse matrix that may have different columns than the
original columns in new_data
.
data(ames, package = "modeldata") ames <- mutate(ames, Sale_Price = log10(Sale_Price)) ames_rec <- recipe(Sale_Price ~ ., data = ames[-(1:6), ]) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(all_nominal()) %>% step_interact(~ starts_with("Central_Air"):Year_Built) %>% step_ns(Longitude, Latitude, deg_free = 2) %>% step_zv(all_predictors()) %>% prep() # return the training set (already embedded in ames_rec) bake(ames_rec, new_data = NULL) # apply processing to other data: bake(ames_rec, new_data = head(ames)) # only return selected variables: bake(ames_rec, new_data = head(ames), all_numeric_predictors()) bake(ames_rec, new_data = head(ames), starts_with(c("Longitude", "Latitude")))
data(ames, package = "modeldata") ames <- mutate(ames, Sale_Price = log10(Sale_Price)) ames_rec <- recipe(Sale_Price ~ ., data = ames[-(1:6), ]) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(all_nominal()) %>% step_interact(~ starts_with("Central_Air"):Year_Built) %>% step_ns(Longitude, Latitude, deg_free = 2) %>% step_zv(all_predictors()) %>% prep() # return the training set (already embedded in ames_rec) bake(ames_rec, new_data = NULL) # apply processing to other data: bake(ames_rec, new_data = head(ames)) # only return selected variables: bake(ames_rec, new_data = head(ames), all_numeric_predictors()) bake(ames_rec, new_data = head(ames), starts_with(c("Longitude", "Latitude")))
Case weights are positive numeric values that may influence how much each data point has during the preprocessing. There are a variety of situations where case weights can be used.
tidymodels packages differentiate how different types of case weights should be used during the entire data analysis process, including preprocessing data, model fitting, performance calculations, etc.
The tidymodels packages require users to convert their numeric vectors to a vector class that reflects how these should be used. For example, there are some situations where the weights should not affect operations such as centering and scaling or other preprocessing operations.
The types of weights allowed in tidymodels are:
Frequency weights via hardhat::frequency_weights()
Importance weights via hardhat::importance_weights()
More types can be added by request.
For recipes, we distinguish between supervised and unsupervised steps. Supervised steps use the outcome in the calculations, this type of steps will use frequency and importance weights. Unsupervised steps don't use the outcome and will only use frequency weights.
There are 3 main principles about how case weights are used within recipes.
First, the data set that is passed to the recipe()
function should already
have a case weights column in it. This column can be created beforehand using
hardhat::frequency_weights()
or hardhat::importance_weights()
. Second,
There can only be 1 case weights column in a recipe at any given time. Third,
You can not modify the case weights column with most of the steps or using
the update_role()
and add_role()
functions.
These principles ensure that you experience minimal surprises when using case weights, as the steps automatically apply case weighted operations when supported. The printing method will additionally show which steps where weighted and which steps ignored the weights because they were of an incompatible type.
frequency_weights()
, importance_weights()
These functions can be used to do basic calculations with or without case weights.
get_case_weights(info, .data, call = rlang::caller_env()) averages(x, wts = NULL, na_rm = TRUE) medians(x, wts = NULL) variances(x, wts = NULL, na_rm = TRUE) correlations(x, wts = NULL, use = "everything", method = "pearson") covariances(x, wts = NULL, use = "everything", method = "pearson") pca_wts(x, wts = NULL) are_weights_used(wts, unsupervised = FALSE)
get_case_weights(info, .data, call = rlang::caller_env()) averages(x, wts = NULL, na_rm = TRUE) medians(x, wts = NULL) variances(x, wts = NULL, na_rm = TRUE) correlations(x, wts = NULL, use = "everything", method = "pearson") covariances(x, wts = NULL, use = "everything", method = "pearson") pca_wts(x, wts = NULL) are_weights_used(wts, unsupervised = FALSE)
info |
A data frame from the |
.data |
The training data |
call |
The execution environment of a currently
running function, e.g. |
x |
A numeric vector or a data frame |
wts |
A vector of case weights |
na_rm |
A logical value indicating whether |
use |
Used by |
method |
Used by |
unsupervised |
Can the step handle unsupervised weights |
get_case_weights()
is designed for developers of recipe steps, to return
a column with the role of "case weight" as a vector.
For the other functions, rows with missing case weights are removed from calculations.
For averages()
and variances()
, missing values in the data (not the
case weights) only affect the calculations for those rows. For
correlations()
, the correlation matrix computation first removes rows
with any missing values (equal to the "complete.obs" strategy in
stats::cor()
).
are_weights_used()
is designed for developers of recipe steps and is used
inside print method to determine how printing should be done.
check_class
creates a specification of a recipe
check that will check if a variable is of a designated class.
check_class( recipe, ..., role = NA, trained = FALSE, class_nm = NULL, allow_additional = FALSE, skip = FALSE, class_list = NULL, id = rand_id("class") )
check_class( recipe, ..., role = NA, trained = FALSE, class_nm = NULL, allow_additional = FALSE, skip = FALSE, class_list = NULL, id = rand_id("class") )
recipe |
A recipe object. The check will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this check. See |
role |
Not used by this check since no new variables are created. |
trained |
A logical for whether the selectors in |
class_nm |
A character vector that will be used in |
allow_additional |
If |
skip |
A logical. Should the check be skipped when the
recipe is baked by |
class_list |
A named list of column classes. This is
|
id |
A character string that is unique to this check to identify it. |
This function can check the classes of the variables
in two ways. When the class
argument is provided
it will check if all the variables specified are of the
given class. If this argument is NULL
, the check will
learn the classes of each of the specified variables in prep
.
Both ways will break bake
if the variables are not of
the requested class. If a variable has multiple
classes in prep
, all the classes are checked. Please note
that in prep
the argument strings_as_factors
defaults to
TRUE
. If the train set contains character variables
the check will be break bake
when strings_as_factors
is
TRUE
.
An updated version of recipe
with the new check added to the
sequence of any existing operations.
When you tidy()
this check, a tibble with columns
terms
(the selectors or variables selected) and value
(the type)
is returned.
The underlying operation does not allow for case weights.
Other checks:
check_cols()
,
check_missing()
,
check_new_values()
,
check_range()
library(dplyr) data(Sacramento, package = "modeldata") # Learn the classes on the train set train <- Sacramento[1:500, ] test <- Sacramento[501:nrow(Sacramento), ] recipe(train, sqft ~ .) %>% check_class(everything()) %>% prep(train, strings_as_factors = FALSE) %>% bake(test) # Manual specification recipe(train, sqft ~ .) %>% check_class(sqft, class_nm = "integer") %>% check_class(city, zip, type, class_nm = "factor") %>% check_class(latitude, longitude, class_nm = "numeric") %>% prep(train, strings_as_factors = FALSE) %>% bake(test) # By default only the classes that are specified # are allowed. x_df <- tibble(time = c(Sys.time() - 60, Sys.time())) x_df$time %>% class() ## Not run: recipe(x_df) %>% check_class(time, class_nm = "POSIXt") %>% prep(x_df) %>% bake_(x_df) ## End(Not run) # Use allow_additional = TRUE if you are fine with it recipe(x_df) %>% check_class(time, class_nm = "POSIXt", allow_additional = TRUE) %>% prep(x_df) %>% bake(x_df)
library(dplyr) data(Sacramento, package = "modeldata") # Learn the classes on the train set train <- Sacramento[1:500, ] test <- Sacramento[501:nrow(Sacramento), ] recipe(train, sqft ~ .) %>% check_class(everything()) %>% prep(train, strings_as_factors = FALSE) %>% bake(test) # Manual specification recipe(train, sqft ~ .) %>% check_class(sqft, class_nm = "integer") %>% check_class(city, zip, type, class_nm = "factor") %>% check_class(latitude, longitude, class_nm = "numeric") %>% prep(train, strings_as_factors = FALSE) %>% bake(test) # By default only the classes that are specified # are allowed. x_df <- tibble(time = c(Sys.time() - 60, Sys.time())) x_df$time %>% class() ## Not run: recipe(x_df) %>% check_class(time, class_nm = "POSIXt") %>% prep(x_df) %>% bake_(x_df) ## End(Not run) # Use allow_additional = TRUE if you are fine with it recipe(x_df) %>% check_class(time, class_nm = "POSIXt", allow_additional = TRUE) %>% prep(x_df) %>% bake(x_df)
check_cols
creates a specification of a recipe
step that will check if all the columns of the training frame are
present in the new data.
check_cols( recipe, ..., role = NA, trained = FALSE, skip = FALSE, id = rand_id("cols") )
check_cols( recipe, ..., role = NA, trained = FALSE, skip = FALSE, id = rand_id("cols") )
recipe |
A recipe object. The check will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this check. See |
role |
Not used by this check since no new variables are created. |
trained |
A logical for whether the selectors in |
skip |
A logical. Should the check be skipped when the
recipe is baked by |
id |
A character string that is unique to this check to identify it. |
This check will break the bake
function if any of the specified
columns is not present in the data. If the check passes, nothing is changed
to the data.
An updated version of recipe
with the new check added to the
sequence of any existing operations.
When you tidy()
this check, a tibble with columns
terms
(the selectors or variables selected) and value
(the type)
is returned.
Other checks:
check_class()
,
check_missing()
,
check_new_values()
,
check_range()
data(biomass, package = "modeldata") biomass_rec <- recipe(HHV ~ ., data = biomass) %>% step_rm(sample, dataset) %>% check_cols(contains("gen")) %>% step_center(all_numeric_predictors()) ## Not run: bake(biomass_rec, biomass[, c("carbon", "HHV")]) ## End(Not run)
data(biomass, package = "modeldata") biomass_rec <- recipe(HHV ~ ., data = biomass) %>% step_rm(sample, dataset) %>% check_cols(contains("gen")) %>% step_center(all_numeric_predictors()) ## Not run: bake(biomass_rec, biomass[, c("carbon", "HHV")]) ## End(Not run)
check_missing
creates a specification of a recipe
operation that will check if variables contain missing values.
check_missing( recipe, ..., role = NA, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("missing") )
check_missing( recipe, ..., role = NA, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("missing") )
recipe |
A recipe object. The check will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this check. See |
role |
Not used by this check since no new variables are created. |
trained |
A logical for whether the selectors in |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the check be skipped when the
recipe is baked by |
id |
A character string that is unique to this check to identify it. |
This check will break the bake
function if any of the checked
columns does contain NA
values. If the check passes, nothing is changed
to the data.
An updated version of recipe
with the new check added to the
sequence of any existing operations.
When you tidy()
this check, a tibble with column
terms
(the selectors or variables selected) is returned.
Other checks:
check_class()
,
check_cols()
,
check_new_values()
,
check_range()
data(credit_data, package = "modeldata") is.na(credit_data) %>% colSums() # If the test passes, `new_data` is returned unaltered recipe(credit_data) %>% check_missing(Age, Expenses) %>% prep() %>% bake(credit_data) # If your training set doesn't pass, prep() will stop with an error ## Not run: recipe(credit_data) %>% check_missing(Income) %>% prep() ## End(Not run) # If `new_data` contain missing values, the check will stop `bake()` train_data <- credit_data %>% dplyr::filter(Income > 150) test_data <- credit_data %>% dplyr::filter(Income <= 150 | is.na(Income)) rp <- recipe(train_data) %>% check_missing(Income) %>% prep() bake(rp, train_data) ## Not run: bake(rp, test_data) ## End(Not run)
data(credit_data, package = "modeldata") is.na(credit_data) %>% colSums() # If the test passes, `new_data` is returned unaltered recipe(credit_data) %>% check_missing(Age, Expenses) %>% prep() %>% bake(credit_data) # If your training set doesn't pass, prep() will stop with an error ## Not run: recipe(credit_data) %>% check_missing(Income) %>% prep() ## End(Not run) # If `new_data` contain missing values, the check will stop `bake()` train_data <- credit_data %>% dplyr::filter(Income > 150) test_data <- credit_data %>% dplyr::filter(Income <= 150 | is.na(Income)) rp <- recipe(train_data) %>% check_missing(Income) %>% prep() bake(rp, train_data) ## Not run: bake(rp, test_data) ## End(Not run)
check_new_values
creates a specification of a recipe
operation that will check if variables contain new values.
check_new_values( recipe, ..., role = NA, trained = FALSE, columns = NULL, ignore_NA = TRUE, values = NULL, skip = FALSE, id = rand_id("new_values") )
check_new_values( recipe, ..., role = NA, trained = FALSE, columns = NULL, ignore_NA = TRUE, values = NULL, skip = FALSE, id = rand_id("new_values") )
recipe |
A recipe object. The check will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this check. See |
role |
Not used by this check since no new variables are created. |
trained |
A logical for whether the selectors in |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
ignore_NA |
A logical that indicates if we should consider missing
values as value or not. Defaults to |
values |
A named list with the allowed values.
This is |
skip |
A logical. Should the check be skipped when the
recipe is baked by |
id |
A character string that is unique to this check to identify it. |
This check will break the bake
function if any of the checked
columns does contain values it did not contain when prep
was called
on the recipe. If the check passes, nothing is changed to the data.
An updated version of recipe
with the new check added to the
sequence of any existing operations.
When you tidy()
this check, a tibble with columns
terms
(the selectors or variables selected) is returned.
The underlying operation does not allow for case weights.
Other checks:
check_class()
,
check_cols()
,
check_missing()
,
check_range()
data(credit_data, package = "modeldata") # If the test passes, `new_data` is returned unaltered recipe(credit_data) %>% check_new_values(Home) %>% prep() %>% bake(new_data = credit_data) # If `new_data` contains values not in `x` at the [prep()] function, # the [bake()] function will break. ## Not run: recipe(credit_data %>% dplyr::filter(Home != "rent")) %>% check_new_values(Home) %>% prep() %>% bake(new_data = credit_data) ## End(Not run) # By default missing values are ignored, so this passes. recipe(credit_data %>% dplyr::filter(!is.na(Home))) %>% check_new_values(Home) %>% prep() %>% bake(credit_data) # Use `ignore_NA = FALSE` if you consider missing values as a value, # that should not occur when not observed in the train set. ## Not run: recipe(credit_data %>% dplyr::filter(!is.na(Home))) %>% check_new_values(Home, ignore_NA = FALSE) %>% prep() %>% bake(credit_data) ## End(Not run)
data(credit_data, package = "modeldata") # If the test passes, `new_data` is returned unaltered recipe(credit_data) %>% check_new_values(Home) %>% prep() %>% bake(new_data = credit_data) # If `new_data` contains values not in `x` at the [prep()] function, # the [bake()] function will break. ## Not run: recipe(credit_data %>% dplyr::filter(Home != "rent")) %>% check_new_values(Home) %>% prep() %>% bake(new_data = credit_data) ## End(Not run) # By default missing values are ignored, so this passes. recipe(credit_data %>% dplyr::filter(!is.na(Home))) %>% check_new_values(Home) %>% prep() %>% bake(credit_data) # Use `ignore_NA = FALSE` if you consider missing values as a value, # that should not occur when not observed in the train set. ## Not run: recipe(credit_data %>% dplyr::filter(!is.na(Home))) %>% check_new_values(Home, ignore_NA = FALSE) %>% prep() %>% bake(credit_data) ## End(Not run)
check_range
creates a specification of a recipe
check that will check if the range of a numeric
variable changed in the new data.
check_range( recipe, ..., role = NA, skip = FALSE, trained = FALSE, slack_prop = 0.05, warn = FALSE, lower = NULL, upper = NULL, id = rand_id("range_check_") )
check_range( recipe, ..., role = NA, skip = FALSE, trained = FALSE, slack_prop = 0.05, warn = FALSE, lower = NULL, upper = NULL, id = rand_id("range_check_") )
recipe |
A recipe object. The check will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this check. See |
role |
Not used by this check since no new variables are created. |
skip |
A logical. Should the check be skipped when the
recipe is baked by |
trained |
A logical for whether the selectors in |
slack_prop |
The allowed slack as a proportion of the range of the variable in the train set. |
warn |
If |
lower |
A named numeric vector of minimum values in the train set.
This is |
upper |
A named numeric vector of maximum values in the train set.
This is |
id |
A character string that is unique to this check to identify it. |
The amount of slack that is allowed is determined by the
slack_prop
. This is a numeric of length one or two. If
of length one, the same proportion will be used at both ends
of the train set range. If of length two, its first value
is used to compute the allowed slack at the lower end,
the second to compute the allowed slack at the upper end.
An updated version of recipe
with the new check added to the
sequence of any existing operations.
When you tidy()
this check, a tibble with columns
terms
(the selectors or variables selected) and value
(the means)
is returned.
Other checks:
check_class()
,
check_cols()
,
check_missing()
,
check_new_values()
slack_df <- data_frame(x = 0:100) slack_new_data <- data_frame(x = -10:110) # this will fail the check both ends ## Not run: recipe(slack_df) %>% check_range(x) %>% prep() %>% bake(slack_new_data) ## End(Not run) # this will fail the check only at the upper end ## Not run: recipe(slack_df) %>% check_range(x, slack_prop = c(0.1, 0.05)) %>% prep() %>% bake(slack_new_data) ## End(Not run) # give a warning instead of an error ## Not run: recipe(slack_df) %>% check_range(x, warn = TRUE) %>% prep() %>% bake(slack_new_data) ## End(Not run)
slack_df <- data_frame(x = 0:100) slack_new_data <- data_frame(x = -10:110) # this will fail the check both ends ## Not run: recipe(slack_df) %>% check_range(x) %>% prep() %>% bake(slack_new_data) ## End(Not run) # this will fail the check only at the upper end ## Not run: recipe(slack_df) %>% check_range(x, slack_prop = c(0.1, 0.05)) %>% prep() %>% bake(slack_new_data) ## End(Not run) # give a warning instead of an error ## Not run: recipe(slack_df) %>% check_range(x, warn = TRUE) %>% prep() %>% bake(slack_new_data) ## End(Not run)
Detect if a particular step or check is used in a recipe
detect_step(recipe, name)
detect_step(recipe, name)
recipe |
A recipe to check. |
name |
Character name of a step or check, omitted the prefix. That is,
to check if |
Logical indicating if recipes contains given step.
rec <- recipe(Species ~ ., data = iris) %>% step_intercept() detect_step(rec, "intercept")
rec <- recipe(Species ~ ., data = iris) %>% step_intercept() detect_step(rec, "intercept")
This page provides a comprehensive list of the exported functions for creating recipes steps and guidance on how to use them.
add_step()
and add_check()
are required when creating a new step. The
output of add_step()
should be the return value of all steps and should
have the following format:
step_example <- function(recipe, ..., role = NA, trained = FALSE, skip = FALSE, id = rand_id("example")) { add_step( recipe, step_example_new( terms = enquos(...), role = role, trained = trained, skip = skip, id = id ) ) }
rand_id()
should be used in the arguments of step_example()
to specify
the argument, as we see in the above example.
recipes_pkg_check()
should be used in step_example()
functions together
with required_pkgs()
to alert users that certain other packages are
required. The standard way of using this function is the following format:
recipes_pkg_check(required_pkgs.step_example())
step()
and check()
are used within the step_*_new()
function that you
use in your new step. It will be used in the following way:
step_example_new <- function(terms, role, trained, skip, id) { step( subclass = "example", terms = terms, role = role, trained = trained, skip = skip, id = id ) }
recipes_eval_select()
is used within prep.step_*()
functions, and are
used to turn the terms
object into a character vector of the selected
variables.
It will most likely be used like so:
col_names <- recipes_eval_select(x$terms, training, info)
check_type()
can be used within prep.step_*()
functions to check that the
variables passed in are the right types. We recommend that you use the
types
argument as it offers higher flexibility and it matches the types
defined by .get_data_types()
. When using types
we find it better to be
explicit, e.g. writing types = c("double", "integer")
instead of types = "numeric"
, as it produces cleaner error messages.
It should be used like so:
check_type(training[, col_names], types = c("double", "integer"))
check_new_data()
should be used within bake.step_*()
. This function is
used to make check that the required columns are present in the data. It
should be one of the first lines inside the function.
It should be used like so:
check_new_data(names(object$columns), object, new_data)
check_name()
should be used in bake.step_*()
functions for steps that add
new columns to the data set. The function throws an error if the column names
already exist in the data set. It should be called before adding the new
columns to the data set.
get_keep_original_cols()
and remove_original_cols()
are used within steps
with the keep_original_cols
argument. get_keep_original_cols()
is used in
prep.step_*()
functions for steps that were created before the
keep_original_cols
argument was added, and acts as a way to throw a warning
that the user should regenerate the recipe. remove_original_cols()
should
be used in bake.step_*()
functions to remove the original columns. It is
worth noting that remove_original_cols()
can remove multiple columns at
once and when possible should be put outside for
loops.
new_data <- remove_original_cols(new_data, object, names_of_original_cols)
recipes_remove_cols()
should be used in prep.step_*()
functions, and is
used to remove columns from the data set, either by using the
object$removals
field or by using the col_names
argument.
recipes_names_predictors()
and recipes_names_outcomes()
should be used in
prep.step_*()
functions, and are used to get names of predictors and
outcomes.
get_case_weights()
and are_weights_used()
are functions that help you
extract case weights and help determine if they are used or not within the
step. They will typically be used within the prep.step_*()
functions if the
step in question supports case weights.
print_step()
is used inside print.step_*()
functions. This function is
replacing the internally deprecated printer()
function.
sel2char()
is mostly used within tidy.step_*()
functions to turn
selections into character vectors.
names0()
creates a series of num
names with a common prefix. The names
are numbered with leading zeros (e.g. prefix01
-prefix10
instead of
prefix1
-prefix10
). This is useful for many types of steps that produce
new columns.
recipes_ptype()
returns the ptype, expected variables and types, that a
recipe object expects at prep()
and bake()
time. Controlled using the
stage
argument. Can be used by functions that interact with recipes to
verify data is correct before passing it to prep()
and bake()
.
recipes_ptype_validate()
checks that a recipe and its data are compatible
using information extracted using recipes_ptype()
. Can be used by functions
that interact with recipes to verify data is correct before passing it to
prep()
and bake()
.
detect_step()
returns a logical indicator to determine if a given step or
check is included in a recipe.
fully_trained()
returns a logical indicator if the recipe is fully trained.
The function is_trained()
can be used to check in any individual steps are
trained or not.
.get_data_types()
is an S3 method that is used for selections. This method
can be extended to work with column types not supported by recipes.
recipes_extension_check()
is recommended to be used by package authors to
make sure that all steps have prep.step_*()
, bake.step_*()
,
print.step_*()
, tidy.step_*()
, and required_pkgs.step_*()
methods. It
should be used as a test, preferably like this:
test_that("recipes_extension_check", { expect_snapshot( recipes::recipes_extension_check( pkg = "pkgname" ) ) })
discretize()
converts a numeric vector into a factor with
bins having approximately the same number of data points (based
on a training set).
discretize(x, ...) ## Default S3 method: discretize(x, ...) ## S3 method for class 'numeric' discretize( x, cuts = 4, labels = NULL, prefix = "bin", keep_na = TRUE, infs = TRUE, min_unique = 10, ... ) ## S3 method for class 'discretize' predict(object, new_data, ...)
discretize(x, ...) ## Default S3 method: discretize(x, ...) ## S3 method for class 'numeric' discretize( x, cuts = 4, labels = NULL, prefix = "bin", keep_na = TRUE, infs = TRUE, min_unique = 10, ... ) ## S3 method for class 'discretize' predict(object, new_data, ...)
x |
A numeric vector |
... |
Options to pass to
|
cuts |
An integer defining how many cuts to make of the data. |
labels |
A character vector defining the factor levels
that will be in the new factor (from smallest to largest). This
should have length |
prefix |
A single parameter value to be used as a prefix
for the factor levels (e.g. |
keep_na |
A logical for whether a factor level should be
created to identify missing values in |
infs |
A logical indicating whether the smallest and largest cut point should be infinite. |
min_unique |
An integer defining a sample size line of
dignity for the binning. If (the number of unique
values) |
object |
An object of class |
new_data |
A new numeric object to be binned. |
discretize
estimates the cut points from
x
using percentiles. For example, if cuts = 3
, the
function estimates the quartiles of x
and uses these as
the cut points. If cuts = 2
, the bins are defined as
being above or below the median of x
.
The predict
method can then be used to turn numeric
vectors into factor vectors.
If keep_na = TRUE
, a suffix of "_missing" is used as a
factor level (see the examples below).
If infs = FALSE
and a new value is greater than the
largest value of x
, a missing value will result.
discretize
returns an object of class
discretize
and predict.discretize
returns a factor
vector.
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] median(biomass_tr$carbon) discretize(biomass_tr$carbon, cuts = 2) discretize(biomass_tr$carbon, cuts = 2, infs = FALSE) discretize(biomass_tr$carbon, cuts = 2, infs = FALSE, keep_na = FALSE) discretize(biomass_tr$carbon, cuts = 2, prefix = "maybe a bad idea to bin") carbon_binned <- discretize(biomass_tr$carbon) table(predict(carbon_binned, biomass_tr$carbon)) carbon_no_infs <- discretize(biomass_tr$carbon, infs = FALSE) predict(carbon_no_infs, c(50, 100))
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] median(biomass_tr$carbon) discretize(biomass_tr$carbon, cuts = 2) discretize(biomass_tr$carbon, cuts = 2, infs = FALSE) discretize(biomass_tr$carbon, cuts = 2, infs = FALSE, keep_na = FALSE) discretize(biomass_tr$carbon, cuts = 2, prefix = "maybe a bad idea to bin") carbon_binned <- discretize(biomass_tr$carbon) table(predict(carbon_binned, biomass_tr$carbon)) carbon_no_infs <- discretize(biomass_tr$carbon, infs = FALSE) predict(carbon_no_infs, c(50, 100))
In case a model formula is required, the formula method can be used on a recipe to show what predictors and outcome(s) could be used.
## S3 method for class 'recipe' formula(x, ...)
## S3 method for class 'recipe' formula(x, ...)
x |
A recipe object that has been prepared. |
... |
Note currently used. |
A formula.
formula(recipe(Species + Sepal.Length ~ ., data = iris) %>% prep()) iris_rec <- recipe(Species ~ ., data = iris) %>% step_center(all_numeric()) %>% prep() formula(iris_rec)
formula(recipe(Species + Sepal.Length ~ ., data = iris) %>% prep()) iris_rec <- recipe(Species ~ ., data = iris) %>% step_center(all_numeric()) %>% prep() formula(iris_rec)
Check to see if a recipe is trained/prepared
fully_trained(x)
fully_trained(x)
x |
A recipe |
A logical which is true if all of the recipe steps have been run
through prep
. If no steps have been added to the recipe, TRUE
is
returned only if the recipe has been prepped.
rec <- recipe(Species ~ ., data = iris) %>% step_center(all_numeric()) rec %>% fully_trained() rec %>% prep(training = iris) %>% fully_trained()
rec <- recipe(Species ~ ., data = iris) %>% step_center(all_numeric()) rec %>% fully_trained() rec %>% prep(training = iris) %>% fully_trained()
has_role()
, all_predictors()
, and all_outcomes()
can be used to
select variables in a formula that have certain roles.
In most cases, the right approach for users will be use to use the
predictor-specific selectors such as all_numeric_predictors()
and
all_nominal_predictors()
. In general you should be careful about using
-all_outcomes()
if a *_predictors()
selector would do what you want.
Similarly, has_type()
, all_numeric()
, all_integer()
, all_double()
,
all_nominal()
, all_ordered()
, all_unordered()
, all_factor()
,
all_string()
, all_date()
and all_datetime()
are used to select columns
based on their data type.
all_factor()
captures ordered and unordered factors, all_string()
captures characters, all_unordered()
captures unordered factors and
characters, all_ordered()
captures ordered factors, all_nominal()
captures characters, unordered and ordered factors.
all_integer()
captures integers, all_double()
captures doubles,
all_numeric()
captures all kinds of numeric.
all_date()
captures Date()
variables, all_datetime()
captures
POSIXct()
variables.
See selections for more details.
current_info()
is an internal function.
All of these functions have have limited utility outside of column selection in step functions.
has_role(match = "predictor") has_type(match = "numeric") all_outcomes() all_predictors() all_date() all_date_predictors() all_datetime() all_datetime_predictors() all_double() all_double_predictors() all_factor() all_factor_predictors() all_integer() all_integer_predictors() all_logical() all_logical_predictors() all_nominal() all_nominal_predictors() all_numeric() all_numeric_predictors() all_ordered() all_ordered_predictors() all_string() all_string_predictors() all_unordered() all_unordered_predictors() current_info()
has_role(match = "predictor") has_type(match = "numeric") all_outcomes() all_predictors() all_date() all_date_predictors() all_datetime() all_datetime_predictors() all_double() all_double_predictors() all_factor() all_factor_predictors() all_integer() all_integer_predictors() all_logical() all_logical_predictors() all_nominal() all_nominal_predictors() all_numeric() all_numeric_predictors() all_ordered() all_ordered_predictors() all_string() all_string_predictors() all_unordered() all_unordered_predictors() current_info()
match |
A single character string for the query. Exact matching is used (i.e. regular expressions won't work). |
Selector functions return an integer vector.
current_info()
returns an environment with objects vars
and data
.
data(biomass, package = "modeldata") rec <- recipe(biomass) %>% update_role( carbon, hydrogen, oxygen, nitrogen, sulfur, new_role = "predictor" ) %>% update_role(HHV, new_role = "outcome") %>% update_role(sample, new_role = "id variable") %>% update_role(dataset, new_role = "splitting indicator") recipe_info <- summary(rec) recipe_info # Centering on all predictors except carbon rec %>% step_center(all_predictors(), -carbon) %>% prep(training = biomass) %>% bake(new_data = NULL)
data(biomass, package = "modeldata") rec <- recipe(biomass) %>% update_role( carbon, hydrogen, oxygen, nitrogen, sulfur, new_role = "predictor" ) %>% update_role(HHV, new_role = "outcome") %>% update_role(sample, new_role = "id variable") %>% update_role(dataset, new_role = "splitting indicator") recipe_info <- summary(rec) recipe_info # Centering on all predictors except carbon rec %>% step_center(all_predictors(), -carbon) %>% prep(training = biomass) %>% bake(new_data = NULL)
As of recipes
version 0.1.14, juice()
is superseded in favor of
bake(object, new_data = NULL)
.
As steps are estimated by prep
, these operations are applied to the
training set. Rather than running bake()
to duplicate this processing, this
function will return variables from the processed training set.
juice(object, ..., composition = "tibble")
juice(object, ..., composition = "tibble")
object |
A |
... |
One or more selector functions to choose which variables will be
returned by the function. See |
composition |
Either |
juice()
will return the results of a recipe where all steps have been
applied to the data, irrespective of the value of the step's skip
argument.
juice()
can only be used if a recipe was prepped with retain = TRUE
. This
is equivalent to bake(object, new_data = NULL)
which is the preferred way
to extract the transformation of the training data set.
names0
creates a series of num
names with a common prefix.
The names are numbered with leading zeros (e.g.
prefix01
-prefix10
instead of prefix1
-prefix10
).
dummy_names
can be used for renaming unordered and ordered
dummy variables (in step_dummy()
).
names0(num, prefix = "x", call = rlang::caller_env()) dummy_names(var, lvl, ordinal = FALSE, sep = "_") dummy_extract_names(var, lvl, ordinal = FALSE, sep = "_")
names0(num, prefix = "x", call = rlang::caller_env()) dummy_names(var, lvl, ordinal = FALSE, sep = "_") dummy_extract_names(var, lvl, ordinal = FALSE, sep = "_")
num |
A single integer for how many elements are created. |
prefix |
A character string that will start each name. |
call |
The execution environment of a currently running function, e.g.
|
var |
A single string for the original factor name. |
lvl |
A character vectors of the factor levels (in order).
When used with |
ordinal |
A logical; was the original factor ordered? |
sep |
A single character value for the separator between the names and levels. |
When using dummy_names()
, factor levels that are not valid
variable names (e.g. "some text with spaces") will be changed to valid
names by base::make.names()
; see example below. This function will also
change the names of ordinal dummy variables. Instead of values such as
".L
", ".Q
", or "^4
", ordinal dummy variables are given simple integer
suffixes such as "_1
", "_2
", etc.
names0
returns a character string of length num
and
dummy_names
generates a character vector the same length as
lvl
.
names0(9, "a") names0(10, "a") example <- data.frame( x = ordered(letters[1:5]), y = factor(LETTERS[1:5]), z = factor(paste(LETTERS[1:5], 1:5)) ) dummy_names("y", levels(example$y)[-1]) dummy_names("z", levels(example$z)[-1]) after_mm <- colnames(model.matrix(~x, data = example))[-1] after_mm levels(example$x) dummy_names("x", substring(after_mm, 2), ordinal = TRUE)
names0(9, "a") names0(10, "a") example <- data.frame( x = ordered(letters[1:5]), y = factor(LETTERS[1:5]), z = factor(paste(LETTERS[1:5], 1:5)) ) dummy_names("y", levels(example$y)[-1]) dummy_names("z", levels(example$z)[-1]) after_mm <- colnames(model.matrix(~x, data = example))[-1] after_mm levels(example$x) dummy_names("x", substring(after_mm, 2), ordinal = TRUE)
For a recipe with at least one preprocessing operation, estimate the required parameters from a training set that can be later applied to other data sets.
prep(x, ...) ## S3 method for class 'recipe' prep( x, training = NULL, fresh = FALSE, verbose = FALSE, retain = TRUE, log_changes = FALSE, strings_as_factors = TRUE, ... )
prep(x, ...) ## S3 method for class 'recipe' prep( x, training = NULL, fresh = FALSE, verbose = FALSE, retain = TRUE, log_changes = FALSE, strings_as_factors = TRUE, ... )
x |
an |
... |
further arguments passed to or from other methods (not currently used). |
training |
A data frame, tibble, or sparse matrix from the |
fresh |
A logical indicating whether already trained operation should be
re-trained. If |
verbose |
A logical that controls whether progress is reported as operations are executed. |
retain |
A logical: should the preprocessed training set be saved into
the |
log_changes |
A logical for printing a summary for each step regarding which (if any) columns were added or removed during training. |
strings_as_factors |
A logical: should character columns that have role
|
Given a data set, this function estimates the required quantities and
statistics needed by any operations. prep()
returns an updated recipe with
the estimates. If you are using a recipe as a preprocessor for modeling, we
highly recommend that you use a workflow()
instead of manually
estimating a recipe (see the example in recipe()
).
Note that missing data is handled in the steps; there is no global na.rm
option at the recipe level or in prep()
.
Also, if a recipe has been trained using prep()
and then steps are added,
prep()
will only update the new operations. If fresh = TRUE
, all of the
operations will be (re)estimated.
As the steps are executed, the training
set is updated. For example, if the
first step is to center the data and the second is to scale the data, the
step for scaling is given the centered data.
A recipe whose step objects have been updated with the required quantities
(e.g. parameter estimates, model objects, etc). Also, the term_info
object
is likely to be modified as the operations are executed.
data(ames, package = "modeldata") library(dplyr) ames <- mutate(ames, Sale_Price = log10(Sale_Price)) ames_rec <- recipe( Sale_Price ~ Longitude + Latitude + Neighborhood + Year_Built + Central_Air, data = ames ) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(all_nominal()) %>% step_interact(~ starts_with("Central_Air"):Year_Built) %>% step_ns(Longitude, Latitude, deg_free = 5) prep(ames_rec, verbose = TRUE) prep(ames_rec, log_changes = TRUE)
data(ames, package = "modeldata") library(dplyr) ames <- mutate(ames, Sale_Price = log10(Sale_Price)) ames_rec <- recipe( Sale_Price ~ Longitude + Latitude + Neighborhood + Year_Built + Central_Air, data = ames ) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(all_nominal()) %>% step_interact(~ starts_with("Central_Air"):Year_Built) %>% step_ns(Longitude, Latitude, deg_free = 5) prep(ames_rec, verbose = TRUE) prep(ames_rec, log_changes = TRUE)
When working with the rsample package, a simple recipe
must be prepared using the prep
function first. When using recipes with rsample it
is helpful to have a function that can prepare a recipe
across a series of split
objects that are produced
in this package. prepper
is a wrapper function
around prep
that can be used to do
this. See the vignette on "Recipes and rsample" for an
example.
prepper(split_obj, recipe, ...)
prepper(split_obj, recipe, ...)
split_obj |
An |
recipe |
An untrained |
... |
Arguments to pass to |
prepper()
sets the underlying prep()
argument fresh
to TRUE
.
Print a Recipe
## S3 method for class 'recipe' print(x, form_width = 30, ...)
## S3 method for class 'recipe' print(x, form_width = 30, ...)
x |
A |
form_width |
The number of characters used to print the variables or terms in a formula |
... |
further arguments passed to or from other methods (not currently used). |
The original object (invisibly)
A recipe is a description of the steps to be applied to a data set in order to prepare it for data analysis.
recipe(x, ...) ## Default S3 method: recipe(x, ...) ## S3 method for class 'data.frame' recipe(x, formula = NULL, ..., vars = NULL, roles = NULL) ## S3 method for class 'formula' recipe(formula, data, ...) ## S3 method for class 'matrix' recipe(x, ...)
recipe(x, ...) ## Default S3 method: recipe(x, ...) ## S3 method for class 'data.frame' recipe(x, formula = NULL, ..., vars = NULL, roles = NULL) ## S3 method for class 'formula' recipe(formula, data, ...) ## S3 method for class 'matrix' recipe(x, ...)
x , data
|
A data frame, tibble, or sparse matrix from the |
... |
Further arguments passed to or from other methods (not currently used). |
formula |
A model formula. No in-line functions should be used here
(e.g. |
vars |
A character string of column names corresponding to variables that will be used in any context (see below) |
roles |
A character string (the same length of |
Variables in recipes can have any type of role, including outcome, predictor, observation ID, case weights, stratification variables, etc.
recipe
objects can be created in several ways. If an analysis only
contains outcomes and predictors, the simplest way to create one is to
use a formula (e.g. y ~ x1 + x2
) that does not contain inline
functions such as log(x3)
(see the first example below).
Alternatively, a recipe
object can be created by first specifying
which variables in a data set should be used and then sequentially
defining their roles (see the last example). This alternative is an
excellent choice when the number of variables is very high, as the
formula method is memory-inefficient with many variables.
There are two different types of operations that can be sequentially added to a recipe.
Steps can include operations like scaling a variable, creating dummy variables or interactions, and so on. More computationally complex actions such as dimension reduction or imputation can also be specified.
Checks are operations that conduct specific tests of the data. When the test is satisfied, the data are returned without issue or modification. Otherwise, an error is thrown.
If you have defined a recipe and want to see which steps are included,
use the tidy()
method on the recipe object.
Note that the data passed to recipe()
need not be the
complete data that will be used to train the steps (by
prep()
). The recipe only needs to know the names and types
of data that will be used. For large data sets, head()
could
be used to pass a smaller data set to save time and memory.
Once a recipe is defined, it needs to be estimated before being
applied to data. Most recipe steps have specific quantities that must be
calculated or estimated. For example,
step_normalize()
needs to compute the training
set’s mean for the selected columns, while
step_dummy()
needs to determine the factor levels of
selected columns in order to make the appropriate indicator columns.
The two most common application of recipes are modeling and stand-alone preprocessing. How the recipe is estimated depends on how it is being used.
The best way to use use a recipe for modeling is via the workflows
package. This bundles a model and preprocessor (e.g. a recipe) together
and gives the user a fluent way to train the model/recipe and make
predictions.
library(dplyr) library(workflows) library(recipes) library(parsnip) data(biomass, package = "modeldata") # split data biomass_tr <- biomass %>% filter(dataset == "Training") biomass_te <- biomass %>% filter(dataset == "Testing") # With only predictors and outcomes, use a formula: rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr) # Now add preprocessing steps to the recipe: sp_signed <- rec %>% step_normalize(all_numeric_predictors()) %>% step_spatialsign(all_numeric_predictors()) sp_signed
## ## -- Recipe ------------------------------------------------------------ ## ## -- Inputs ## Number of variables by role ## outcome: 1 ## predictor: 5 ## ## -- Operations ## * Centering and scaling for: all_numeric_predictors() ## * Spatial sign on: all_numeric_predictors()
We can create a parsnip
model, and then build a workflow with the
model and recipe:
linear_mod <- linear_reg() linear_sp_sign_wflow <- workflow() %>% add_model(linear_mod) %>% add_recipe(sp_signed) linear_sp_sign_wflow
## == Workflow ========================================================== ## Preprocessor: Recipe ## Model: linear_reg() ## ## -- Preprocessor ------------------------------------------------------ ## 2 Recipe Steps ## ## * step_normalize() ## * step_spatialsign() ## ## -- Model ------------------------------------------------------------- ## Linear Regression Model Specification (regression) ## ## Computational engine: lm
To estimate the preprocessing steps and then fit the linear model, a
single call to fit()
is used:
linear_sp_sign_fit <- fit(linear_sp_sign_wflow, data = biomass_tr)
When predicting, there is no need to do anything other than call
predict()
. This preprocesses the new
data in the same manner as the training set, then gives the data to the
linear model prediction code:
predict(linear_sp_sign_fit, new_data = head(biomass_te))
## # A tibble: 6 x 1 ## .pred ## <dbl> ## 1 18.1 ## 2 17.9 ## 3 17.2 ## 4 18.8 ## 5 19.6 ## 6 14.6
When using a recipe to generate data for a visualization or to troubleshoot any problems with the recipe, there are functions that can be used to estimate the recipe and apply it to new data manually.
Once a recipe has been defined, the prep()
function can be
used to estimate quantities required for the operations using a data set
(a.k.a. the training data). prep()
returns a recipe.
As an example of using PCA (perhaps to produce a plot):
# Define the recipe pca_rec <- rec %>% step_normalize(all_numeric_predictors()) %>% step_pca(all_numeric_predictors())
Now to estimate the normalization statistics and the PCA loadings:
pca_rec <- prep(pca_rec, training = biomass_tr) pca_rec
## ## -- Recipe ------------------------------------------------------------ ## ## -- Inputs ## Number of variables by role ## outcome: 1 ## predictor: 5 ## ## -- Training information ## Training data contained 456 data points and no incomplete rows. ## ## -- Operations ## * Centering and scaling for: carbon and hydrogen, ... | Trained ## * PCA extraction with: carbon, hydrogen, oxygen, ... | Trained
Note that the estimated recipe shows the actual column names captured by the selectors.
You can tidy.recipe()
a recipe, either when it is
prepped or unprepped, to learn more about its components.
tidy(pca_rec)
## # A tibble: 2 x 6 ## number operation type trained skip id ## <int> <chr> <chr> <lgl> <lgl> <chr> ## 1 1 step normalize TRUE FALSE normalize_AeYA4 ## 2 2 step pca TRUE FALSE pca_Zn1yz
You can also tidy()
recipe steps with a number
or id
argument.
To apply the prepped recipe to a data set, the bake()
function is used in the same manner that
predict()
would be for models. This
applies the estimated steps to any data set.
bake(pca_rec, head(biomass_te))
## # A tibble: 6 x 6 ## HHV PC1 PC2 PC3 PC4 PC5 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 18.3 0.730 -0.412 -0.495 0.333 0.253 ## 2 17.6 0.617 1.41 0.118 -0.466 0.815 ## 3 17.2 0.761 1.10 -0.0550 -0.397 0.747 ## 4 18.9 0.0400 0.950 0.158 0.405 -0.143 ## 5 20.5 0.792 -0.732 0.204 0.465 -0.148 ## 6 18.5 0.433 -0.127 -0.354 -0.0168 -0.0888
In general, the workflow interface to recipes is recommended for most applications.
An object of class recipe
with sub-objects:
var_info |
A tibble containing information about the original data set columns. |
term_info |
A tibble that contains the current set of terms in the
data set. This initially defaults to the same data contained in
|
steps |
A list of |
template |
A tibble of the data. This is initialized to be the same as
the data given in the |
# formula example with single outcome: data(biomass, package = "modeldata") # split data biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] # With only predictors and outcomes, use a formula rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) # Now add preprocessing steps to the recipe sp_signed <- rec %>% step_normalize(all_numeric_predictors()) %>% step_spatialsign(all_numeric_predictors()) sp_signed # --------------------------------------------------------------------------- # formula multivariate example: # no need for `cbind(carbon, hydrogen)` for left-hand side multi_y <- recipe(carbon + hydrogen ~ oxygen + nitrogen + sulfur, data = biomass_tr ) multi_y <- multi_y %>% step_center(all_numeric_predictors()) %>% step_scale(all_numeric_predictors()) # --------------------------------------------------------------------------- # example using `update_role` instead of formula: # best choice for high-dimensional data rec <- recipe(biomass_tr) %>% update_role(carbon, hydrogen, oxygen, nitrogen, sulfur, new_role = "predictor" ) %>% update_role(HHV, new_role = "outcome") %>% update_role(sample, new_role = "id variable") %>% update_role(dataset, new_role = "splitting indicator") rec
# formula example with single outcome: data(biomass, package = "modeldata") # split data biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] # With only predictors and outcomes, use a formula rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) # Now add preprocessing steps to the recipe sp_signed <- rec %>% step_normalize(all_numeric_predictors()) %>% step_spatialsign(all_numeric_predictors()) sp_signed # --------------------------------------------------------------------------- # formula multivariate example: # no need for `cbind(carbon, hydrogen)` for left-hand side multi_y <- recipe(carbon + hydrogen ~ oxygen + nitrogen + sulfur, data = biomass_tr ) multi_y <- multi_y %>% step_center(all_numeric_predictors()) %>% step_scale(all_numeric_predictors()) # --------------------------------------------------------------------------- # example using `update_role` instead of formula: # best choice for high-dimensional data rec <- recipe(biomass_tr) %>% update_role(carbon, hydrogen, oxygen, nitrogen, sulfur, new_role = "predictor" ) %>% update_role(HHV, new_role = "outcome") %>% update_role(sample, new_role = "id variable") %>% update_role(dataset, new_role = "splitting indicator") rec
recipes_eval_select()
is a recipes specific variant of
tidyselect::eval_select()
enhanced with the ability to recognize recipes
selectors, such as all_numeric_predictors()
. See selections
for more information about the unique recipes selectors.
This is a developer tool that is only useful for creating new recipes steps.
recipes_eval_select( quos, data, info, ..., allow_rename = FALSE, check_case_weights = TRUE, call = caller_env() )
recipes_eval_select( quos, data, info, ..., allow_rename = FALSE, check_case_weights = TRUE, call = caller_env() )
quos |
A list of quosures describing the selection. This is generally
the |
data |
A data frame to use as the context to evaluate the selection in.
This is generally the |
info |
A data frame of term information describing each column's type
and role for use with the recipes selectors. This is generally the |
... |
These dots are for future extensions and must be empty. |
allow_rename |
Should the renaming syntax |
check_case_weights |
Should selecting case weights throw an error?
Defaults to |
call |
The execution environment of a currently running function, e.g.
|
A named character vector containing the evaluated selection. The names are
always the same as the values, except when allow_rename = TRUE
, in which
case the names reflect the new names chosen by the user.
library(rlang) data(scat, package = "modeldata") rec <- recipe(Species ~ ., data = scat) info <- summary(rec) info quos <- quos(all_numeric_predictors(), where(is.factor)) recipes_eval_select(quos, scat, info)
library(rlang) data(scat, package = "modeldata") rec <- recipe(Species ~ ., data = scat) info <- summary(rec) info quos <- quos(all_numeric_predictors(), where(is.factor)) recipes_eval_select(quos, scat, info)
This is a developer tool intended to help making sure all methods for each step have been created.
recipes_extension_check( pkg, exclude_steps = character(), exclude_methods = character() )
recipes_extension_check( pkg, exclude_steps = character(), exclude_methods = character() )
pkg |
Character, name of package containing steps to check |
exclude_steps |
Character, name of steps to exclude. This is mostly used to remove false positives. |
exclude_methods |
Character, which methods to exclude testing for. Can take the values "prep", "bake", "print", "tidy", and "required_pkgs". |
It is recommended that the following test in placed in packages that add recipes steps to help keep everything up to date.
test_that("recipes_extension_check", { expect_snapshot( recipes::recipes_extension_check( pkg = "pkgname" ) ) })
cli output
recipes_extension_check( pkg = "recipes" ) recipes_extension_check( pkg = "recipes", exclude_steps = "step_testthat_helper", exclude_methods = c("required_pkgs") )
recipes_extension_check( pkg = "recipes" ) recipes_extension_check( pkg = "recipes", exclude_steps = "step_testthat_helper", exclude_methods = c("required_pkgs") )
update_role()
alters an existing role in the recipe or assigns an initial
role to variables that do not yet have a declared role.
add_role()
adds an additional role to variables that already have a role
in the recipe. It does not overwrite old roles, as a single variable can have
multiple roles.
remove_role()
eliminates a single existing role in the recipe.
add_role(recipe, ..., new_role = "predictor", new_type = NULL) update_role(recipe, ..., new_role = "predictor", old_role = NULL) remove_role(recipe, ..., old_role)
add_role(recipe, ..., new_role = "predictor", new_type = NULL) update_role(recipe, ..., new_role = "predictor", old_role = NULL) remove_role(recipe, ..., old_role)
recipe |
An existing |
... |
One or more selector functions to choose which variables are being
assigned a role. See |
new_role |
A character string for a single role. |
new_type |
A character string for specific type that the variable should
be identified as. If left as |
old_role |
A character string for the specific role to update for the
variables selected by |
update_role()
, add_role()
and remove_role()
will be applied on a recipe
before any of the steps or checks, regardless of where they are located in
position. This means that roles can only be changed with these three
functions for columns that are already present in the original data supplied
to recipe()
. See the role
argument in some step functions to update
roles for columns created by steps.
Variables can have any arbitrary role (see the examples) but there are three
special standard roles, "predictor"
, "outcome"
, and "case_weights"
.
The first two roles are typically required when fitting a model.
update_role()
should be used when a variable doesn't currently have a role
in the recipe, or to replace an old_role
with a new_role
. add_role()
only adds additional roles to variables that already have roles and will
throw an error when the current role is missing (i.e. NA
).
When using add_role()
, if a variable is selected that already has the
new_role
, a warning is emitted and that variable is skipped so no duplicate
roles are added.
Adding or updating roles is a useful way to group certain variables that
don't fall in the standard "predictor"
bucket. You can perform a step on
all of the variables that have a custom role with the selector has_role()
.
Recipes can label and retain column(s) of your data set that should not be treated as outcomes or predictors. A unique identifier column or some other ancillary data could be used to troubleshoot issues during model development but may not be either an outcome or predictor.
For example, the modeldata::biomass
dataset has a column named sample
with information about the specific sample type. We can change that role:
library(recipes) data(biomass, package = "modeldata") biomass_train <- biomass[1:100,] biomass_test <- biomass[101:200,] rec <- recipe(HHV ~ ., data = biomass_train) %>% update_role(sample, new_role = "id variable") %>% step_center(carbon) rec <- prep(rec, biomass_train)
This means that sample
is no longer treated as a "predictor"
(the default role for columns on the right-hand side of the formula supplied to recipe()
) and won't be used in model fitting or analysis, but will still be retained in the data set.
If you really aren't using sample
in your recipe, we recommend that you instead remove sample
from your dataset before passing it to recipe()
. The reason for this is because recipes assumes that all non-standard roles are required at bake()
time (or predict()
time, if you are using a workflow). Since you didn't use sample
in any steps of the recipe, you might think that you don't need to pass it to bake()
, but this isn't true because recipes doesn't know that you didn't use it:
biomass_test$sample <- NULL try(bake(rec, biomass_test)) #> Error in bake(rec, biomass_test) : #> x The following required columns are missing from `new_data`: `sample`. #> i These columns have one of the following roles, which are required at `bake()` #> time: `id variable`. #> i If these roles are not required at `bake()` time, use #> `update_role_requirements(role = "your_role", bake = FALSE)`.
As we mentioned before, the best way to avoid this issue is to not even use a role, just remove the sample
column from biomass
before calling recipe()
. In general, predictors and non-standard roles that are supplied to recipe()
should be present at both prep()
and bake()
time.
If you can't remove sample
for some reason, then the second best way to get around this issue is to tell recipes that the "id variable"
role isn't required at bake()
time. You can do that by using update_role_requirements()
:
rec <- recipe(HHV ~ ., data = biomass_train) %>% update_role(sample, new_role = "id variable") %>% update_role_requirements("id variable", bake = FALSE) %>% step_center(carbon) rec <- prep(rec, biomass_train) # No errors! biomass_test_baked <- bake(rec, biomass_test)
It should be very rare that you need this feature.
An updated recipe object.
library(recipes) data(biomass, package = "modeldata") # Using the formula method, roles are created for any outcomes and predictors: recipe(HHV ~ ., data = biomass) %>% summary() # However `sample` and `dataset` aren't predictors. Since they already have # roles, `update_role()` can be used to make changes, to any arbitrary role: recipe(HHV ~ ., data = biomass) %>% update_role(sample, new_role = "id variable") %>% update_role(dataset, new_role = "splitting variable") %>% summary() # `update_role()` cannot set a role to NA, use `remove_role()` for that ## Not run: recipe(HHV ~ ., data = biomass) %>% update_role(sample, new_role = NA_character_) ## End(Not run) # ------------------------------------------------------------------------------ # Variables can have more than one role. `add_role()` can be used # if the column already has at least one role: recipe(HHV ~ ., data = biomass) %>% add_role(carbon, sulfur, new_role = "something") %>% summary() # `update_role()` has an argument called `old_role` that is required to # unambiguously update a role when the column currently has multiple roles. recipe(HHV ~ ., data = biomass) %>% add_role(carbon, new_role = "something") %>% update_role(carbon, new_role = "something else", old_role = "something") %>% summary() # `carbon` has two roles at the end, so the last `update_roles()` fails since # `old_role` was not given. ## Not run: recipe(HHV ~ ., data = biomass) %>% add_role(carbon, sulfur, new_role = "something") %>% update_role(carbon, new_role = "something else") ## End(Not run) # ------------------------------------------------------------------------------ # To remove a role, `remove_role()` can be used to remove a single role. recipe(HHV ~ ., data = biomass) %>% add_role(carbon, new_role = "something") %>% remove_role(carbon, old_role = "something") %>% summary() # To remove all roles, call `remove_role()` multiple times to reset to `NA` recipe(HHV ~ ., data = biomass) %>% add_role(carbon, new_role = "something") %>% remove_role(carbon, old_role = "something") %>% remove_role(carbon, old_role = "predictor") %>% summary() # ------------------------------------------------------------------------------ # If the formula method is not used, all columns have a missing role: recipe(biomass) %>% summary()
library(recipes) data(biomass, package = "modeldata") # Using the formula method, roles are created for any outcomes and predictors: recipe(HHV ~ ., data = biomass) %>% summary() # However `sample` and `dataset` aren't predictors. Since they already have # roles, `update_role()` can be used to make changes, to any arbitrary role: recipe(HHV ~ ., data = biomass) %>% update_role(sample, new_role = "id variable") %>% update_role(dataset, new_role = "splitting variable") %>% summary() # `update_role()` cannot set a role to NA, use `remove_role()` for that ## Not run: recipe(HHV ~ ., data = biomass) %>% update_role(sample, new_role = NA_character_) ## End(Not run) # ------------------------------------------------------------------------------ # Variables can have more than one role. `add_role()` can be used # if the column already has at least one role: recipe(HHV ~ ., data = biomass) %>% add_role(carbon, sulfur, new_role = "something") %>% summary() # `update_role()` has an argument called `old_role` that is required to # unambiguously update a role when the column currently has multiple roles. recipe(HHV ~ ., data = biomass) %>% add_role(carbon, new_role = "something") %>% update_role(carbon, new_role = "something else", old_role = "something") %>% summary() # `carbon` has two roles at the end, so the last `update_roles()` fails since # `old_role` was not given. ## Not run: recipe(HHV ~ ., data = biomass) %>% add_role(carbon, sulfur, new_role = "something") %>% update_role(carbon, new_role = "something else") ## End(Not run) # ------------------------------------------------------------------------------ # To remove a role, `remove_role()` can be used to remove a single role. recipe(HHV ~ ., data = biomass) %>% add_role(carbon, new_role = "something") %>% remove_role(carbon, old_role = "something") %>% summary() # To remove all roles, call `remove_role()` multiple times to reset to `NA` recipe(HHV ~ ., data = biomass) %>% add_role(carbon, new_role = "something") %>% remove_role(carbon, old_role = "something") %>% remove_role(carbon, old_role = "predictor") %>% summary() # ------------------------------------------------------------------------------ # If the formula method is not used, all columns have a missing role: recipe(biomass) %>% summary()
Tips for selecting columns in step functions.
When selecting variables or model terms in step
functions, dplyr
-like tools are used. The selector functions
can choose variables based on their name, current role, data
type, or any combination of these. The selectors are passed as
any other argument to the step. If the variables are explicitly
named in the step function, this might look like:
recipe( ~ ., data = USArrests) %>% step_pca(Murder, Assault, UrbanPop, Rape, num_comp = 3)
The first four arguments indicate which variables should be
used in the PCA while the last argument is a specific argument
to step_pca()
about the number of components.
Note that:
These arguments are not evaluated until the prep
function for the step is executed.
The dplyr
-like syntax allows for negative signs to
exclude variables (e.g. -Murder
) and the set of selectors will
processed in order.
A leading exclusion in these arguments (e.g. -Murder
)
has the effect of adding all variables to the list except the
excluded variable(s), ignoring role information.
Select helpers from the tidyselect
package can also be used:
tidyselect::starts_with()
, tidyselect::ends_with()
,
tidyselect::contains()
, tidyselect::matches()
,
tidyselect::num_range()
, tidyselect::everything()
,
tidyselect::one_of()
, tidyselect::all_of()
, and
tidyselect::any_of()
Note that using tidyselect::everything()
or any of the other tidyselect
functions aren't restricted to predictors. They will thus select outcomes,
ID, and predictor columns alike. This is why these functions should be used
with care, and why tidyselect::everything()
likely isn't what you need.
For example:
recipe(Species ~ ., data = iris) %>% step_center(starts_with("Sepal"), -contains("Width"))
would only select Sepal.Length
Columns of the design matrix that may not exist when the step
is coded can also be selected. For example, when using
step_pca()
, the number of columns created by feature extraction
may not be known when subsequent steps are defined. In this
case, using matches("^PC")
will select all of the columns
whose names start with "PC" once those columns are created.
There are sets of recipes-specific functions that can be used to select
variables based on their role or type: has_role()
and
has_type()
. For convenience, there are also functions that are
more specific. The functions all_numeric()
and all_nominal()
select
based on type, with nominal variables including both character and factor;
the functions all_predictors()
and all_outcomes()
select based on role.
The functions all_numeric_predictors()
and all_nominal_predictors()
select intersections of role and type. Any can be used in conjunction with
the previous functions described for selecting variables using their names.
A selection like this:
data(biomass) recipe(HHV ~ ., data = biomass) %>% step_center(all_numeric(), -all_outcomes())
is equivalent to:
data(biomass) recipe(HHV ~ ., data = biomass) %>% step_center(all_numeric_predictors())
Both result in all the numeric predictors: carbon, hydrogen, oxygen, nitrogen, and sulfur.
If a role for a variable has not been defined, it will never be selected using role-specific selectors.
Selectors can be used in step_interact()
in similar ways but
must be embedded in a model formula (as opposed to a sequence
of selectors). For example, the interaction specification
could be ~ starts_with("Species"):Sepal.Width
. This can be
useful if Species
was converted to dummy variables
previously using step_dummy()
. The implementation of
step_interact()
is special, and is more restricted than
the other step functions. Only the selector functions from
recipes and tidyselect are allowed. User defined selector functions
will not be recognized. Additionally, the tidyselect domain specific
language is not recognized here, meaning that &
, |
, !
, and -
will not work.
When creating variable selections:
If you are using column filtering steps, such as step_corr()
, try to
avoid hardcoding specific variable names in downstream steps in case
those columns are removed by the filter. Instead, use
dplyr::any_of()
and
dplyr::all_of()
.
dplyr::any_of()
will be tolerant if a column
has been removed.
dplyr::all_of()
will fail unless all of the
columns are present in the data.
For both of these functions, if you are going to save the recipe as a binary object to use in another R session, try to avoid referring to a vector in your workspace.
Preferred: any_of(!!var_names)
Avoid: any_of(var_names)
Some examples:
some_vars <- names(mtcars)[4:6] # No filter steps, OK for not saving the recipe rec_1 <- recipe(mpg ~ ., data = mtcars) %>% step_log(all_of(some_vars)) %>% prep() # No filter steps, saving the recipe rec_2 <- recipe(mpg ~ ., data = mtcars) %>% step_log(!!!some_vars) %>% prep() # This fails since `wt` is not in the data try( recipe(mpg ~ ., data = mtcars) %>% step_rm(wt) %>% step_log(!!!some_vars) %>% prep(), silent = TRUE ) # Best for filters (using any_of()) and when # saving the recipe rec_4 <- recipe(mpg ~ ., data = mtcars) %>% step_rm(wt) %>% step_log(any_of(!!some_vars)) %>% # equal to step_log(any_of(c("hp", "drat", "wt"))) prep()
recipe()
, prep()
, and bake()
all accept sparse tibbles from the
sparsevctrs
package and sparse matrices from the Matrix
package. Sparse
matrices are converted to sparse tibbles internally as each step expects a
tibble as its input, and is expected to return a tibble as well.
Several steps work with sparse data. A step can either work with sparse data, ruin sparsity, or create sparsity. The documentation for each step will indicate whether it will work with sparse data or create sparse columns. If nothing is listed it is assumed to ruin sparsity.
Sparse tibbles or data.frame
s will be returned from bake()
if sparse columns
are present in data, either from being generated in steps or because sparse
data was passed into recipe()
, prep()
, or bake()
.
step_arrange()
creates a specification of a recipe step that will sort
rows using dplyr::arrange()
.
step_arrange( recipe, ..., role = NA, trained = FALSE, inputs = NULL, skip = FALSE, id = rand_id("arrange") )
step_arrange( recipe, ..., role = NA, trained = FALSE, inputs = NULL, skip = FALSE, id = rand_id("arrange") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
Comma separated list of unquoted variable names.
Use 'desc()“ to sort a variable in descending order. See
|
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
inputs |
Quosure of values given by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
When an object in the user's global environment is
referenced in the expression defining the new variable(s),
it is a good idea to use quasiquotation (e.g. !!!
)
to embed the value of the object in the expression (to
be portable between sessions). See the examples.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other row operation steps:
step_filter()
,
step_impute_roll()
,
step_lag()
,
step_naomit()
,
step_sample()
,
step_shuffle()
,
step_slice()
Other dplyr steps:
step_filter()
,
step_mutate()
,
step_mutate_at()
,
step_rename()
,
step_rename_at()
,
step_sample()
,
step_select()
,
step_slice()
rec <- recipe(~., data = iris) %>% step_arrange(desc(Sepal.Length), 1 / Petal.Length) prepped <- prep(rec, training = iris %>% slice(1:75)) tidy(prepped, number = 1) library(dplyr) dplyr_train <- iris %>% as_tibble() %>% slice(1:75) %>% dplyr::arrange(desc(Sepal.Length), 1 / Petal.Length) rec_train <- bake(prepped, new_data = NULL) all.equal(dplyr_train, rec_train) dplyr_test <- iris %>% as_tibble() %>% slice(76:150) %>% dplyr::arrange(desc(Sepal.Length), 1 / Petal.Length) rec_test <- bake(prepped, iris %>% slice(76:150)) all.equal(dplyr_test, rec_test) # When you have variables/expressions, you can create a # list of symbols with `rlang::syms()`` and splice them in # the call with `!!!`. See https://tidyeval.tidyverse.org sort_vars <- c("Sepal.Length", "Petal.Length") qq_rec <- recipe(~., data = iris) %>% # Embed the `values` object in the call using !!! step_arrange(!!!syms(sort_vars)) %>% prep(training = iris) tidy(qq_rec, number = 1)
rec <- recipe(~., data = iris) %>% step_arrange(desc(Sepal.Length), 1 / Petal.Length) prepped <- prep(rec, training = iris %>% slice(1:75)) tidy(prepped, number = 1) library(dplyr) dplyr_train <- iris %>% as_tibble() %>% slice(1:75) %>% dplyr::arrange(desc(Sepal.Length), 1 / Petal.Length) rec_train <- bake(prepped, new_data = NULL) all.equal(dplyr_train, rec_train) dplyr_test <- iris %>% as_tibble() %>% slice(76:150) %>% dplyr::arrange(desc(Sepal.Length), 1 / Petal.Length) rec_test <- bake(prepped, iris %>% slice(76:150)) all.equal(dplyr_test, rec_test) # When you have variables/expressions, you can create a # list of symbols with `rlang::syms()`` and splice them in # the call with `!!!`. See https://tidyeval.tidyverse.org sort_vars <- c("Sepal.Length", "Petal.Length") qq_rec <- recipe(~., data = iris) %>% # Embed the `values` object in the call using !!! step_arrange(!!!syms(sort_vars)) %>% prep(training = iris) tidy(qq_rec, number = 1)
step_bin2factor()
creates a specification of a recipe step that will
create a two-level factor from a single dummy variable.
step_bin2factor( recipe, ..., role = NA, trained = FALSE, levels = c("yes", "no"), ref_first = TRUE, columns = NULL, skip = FALSE, id = rand_id("bin2factor") )
step_bin2factor( recipe, ..., role = NA, trained = FALSE, levels = c("yes", "no"), ref_first = TRUE, columns = NULL, skip = FALSE, id = rand_id("bin2factor") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
levels |
A length 2 character string that indicates the factor levels for the 1's (in the first position) and the zeros (second) |
ref_first |
Logical. Should the first level, which replaces 1's, be the factor reference level? |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
This operation may be useful for situations where a binary piece of information may need to be represented as categorical instead of numeric. For example, naive Bayes models would do better to have factor predictors so that the binomial distribution is modeled instead of a Gaussian probability density of numeric binary data. Note that the numeric data is only verified to be numeric (and does not count levels).
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
data(covers, package = "modeldata") rec <- recipe(~description, covers) %>% step_regex(description, pattern = "(rock|stony)", result = "rocks") %>% step_regex(description, pattern = "(rock|stony)", result = "more_rocks") %>% step_bin2factor(rocks) tidy(rec, number = 3) rec <- prep(rec, training = covers) results <- bake(rec, new_data = covers) table(results$rocks, results$more_rocks) tidy(rec, number = 3)
data(covers, package = "modeldata") rec <- recipe(~description, covers) %>% step_regex(description, pattern = "(rock|stony)", result = "rocks") %>% step_regex(description, pattern = "(rock|stony)", result = "more_rocks") %>% step_bin2factor(rocks) tidy(rec, number = 3) rec <- prep(rec, training = covers) results <- bake(rec, new_data = covers) table(results$rocks, results$more_rocks) tidy(rec, number = 3)
step_BoxCox()
creates a specification of a recipe step that will transform
data using a Box-Cox transformation.
step_BoxCox( recipe, ..., role = NA, trained = FALSE, lambdas = NULL, limits = c(-5, 5), num_unique = 5, skip = FALSE, id = rand_id("BoxCox") )
step_BoxCox( recipe, ..., role = NA, trained = FALSE, lambdas = NULL, limits = c(-5, 5), num_unique = 5, skip = FALSE, id = rand_id("BoxCox") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
lambdas |
A numeric vector of transformation values. This
is |
limits |
A length 2 numeric vector defining the range to compute the transformation parameter lambda. |
num_unique |
An integer to specify minimum required unique values to evaluate for a transformation. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The Box-Cox transformation, which requires a strictly positive variable, can be used to rescale a variable to be more similar to a normal distribution. In this package, the partial log-likelihood function is directly optimized within a reasonable set of transformation values (which can be changed by the user).
This transformation is typically done on the outcome variable using the residuals for a statistical model (such as ordinary least squares). Here, a simple null model (intercept only) is used to apply the transformation to the predictor variables individually. This can have the effect of making the variable distributions more symmetric.
If the transformation parameters are estimated to be very
closed to the bounds, or if the optimization fails, a value of
NA
is used and no transformation is applied.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
numeric, the lambda estimate
character, id of this step
The underlying operation does not allow for case weights.
Sakia, R. M. (1992). The Box-Cox transformation technique: A review. The Statistician, 169-178..
Other individual transformation steps:
step_YeoJohnson()
,
step_bs()
,
step_harmonic()
,
step_hyperbolic()
,
step_inverse()
,
step_invlogit()
,
step_log()
,
step_logit()
,
step_mutate()
,
step_ns()
,
step_percentile()
,
step_poly()
,
step_relu()
,
step_sqrt()
rec <- recipe(~., data = as.data.frame(state.x77)) bc_trans <- step_BoxCox(rec, all_numeric()) bc_estimates <- prep(bc_trans, training = as.data.frame(state.x77)) bc_data <- bake(bc_estimates, as.data.frame(state.x77)) plot(density(state.x77[, "Illiteracy"]), main = "before") plot(density(bc_data$Illiteracy), main = "after") tidy(bc_trans, number = 1) tidy(bc_estimates, number = 1)
rec <- recipe(~., data = as.data.frame(state.x77)) bc_trans <- step_BoxCox(rec, all_numeric()) bc_estimates <- prep(bc_trans, training = as.data.frame(state.x77)) bc_data <- bake(bc_estimates, as.data.frame(state.x77)) plot(density(state.x77[, "Illiteracy"]), main = "before") plot(density(bc_data$Illiteracy), main = "after") tidy(bc_trans, number = 1) tidy(bc_estimates, number = 1)
step_bs()
creates a specification of a recipe step that will create new
columns that are basis expansions of variables using B-splines.
step_bs( recipe, ..., role = "predictor", trained = FALSE, deg_free = NULL, degree = 3, objects = NULL, options = list(), keep_original_cols = FALSE, skip = FALSE, id = rand_id("bs") )
step_bs( recipe, ..., role = "predictor", trained = FALSE, deg_free = NULL, degree = 3, objects = NULL, options = list(), keep_original_cols = FALSE, skip = FALSE, id = rand_id("bs") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
deg_free |
The degrees of freedom for the spline. As the degrees of freedom for a spline increase, more flexible and complex curves can be generated. When a single degree of freedom is used, the result is a rescaled version of the original data. |
degree |
Degree of polynomial spline (integer). |
objects |
A list of |
options |
A list of options for |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_bs
can create new features from a single variable
that enable fitting routines to model this variable in a
nonlinear manner. The extent of the possible nonlinearity is
determined by the df
, degree
, or knots
arguments of
splines::bs()
. The original variables are removed
from the data and new columns are added. The naming convention
for the new variables is varname_bs_1
and so on.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
This step has 2 tuning parameters:
deg_free
: Spline Degrees of Freedom (type: integer, default: NULL)
degree
: Polynomial Degree (type: integer, default: 3)
The underlying operation does not allow for case weights.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_harmonic()
,
step_hyperbolic()
,
step_inverse()
,
step_invlogit()
,
step_log()
,
step_logit()
,
step_mutate()
,
step_ns()
,
step_percentile()
,
step_poly()
,
step_relu()
,
step_sqrt()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) with_splines <- rec %>% step_bs(carbon, hydrogen) with_splines <- prep(with_splines, training = biomass_tr) expanded <- bake(with_splines, biomass_te) expanded
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) with_splines <- rec %>% step_bs(carbon, hydrogen) with_splines <- prep(with_splines, training = biomass_tr) expanded <- bake(with_splines, biomass_te) expanded
step_center()
creates a specification of a recipe step that will
normalize numeric data to have a mean of zero.
step_center( recipe, ..., role = NA, trained = FALSE, means = NULL, na_rm = TRUE, skip = FALSE, id = rand_id("center") )
step_center( recipe, ..., role = NA, trained = FALSE, means = NULL, na_rm = TRUE, skip = FALSE, id = rand_id("center") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
means |
A named numeric vector of means. This is
|
na_rm |
A logical value indicating whether |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Centering data means that the average of a variable is
subtracted from the data. step_center
estimates the
variable means from the data used in the training
argument of prep.recipe
. bake.recipe
then applies
the centering to new data sets using these means.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
numeric, the means
character, id of this step
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Other normalization steps:
step_normalize()
,
step_range()
,
step_scale()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) center_trans <- rec %>% step_center(carbon, contains("gen"), -hydrogen) center_obj <- prep(center_trans, training = biomass_tr) transformed_te <- bake(center_obj, biomass_te) biomass_te[1:10, names(transformed_te)] transformed_te tidy(center_trans, number = 1) tidy(center_obj, number = 1)
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) center_trans <- rec %>% step_center(carbon, contains("gen"), -hydrogen) center_obj <- prep(center_trans, training = biomass_tr) transformed_te <- bake(center_obj, biomass_te) biomass_te[1:10, names(transformed_te)] transformed_te tidy(center_trans, number = 1) tidy(center_obj, number = 1)
step_classdist()
creates a specification of a recipe step that will
convert numeric data into Mahalanobis distance measurements to the data
centroid. This is done for each value of a categorical class variable.
step_classdist( recipe, ..., class, role = "predictor", trained = FALSE, mean_func = mean, cov_func = cov, pool = FALSE, log = TRUE, objects = NULL, prefix = "classdist_", keep_original_cols = TRUE, skip = FALSE, id = rand_id("classdist") )
step_classdist( recipe, ..., class, role = "predictor", trained = FALSE, mean_func = mean, cov_func = cov, pool = FALSE, log = TRUE, objects = NULL, prefix = "classdist_", keep_original_cols = TRUE, skip = FALSE, id = rand_id("classdist") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
class |
A single character string that specifies a single categorical variable to be used as the class. |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
mean_func |
A function to compute the center of the distribution. |
cov_func |
A function that computes the covariance matrix |
pool |
A logical: should the covariance matrix be computed by pooling the data for all of the classes? |
log |
A logical: should the distances be transformed by the natural log function? |
objects |
Statistics are stored here once this step has
been trained by |
prefix |
A character string for the prefix of the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_classdist
will create a new column for every unique value of
the class
variable. The resulting variables will not replace the original
values and, by default, have the prefix classdist_
. The naming format can
be changed using the prefix
argument.
Class-specific centroids are the multivariate averages of each predictor using the data from each class in the training set. When pre-processing a new data point, this step computes the distance from the new point to each of the class centroids. These distance features can be very effective at capturing linear class boundaries. For this reason, they can be useful to add to an existing predictor set used within a nonlinear model. If the true boundary is actually linear, the model will have an easier time learning the training data patterns.
Note that, by default, the default covariance function requires
that each class should have at least as many rows as variables
listed in the terms
argument. If pool = TRUE
,
there must be at least as many data points are variables
overall.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, class
, and id
:
character, the selectors or variables selected
numeric, location of centroid
character, name of the class
character, id of this step
This step performs an supervised operation that can utilize case weights.
As a result, case weights are used with frequency weights as well as
importance weights. For more information,, see the documentation in
case_weights and the examples on tidymodels.org
.
Other multivariate transformation steps:
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
data(penguins, package = "modeldata") penguins <- penguins[vctrs::vec_detect_complete(penguins), ] penguins$island <- NULL penguins$sex <- NULL # in case of missing data... mean2 <- function(x) mean(x, na.rm = TRUE) # define naming convention rec <- recipe(species ~ ., data = penguins) %>% step_classdist(all_numeric_predictors(), class = "species", pool = FALSE, mean_func = mean2, prefix = "centroid_" ) # default naming rec <- recipe(species ~ ., data = penguins) %>% step_classdist(all_numeric_predictors(), class = "species", pool = FALSE, mean_func = mean2 ) rec_dists <- prep(rec, training = penguins) dists_to_species <- bake(rec_dists, new_data = penguins) ## on log scale: dist_cols <- grep("classdist", names(dists_to_species), value = TRUE) dists_to_species[, c("species", dist_cols)] tidy(rec, number = 1) tidy(rec_dists, number = 1)
data(penguins, package = "modeldata") penguins <- penguins[vctrs::vec_detect_complete(penguins), ] penguins$island <- NULL penguins$sex <- NULL # in case of missing data... mean2 <- function(x) mean(x, na.rm = TRUE) # define naming convention rec <- recipe(species ~ ., data = penguins) %>% step_classdist(all_numeric_predictors(), class = "species", pool = FALSE, mean_func = mean2, prefix = "centroid_" ) # default naming rec <- recipe(species ~ ., data = penguins) %>% step_classdist(all_numeric_predictors(), class = "species", pool = FALSE, mean_func = mean2 ) rec_dists <- prep(rec, training = penguins) dists_to_species <- bake(rec_dists, new_data = penguins) ## on log scale: dist_cols <- grep("classdist", names(dists_to_species), value = TRUE) dists_to_species[, c("species", dist_cols)] tidy(rec, number = 1) tidy(rec_dists, number = 1)
step_classdist_shrunken
creates a specification of a recipe
step that will convert numeric data into Euclidean distance
to the regularized class centroid. This is done for each value of a
categorical class variable.
step_classdist_shrunken( recipe, ..., class = NULL, role = NA, trained = FALSE, threshold = 1/2, sd_offset = 1/2, log = TRUE, prefix = "classdist_", keep_original_cols = TRUE, objects = NULL, skip = FALSE, id = rand_id("classdist_shrunken") )
step_classdist_shrunken( recipe, ..., class = NULL, role = NA, trained = FALSE, threshold = 1/2, sd_offset = 1/2, log = TRUE, prefix = "classdist_", keep_original_cols = TRUE, objects = NULL, skip = FALSE, id = rand_id("classdist_shrunken") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
class |
A single character string that specifies a single categorical variable to be used as the class. |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
threshold |
A regularization parameter between zero and one. Zero means that no regularization is used and one means that centroids should be shrunk to the global centroid. |
sd_offset |
A value between zero and one for the quantile that should be used to stabilize the pooled standard deviation. |
log |
A logical: should the distances be transformed by the natural log function? |
prefix |
A character string for the prefix of the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
objects |
Statistics are stored here once this step has
been trained by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Class-specific centroids are the multivariate averages of each predictor using the data from each class in the training set. When pre-processing a new data point, this step computes the distance from the new point to each of the class centroids. These distance features can be very effective at capturing linear class boundaries. For this reason, they can be useful to add to an existing predictor set used within a nonlinear model. If the true boundary is actually linear, the model will have an easier time learning the training data patterns.
Shrunken centroids use a form of regularization where the class-specific centroids are contracted to the overall class-independent centroid. If a predictor is uninformative, shrinking it may move it entirely to the overall centroid. This has the effect of removing that predictor's effect on the new distance features. However, it may not move all of the class-specific features to the center in many cases. This means that some features will only affect the classification of specific classes.
The threshold
parameter can be used to optimized how much regularization
should be used.
step_classdist_shrunken
will create a new column for every unique value of
the class
variable. The resulting variables will not replace the original
values and, by default, have the prefix classdist_
. The naming format can
be changed using the prefix
argument.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, class
, type
, threshold
, and id
:
character, the selectors or variables selected
numeric, the centroid
character, name of class variable
character, has values "global"
, "by_class"
, and "shrunken"
numeric, value of threshold
character, id of this step
The first two types of centroids are in the original units while the last has been standardized.
This step performs an supervised operation that can utilize case weights.
As a result, case weights are used with frequency weights as well as
importance weights. For more information,, see the documentation in
case_weights and the examples on tidymodels.org
.
Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99(10), 6567-6572.
Other multivariate transformation steps:
step_classdist()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
data(penguins, package = "modeldata") penguins <- penguins[vctrs::vec_detect_complete(penguins), ] penguins$island <- NULL penguins$sex <- NULL # define naming convention rec <- recipe(species ~ ., data = penguins) %>% step_classdist_shrunken(all_numeric_predictors(), class = "species", threshold = 1 / 4, prefix = "centroid_" ) # default naming rec <- recipe(species ~ ., data = penguins) %>% step_classdist_shrunken(all_numeric_predictors(), class = "species", threshold = 3 / 4 ) rec_dists <- prep(rec, training = penguins) dists_to_species <- bake(rec_dists, new_data = penguins) ## on log scale: dist_cols <- grep("classdist", names(dists_to_species), value = TRUE) dists_to_species[, c("species", dist_cols)] tidy(rec, number = 1) tidy(rec_dists, number = 1)
data(penguins, package = "modeldata") penguins <- penguins[vctrs::vec_detect_complete(penguins), ] penguins$island <- NULL penguins$sex <- NULL # define naming convention rec <- recipe(species ~ ., data = penguins) %>% step_classdist_shrunken(all_numeric_predictors(), class = "species", threshold = 1 / 4, prefix = "centroid_" ) # default naming rec <- recipe(species ~ ., data = penguins) %>% step_classdist_shrunken(all_numeric_predictors(), class = "species", threshold = 3 / 4 ) rec_dists <- prep(rec, training = penguins) dists_to_species <- bake(rec_dists, new_data = penguins) ## on log scale: dist_cols <- grep("classdist", names(dists_to_species), value = TRUE) dists_to_species[, c("species", dist_cols)] tidy(rec, number = 1) tidy(rec_dists, number = 1)
step_corr()
creates a specification of a recipe step that will
potentially remove variables that have large absolute correlations with other
variables.
step_corr( recipe, ..., role = NA, trained = FALSE, threshold = 0.9, use = "pairwise.complete.obs", method = "pearson", removals = NULL, skip = FALSE, id = rand_id("corr") )
step_corr( recipe, ..., role = NA, trained = FALSE, threshold = 0.9, use = "pairwise.complete.obs", method = "pearson", removals = NULL, skip = FALSE, id = rand_id("corr") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
threshold |
A value for the threshold of absolute correlation values. The step will try to remove the minimum number of columns so that all the resulting absolute correlations are less than this value. |
use |
A character string for the |
method |
A character string for the |
removals |
A character string that contains the names of
columns that should be removed. These values are not determined
until |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
This step can potentially remove columns from the data set. This may cause issues for subsequent steps in your recipe if the missing columns are specifically referenced by name. To avoid this, see the advice in the Tips for saving recipes and filtering columns section of selections.
This step attempts to remove variables to keep the
largest absolute correlation between the variables less than
threshold
.
When a column has a single unique value, that column will be
excluded from the correlation analysis. Also, if the data set
has sporadic missing values (and an inappropriate value of use
is chosen), some columns will also be excluded from the filter.
The arguments use
and method
don't take effect if case weights
are used in the recipe.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected to be removed
character, id of this step
This step has 1 tuning parameters:
threshold
: Threshold (type: double, default: 0.9)
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Original R code for filtering algorithm by Dong Li,
modified by Max Kuhn. Contributions by Reynald Lescarbeau (for
original in caret
package). Max Kuhn for the step
function.
Other variable filter steps:
step_filter_missing()
,
step_lincomb()
,
step_nzv()
,
step_rm()
,
step_select()
,
step_zv()
data(biomass, package = "modeldata") set.seed(3535) biomass$duplicate <- biomass$carbon + rnorm(nrow(biomass)) biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + duplicate, data = biomass_tr ) corr_filter <- rec %>% step_corr(all_numeric_predictors(), threshold = .5) filter_obj <- prep(corr_filter, training = biomass_tr) filtered_te <- bake(filter_obj, biomass_te) round(abs(cor(biomass_tr[, c(3:7, 9)])), 2) round(abs(cor(filtered_te)), 2) tidy(corr_filter, number = 1) tidy(filter_obj, number = 1)
data(biomass, package = "modeldata") set.seed(3535) biomass$duplicate <- biomass$carbon + rnorm(nrow(biomass)) biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + duplicate, data = biomass_tr ) corr_filter <- rec %>% step_corr(all_numeric_predictors(), threshold = .5) filter_obj <- prep(corr_filter, training = biomass_tr) filtered_te <- bake(filter_obj, biomass_te) round(abs(cor(biomass_tr[, c(3:7, 9)])), 2) round(abs(cor(filtered_te)), 2) tidy(corr_filter, number = 1) tidy(filter_obj, number = 1)
step_count()
creates a specification of a recipe step that will create a
variable that counts instances of a regular expression pattern in text.
step_count( recipe, ..., role = "predictor", trained = FALSE, pattern = ".", normalize = FALSE, options = list(), result = make.names(pattern), input = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("count") )
step_count( recipe, ..., role = "predictor", trained = FALSE, pattern = ".", normalize = FALSE, options = list(), result = make.names(pattern), input = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("count") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
A single selector function to choose which variable
will be searched for the regex pattern. The selector should
resolve to a single variable. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
pattern |
A character string containing a regular
expression (or character string for |
normalize |
A logical; should the integer counts be divided by the total number of characters in the string?. |
options |
A list of options to |
result |
A single character value for the name of the new variable. It should be a valid column name. |
input |
A single character value for the name of the
variable being searched. This is |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, result
, and id
:
character, the selectors or variables selected
character, the new column names
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
data(covers, package = "modeldata") rec <- recipe(~description, covers) %>% step_count(description, pattern = "(rock|stony)", result = "rocks") %>% step_count(description, pattern = "famil", normalize = TRUE) rec2 <- prep(rec, training = covers) rec2 count_values <- bake(rec2, new_data = covers) count_values tidy(rec, number = 1) tidy(rec2, number = 1)
data(covers, package = "modeldata") rec <- recipe(~description, covers) %>% step_count(description, pattern = "(rock|stony)", result = "rocks") %>% step_count(description, pattern = "famil", normalize = TRUE) rec2 <- prep(rec, training = covers) rec2 count_values <- bake(rec2, new_data = covers) count_values tidy(rec, number = 1) tidy(rec2, number = 1)
step_cut()
creates a specification of a recipe step that cuts a numeric
variable into a factor based on provided boundary values.
step_cut( recipe, ..., role = NA, trained = FALSE, breaks, include_outside_range = FALSE, skip = FALSE, id = rand_id("cut") )
step_cut( recipe, ..., role = NA, trained = FALSE, breaks, include_outside_range = FALSE, skip = FALSE, id = rand_id("cut") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
breaks |
A numeric vector with at least one cut point. |
include_outside_range |
Logical, indicating if values outside the
range in the train set should be included in the lowest or highest bucket.
Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Unlike the base::cut()
function there is no need to specify the
min and the max values in the breaks. All values before the lowest break
point will end up in the first bucket, all values after the last break
points will end up in the last.
step_cut()
will call base::cut()
in the baking step with
include.lowest
set to TRUE
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
numeric, the location of the cuts
character, id of this step
The underlying operation does not allow for case weights.
Other discretization steps:
step_discretize()
df <- data.frame(x = 1:10, y = 5:14) rec <- recipe(df) # The min and max of the variable are used as boundaries # if they exceed the breaks rec %>% step_cut(x, breaks = 5) %>% prep() %>% bake(df) # You can use the same breaks on multiple variables # then for each variable the boundaries are set separately rec %>% step_cut(x, y, breaks = c(6, 9)) %>% prep() %>% bake(df) # You can keep the original variables using `step_mutate` or # `step_mutate_at`, for transforming multiple variables at once rec %>% step_mutate(x_orig = x) %>% step_cut(x, breaks = 5) %>% prep() %>% bake(df) # It is up to you if you want values outside the # range learned at prep to be included new_df <- data.frame(x = 1:11, y = 5:15) rec %>% step_cut(x, breaks = 5, include_outside_range = TRUE) %>% prep() %>% bake(new_df) rec %>% step_cut(x, breaks = 5, include_outside_range = FALSE) %>% prep() %>% bake(new_df)
df <- data.frame(x = 1:10, y = 5:14) rec <- recipe(df) # The min and max of the variable are used as boundaries # if they exceed the breaks rec %>% step_cut(x, breaks = 5) %>% prep() %>% bake(df) # You can use the same breaks on multiple variables # then for each variable the boundaries are set separately rec %>% step_cut(x, y, breaks = c(6, 9)) %>% prep() %>% bake(df) # You can keep the original variables using `step_mutate` or # `step_mutate_at`, for transforming multiple variables at once rec %>% step_mutate(x_orig = x) %>% step_cut(x, breaks = 5) %>% prep() %>% bake(df) # It is up to you if you want values outside the # range learned at prep to be included new_df <- data.frame(x = 1:11, y = 5:15) rec %>% step_cut(x, breaks = 5, include_outside_range = TRUE) %>% prep() %>% bake(new_df) rec %>% step_cut(x, breaks = 5, include_outside_range = FALSE) %>% prep() %>% bake(new_df)
step_date()
creates a specification of a recipe step that will convert
date data into one or more factor or numeric variables.
step_date( recipe, ..., role = "predictor", trained = FALSE, features = c("dow", "month", "year"), abbr = TRUE, label = TRUE, ordinal = FALSE, locale = clock::clock_locale()$labels, columns = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("date") )
step_date( recipe, ..., role = "predictor", trained = FALSE, features = c("dow", "month", "year"), abbr = TRUE, label = TRUE, ordinal = FALSE, locale = clock::clock_locale()$labels, columns = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("date") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. The selected variables should have class |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
features |
A character string that includes at least one
of the following values: |
abbr |
A logical. Only available for features |
label |
A logical. Only available for features
|
ordinal |
A logical: should factors be ordered? Only
available for features |
locale |
Locale to be used for |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Unlike some other steps, step_date
does not
remove the original date variables by default. Set keep_original_cols
to FALSE
to remove them.
See step_time()
if you want to calculate features that are smaller than
days.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble with columns
terms
(the selectors or variables selected), value
(the feature
names), and ordinal
(a logical) is returned.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, ordinal
, and id
:
character, the selectors or variables selected
character, the feature names
logical, are factors ordered
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
library(lubridate) examples <- data.frame( Dan = ymd("2002-03-04") + days(1:10), Stefan = ymd("2006-01-13") + days(1:10) ) date_rec <- recipe(~ Dan + Stefan, examples) %>% step_date(all_predictors()) tidy(date_rec, number = 1) date_rec <- prep(date_rec, training = examples) date_values <- bake(date_rec, new_data = examples) date_values tidy(date_rec, number = 1)
library(lubridate) examples <- data.frame( Dan = ymd("2002-03-04") + days(1:10), Stefan = ymd("2006-01-13") + days(1:10) ) date_rec <- recipe(~ Dan + Stefan, examples) %>% step_date(all_predictors()) tidy(date_rec, number = 1) date_rec <- prep(date_rec, training = examples) date_values <- bake(date_rec, new_data = examples) date_values tidy(date_rec, number = 1)
step_depth()
creates a specification of a recipe step that will convert
numeric data into a measurement of data depth. This is done for each value of
a categorical class variable.
step_depth( recipe, ..., class, role = "predictor", trained = FALSE, metric = "halfspace", options = list(), data = NULL, prefix = "depth_", keep_original_cols = TRUE, skip = FALSE, id = rand_id("depth") )
step_depth( recipe, ..., class, role = "predictor", trained = FALSE, metric = "halfspace", options = list(), data = NULL, prefix = "depth_", keep_original_cols = TRUE, skip = FALSE, id = rand_id("depth") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
class |
A single character string that specifies a single categorical variable to be used as the class. |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
metric |
A character string specifying the depth metric. Possible values are "potential", "halfspace", "Mahalanobis", "simplicialVolume", "spatial", and "zonoid". |
options |
A list of options to pass to the underlying
depth functions. See |
data |
The training data are stored here once after
|
prefix |
A character string for the prefix of the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Data depth metrics attempt to measure how close data a
data point is to the center of its distribution. There are a
number of methods for calculating depth but a simple example is
the inverse of the distance of a data point to the centroid of
the distribution. Generally, small values indicate that a data
point not close to the centroid. step_depth
can compute a
class-specific depth for a new data point based on the proximity
of the new value to the training set distribution.
This step requires the ddalpha package. If not installed, the step will stop with a note about installing the package.
Note that the entire training set is saved to compute future
depth values. The saved data have been trained (i.e. prepared)
and baked (i.e. processed) up to the point before the location
that step_depth
occupies in the recipe. Also, the data
requirements for the different step methods may vary. For
example, using metric = "Mahalanobis"
requires that each
class should have at least as many rows as variables listed in
the terms
argument.
The function will create a new column for every unique value of
the class
variable. The resulting variables will not
replace the original values and by default have the prefix depth_
. The
naming format can be changed using the prefix
argument.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, class
, and id
:
character, the selectors or variables selected
character, name of class variable
character, id of this step
The underlying operation does not allow for case weights.
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
# halfspace depth is the default rec <- recipe(Species ~ ., data = iris) %>% step_depth(all_numeric_predictors(), class = "Species") # use zonoid metric instead # also, define naming convention for new columns rec <- recipe(Species ~ ., data = iris) %>% step_depth(all_numeric_predictors(), class = "Species", metric = "zonoid", prefix = "zonoid_" ) rec_dists <- prep(rec, training = iris) dists_to_species <- bake(rec_dists, new_data = iris) dists_to_species tidy(rec, number = 1) tidy(rec_dists, number = 1)
# halfspace depth is the default rec <- recipe(Species ~ ., data = iris) %>% step_depth(all_numeric_predictors(), class = "Species") # use zonoid metric instead # also, define naming convention for new columns rec <- recipe(Species ~ ., data = iris) %>% step_depth(all_numeric_predictors(), class = "Species", metric = "zonoid", prefix = "zonoid_" ) rec_dists <- prep(rec, training = iris) dists_to_species <- bake(rec_dists, new_data = iris) dists_to_species tidy(rec, number = 1) tidy(rec_dists, number = 1)
step_discretize()
creates a specification of a recipe step that will
convert numeric data into a factor with bins having approximately the same
number of data points (based on a training set).
step_discretize( recipe, ..., role = NA, trained = FALSE, num_breaks = 4, min_unique = 10, objects = NULL, options = list(prefix = "bin"), skip = FALSE, id = rand_id("discretize") )
step_discretize( recipe, ..., role = NA, trained = FALSE, num_breaks = 4, min_unique = 10, objects = NULL, options = list(prefix = "bin"), skip = FALSE, id = rand_id("discretize") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
num_breaks |
An integer defining how many cuts to make of the data. |
min_unique |
An integer defining a sample size line of
dignity for the binning. If (the number of unique
values) |
objects |
The |
options |
A list of options to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
numeric, the breaks
character, id of this step
This step has 2 tuning parameters:
min_unique
: Unique Value Threshold (type: integer, default: 10)
num_breaks
: Number of Cut Points (type: integer, default: 4)
The underlying operation does not allow for case weights.
Other discretization steps:
step_cut()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) %>% step_discretize(carbon, hydrogen) rec <- prep(rec, biomass_tr) binned_te <- bake(rec, biomass_te) table(binned_te$carbon) tidy(rec, 1)
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) %>% step_discretize(carbon, hydrogen) rec <- prep(rec, biomass_tr) binned_te <- bake(rec, biomass_te) table(binned_te$carbon) tidy(rec, 1)
step_dummy()
creates a specification of a recipe step that will convert
nominal data (e.g. factors) into one or more numeric binary model terms
corresponding to the levels of the original data.
step_dummy( recipe, ..., role = "predictor", trained = FALSE, one_hot = FALSE, preserve = deprecated(), naming = dummy_names, levels = NULL, sparse = FALSE, keep_original_cols = FALSE, skip = FALSE, id = rand_id("dummy") )
step_dummy( recipe, ..., role = "predictor", trained = FALSE, one_hot = FALSE, preserve = deprecated(), naming = dummy_names, levels = NULL, sparse = FALSE, keep_original_cols = FALSE, skip = FALSE, id = rand_id("dummy") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables for this step.
See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
one_hot |
A logical. For C levels, should C dummy variables be created rather than C-1? |
preserve |
This argument has been deprecated. Please use
|
naming |
A function that defines the naming convention for new dummy columns. See Details below. |
levels |
A list that contains the information needed to create dummy
variables for each variable contained in |
sparse |
A logical. Should the columns produced be sparse vectors.
Sparsity is only supported for |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_dummy()
will create a set of binary dummy variables from a factor
variable. For example, if an unordered factor column in the data set has
levels of "red", "green", "blue", the dummy variable bake will create two
additional columns of 0/1 data for two of those three values (and remove the
original column). For ordered factors, polynomial contrasts are used to
encode the numeric values.
By default, the excluded dummy variable (i.e. the reference cell) will
correspond to the first level of the unordered factor being converted.
step_relevel()
can be used to create a new reference level by setting the
ref_level
argument.
This recipe step allows for flexible naming of the resulting
variables. For an unordered factor named x
, with levels "a"
and "b"
, the default naming convention would be to create a
new variable called x_b
. The naming format can be changed using
the naming
argument; the function dummy_names()
is the
default.
To change the type of contrast being used, change the global contrast option
via options
.
When the factor being converted has a missing value, all of the corresponding
dummy variables are also missing. See step_unknown()
for a solution.
When data to be processed contains novel levels (i.e., not contained in the
training set), a missing value is assigned to the results. See step_other()
for an alternative.
If no columns are selected (perhaps due to an earlier step_zv()
), bake()
will return the data as-is (e.g. with no dummy variables).
Note that, by default, the new dummy variable column names obey the naming
rules for columns. If there are levels such as "0", dummy_names()
will put
a leading "X" in front of the level (since it uses make.names()
). This can
be changed by passing in a different function to the naming
argument for
this step.
Also, there are a number of contrast methods that return fractional values.
The columns returned by this step are doubles (not integers) when
sparse = FALSE
. The columns returned when sparse = TRUE
are integers.
The package vignette for dummy variables and interactions has more information.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, columns
, and id
:
character, the selectors or variables selected
character, names of resulting columns
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
data(Sacramento, package = "modeldata") # Original data: city has 37 levels length(unique(Sacramento$city)) unique(Sacramento$city) %>% sort() rec <- recipe(~ city + sqft + price, data = Sacramento) # Default dummy coding: 36 dummy variables dummies <- rec %>% step_dummy(city) %>% prep(training = Sacramento) dummy_data <- bake(dummies, new_data = NULL) dummy_data %>% select(starts_with("city")) %>% names() # level "anything" is the reference level # Obtain the full set of 37 dummy variables using `one_hot` option dummies_one_hot <- rec %>% step_dummy(city, one_hot = TRUE) %>% prep(training = Sacramento) dummy_data_one_hot <- bake(dummies_one_hot, new_data = NULL) dummy_data_one_hot %>% select(starts_with("city")) %>% names() # no reference level tidy(dummies, number = 1) tidy(dummies_one_hot, number = 1)
data(Sacramento, package = "modeldata") # Original data: city has 37 levels length(unique(Sacramento$city)) unique(Sacramento$city) %>% sort() rec <- recipe(~ city + sqft + price, data = Sacramento) # Default dummy coding: 36 dummy variables dummies <- rec %>% step_dummy(city) %>% prep(training = Sacramento) dummy_data <- bake(dummies, new_data = NULL) dummy_data %>% select(starts_with("city")) %>% names() # level "anything" is the reference level # Obtain the full set of 37 dummy variables using `one_hot` option dummies_one_hot <- rec %>% step_dummy(city, one_hot = TRUE) %>% prep(training = Sacramento) dummy_data_one_hot <- bake(dummies_one_hot, new_data = NULL) dummy_data_one_hot %>% select(starts_with("city")) %>% names() # no reference level tidy(dummies, number = 1) tidy(dummies_one_hot, number = 1)
step_dummy_extract()
creates a specification of a recipe step that will
convert nominal data (e.g. characters or factors) into one or more integer
model terms for the extracted levels.
step_dummy_extract( recipe, ..., role = "predictor", trained = FALSE, sep = NULL, pattern = NULL, threshold = 0, other = "other", naming = dummy_extract_names, levels = NULL, keep_original_cols = FALSE, skip = FALSE, id = rand_id("dummy_extract") )
step_dummy_extract( recipe, ..., role = "predictor", trained = FALSE, sep = NULL, pattern = NULL, threshold = 0, other = "other", naming = dummy_extract_names, levels = NULL, keep_original_cols = FALSE, skip = FALSE, id = rand_id("dummy_extract") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
sep |
Character string containing a regular expression to use
for splitting. |
pattern |
Character string containing a regular expression used
for extraction. |
threshold |
A numeric value between 0 and 1, or an integer greater or
equal to one. If less than one, then factor levels with a rate of
occurrence in the training set below |
other |
A single character value for the "other" category. |
naming |
A function that defines the naming convention for new dummy columns. See Details below. |
levels |
A list that contains the information needed to create dummy
variables for each variable contained in |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_dummy_extract()
will create a set of integer dummy
variables from a character variable by extracting individual strings
by either splitting or extracting then counting those to create
count variables.
Note that threshold
works in a very specific way for this step.
While it is possible for one label to be present multiple times in
the same row, it will only be counted once when calculating the
occurrences and frequencies.
This recipe step allows for flexible naming of the resulting
variables. For an unordered factor named x
, with levels "a"
and "b"
, the default naming convention would be to create a
new variable called x_b
. The naming format can be changed using
the naming
argument; the function dummy_names()
is the
default.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, columns
, and id
:
character, the selectors or variables selected
character, names of resulting columns
character, id of this step
The return value is ordered according to the frequency of columns
entries in the training data set.
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
data(tate_text, package = "modeldata") dummies <- recipe(~ artist + medium, data = tate_text) %>% step_dummy_extract(artist, medium, sep = ", ") %>% prep() dummy_data <- bake(dummies, new_data = NULL) dummy_data %>% select(starts_with("medium")) %>% names() %>% head() # More detailed splitting dummies_specific <- recipe(~medium, data = tate_text) %>% step_dummy_extract(medium, sep = "(, )|( and )|( on )") %>% prep() dummy_data_specific <- bake(dummies_specific, new_data = NULL) dummy_data_specific %>% select(starts_with("medium")) %>% names() %>% head() tidy(dummies, number = 1) tidy(dummies_specific, number = 1) # pattern argument can be useful to extract harder patterns color_examples <- tibble( colors = c( "['red', 'blue']", "['red', 'blue', 'white']", "['blue', 'blue', 'blue']" ) ) dummies_color <- recipe(~colors, data = color_examples) %>% step_dummy_extract(colors, pattern = "(?<=')[^',]+(?=')") %>% prep() dummies_data_color <- dummies_color %>% bake(new_data = NULL) dummies_data_color
data(tate_text, package = "modeldata") dummies <- recipe(~ artist + medium, data = tate_text) %>% step_dummy_extract(artist, medium, sep = ", ") %>% prep() dummy_data <- bake(dummies, new_data = NULL) dummy_data %>% select(starts_with("medium")) %>% names() %>% head() # More detailed splitting dummies_specific <- recipe(~medium, data = tate_text) %>% step_dummy_extract(medium, sep = "(, )|( and )|( on )") %>% prep() dummy_data_specific <- bake(dummies_specific, new_data = NULL) dummy_data_specific %>% select(starts_with("medium")) %>% names() %>% head() tidy(dummies, number = 1) tidy(dummies_specific, number = 1) # pattern argument can be useful to extract harder patterns color_examples <- tibble( colors = c( "['red', 'blue']", "['red', 'blue', 'white']", "['blue', 'blue', 'blue']" ) ) dummies_color <- recipe(~colors, data = color_examples) %>% step_dummy_extract(colors, pattern = "(?<=')[^',]+(?=')") %>% prep() dummies_data_color <- dummies_color %>% bake(new_data = NULL) dummies_data_color
step_dummy_multi_choice()
creates a specification of a recipe step that
will convert multiple nominal data (e.g. characters or factors) into one or
more numeric binary model terms for the levels of the original data.
step_dummy_multi_choice( recipe, ..., role = "predictor", trained = FALSE, threshold = 0, levels = NULL, input = NULL, other = "other", naming = dummy_names, prefix = NULL, keep_original_cols = FALSE, skip = FALSE, id = rand_id("dummy_multi_choice") )
step_dummy_multi_choice( recipe, ..., role = "predictor", trained = FALSE, threshold = 0, levels = NULL, input = NULL, other = "other", naming = dummy_names, prefix = NULL, keep_original_cols = FALSE, skip = FALSE, id = rand_id("dummy_multi_choice") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables for this step.
See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
threshold |
A numeric value between 0 and 1, or an integer greater or
equal to one. If less than one, then factor levels with a rate of
occurrence in the training set below |
levels |
A list that contains the information needed to create dummy
variables for each variable contained in |
input |
A character vector containing the names of the columns used.
This is |
other |
A single character value for the "other" category. |
naming |
A function that defines the naming convention for new dummy columns. See Details below. |
prefix |
A character string for the prefix of the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The overall proportion (or total counts) of the categories are computed. The
"other"
category is used in place of any categorical levels whose
individual proportion (or frequency) in the training set is less than
threshold
.
This step produces a number of columns, based on the number of categories it
finds. The naming of the columns is determined by the function based on the
naming
argument. The default is to return <prefix>_<category name>
. By
default prefix
is NULL
, which means the name of the first column
selected will be used in place.
This recipe step allows for flexible naming of the resulting
variables. For an unordered factor named x
, with levels "a"
and "b"
, the default naming convention would be to create a
new variable called x_b
. The naming format can be changed using
the naming
argument; the function dummy_names()
is the
default.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
This step has 1 tuning parameters:
threshold
: Threshold (type: double, default: 0)
When you tidy()
this step, a tibble is returned with
columns terms
, columns
, and id
:
character, the selectors or variables selected
character, names of resulting columns
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
library(tibble) languages <- tribble( ~lang_1, ~lang_2, ~lang_3, "English", "Italian", NA, "Spanish", NA, "French", "Armenian", "English", "French", NA, NA, NA ) dummy_multi_choice_rec <- recipe(~., data = languages) %>% step_dummy_multi_choice(starts_with("lang")) %>% prep() bake(dummy_multi_choice_rec, new_data = NULL) tidy(dummy_multi_choice_rec, number = 1) dummy_multi_choice_rec2 <- recipe(~., data = languages) %>% step_dummy_multi_choice(starts_with("lang"), prefix = "lang", threshold = 0.2 ) %>% prep() bake(dummy_multi_choice_rec2, new_data = NULL) tidy(dummy_multi_choice_rec2, number = 1)
library(tibble) languages <- tribble( ~lang_1, ~lang_2, ~lang_3, "English", "Italian", NA, "Spanish", NA, "French", "Armenian", "English", "French", NA, NA, NA ) dummy_multi_choice_rec <- recipe(~., data = languages) %>% step_dummy_multi_choice(starts_with("lang")) %>% prep() bake(dummy_multi_choice_rec, new_data = NULL) tidy(dummy_multi_choice_rec, number = 1) dummy_multi_choice_rec2 <- recipe(~., data = languages) %>% step_dummy_multi_choice(starts_with("lang"), prefix = "lang", threshold = 0.2 ) %>% prep() bake(dummy_multi_choice_rec2, new_data = NULL) tidy(dummy_multi_choice_rec2, number = 1)
step_factor2string()
creates a specification of a recipe step that will
convert one or more factor vectors to strings.
step_factor2string( recipe, ..., role = NA, trained = FALSE, columns = FALSE, skip = FALSE, id = rand_id("factor2string") )
step_factor2string( recipe, ..., role = NA, trained = FALSE, columns = FALSE, skip = FALSE, id = rand_id("factor2string") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
prep()
has an option strings_as_factors
that defaults to TRUE
. If this
step is used with the default option, the strings produced by this step will
not be converted to factors.
Remember that categorical data that will be directly passed to a model should be encoded as factors. This step is helpful for ancillary columns (such as identifiers) that will not be computed on in the model.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
data(Sacramento, package = "modeldata") rec <- recipe(~ city + zip, data = Sacramento) make_string <- rec %>% step_factor2string(city) make_string <- prep(make_string, training = Sacramento, strings_as_factors = FALSE ) make_string # note that `city` is a string in recipe output bake(make_string, new_data = NULL) %>% head() # ...but remains a factor in the original data Sacramento %>% head()
data(Sacramento, package = "modeldata") rec <- recipe(~ city + zip, data = Sacramento) make_string <- rec %>% step_factor2string(city) make_string <- prep(make_string, training = Sacramento, strings_as_factors = FALSE ) make_string # note that `city` is a string in recipe output bake(make_string, new_data = NULL) %>% head() # ...but remains a factor in the original data Sacramento %>% head()
step_filter()
creates a specification of a recipe step that will remove
rows using dplyr::filter()
.
step_filter( recipe, ..., role = NA, trained = FALSE, inputs = NULL, skip = TRUE, id = rand_id("filter") )
step_filter( recipe, ..., role = NA, trained = FALSE, inputs = NULL, skip = TRUE, id = rand_id("filter") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
Logical predicates defined in terms of the variables
in the data. Multiple conditions are combined with |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
inputs |
Quosure of values given by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
When an object in the user's global environment is
referenced in the expression defining the new variable(s),
it is a good idea to use quasiquotation (e.g. !!
) to embed
the value of the object in the expression (to be portable
between sessions). See the examples.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
This step can entirely remove observations (rows of data), which can have
unintended and/or problematic consequences when applying the step to new
data later via bake()
. Consider whether skip = TRUE
or
skip = FALSE
is more appropriate in any given use case. In most instances
that affect the rows of the data being predicted, this step probably should
not be applied at all; instead, execute operations like this outside and
before starting a preprocessing recipe()
.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The expressions in terms
are text representations and are not parsable.
The underlying operation does not allow for case weights.
Other row operation steps:
step_arrange()
,
step_impute_roll()
,
step_lag()
,
step_naomit()
,
step_sample()
,
step_shuffle()
,
step_slice()
Other dplyr steps:
step_arrange()
,
step_mutate()
,
step_mutate_at()
,
step_rename()
,
step_rename_at()
,
step_sample()
,
step_select()
,
step_slice()
rec <- recipe(~., data = iris) %>% step_filter(Sepal.Length > 4.5, Species == "setosa") prepped <- prep(rec, training = iris %>% slice(1:75)) library(dplyr) dplyr_train <- iris %>% as_tibble() %>% slice(1:75) %>% dplyr::filter(Sepal.Length > 4.5, Species == "setosa") rec_train <- bake(prepped, new_data = NULL) all.equal(dplyr_train, rec_train) dplyr_test <- iris %>% as_tibble() %>% slice(76:150) %>% dplyr::filter(Sepal.Length > 4.5, Species != "setosa") rec_test <- bake(prepped, iris %>% slice(76:150)) all.equal(dplyr_test, rec_test) values <- c("versicolor", "virginica") qq_rec <- recipe(~., data = iris) %>% # Embed the `values` object in the call using !! step_filter(Sepal.Length > 4.5, Species %in% !!values) tidy(qq_rec, number = 1)
rec <- recipe(~., data = iris) %>% step_filter(Sepal.Length > 4.5, Species == "setosa") prepped <- prep(rec, training = iris %>% slice(1:75)) library(dplyr) dplyr_train <- iris %>% as_tibble() %>% slice(1:75) %>% dplyr::filter(Sepal.Length > 4.5, Species == "setosa") rec_train <- bake(prepped, new_data = NULL) all.equal(dplyr_train, rec_train) dplyr_test <- iris %>% as_tibble() %>% slice(76:150) %>% dplyr::filter(Sepal.Length > 4.5, Species != "setosa") rec_test <- bake(prepped, iris %>% slice(76:150)) all.equal(dplyr_test, rec_test) values <- c("versicolor", "virginica") qq_rec <- recipe(~., data = iris) %>% # Embed the `values` object in the call using !! step_filter(Sepal.Length > 4.5, Species %in% !!values) tidy(qq_rec, number = 1)
step_filter_missing()
creates a specification of a recipe step that will
potentially remove variables that have too many missing values.
step_filter_missing( recipe, ..., role = NA, trained = FALSE, threshold = 0.1, removals = NULL, skip = FALSE, id = rand_id("filter_missing") )
step_filter_missing( recipe, ..., role = NA, trained = FALSE, threshold = 0.1, removals = NULL, skip = FALSE, id = rand_id("filter_missing") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
threshold |
A value for the threshold of missing values in column. The step will remove the columns where the proportion of missing values exceeds the threshold. |
removals |
A character string that contains the names of
columns that should be removed. These values are not determined
until |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
This step can potentially remove columns from the data set. This may cause issues for subsequent steps in your recipe if the missing columns are specifically referenced by name. To avoid this, see the advice in the Tips for saving recipes and filtering columns section of selections.
This step will remove variables if the proportion of missing
values exceeds the threshold
.
All variables with missing values will be removed for threshold = 0
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
This step has 1 tuning parameters:
threshold
: Threshold (type: double, default: 0.1)
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Other variable filter steps:
step_corr()
,
step_lincomb()
,
step_nzv()
,
step_rm()
,
step_select()
,
step_zv()
data(credit_data, package = "modeldata") rec <- recipe(Status ~ ., data = credit_data) %>% step_filter_missing(all_predictors(), threshold = 0) filter_obj <- prep(rec) filtered_te <- bake(filter_obj, new_data = NULL) tidy(rec, number = 1) tidy(filter_obj, number = 1)
data(credit_data, package = "modeldata") rec <- recipe(Status ~ ., data = credit_data) %>% step_filter_missing(all_predictors(), threshold = 0) filter_obj <- prep(rec) filtered_te <- bake(filter_obj, new_data = NULL) tidy(rec, number = 1) tidy(filter_obj, number = 1)
step_geodist()
creates a specification of a recipe step that will
calculate the distance between points on a map to a reference location.
step_geodist( recipe, lat = NULL, lon = NULL, role = "predictor", trained = FALSE, ref_lat = NULL, ref_lon = NULL, is_lat_lon = TRUE, log = FALSE, name = "geo_dist", columns = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("geodist") )
step_geodist( recipe, lat = NULL, lon = NULL, role = "predictor", trained = FALSE, ref_lat = NULL, ref_lon = NULL, is_lat_lon = TRUE, log = FALSE, name = "geo_dist", columns = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("geodist") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
lon , lat
|
Selector functions to choose which variables are
used by the step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
ref_lon , ref_lat
|
Single numeric values for the location of the reference point. |
is_lat_lon |
A logical: Are coordinates in latitude and longitude? If
|
log |
A logical: should the distance be transformed by the natural log function? |
name |
A single character value to use for the new predictor column. If a column exists with this name, an error is issued. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_geodist
uses the Pythagorean theorem to calculate Euclidean
distances if is_lat_lon
is FALSE. If is_lat_lon
is TRUE, the Haversine
formula is used to calculate the great-circle distance in meters.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns latitude
, longitude
, ref_latitude
, ref_longitude
,
is_lat_lon
, name
, and id
:
character, name of latitude variable
character, name of longitude variable
numeric, location of latitude reference point
numeric, location of longitude reference point
character, the summary function name
character, name of resulting variable
character, id of this step
The underlying operation does not allow for case weights.
https://en.wikipedia.org/wiki/Haversine_formula
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
data(Smithsonian, package = "modeldata") # How close are the museums to Union Station? near_station <- recipe(~., data = Smithsonian) %>% update_role(name, new_role = "location") %>% step_geodist( lat = latitude, lon = longitude, log = FALSE, ref_lat = 38.8986312, ref_lon = -77.0062457, is_lat_lon = TRUE ) %>% prep(training = Smithsonian) bake(near_station, new_data = NULL) %>% arrange(geo_dist) tidy(near_station, number = 1)
data(Smithsonian, package = "modeldata") # How close are the museums to Union Station? near_station <- recipe(~., data = Smithsonian) %>% update_role(name, new_role = "location") %>% step_geodist( lat = latitude, lon = longitude, log = FALSE, ref_lat = 38.8986312, ref_lon = -77.0062457, is_lat_lon = TRUE ) %>% prep(training = Smithsonian) bake(near_station, new_data = NULL) %>% arrange(geo_dist) tidy(near_station, number = 1)
step_harmonic()
creates a specification of a recipe step that will add
sin()
and cos()
terms for harmonic analysis.
step_harmonic( recipe, ..., role = "predictor", trained = FALSE, frequency = NA_real_, cycle_size = NA_real_, starting_val = NA_real_, keep_original_cols = FALSE, columns = NULL, skip = FALSE, id = rand_id("harmonic") )
step_harmonic( recipe, ..., role = "predictor", trained = FALSE, frequency = NA_real_, cycle_size = NA_real_, starting_val = NA_real_, keep_original_cols = FALSE, columns = NULL, skip = FALSE, id = rand_id("harmonic") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
frequency |
A numeric vector with at least one value. The value(s) must be greater than zero and finite. |
cycle_size |
A numeric vector with at least one value that indicates
the size of a single cycle. |
starting_val |
either |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
This step seeks to describe periodic components of observational data using a combination of sin and cos waves. To do this, each wave of a specified frequency is modeled using one sin and one cos term. The two terms for each frequency can then be used to estimate the amplitude and phase shift of a periodic signal in observational data. The equation relating cos waves of known frequency but unknown phase and amplitude to a sum of sin and cos terms is below:
Solving the equation yields and
. the
amplitude can then be obtained with:
And the phase can be obtained with:
where:
is the amplitude of the
frequency
is the phase of the
frequency
is the coefficient of the cos term for the
frequency
is the coefficient of the sin term for the
frequency
The periodic component is specified by frequency
and cycle_size
parameters. The cycle size relates the specified frequency to the
input column(s) units. There are multiple ways to specify a wave of given
frequency, for example, a POSIXct
input column given a frequency
of
24 and a cycle_size
equal to 86400 is equivalent to a frequency
of
1.0 with cycle_size
equal to 3600.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
This step has 1 tuning parameters:
frequency
: Harmonic Frequency (type: double, default: NA)
When you tidy()
this step, a tibble is returned with
columns terms
, starting_val
, cycle_size
, frequency
, key
, and id
:
character, the selectors or variables selected
numeric, the starting value
numeric, the cycle size
numeric, the frequency
character, key describing the calculation
character, id of this step
The underlying operation does not allow for case weights.
Doran, H. E., & Quilkey, J. J. (1972). Harmonic analysis of seasonal data: some important properties. American Journal of Agricultural Economics, 54, volume 4, part 1, 646-651.
Foreman, M. G. G., & Henry, R. F. (1989). The harmonic analysis of tidal model time series. Advances in water resources, 12(3), 109-120.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_bs()
,
step_hyperbolic()
,
step_inverse()
,
step_invlogit()
,
step_log()
,
step_logit()
,
step_mutate()
,
step_ns()
,
step_percentile()
,
step_poly()
,
step_relu()
,
step_sqrt()
library(ggplot2, quietly = TRUE) library(dplyr) data(sunspot.year) sunspots <- tibble( year = 1700:1988, n_sunspot = sunspot.year, type = "measured" ) %>% slice(1:75) # sunspots period is around 11 years, sample spacing is one year dat <- recipe(n_sunspot ~ year, data = sunspots) %>% step_harmonic(year, frequency = 1 / 11, cycle_size = 1) %>% prep() %>% bake(new_data = NULL) fit <- lm(n_sunspot ~ year_sin_1 + year_cos_1, data = dat) preds <- tibble( year = sunspots$year, n_sunspot = fit$fitted.values, type = "predicted" ) bind_rows(sunspots, preds) %>% ggplot(aes(x = year, y = n_sunspot, color = type)) + geom_line() # ------------------------------------------------------------------------------ # POSIXct example date_time <- as.POSIXct( paste0(rep(1959:1997, each = 12), "-", rep(1:12, length(1959:1997)), "-01"), tz = "UTC" ) carbon_dioxide <- tibble( date_time = date_time, co2 = as.numeric(co2), type = "measured" ) # yearly co2 fluctuations dat <- recipe(co2 ~ date_time, data = carbon_dioxide ) %>% step_mutate(date_time_num = as.numeric(date_time)) %>% step_ns(date_time_num, deg_free = 3) %>% step_harmonic(date_time, frequency = 1, cycle_size = 86400 * 365.24) %>% prep() %>% bake(new_data = NULL) fit <- lm(co2 ~ date_time_num_ns_1 + date_time_num_ns_2 + date_time_num_ns_3 + date_time_sin_1 + date_time_cos_1, data = dat) preds <- tibble( date_time = date_time, co2 = fit$fitted.values, type = "predicted" ) bind_rows(carbon_dioxide, preds) %>% ggplot(aes(x = date_time, y = co2, color = type)) + geom_line()
library(ggplot2, quietly = TRUE) library(dplyr) data(sunspot.year) sunspots <- tibble( year = 1700:1988, n_sunspot = sunspot.year, type = "measured" ) %>% slice(1:75) # sunspots period is around 11 years, sample spacing is one year dat <- recipe(n_sunspot ~ year, data = sunspots) %>% step_harmonic(year, frequency = 1 / 11, cycle_size = 1) %>% prep() %>% bake(new_data = NULL) fit <- lm(n_sunspot ~ year_sin_1 + year_cos_1, data = dat) preds <- tibble( year = sunspots$year, n_sunspot = fit$fitted.values, type = "predicted" ) bind_rows(sunspots, preds) %>% ggplot(aes(x = year, y = n_sunspot, color = type)) + geom_line() # ------------------------------------------------------------------------------ # POSIXct example date_time <- as.POSIXct( paste0(rep(1959:1997, each = 12), "-", rep(1:12, length(1959:1997)), "-01"), tz = "UTC" ) carbon_dioxide <- tibble( date_time = date_time, co2 = as.numeric(co2), type = "measured" ) # yearly co2 fluctuations dat <- recipe(co2 ~ date_time, data = carbon_dioxide ) %>% step_mutate(date_time_num = as.numeric(date_time)) %>% step_ns(date_time_num, deg_free = 3) %>% step_harmonic(date_time, frequency = 1, cycle_size = 86400 * 365.24) %>% prep() %>% bake(new_data = NULL) fit <- lm(co2 ~ date_time_num_ns_1 + date_time_num_ns_2 + date_time_num_ns_3 + date_time_sin_1 + date_time_cos_1, data = dat) preds <- tibble( date_time = date_time, co2 = fit$fitted.values, type = "predicted" ) bind_rows(carbon_dioxide, preds) %>% ggplot(aes(x = date_time, y = co2, color = type)) + geom_line()
step_holiday()
creates a specification of a recipe step that will convert
date data into one or more binary indicator variables for common holidays.
step_holiday( recipe, ..., role = "predictor", trained = FALSE, holidays = c("LaborDay", "NewYearsDay", "ChristmasDay"), columns = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("holiday") )
step_holiday( recipe, ..., role = "predictor", trained = FALSE, holidays = c("LaborDay", "NewYearsDay", "ChristmasDay"), columns = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("holiday") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. The selected variables should have class |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
holidays |
A character string that includes at least one
holiday supported by the |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Unlike some other steps, step_holiday
does not
remove the original date variables by default. Set keep_original_cols
to FALSE
to remove them.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, holiday
, and id
:
character, the selectors or variables selected
character, name of holidays
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
library(lubridate) examples <- data.frame(someday = ymd("2000-12-20") + days(0:40)) holiday_rec <- recipe(~someday, examples) %>% step_holiday(all_predictors()) holiday_rec <- prep(holiday_rec, training = examples) holiday_values <- bake(holiday_rec, new_data = examples) holiday_values
library(lubridate) examples <- data.frame(someday = ymd("2000-12-20") + days(0:40)) holiday_rec <- recipe(~someday, examples) %>% step_holiday(all_predictors()) holiday_rec <- prep(holiday_rec, training = examples) holiday_values <- bake(holiday_rec, new_data = examples) holiday_values
step_hyperbolic()
creates a specification of a recipe step that will
transform data using a hyperbolic function.
step_hyperbolic( recipe, ..., role = NA, trained = FALSE, func = c("sinh", "cosh", "tanh"), inverse = TRUE, columns = NULL, skip = FALSE, id = rand_id("hyperbolic") )
step_hyperbolic( recipe, ..., role = NA, trained = FALSE, func = c("sinh", "cosh", "tanh"), inverse = TRUE, columns = NULL, skip = FALSE, id = rand_id("hyperbolic") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
func |
A character value for the function. Valid values are "sinh", "cosh", or "tanh". |
inverse |
A logical: should the inverse function be used? |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, inverse
, func
, and id
:
character, the selectors or variables selected
logical, is the inverse function be used
character, name of function. "sinh"
, "cosh"
, or "tanh"
character, id of this step
The underlying operation does not allow for case weights.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_bs()
,
step_harmonic()
,
step_inverse()
,
step_invlogit()
,
step_log()
,
step_logit()
,
step_mutate()
,
step_ns()
,
step_percentile()
,
step_poly()
,
step_relu()
,
step_sqrt()
set.seed(313) examples <- matrix(rnorm(40), ncol = 2) examples <- as.data.frame(examples) rec <- recipe(~ V1 + V2, data = examples) cos_trans <- rec %>% step_hyperbolic( all_numeric_predictors(), func = "cosh", inverse = FALSE ) cos_obj <- prep(cos_trans, training = examples) transformed_te <- bake(cos_obj, examples) plot(examples$V1, transformed_te$V1) tidy(cos_trans, number = 1) tidy(cos_obj, number = 1)
set.seed(313) examples <- matrix(rnorm(40), ncol = 2) examples <- as.data.frame(examples) rec <- recipe(~ V1 + V2, data = examples) cos_trans <- rec %>% step_hyperbolic( all_numeric_predictors(), func = "cosh", inverse = FALSE ) cos_obj <- prep(cos_trans, training = examples) transformed_te <- bake(cos_obj, examples) plot(examples$V1, transformed_te$V1) tidy(cos_trans, number = 1) tidy(cos_obj, number = 1)
step_ica()
creates a specification of a recipe step that will convert
numeric data into one or more independent components.
step_ica( recipe, ..., role = "predictor", trained = FALSE, num_comp = 5, options = list(method = "C"), seed = sample.int(10000, 5), res = NULL, columns = NULL, prefix = "IC", keep_original_cols = FALSE, skip = FALSE, id = rand_id("ica") )
step_ica( recipe, ..., role = "predictor", trained = FALSE, num_comp = 5, options = list(method = "C"), seed = sample.int(10000, 5), res = NULL, columns = NULL, prefix = "IC", keep_original_cols = FALSE, skip = FALSE, id = rand_id("ica") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
num_comp |
The number of components to retain as new predictors.
If |
options |
A list of options to
|
seed |
A single integer to set the random number stream prior to running ICA. |
res |
The |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
prefix |
A character string for the prefix of the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Independent component analysis (ICA) is a transformation of a group of variables that produces a new set of artificial features or components. ICA assumes that the variables are mixtures of a set of distinct, non-Gaussian signals and attempts to transform the data to isolate these signals. Like PCA, the components are statistically independent from one another. This means that they can be used to combat large inter-variables correlations in a data set. Also like PCA, it is advisable to center and scale the variables prior to running ICA.
This package produces components using the "FastICA" methodology (see reference below). This step requires the dimRed and fastICA packages. If not installed, the step will stop with a note about installing these packages.
The argument num_comp
controls the number of components that will be retained
(the original variables that are used to derive the components are removed from
the data). The new components will have names that begin with prefix
and a
sequence of numbers. The variable names are padded with zeros. For example, if
num_comp < 10
, their names will be IC1
- IC9
. If num_comp = 101
,
the names would be IC1
- IC101
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, component
, value
, and id
:
character, the selectors or variables selected
character, name of component
numeric, the loading
character, id of this step
This step has 1 tuning parameters:
num_comp
: # Components (type: integer, default: 5)
The underlying operation does not allow for case weights.
Hyvarinen, A., and Oja, E. (2000). Independent component analysis: algorithms and applications. Neural Networks, 13(4-5), 411-430.
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
# from fastICA::fastICA set.seed(131) S <- matrix(runif(400), 200, 2) A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE) X <- as.data.frame(S %*% A) tr <- X[1:100, ] te <- X[101:200, ] rec <- recipe(~., data = tr) ica_trans <- step_center(rec, V1, V2) ica_trans <- step_scale(ica_trans, V1, V2) ica_trans <- step_ica(ica_trans, V1, V2, num_comp = 2) ica_estimates <- prep(ica_trans, training = tr) ica_data <- bake(ica_estimates, te) plot(te$V1, te$V2) plot(ica_data$IC1, ica_data$IC2) tidy(ica_trans, number = 3) tidy(ica_estimates, number = 3)
# from fastICA::fastICA set.seed(131) S <- matrix(runif(400), 200, 2) A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE) X <- as.data.frame(S %*% A) tr <- X[1:100, ] te <- X[101:200, ] rec <- recipe(~., data = tr) ica_trans <- step_center(rec, V1, V2) ica_trans <- step_scale(ica_trans, V1, V2) ica_trans <- step_ica(ica_trans, V1, V2, num_comp = 2) ica_estimates <- prep(ica_trans, training = tr) ica_data <- bake(ica_estimates, te) plot(te$V1, te$V2) plot(ica_data$IC1, ica_data$IC2) tidy(ica_trans, number = 3) tidy(ica_estimates, number = 3)
step_impute_bag()
creates a specification of a recipe step that will
create bagged tree models to impute missing data.
step_impute_bag( recipe, ..., role = NA, trained = FALSE, impute_with = imp_vars(all_predictors()), trees = 25, models = NULL, options = list(keepX = FALSE), seed_val = sample.int(10^4, 1), skip = FALSE, id = rand_id("impute_bag") ) imp_vars(...)
step_impute_bag( recipe, ..., role = NA, trained = FALSE, impute_with = imp_vars(all_predictors()), trees = 25, models = NULL, options = list(keepX = FALSE), seed_val = sample.int(10^4, 1), skip = FALSE, id = rand_id("impute_bag") ) imp_vars(...)
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables to be imputed.
When used with |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
impute_with |
A call to |
trees |
An integer for the number of bagged trees to use in each model. |
models |
The |
options |
A list of options to |
seed_val |
An integer used to create reproducible models. The same seed is used across all imputation models. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
For each variable requiring imputation, a bagged tree is created
where the outcome is the variable of interest and the predictors are any
other variables listed in the impute_with
formula. One advantage to the
bagged tree is that is can accept predictors that have missing values
themselves. This imputation method can be used when the variable of interest
(and predictors) are numeric or categorical. Imputed categorical variables
will remain categorical. Also, integers will be imputed to integer too.
Note that if a variable that is to be imputed is also in impute_with
,
this variable will be ignored.
It is possible that missing values will still occur after imputation if a large majority (or all) of the imputing variables are also missing.
As of recipes
0.1.16, this function name changed from step_bagimpute()
to step_impute_bag()
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble with columns
terms
(the selectors or variables selected) and model
(the bagged tree object) is returned.
When you tidy()
this step, a tibble is returned with
columns terms
, model
, and id
:
character, the selectors or variables selected
list, the bagged tree object
character, id of this step
This step has 1 tuning parameters:
trees
: # Trees (type: integer, default: 25)
The underlying operation does not allow for case weights.
Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer Verlag.
Other imputation steps:
step_impute_knn()
,
step_impute_linear()
,
step_impute_lower()
,
step_impute_mean()
,
step_impute_median()
,
step_impute_mode()
,
step_impute_roll()
data("credit_data", package = "modeldata") ## missing data per column vapply(credit_data, function(x) mean(is.na(x)), c(num = 0)) set.seed(342) in_training <- sample(1:nrow(credit_data), 2000) credit_tr <- credit_data[in_training, ] credit_te <- credit_data[-in_training, ] missing_examples <- c(14, 394, 565) rec <- recipe(Price ~ ., data = credit_tr) ## Not run: impute_rec <- rec %>% step_impute_bag(Status, Home, Marital, Job, Income, Assets, Debt) imp_models <- prep(impute_rec, training = credit_tr) imputed_te <- bake(imp_models, new_data = credit_te) credit_te[missing_examples, ] imputed_te[missing_examples, names(credit_te)] tidy(impute_rec, number = 1) tidy(imp_models, number = 1) ## Specifying which variables to imputate with impute_rec <- rec %>% step_impute_bag(Status, Home, Marital, Job, Income, Assets, Debt, impute_with = imp_vars(Time, Age, Expenses), # for quick execution, nbagg lowered options = list(nbagg = 5, keepX = FALSE) ) imp_models <- prep(impute_rec, training = credit_tr) imputed_te <- bake(imp_models, new_data = credit_te) credit_te[missing_examples, ] imputed_te[missing_examples, names(credit_te)] tidy(impute_rec, number = 1) tidy(imp_models, number = 1) ## End(Not run)
data("credit_data", package = "modeldata") ## missing data per column vapply(credit_data, function(x) mean(is.na(x)), c(num = 0)) set.seed(342) in_training <- sample(1:nrow(credit_data), 2000) credit_tr <- credit_data[in_training, ] credit_te <- credit_data[-in_training, ] missing_examples <- c(14, 394, 565) rec <- recipe(Price ~ ., data = credit_tr) ## Not run: impute_rec <- rec %>% step_impute_bag(Status, Home, Marital, Job, Income, Assets, Debt) imp_models <- prep(impute_rec, training = credit_tr) imputed_te <- bake(imp_models, new_data = credit_te) credit_te[missing_examples, ] imputed_te[missing_examples, names(credit_te)] tidy(impute_rec, number = 1) tidy(imp_models, number = 1) ## Specifying which variables to imputate with impute_rec <- rec %>% step_impute_bag(Status, Home, Marital, Job, Income, Assets, Debt, impute_with = imp_vars(Time, Age, Expenses), # for quick execution, nbagg lowered options = list(nbagg = 5, keepX = FALSE) ) imp_models <- prep(impute_rec, training = credit_tr) imputed_te <- bake(imp_models, new_data = credit_te) credit_te[missing_examples, ] imputed_te[missing_examples, names(credit_te)] tidy(impute_rec, number = 1) tidy(imp_models, number = 1) ## End(Not run)
step_impute_knn()
creates a specification of a recipe step that will
impute missing data using nearest neighbors.
step_impute_knn( recipe, ..., role = NA, trained = FALSE, neighbors = 5, impute_with = imp_vars(all_predictors()), options = list(nthread = 1, eps = 1e-08), ref_data = NULL, columns = NULL, skip = FALSE, id = rand_id("impute_knn") )
step_impute_knn( recipe, ..., role = NA, trained = FALSE, neighbors = 5, impute_with = imp_vars(all_predictors()), options = list(nthread = 1, eps = 1e-08), ref_data = NULL, columns = NULL, skip = FALSE, id = rand_id("impute_knn") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables to be imputed.
When used with |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
neighbors |
The number of neighbors. |
impute_with |
A call to |
options |
A named list of options to pass to |
ref_data |
A tibble of data that will reflect the data preprocessing
done up to the point of this imputation step. This is |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The step uses the training set to impute any other data sets. The only distance function available is Gower's distance which can be used for mixtures of nominal and numeric data.
Once the nearest neighbors are determined, the mode is used to predictor nominal variables and the mean is used for numeric data. Note that, if the underlying data are integer, the mean will be converted to an integer too.
Note that if a variable that is to be imputed is also in impute_with
,
this variable will be ignored.
It is possible that missing values will still occur after imputation if a large majority (or all) of the imputing variables are also missing.
As of recipes
0.1.16, this function name changed from step_knnimpute()
to step_impute_knn()
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, predictors
, neighbors
, and id
:
character, the selectors or variables selected
character, selected predictors used to impute
integer, number of neighbors
character, id of this step
This step has 1 tuning parameters:
neighbors
: # Nearest Neighbors (type: integer, default: 5)
The underlying operation does not allow for case weights.
Gower, C. (1971) "A general coefficient of similarity and some of its properties," Biometrics, 857-871.
Other imputation steps:
step_impute_bag()
,
step_impute_linear()
,
step_impute_lower()
,
step_impute_mean()
,
step_impute_median()
,
step_impute_mode()
,
step_impute_roll()
library(recipes) data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] biomass_te_whole <- biomass_te # induce some missing data at random set.seed(9039) carb_missing <- sample(1:nrow(biomass_te), 3) nitro_missing <- sample(1:nrow(biomass_te), 3) biomass_te$carbon[carb_missing] <- NA biomass_te$nitrogen[nitro_missing] <- NA rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) ratio_recipe <- rec %>% step_impute_knn(all_predictors(), neighbors = 3) ratio_recipe2 <- prep(ratio_recipe, training = biomass_tr) imputed <- bake(ratio_recipe2, biomass_te) # how well did it work? summary(biomass_te_whole$carbon) cbind( before = biomass_te_whole$carbon[carb_missing], after = imputed$carbon[carb_missing] ) summary(biomass_te_whole$nitrogen) cbind( before = biomass_te_whole$nitrogen[nitro_missing], after = imputed$nitrogen[nitro_missing] ) tidy(ratio_recipe, number = 1) tidy(ratio_recipe2, number = 1)
library(recipes) data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] biomass_te_whole <- biomass_te # induce some missing data at random set.seed(9039) carb_missing <- sample(1:nrow(biomass_te), 3) nitro_missing <- sample(1:nrow(biomass_te), 3) biomass_te$carbon[carb_missing] <- NA biomass_te$nitrogen[nitro_missing] <- NA rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) ratio_recipe <- rec %>% step_impute_knn(all_predictors(), neighbors = 3) ratio_recipe2 <- prep(ratio_recipe, training = biomass_tr) imputed <- bake(ratio_recipe2, biomass_te) # how well did it work? summary(biomass_te_whole$carbon) cbind( before = biomass_te_whole$carbon[carb_missing], after = imputed$carbon[carb_missing] ) summary(biomass_te_whole$nitrogen) cbind( before = biomass_te_whole$nitrogen[nitro_missing], after = imputed$nitrogen[nitro_missing] ) tidy(ratio_recipe, number = 1) tidy(ratio_recipe2, number = 1)
step_impute_linear()
creates a specification of a recipe step that will
create linear regression models to impute missing data.
step_impute_linear( recipe, ..., role = NA, trained = FALSE, impute_with = imp_vars(all_predictors()), models = NULL, skip = FALSE, id = rand_id("impute_linear") )
step_impute_linear( recipe, ..., role = NA, trained = FALSE, impute_with = imp_vars(all_predictors()), models = NULL, skip = FALSE, id = rand_id("impute_linear") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables to be imputed;
these variables must be of type |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
impute_with |
A call to |
models |
The |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
For each variable requiring imputation, a linear model is fit
where the outcome is the variable of interest and the predictors are any
other variables listed in the impute_with
formula. Note that if a variable
that is to be imputed is also in impute_with
, this variable will be ignored.
The variable(s) to be imputed must be of type numeric
. The imputed values
will keep the same type as their original data (i.e, model predictions are
coerced to integer as needed).
Since this is a linear regression, the imputation model only uses complete cases for the training set predictors.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, model
, and id
:
character, the selectors or variables selected
list, list of fitted lm()
models
character, id of this step
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Kuhn, M. and Johnson, K. (2013). Feature Engineering and Selection https://bookdown.org/max/FES/handling-missing-data.html
Other imputation steps:
step_impute_bag()
,
step_impute_knn()
,
step_impute_lower()
,
step_impute_mean()
,
step_impute_median()
,
step_impute_mode()
,
step_impute_roll()
data(ames, package = "modeldata") set.seed(393) ames_missing <- ames ames_missing$Longitude[sample(1:nrow(ames), 200)] <- NA imputed_ames <- recipe(Sale_Price ~ ., data = ames_missing) %>% step_impute_linear( Longitude, impute_with = imp_vars(Latitude, Neighborhood, MS_Zoning, Alley) ) %>% prep(ames_missing) imputed <- bake(imputed_ames, new_data = ames_missing) %>% dplyr::rename(imputed = Longitude) %>% bind_cols(ames %>% dplyr::select(original = Longitude)) %>% bind_cols(ames_missing %>% dplyr::select(Longitude)) %>% dplyr::filter(is.na(Longitude)) library(ggplot2) ggplot(imputed, aes(x = original, y = imputed)) + geom_abline(col = "green") + geom_point(alpha = .3) + coord_equal() + labs(title = "Imputed Values")
data(ames, package = "modeldata") set.seed(393) ames_missing <- ames ames_missing$Longitude[sample(1:nrow(ames), 200)] <- NA imputed_ames <- recipe(Sale_Price ~ ., data = ames_missing) %>% step_impute_linear( Longitude, impute_with = imp_vars(Latitude, Neighborhood, MS_Zoning, Alley) ) %>% prep(ames_missing) imputed <- bake(imputed_ames, new_data = ames_missing) %>% dplyr::rename(imputed = Longitude) %>% bind_cols(ames %>% dplyr::select(original = Longitude)) %>% bind_cols(ames_missing %>% dplyr::select(Longitude)) %>% dplyr::filter(is.na(Longitude)) library(ggplot2) ggplot(imputed, aes(x = original, y = imputed)) + geom_abline(col = "green") + geom_point(alpha = .3) + coord_equal() + labs(title = "Imputed Values")
step_impute_lower()
creates a specification of a recipe step designed for
cases where the non-negative numeric data cannot be measured below a known
value. In these cases, one method for imputing the data is to substitute the
truncated value by a random uniform number between zero and the truncation
point.
step_impute_lower( recipe, ..., role = NA, trained = FALSE, threshold = NULL, skip = FALSE, id = rand_id("impute_lower") )
step_impute_lower( recipe, ..., role = NA, trained = FALSE, threshold = NULL, skip = FALSE, id = rand_id("impute_lower") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
threshold |
A named numeric vector of lower bounds. This is
|
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_impute_lower
estimates the variable minimums
from the data used in the training
argument of prep.recipe
.
bake.recipe
then simulates a value for any data at the minimum
with a random uniform value between zero and the minimum.
As of recipes
0.1.16, this function name changed from step_lowerimpute()
to step_impute_lower()
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
numeric, the estimated value
character, id of this step
The underlying operation does not allow for case weights.
Other imputation steps:
step_impute_bag()
,
step_impute_knn()
,
step_impute_linear()
,
step_impute_mean()
,
step_impute_median()
,
step_impute_mode()
,
step_impute_roll()
library(recipes) data(biomass, package = "modeldata") ## Truncate some values to emulate what a lower limit of ## the measurement system might look like biomass$carbon <- ifelse(biomass$carbon > 40, biomass$carbon, 40) biomass$hydrogen <- ifelse(biomass$hydrogen > 5, biomass$carbon, 5) biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) impute_rec <- rec %>% step_impute_lower(carbon, hydrogen) tidy(impute_rec, number = 1) impute_rec <- prep(impute_rec, training = biomass_tr) tidy(impute_rec, number = 1) transformed_te <- bake(impute_rec, biomass_te) plot(transformed_te$carbon, biomass_te$carbon, ylab = "pre-imputation", xlab = "imputed" )
library(recipes) data(biomass, package = "modeldata") ## Truncate some values to emulate what a lower limit of ## the measurement system might look like biomass$carbon <- ifelse(biomass$carbon > 40, biomass$carbon, 40) biomass$hydrogen <- ifelse(biomass$hydrogen > 5, biomass$carbon, 5) biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) impute_rec <- rec %>% step_impute_lower(carbon, hydrogen) tidy(impute_rec, number = 1) impute_rec <- prep(impute_rec, training = biomass_tr) tidy(impute_rec, number = 1) transformed_te <- bake(impute_rec, biomass_te) plot(transformed_te$carbon, biomass_te$carbon, ylab = "pre-imputation", xlab = "imputed" )
step_impute_mean()
creates a specification of a recipe step that will
substitute missing values of numeric variables by the training set mean of
those variables.
step_impute_mean( recipe, ..., role = NA, trained = FALSE, means = NULL, trim = 0, skip = FALSE, id = rand_id("impute_mean") )
step_impute_mean( recipe, ..., role = NA, trained = FALSE, means = NULL, trim = 0, skip = FALSE, id = rand_id("impute_mean") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
means |
A named numeric vector of means. This is |
trim |
The fraction (0 to 0.5) of observations to be trimmed from each end of the variables before the mean is computed. Values of trim outside that range are taken as the nearest endpoint. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_impute_mean
estimates the variable means from the data used
in the training
argument of prep.recipe
. bake.recipe
then applies the
new values to new data sets using these averages.
As of recipes
0.1.16, this function name changed from step_meanimpute()
to step_impute_mean()
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
numeric, the mean value
character, id of this step
This step has 1 tuning parameters:
trim
: Amount of Trimming (type: double, default: 0)
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Other imputation steps:
step_impute_bag()
,
step_impute_knn()
,
step_impute_linear()
,
step_impute_lower()
,
step_impute_median()
,
step_impute_mode()
,
step_impute_roll()
data("credit_data", package = "modeldata") ## missing data per column vapply(credit_data, function(x) mean(is.na(x)), c(num = 0)) set.seed(342) in_training <- sample(1:nrow(credit_data), 2000) credit_tr <- credit_data[in_training, ] credit_te <- credit_data[-in_training, ] missing_examples <- c(14, 394, 565) rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_impute_mean(Income, Assets, Debt) imp_models <- prep(impute_rec, training = credit_tr) imputed_te <- bake(imp_models, new_data = credit_te) credit_te[missing_examples, ] imputed_te[missing_examples, names(credit_te)] tidy(impute_rec, number = 1) tidy(imp_models, number = 1)
data("credit_data", package = "modeldata") ## missing data per column vapply(credit_data, function(x) mean(is.na(x)), c(num = 0)) set.seed(342) in_training <- sample(1:nrow(credit_data), 2000) credit_tr <- credit_data[in_training, ] credit_te <- credit_data[-in_training, ] missing_examples <- c(14, 394, 565) rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_impute_mean(Income, Assets, Debt) imp_models <- prep(impute_rec, training = credit_tr) imputed_te <- bake(imp_models, new_data = credit_te) credit_te[missing_examples, ] imputed_te[missing_examples, names(credit_te)] tidy(impute_rec, number = 1) tidy(imp_models, number = 1)
step_impute_median()
creates a specification of a recipe step that will
substitute missing values of numeric variables by the training set median of
those variables.
step_impute_median( recipe, ..., role = NA, trained = FALSE, medians = NULL, skip = FALSE, id = rand_id("impute_median") )
step_impute_median( recipe, ..., role = NA, trained = FALSE, medians = NULL, skip = FALSE, id = rand_id("impute_median") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
medians |
A named numeric vector of medians. This is |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_impute_median
estimates the variable medians from the data
used in the training
argument of prep.recipe
. bake.recipe
then applies
the new values to new data sets using these medians.
As of recipes
0.1.16, this function name changed from
step_medianimpute()
to step_impute_median()
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
numeric, the median value
character, id of this step
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Other imputation steps:
step_impute_bag()
,
step_impute_knn()
,
step_impute_linear()
,
step_impute_lower()
,
step_impute_mean()
,
step_impute_mode()
,
step_impute_roll()
data("credit_data", package = "modeldata") ## missing data per column vapply(credit_data, function(x) mean(is.na(x)), c(num = 0)) set.seed(342) in_training <- sample(1:nrow(credit_data), 2000) credit_tr <- credit_data[in_training, ] credit_te <- credit_data[-in_training, ] missing_examples <- c(14, 394, 565) rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_impute_median(Income, Assets, Debt) imp_models <- prep(impute_rec, training = credit_tr) imputed_te <- bake(imp_models, new_data = credit_te) credit_te[missing_examples, ] imputed_te[missing_examples, names(credit_te)] tidy(impute_rec, number = 1) tidy(imp_models, number = 1)
data("credit_data", package = "modeldata") ## missing data per column vapply(credit_data, function(x) mean(is.na(x)), c(num = 0)) set.seed(342) in_training <- sample(1:nrow(credit_data), 2000) credit_tr <- credit_data[in_training, ] credit_te <- credit_data[-in_training, ] missing_examples <- c(14, 394, 565) rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_impute_median(Income, Assets, Debt) imp_models <- prep(impute_rec, training = credit_tr) imputed_te <- bake(imp_models, new_data = credit_te) credit_te[missing_examples, ] imputed_te[missing_examples, names(credit_te)] tidy(impute_rec, number = 1) tidy(imp_models, number = 1)
step_impute_mode()
creates a specification of a recipe step that will
substitute missing values of nominal variables by the training set mode of
those variables.
step_impute_mode( recipe, ..., role = NA, trained = FALSE, modes = NULL, ptype = NULL, skip = FALSE, id = rand_id("impute_mode") )
step_impute_mode( recipe, ..., role = NA, trained = FALSE, modes = NULL, ptype = NULL, skip = FALSE, id = rand_id("impute_mode") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
modes |
A named character vector of modes. This is
|
ptype |
A data frame prototype to cast new data sets to. This is commonly a 0-row slice of the training set. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_impute_mode
estimates the variable modes
from the data used in the training
argument of
prep.recipe
. bake.recipe
then applies the new
values to new data sets using these values. If the training set
data has more than one mode, one is selected at random.
As of recipes
0.1.16, this function name changed from step_modeimpute()
to step_impute_mode()
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
character, the mode value
character, id of this step
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Other imputation steps:
step_impute_bag()
,
step_impute_knn()
,
step_impute_linear()
,
step_impute_lower()
,
step_impute_mean()
,
step_impute_median()
,
step_impute_roll()
data("credit_data", package = "modeldata") ## missing data per column vapply(credit_data, function(x) mean(is.na(x)), c(num = 0)) set.seed(342) in_training <- sample(1:nrow(credit_data), 2000) credit_tr <- credit_data[in_training, ] credit_te <- credit_data[-in_training, ] missing_examples <- c(14, 394, 565) rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_impute_mode(Status, Home, Marital) imp_models <- prep(impute_rec, training = credit_tr) imputed_te <- bake(imp_models, new_data = credit_te) table(credit_te$Home, imputed_te$Home, useNA = "always") tidy(impute_rec, number = 1) tidy(imp_models, number = 1)
data("credit_data", package = "modeldata") ## missing data per column vapply(credit_data, function(x) mean(is.na(x)), c(num = 0)) set.seed(342) in_training <- sample(1:nrow(credit_data), 2000) credit_tr <- credit_data[in_training, ] credit_te <- credit_data[-in_training, ] missing_examples <- c(14, 394, 565) rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_impute_mode(Status, Home, Marital) imp_models <- prep(impute_rec, training = credit_tr) imputed_te <- bake(imp_models, new_data = credit_te) table(credit_te$Home, imputed_te$Home, useNA = "always") tidy(impute_rec, number = 1) tidy(imp_models, number = 1)
step_impute_roll()
creates a specification of a recipe step that will
substitute missing values of numeric variables by the measure of location
(e.g. median) within a moving window.
step_impute_roll( recipe, ..., role = NA, trained = FALSE, columns = NULL, statistic = median, window = 5L, skip = FALSE, id = rand_id("impute_roll") )
step_impute_roll( recipe, ..., role = NA, trained = FALSE, columns = NULL, statistic = median, window = 5L, skip = FALSE, id = rand_id("impute_roll") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables to be imputed;
these columns must be non-integer numerics (i.e., double precision).
See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
statistic |
A function with a single argument for the data to compute the imputed value. Only complete values will be passed to the function and it should return a double precision value. |
window |
The size of the window around a point to be imputed. Should be an odd integer greater than one. See Details below for a discussion of points at the ends of the series. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
On the tails, the window is shifted towards the ends.
For example, for a 5-point window, the windows for the first
four points are 1:5
, 1:5
, 1:5
, and then 2:6
.
When missing data are in the window, they are not passed to the function. If all of the data in the window are missing, a missing value is returned.
The statistics are calculated on the training set values before imputation. This means that if previous data within the window are missing, their imputed values are not included in the window data used for imputation. In other words, each imputation does not know anything about previous imputations in the series prior to the current point.
As of recipes
0.1.16, this function name changed from step_rollimpute()
to step_impute_roll()
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, window
, and id
:
character, the selectors or variables selected
integer, window size
character, id of this step
This step has 2 tuning parameters:
statistic
: Rolling Summary Statistic (type: character, default: median)
window
: Window Size (type: integer, default: 5)
The underlying operation does not allow for case weights.
Other imputation steps:
step_impute_bag()
,
step_impute_knn()
,
step_impute_linear()
,
step_impute_lower()
,
step_impute_mean()
,
step_impute_median()
,
step_impute_mode()
Other row operation steps:
step_arrange()
,
step_filter()
,
step_lag()
,
step_naomit()
,
step_sample()
,
step_shuffle()
,
step_slice()
library(lubridate) set.seed(145) example_data <- data.frame( day = ymd("2012-06-07") + days(1:12), x1 = round(runif(12), 2), x2 = round(runif(12), 2), x3 = round(runif(12), 2) ) example_data$x1[c(1, 5, 6)] <- NA example_data$x2[c(1:4, 10)] <- NA library(recipes) seven_pt <- recipe(~., data = example_data) %>% update_role(day, new_role = "time_index") %>% step_impute_roll(all_numeric_predictors(), window = 7) %>% prep(training = example_data) # The training set: bake(seven_pt, new_data = NULL)
library(lubridate) set.seed(145) example_data <- data.frame( day = ymd("2012-06-07") + days(1:12), x1 = round(runif(12), 2), x2 = round(runif(12), 2), x3 = round(runif(12), 2) ) example_data$x1[c(1, 5, 6)] <- NA example_data$x2[c(1:4, 10)] <- NA library(recipes) seven_pt <- recipe(~., data = example_data) %>% update_role(day, new_role = "time_index") %>% step_impute_roll(all_numeric_predictors(), window = 7) %>% prep(training = example_data) # The training set: bake(seven_pt, new_data = NULL)
step_indicate_na()
creates a specification of a recipe step that will
create and append additional binary columns to the data set to indicate which
observations are missing.
step_indicate_na( recipe, ..., role = "predictor", trained = FALSE, columns = NULL, prefix = "na_ind", keep_original_cols = TRUE, skip = FALSE, id = rand_id("indicate_na") )
step_indicate_na( recipe, ..., role = "predictor", trained = FALSE, columns = NULL, prefix = "na_ind", keep_original_cols = TRUE, skip = FALSE, id = rand_id("indicate_na") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
prefix |
A character string that will be the prefix to the resulting new variables. Defaults to "na_ind". |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
data("credit_data", package = "modeldata") ## missing data per column purrr::map_dbl(credit_data, function(x) mean(is.na(x))) set.seed(342) in_training <- sample(1:nrow(credit_data), 2000) credit_tr <- credit_data[in_training, ] credit_te <- credit_data[-in_training, ] rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_indicate_na(Income, Assets, Debt) imp_models <- prep(impute_rec, training = credit_tr) imputed_te <- bake(imp_models, new_data = credit_te)
data("credit_data", package = "modeldata") ## missing data per column purrr::map_dbl(credit_data, function(x) mean(is.na(x))) set.seed(342) in_training <- sample(1:nrow(credit_data), 2000) credit_tr <- credit_data[in_training, ] credit_te <- credit_data[-in_training, ] rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_indicate_na(Income, Assets, Debt) imp_models <- prep(impute_rec, training = credit_tr) imputed_te <- bake(imp_models, new_data = credit_te)
step_integer()
creates a specification of a recipe step that will convert
new data into a set of integers based on the original data values.
step_integer( recipe, ..., role = "predictor", trained = FALSE, strict = TRUE, zero_based = FALSE, key = NULL, skip = FALSE, id = rand_id("integer") )
step_integer( recipe, ..., role = "predictor", trained = FALSE, strict = TRUE, zero_based = FALSE, key = NULL, skip = FALSE, id = rand_id("integer") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
strict |
A logical for whether the values should be returned as integers (as opposed to double). |
zero_based |
A logical for whether the integers should start at zero and new values be appended as the largest integer. |
key |
A list that contains the information needed to
create integer variables for each variable contained in
|
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_integer
will determine the unique values of
each variable from the training set (excluding missing values),
order them, and then assign integers to each value. When baked,
each data point is translated to its corresponding integer or a
value of zero for yet unseen data (although see the zero_based
argument above). Missing values propagate.
Factor inputs are ordered by their levels. All others are
ordered by sort
.
Despite the name, the new values are returned as numeric unless
strict = TRUE
, which will coerce the results to integers.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
list, a list column with the conversion key
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
data(Sacramento, package = "modeldata") sacr_tr <- Sacramento[1:100, ] sacr_tr$sqft[1] <- NA sacr_te <- Sacramento[101:105, ] sacr_te$sqft[1] <- NA sacr_te$city[1] <- "whoville" sacr_te$city[2] <- NA rec <- recipe(type ~ ., data = sacr_tr) %>% step_integer(all_predictors()) %>% prep(training = sacr_tr) bake(rec, sacr_te, all_predictors()) tidy(rec, number = 1)
data(Sacramento, package = "modeldata") sacr_tr <- Sacramento[1:100, ] sacr_tr$sqft[1] <- NA sacr_te <- Sacramento[101:105, ] sacr_te$sqft[1] <- NA sacr_te$city[1] <- "whoville" sacr_te$city[2] <- NA rec <- recipe(type ~ ., data = sacr_tr) %>% step_integer(all_predictors()) %>% prep(training = sacr_tr) bake(rec, sacr_te, all_predictors()) tidy(rec, number = 1)
step_interact()
creates a specification of a recipe step that will create
new columns that are interaction terms between two or more variables.
step_interact( recipe, terms, role = "predictor", trained = FALSE, objects = NULL, sep = "_x_", keep_original_cols = TRUE, skip = FALSE, id = rand_id("interact") )
step_interact( recipe, terms, role = "predictor", trained = FALSE, objects = NULL, sep = "_x_", keep_original_cols = TRUE, skip = FALSE, id = rand_id("interact") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
terms |
A traditional R formula that contains interaction
terms. This can include |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
objects |
A list of |
sep |
A character value used to delineate variables in an
interaction (e.g. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_interact
can create interactions between
variables. It is primarily intended for numeric data;
categorical variables should probably be converted to dummy
variables using step_dummy()
prior to being used for
interactions.
Unlike other step functions, the terms
argument should
be a traditional R model formula but should contain no inline
functions (e.g. log
). For example, for predictors
A
, B
, and C
, a formula such as
~A:B:C
can be used to make a three way interaction
between the variables. If the formula contains terms other than
interactions (e.g. (A+B+C)^3
) only the interaction terms
are retained for the design matrix.
The separator between the variables defaults to "_x_
" so
that the three way interaction shown previously would generate a
column named A_x_B_x_C
. This can be changed using the
sep
argument.
When dummy variables are created and are used in interactions,
selectors can help specify the interactions succinctly. For
example, suppose a factor column X
gets converted to dummy
variables x_2
, x_3
, ..., x_6
using step_dummy()
. If
you wanted an interaction with numeric column z
, you could
create a set of specific interaction effects (e.g.
x_2:z + x_3:z
and so on) or you could use
starts_with("x_"):z
. When prep()
evaluates this step,
starts_with("x_")
resolves to (x_2 + x_3 + x_4 + x_5 + x_6)
so that the formula is now (x_2 + x_3 + x_4 + x_5 + x_6):z
and
all two-way interactions are created.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
data(penguins, package = "modeldata") penguins <- penguins %>% na.omit() rec <- recipe(flipper_length_mm ~ ., data = penguins) int_mod_1 <- rec %>% step_interact(terms = ~ bill_depth_mm:bill_length_mm) # specify all dummy variables succinctly with `starts_with()` int_mod_2 <- rec %>% step_dummy(sex, species, island) %>% step_interact(terms = ~ body_mass_g:starts_with("species")) int_mod_1 <- prep(int_mod_1, training = penguins) int_mod_2 <- prep(int_mod_2, training = penguins) dat_1 <- bake(int_mod_1, penguins) dat_2 <- bake(int_mod_2, penguins) names(dat_1) names(dat_2) tidy(int_mod_1, number = 1) tidy(int_mod_2, number = 2)
data(penguins, package = "modeldata") penguins <- penguins %>% na.omit() rec <- recipe(flipper_length_mm ~ ., data = penguins) int_mod_1 <- rec %>% step_interact(terms = ~ bill_depth_mm:bill_length_mm) # specify all dummy variables succinctly with `starts_with()` int_mod_2 <- rec %>% step_dummy(sex, species, island) %>% step_interact(terms = ~ body_mass_g:starts_with("species")) int_mod_1 <- prep(int_mod_1, training = penguins) int_mod_2 <- prep(int_mod_2, training = penguins) dat_1 <- bake(int_mod_1, penguins) dat_2 <- bake(int_mod_2, penguins) names(dat_1) names(dat_2) tidy(int_mod_1, number = 1) tidy(int_mod_2, number = 2)
step_intercept()
creates a specification of a recipe step that will add
an intercept or constant term in the first column of a data matrix.
step_intercept()
defaults to predictor role so that it is by default
only called in the bake step. Be careful to avoid unintentional transformations
when calling steps with all_predictors()
.
step_intercept( recipe, ..., role = "predictor", trained = FALSE, name = "intercept", value = 1L, skip = FALSE, id = rand_id("intercept") )
step_intercept( recipe, ..., role = "predictor", trained = FALSE, name = "intercept", value = 1L, skip = FALSE, id = rand_id("intercept") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
Argument ignored; included for consistency with other step specification functions. |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. Again included only for consistency. |
name |
Character name for newly added column |
value |
A numeric constant to fill the intercept column. Defaults to
|
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) rec_trans <- recipe(HHV ~ ., data = biomass_tr[, -(1:2)]) %>% step_intercept(value = 2) %>% step_scale(carbon) rec_obj <- prep(rec_trans, training = biomass_tr) with_intercept <- bake(rec_obj, biomass_te) with_intercept
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) rec_trans <- recipe(HHV ~ ., data = biomass_tr[, -(1:2)]) %>% step_intercept(value = 2) %>% step_scale(carbon) rec_obj <- prep(rec_trans, training = biomass_tr) with_intercept <- bake(rec_obj, biomass_te) with_intercept
step_inverse()
creates a specification of a recipe step that will inverse
transform the data.
step_inverse( recipe, ..., role = NA, offset = 0, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("inverse") )
step_inverse( recipe, ..., role = NA, offset = 0, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("inverse") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
offset |
An optional value to add to the data prior to
logging (to avoid |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_bs()
,
step_harmonic()
,
step_hyperbolic()
,
step_invlogit()
,
step_log()
,
step_logit()
,
step_mutate()
,
step_ns()
,
step_percentile()
,
step_poly()
,
step_relu()
,
step_sqrt()
set.seed(313) examples <- matrix(runif(40), ncol = 2) examples <- data.frame(examples) rec <- recipe(~ X1 + X2, data = examples) inverse_trans <- rec %>% step_inverse(all_numeric_predictors()) inverse_obj <- prep(inverse_trans, training = examples) transformed_te <- bake(inverse_obj, examples) plot(examples$X1, transformed_te$X1) tidy(inverse_trans, number = 1) tidy(inverse_obj, number = 1)
set.seed(313) examples <- matrix(runif(40), ncol = 2) examples <- data.frame(examples) rec <- recipe(~ X1 + X2, data = examples) inverse_trans <- rec %>% step_inverse(all_numeric_predictors()) inverse_obj <- prep(inverse_trans, training = examples) transformed_te <- bake(inverse_obj, examples) plot(examples$X1, transformed_te$X1) tidy(inverse_trans, number = 1) tidy(inverse_obj, number = 1)
step_invlogit()
creates a specification of a recipe step that will
transform the data from real values to be between zero and one.
step_invlogit( recipe, ..., role = NA, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("invlogit") )
step_invlogit( recipe, ..., role = NA, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("invlogit") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The inverse logit transformation takes values on the
real line and translates them to be between zero and one using
the function f(x) = 1/(1+exp(-x))
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_bs()
,
step_harmonic()
,
step_hyperbolic()
,
step_inverse()
,
step_log()
,
step_logit()
,
step_mutate()
,
step_ns()
,
step_percentile()
,
step_poly()
,
step_relu()
,
step_sqrt()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) ilogit_trans <- rec %>% step_center(carbon, hydrogen) %>% step_scale(carbon, hydrogen) %>% step_invlogit(carbon, hydrogen) ilogit_obj <- prep(ilogit_trans, training = biomass_tr) transformed_te <- bake(ilogit_obj, biomass_te) plot(biomass_te$carbon, transformed_te$carbon)
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) ilogit_trans <- rec %>% step_center(carbon, hydrogen) %>% step_scale(carbon, hydrogen) %>% step_invlogit(carbon, hydrogen) ilogit_obj <- prep(ilogit_trans, training = biomass_tr) transformed_te <- bake(ilogit_obj, biomass_te) plot(biomass_te$carbon, transformed_te$carbon)
step_isomap()
creates a specification of a recipe step that uses
multidimensional scaling to convert numeric data into one or more new
dimensions.
step_isomap( recipe, ..., role = "predictor", trained = FALSE, num_terms = 5, neighbors = 50, options = list(.mute = c("message", "output")), res = NULL, columns = NULL, prefix = "Isomap", keep_original_cols = FALSE, skip = FALSE, id = rand_id("isomap") )
step_isomap( recipe, ..., role = "predictor", trained = FALSE, num_terms = 5, neighbors = 50, options = list(.mute = c("message", "output")), res = NULL, columns = NULL, prefix = "Isomap", keep_original_cols = FALSE, skip = FALSE, id = rand_id("isomap") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
num_terms |
The number of isomap dimensions to retain as new
predictors. If |
neighbors |
The number of neighbors. |
options |
A list of options to |
res |
The |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
prefix |
A character string for the prefix of the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Isomap is a form of multidimensional scaling (MDS). MDS methods try to find a reduced set of dimensions such that the geometric distances between the original data points are preserved. This version of MDS uses nearest neighbors in the data as a method for increasing the fidelity of the new dimensions to the original data values.
This step requires the dimRed, RSpectra, igraph, and RANN packages. If not installed, the step will stop with a note about installing these packages.
It is advisable to center and scale the variables prior to
running Isomap (step_center
and step_scale
can be
used for this purpose).
The argument num_terms
controls the number of components that
will be retained (the original variables that are used to derive
the components are removed from the data). The new components
will have names that begin with prefix
and a sequence of
numbers. The variable names are padded with zeros. For example,
if num_terms < 10
, their names will be Isomap1
-
Isomap9
. If num_terms = 101
, the names would be
Isomap001
- Isomap101
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, and id
:
character, the selectors or variables selected
character, id of this step
This step has 2 tuning parameters:
num_terms
: # Model Terms (type: integer, default: 5)
neighbors
: # Nearest Neighbors (type: integer, default: 50)
The underlying operation does not allow for case weights.
De Silva, V., and Tenenbaum, J. B. (2003). Global versus local methods in nonlinear dimensionality reduction. Advances in Neural Information Processing Systems. 721-728.
dimRed, a framework for dimensionality reduction, https://github.com/gdkrmr
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) im_trans <- rec %>% step_YeoJohnson(all_numeric_predictors()) %>% step_normalize(all_numeric_predictors()) %>% step_isomap(all_numeric_predictors(), neighbors = 100, num_terms = 2) im_estimates <- prep(im_trans, training = biomass_tr) im_te <- bake(im_estimates, biomass_te) rng <- extendrange(c(im_te$Isomap1, im_te$Isomap2)) plot(im_te$Isomap1, im_te$Isomap2, xlim = rng, ylim = rng ) tidy(im_trans, number = 3) tidy(im_estimates, number = 3)
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) im_trans <- rec %>% step_YeoJohnson(all_numeric_predictors()) %>% step_normalize(all_numeric_predictors()) %>% step_isomap(all_numeric_predictors(), neighbors = 100, num_terms = 2) im_estimates <- prep(im_trans, training = biomass_tr) im_te <- bake(im_estimates, biomass_te) rng <- extendrange(c(im_te$Isomap1, im_te$Isomap2)) plot(im_te$Isomap1, im_te$Isomap2, xlim = rng, ylim = rng ) tidy(im_trans, number = 3) tidy(im_estimates, number = 3)
step_kpca()
creates a specification of a recipe step that will convert
numeric data into one or more principal components using a kernel basis
expansion.
step_kpca( recipe, ..., role = "predictor", trained = FALSE, num_comp = 5, res = NULL, columns = NULL, options = list(kernel = "rbfdot", kpar = list(sigma = 0.2)), prefix = "kPC", keep_original_cols = FALSE, skip = FALSE, id = rand_id("kpca") )
step_kpca( recipe, ..., role = "predictor", trained = FALSE, num_comp = 5, res = NULL, columns = NULL, options = list(kernel = "rbfdot", kpar = list(sigma = 0.2)), prefix = "kPC", keep_original_cols = FALSE, skip = FALSE, id = rand_id("kpca") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
num_comp |
The number of components to retain as new predictors.
If |
res |
An S4 |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
options |
A list of options to |
prefix |
A character string for the prefix of the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
When performing kPCA with step_kpca()
, you must choose the kernel
function (and any important kernel parameters). This step uses the
kernlab package; the reference below discusses the types of kernels
available and their parameter(s). These specifications can be made in the
kernel
and kpar
slots of the options
argument to step_kpca()
.
Consider using step_kpca_rbf()
for a radial basis function kernel or
step_kpca_poly()
for a polynomial kernel.
Kernel principal component analysis (kPCA) is an extension of a PCA analysis that conducts the calculations in a broader dimensionality defined by a kernel function. For example, if a quadratic kernel function were used, each variable would be represented by its original values as well as its square. This nonlinear mapping is used during the PCA analysis and can potentially help find better representations of the original data.
This step requires the kernlab package. If not installed, the step will stop with a prompt about installing the package.
As with ordinary PCA, it is important to center and scale the variables
prior to computing PCA components (step_normalize()
can be used for
this purpose).
The argument num_comp
controls the number of components that will be retained
(the original variables that are used to derive the components are removed from
the data). The new components will have names that begin with prefix
and a
sequence of numbers. The variable names are padded with zeros. For example, if
num_comp < 10
, their names will be kPC1
- kPC9
. If num_comp = 101
,
the names would be kPC1
- kPC101
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble with column
terms
(the selectors or variables selected) is returned.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Scholkopf, B., Smola, A., and Muller, K. (1997). Kernel principal component analysis. Lecture Notes in Computer Science, 1327, 583-588.
Karatzoglou, K., Smola, A., Hornik, K., and Zeileis, A. (2004). kernlab - An S4 package for kernel methods in R. Journal of Statistical Software, 11(1), 1-20.
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
library(ggplot2) data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) kpca_trans <- rec %>% step_YeoJohnson(all_numeric_predictors()) %>% step_normalize(all_numeric_predictors()) %>% step_kpca(all_numeric_predictors()) kpca_estimates <- prep(kpca_trans, training = biomass_tr) kpca_te <- bake(kpca_estimates, biomass_te) ggplot(kpca_te, aes(x = kPC1, y = kPC2)) + geom_point() + coord_equal() tidy(kpca_trans, number = 3) tidy(kpca_estimates, number = 3)
library(ggplot2) data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) kpca_trans <- rec %>% step_YeoJohnson(all_numeric_predictors()) %>% step_normalize(all_numeric_predictors()) %>% step_kpca(all_numeric_predictors()) kpca_estimates <- prep(kpca_trans, training = biomass_tr) kpca_te <- bake(kpca_estimates, biomass_te) ggplot(kpca_te, aes(x = kPC1, y = kPC2)) + geom_point() + coord_equal() tidy(kpca_trans, number = 3) tidy(kpca_estimates, number = 3)
step_kpca_poly()
creates a specification of a recipe step that will
convert numeric data into one or more principal components using a polynomial
kernel basis expansion.
step_kpca_poly( recipe, ..., role = "predictor", trained = FALSE, num_comp = 5, res = NULL, columns = NULL, degree = 2, scale_factor = 1, offset = 1, prefix = "kPC", keep_original_cols = FALSE, skip = FALSE, id = rand_id("kpca_poly") )
step_kpca_poly( recipe, ..., role = "predictor", trained = FALSE, num_comp = 5, res = NULL, columns = NULL, degree = 2, scale_factor = 1, offset = 1, prefix = "kPC", keep_original_cols = FALSE, skip = FALSE, id = rand_id("kpca_poly") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
num_comp |
The number of components to retain as new predictors.
If |
res |
An S4 |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
degree , scale_factor , offset
|
Numeric values for the polynomial kernel
function. See the documentation at |
prefix |
A character string for the prefix of the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Kernel principal component analysis (kPCA) is an extension of a PCA analysis that conducts the calculations in a broader dimensionality defined by a kernel function. For example, if a quadratic kernel function were used, each variable would be represented by its original values as well as its square. This nonlinear mapping is used during the PCA analysis and can potentially help find better representations of the original data.
This step requires the kernlab package. If not installed, the step will stop with a prompt about installing the package.
As with ordinary PCA, it is important to center and scale the variables
prior to computing PCA components (step_normalize()
can be used for
this purpose).
The argument num_comp
controls the number of components that will be retained
(the original variables that are used to derive the components are removed from
the data). The new components will have names that begin with prefix
and a
sequence of numbers. The variable names are padded with zeros. For example, if
num_comp < 10
, their names will be kPC1
- kPC9
. If num_comp = 101
,
the names would be kPC1
- kPC101
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble with column
terms
(the selectors or variables selected) is returned.
This step has 4 tuning parameters:
num_comp
: # Components (type: integer, default: 5)
degree
: Polynomial Degree (type: double, default: 2)
scale_factor
: Scale Factor (type: double, default: 1)
offset
: Offset (type: double, default: 1)
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Scholkopf, B., Smola, A., and Muller, K. (1997). Kernel principal component analysis. Lecture Notes in Computer Science, 1327, 583-588.
Karatzoglou, K., Smola, A., Hornik, K., and Zeileis, A. (2004). kernlab - An S4 package for kernel methods in R. Journal of Statistical Software, 11(1), 1-20.
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
library(ggplot2) data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) kpca_trans <- rec %>% step_YeoJohnson(all_numeric_predictors()) %>% step_normalize(all_numeric_predictors()) %>% step_kpca_poly(all_numeric_predictors()) kpca_estimates <- prep(kpca_trans, training = biomass_tr) kpca_te <- bake(kpca_estimates, biomass_te) ggplot(kpca_te, aes(x = kPC1, y = kPC2)) + geom_point() + coord_equal() tidy(kpca_trans, number = 3) tidy(kpca_estimates, number = 3)
library(ggplot2) data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) kpca_trans <- rec %>% step_YeoJohnson(all_numeric_predictors()) %>% step_normalize(all_numeric_predictors()) %>% step_kpca_poly(all_numeric_predictors()) kpca_estimates <- prep(kpca_trans, training = biomass_tr) kpca_te <- bake(kpca_estimates, biomass_te) ggplot(kpca_te, aes(x = kPC1, y = kPC2)) + geom_point() + coord_equal() tidy(kpca_trans, number = 3) tidy(kpca_estimates, number = 3)
step_kpca_rbf()
creates a specification of a recipe step that will
convert numeric data into one or more principal components using a radial
basis function kernel basis expansion.
step_kpca_rbf( recipe, ..., role = "predictor", trained = FALSE, num_comp = 5, res = NULL, columns = NULL, sigma = 0.2, prefix = "kPC", keep_original_cols = FALSE, skip = FALSE, id = rand_id("kpca_rbf") )
step_kpca_rbf( recipe, ..., role = "predictor", trained = FALSE, num_comp = 5, res = NULL, columns = NULL, sigma = 0.2, prefix = "kPC", keep_original_cols = FALSE, skip = FALSE, id = rand_id("kpca_rbf") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
num_comp |
The number of components to retain as new predictors.
If |
res |
An S4 |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
sigma |
A numeric value for the radial basis function parameter. See
the documentation at |
prefix |
A character string for the prefix of the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Kernel principal component analysis (kPCA) is an extension of a PCA analysis that conducts the calculations in a broader dimensionality defined by a kernel function. For example, if a quadratic kernel function were used, each variable would be represented by its original values as well as its square. This nonlinear mapping is used during the PCA analysis and can potentially help find better representations of the original data.
This step requires the kernlab package. If not installed, the step will stop with a prompt about installing the package.
As with ordinary PCA, it is important to center and scale the variables
prior to computing PCA components (step_normalize()
can be used for
this purpose).
The argument num_comp
controls the number of components that will be retained
(the original variables that are used to derive the components are removed from
the data). The new components will have names that begin with prefix
and a
sequence of numbers. The variable names are padded with zeros. For example, if
num_comp < 10
, their names will be kPC1
- kPC9
. If num_comp = 101
,
the names would be kPC1
- kPC101
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble with column
terms
(the selectors or variables selected) is returned.
This step has 2 tuning parameters:
num_comp
: # Components (type: integer, default: 5)
sigma
: Radial Basis Function sigma (type: double, default: 0.2)
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Scholkopf, B., Smola, A., and Muller, K. (1997). Kernel principal component analysis. Lecture Notes in Computer Science, 1327, 583-588.
Karatzoglou, K., Smola, A., Hornik, K., and Zeileis, A. (2004). kernlab - An S4 package for kernel methods in R. Journal of Statistical Software, 11(1), 1-20.
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
library(ggplot2) data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) kpca_trans <- rec %>% step_YeoJohnson(all_numeric_predictors()) %>% step_normalize(all_numeric_predictors()) %>% step_kpca_rbf(all_numeric_predictors()) kpca_estimates <- prep(kpca_trans, training = biomass_tr) kpca_te <- bake(kpca_estimates, biomass_te) ggplot(kpca_te, aes(x = kPC1, y = kPC2)) + geom_point() + coord_equal() tidy(kpca_trans, number = 3) tidy(kpca_estimates, number = 3)
library(ggplot2) data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) kpca_trans <- rec %>% step_YeoJohnson(all_numeric_predictors()) %>% step_normalize(all_numeric_predictors()) %>% step_kpca_rbf(all_numeric_predictors()) kpca_estimates <- prep(kpca_trans, training = biomass_tr) kpca_te <- bake(kpca_estimates, biomass_te) ggplot(kpca_te, aes(x = kPC1, y = kPC2)) + geom_point() + coord_equal() tidy(kpca_trans, number = 3) tidy(kpca_estimates, number = 3)
step_lag()
creates a specification of a recipe step that will add new
columns of lagged data. Lagged data will by default include NA values where
the lag was induced. These can be removed with step_naomit()
, or you may
specify an alternative filler value with the default
argument.
step_lag( recipe, ..., role = "predictor", trained = FALSE, lag = 1, prefix = "lag_", default = NA, columns = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("lag") )
step_lag( recipe, ..., role = "predictor", trained = FALSE, lag = 1, prefix = "lag_", default = NA, columns = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("lag") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
lag |
A vector of positive integers. Each specified column will be lagged for each value in the vector. |
prefix |
A prefix for generated column names, default to "lag_". |
default |
Passed to |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The step assumes that the data are already in the proper sequential order for lagging.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other row operation steps:
step_arrange()
,
step_filter()
,
step_impute_roll()
,
step_naomit()
,
step_sample()
,
step_shuffle()
,
step_slice()
n <- 10 start <- as.Date("1999/01/01") end <- as.Date("1999/01/10") df <- data.frame( x = runif(n), index = 1:n, day = seq(start, end, by = "day") ) recipe(~., data = df) %>% step_lag(index, day, lag = 2:3) %>% prep(df) %>% bake(df)
n <- 10 start <- as.Date("1999/01/01") end <- as.Date("1999/01/10") df <- data.frame( x = runif(n), index = 1:n, day = seq(start, end, by = "day") ) recipe(~., data = df) %>% step_lag(index, day, lag = 2:3) %>% prep(df) %>% bake(df)
step_lincomb()
creates a specification of a recipe step that will
potentially remove numeric variables that have exact linear combinations
between them.
step_lincomb( recipe, ..., role = NA, trained = FALSE, max_steps = 5, removals = NULL, skip = FALSE, id = rand_id("lincomb") )
step_lincomb( recipe, ..., role = NA, trained = FALSE, max_steps = 5, removals = NULL, skip = FALSE, id = rand_id("lincomb") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
max_steps |
The number of times to apply the algorithm. |
removals |
A character string that contains the names of
columns that should be removed. These values are not determined
until |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
This step can potentially remove columns from the data set. This may cause issues for subsequent steps in your recipe if the missing columns are specifically referenced by name. To avoid this, see the advice in the Tips for saving recipes and filtering columns section of selections.
This step finds exact linear combinations between two
or more variables and recommends which column(s) should be
removed to resolve the issue. This algorithm may need to be
applied multiple times (as defined by max_steps
).
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Max Kuhn, Kirk Mettler, and Jed Wing
Other variable filter steps:
step_corr()
,
step_filter_missing()
,
step_nzv()
,
step_rm()
,
step_select()
,
step_zv()
data(biomass, package = "modeldata") biomass$new_1 <- with( biomass, .1 * carbon - .2 * hydrogen + .6 * sulfur ) biomass$new_2 <- with( biomass, .5 * carbon - .2 * oxygen + .6 * nitrogen ) biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + new_1 + new_2, data = biomass_tr ) lincomb_filter <- rec %>% step_lincomb(all_numeric_predictors()) lincomb_filter_trained <- prep(lincomb_filter, training = biomass_tr) lincomb_filter_trained tidy(lincomb_filter, number = 1) tidy(lincomb_filter_trained, number = 1)
data(biomass, package = "modeldata") biomass$new_1 <- with( biomass, .1 * carbon - .2 * hydrogen + .6 * sulfur ) biomass$new_2 <- with( biomass, .5 * carbon - .2 * oxygen + .6 * nitrogen ) biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + new_1 + new_2, data = biomass_tr ) lincomb_filter <- rec %>% step_lincomb(all_numeric_predictors()) lincomb_filter_trained <- prep(lincomb_filter, training = biomass_tr) lincomb_filter_trained tidy(lincomb_filter, number = 1) tidy(lincomb_filter_trained, number = 1)
step_log()
creates a specification of a recipe step that will log
transform data.
step_log( recipe, ..., role = NA, trained = FALSE, base = exp(1), offset = 0, columns = NULL, skip = FALSE, signed = FALSE, id = rand_id("log") )
step_log( recipe, ..., role = NA, trained = FALSE, base = exp(1), offset = 0, columns = NULL, skip = FALSE, signed = FALSE, id = rand_id("log") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
base |
A numeric value for the base. |
offset |
An optional value to add to the data prior to
logging (to avoid |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
signed |
A logical indicating whether to take the signed log.
This is sign(x) * log(abs(x)) when abs(x) => 1 or 0 if abs(x) < 1.
If |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, base
, and id
:
character, the selectors or variables selected
numeric, value for the base
character, id of this step
The underlying operation does not allow for case weights.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_bs()
,
step_harmonic()
,
step_hyperbolic()
,
step_inverse()
,
step_invlogit()
,
step_logit()
,
step_mutate()
,
step_ns()
,
step_percentile()
,
step_poly()
,
step_relu()
,
step_sqrt()
set.seed(313) examples <- matrix(exp(rnorm(40)), ncol = 2) examples <- as.data.frame(examples) rec <- recipe(~ V1 + V2, data = examples) log_trans <- rec %>% step_log(all_numeric_predictors()) log_obj <- prep(log_trans, training = examples) transformed_te <- bake(log_obj, examples) plot(examples$V1, transformed_te$V1) tidy(log_trans, number = 1) tidy(log_obj, number = 1) # using the signed argument with negative values examples2 <- matrix(rnorm(40, sd = 5), ncol = 2) examples2 <- as.data.frame(examples2) recipe(~ V1 + V2, data = examples2) %>% step_log(all_numeric_predictors()) %>% prep(training = examples2) %>% bake(examples2) recipe(~ V1 + V2, data = examples2) %>% step_log(all_numeric_predictors(), signed = TRUE) %>% prep(training = examples2) %>% bake(examples2)
set.seed(313) examples <- matrix(exp(rnorm(40)), ncol = 2) examples <- as.data.frame(examples) rec <- recipe(~ V1 + V2, data = examples) log_trans <- rec %>% step_log(all_numeric_predictors()) log_obj <- prep(log_trans, training = examples) transformed_te <- bake(log_obj, examples) plot(examples$V1, transformed_te$V1) tidy(log_trans, number = 1) tidy(log_obj, number = 1) # using the signed argument with negative values examples2 <- matrix(rnorm(40, sd = 5), ncol = 2) examples2 <- as.data.frame(examples2) recipe(~ V1 + V2, data = examples2) %>% step_log(all_numeric_predictors()) %>% prep(training = examples2) %>% bake(examples2) recipe(~ V1 + V2, data = examples2) %>% step_log(all_numeric_predictors(), signed = TRUE) %>% prep(training = examples2) %>% bake(examples2)
step_logit()
creates a specification of a recipe step that will logit
transform the data.
step_logit( recipe, ..., offset = 0, role = NA, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("logit") )
step_logit( recipe, ..., offset = 0, role = NA, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("logit") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
offset |
A numeric value to modify values of the columns that are either
one or zero. They are modified to be |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The logit transformation takes values between
zero and one and translates them to be on the real line using
the function f(p) = log(p/(1-p))
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_bs()
,
step_harmonic()
,
step_hyperbolic()
,
step_inverse()
,
step_invlogit()
,
step_log()
,
step_mutate()
,
step_ns()
,
step_percentile()
,
step_poly()
,
step_relu()
,
step_sqrt()
set.seed(313) examples <- matrix(runif(40), ncol = 2) examples <- data.frame(examples) rec <- recipe(~ X1 + X2, data = examples) logit_trans <- rec %>% step_logit(all_numeric_predictors()) logit_obj <- prep(logit_trans, training = examples) transformed_te <- bake(logit_obj, examples) plot(examples$X1, transformed_te$X1) tidy(logit_trans, number = 1) tidy(logit_obj, number = 1)
set.seed(313) examples <- matrix(runif(40), ncol = 2) examples <- data.frame(examples) rec <- recipe(~ X1 + X2, data = examples) logit_trans <- rec %>% step_logit(all_numeric_predictors()) logit_obj <- prep(logit_trans, training = examples) transformed_te <- bake(logit_obj, examples) plot(examples$X1, transformed_te$X1) tidy(logit_trans, number = 1) tidy(logit_obj, number = 1)
step_mutate()
creates a specification of a recipe step that will add
variables using dplyr::mutate()
.
step_mutate( recipe, ..., .pkgs = character(), role = "predictor", trained = FALSE, inputs = NULL, skip = FALSE, id = rand_id("mutate") )
step_mutate( recipe, ..., .pkgs = character(), role = "predictor", trained = FALSE, inputs = NULL, skip = FALSE, id = rand_id("mutate") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
Name-value pairs of expressions. See |
.pkgs |
Character vector, package names of functions used in
expressions |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
inputs |
Quosure(s) of |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
When using this flexible step, use extra care to avoid data leakage in your
preprocessing. Consider, for example, the transformation x = w > mean(w)
.
When applied to new data or testing data, this transformation would use the
mean of w
from the new data, not the mean of w
from the training data.
When an object in the user's global environment is
referenced in the expression defining the new variable(s),
it is a good idea to use quasiquotation (e.g. !!
) to embed
the value of the object in the expression (to be portable
between sessions). See the examples.
If a preceding step removes a column that is selected by name in
step_mutate()
, the recipe will error when being estimated with prep()
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
character, expression passed to mutate()
character, id of this step
The underlying operation does not allow for case weights.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_bs()
,
step_harmonic()
,
step_hyperbolic()
,
step_inverse()
,
step_invlogit()
,
step_log()
,
step_logit()
,
step_ns()
,
step_percentile()
,
step_poly()
,
step_relu()
,
step_sqrt()
Other dplyr steps:
step_arrange()
,
step_filter()
,
step_mutate_at()
,
step_rename()
,
step_rename_at()
,
step_sample()
,
step_select()
,
step_slice()
rec <- recipe(~., data = iris) %>% step_mutate( dbl_width = Sepal.Width * 2, half_length = Sepal.Length / 2 ) prepped <- prep(rec, training = iris %>% slice(1:75)) library(dplyr) dplyr_train <- iris %>% as_tibble() %>% slice(1:75) %>% mutate( dbl_width = Sepal.Width * 2, half_length = Sepal.Length / 2 ) rec_train <- bake(prepped, new_data = NULL) all.equal(dplyr_train, rec_train) dplyr_test <- iris %>% as_tibble() %>% slice(76:150) %>% mutate( dbl_width = Sepal.Width * 2, half_length = Sepal.Length / 2 ) rec_test <- bake(prepped, iris %>% slice(76:150)) all.equal(dplyr_test, rec_test) # Embedding objects: const <- 1.414 qq_rec <- recipe(~., data = iris) %>% step_mutate( bad_approach = Sepal.Width * const, best_approach = Sepal.Width * !!const ) %>% prep(training = iris) bake(qq_rec, new_data = NULL, contains("appro")) %>% slice(1:4) # The difference: tidy(qq_rec, number = 1)
rec <- recipe(~., data = iris) %>% step_mutate( dbl_width = Sepal.Width * 2, half_length = Sepal.Length / 2 ) prepped <- prep(rec, training = iris %>% slice(1:75)) library(dplyr) dplyr_train <- iris %>% as_tibble() %>% slice(1:75) %>% mutate( dbl_width = Sepal.Width * 2, half_length = Sepal.Length / 2 ) rec_train <- bake(prepped, new_data = NULL) all.equal(dplyr_train, rec_train) dplyr_test <- iris %>% as_tibble() %>% slice(76:150) %>% mutate( dbl_width = Sepal.Width * 2, half_length = Sepal.Length / 2 ) rec_test <- bake(prepped, iris %>% slice(76:150)) all.equal(dplyr_test, rec_test) # Embedding objects: const <- 1.414 qq_rec <- recipe(~., data = iris) %>% step_mutate( bad_approach = Sepal.Width * const, best_approach = Sepal.Width * !!const ) %>% prep(training = iris) bake(qq_rec, new_data = NULL, contains("appro")) %>% slice(1:4) # The difference: tidy(qq_rec, number = 1)
step_mutate_at()
creates a specification of a recipe step that will
modify the selected variables using a common function via
dplyr::mutate_at()
.
step_mutate_at( recipe, ..., fn, role = "predictor", trained = FALSE, inputs = NULL, skip = FALSE, id = rand_id("mutate_at") )
step_mutate_at( recipe, ..., fn, role = "predictor", trained = FALSE, inputs = NULL, skip = FALSE, id = rand_id("mutate_at") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
fn |
A function fun, a quosure style lambda '~ fun(.)“ or a list of
either form. (see |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
inputs |
A vector of column names populated by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
When using this flexible step, use extra care to avoid data leakage in your
preprocessing. Consider, for example, the transformation x = w > mean(w)
.
When applied to new data or testing data, this transformation would use the
mean of w
from the new data, not the mean of w
from the training data.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
Other dplyr steps:
step_arrange()
,
step_filter()
,
step_mutate()
,
step_rename()
,
step_rename_at()
,
step_sample()
,
step_select()
,
step_slice()
library(dplyr) recipe(~., data = iris) %>% step_mutate_at(contains("Length"), fn = ~ 1 / .) %>% prep() %>% bake(new_data = NULL) %>% slice(1:10) recipe(~., data = iris) %>% # leads to more columns being created. step_mutate_at(contains("Length"), fn = list(log = log, sqrt = sqrt)) %>% prep() %>% bake(new_data = NULL) %>% slice(1:10)
library(dplyr) recipe(~., data = iris) %>% step_mutate_at(contains("Length"), fn = ~ 1 / .) %>% prep() %>% bake(new_data = NULL) %>% slice(1:10) recipe(~., data = iris) %>% # leads to more columns being created. step_mutate_at(contains("Length"), fn = list(log = log, sqrt = sqrt)) %>% prep() %>% bake(new_data = NULL) %>% slice(1:10)
step_naomit()
creates a specification of a recipe step that will remove
observations (rows of data) if they contain NA
or NaN
values.
step_naomit( recipe, ..., role = NA, trained = FALSE, columns = NULL, skip = TRUE, id = rand_id("naomit") )
step_naomit( recipe, ..., role = NA, trained = FALSE, columns = NULL, skip = TRUE, id = rand_id("naomit") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Unused, include for consistency with other steps. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. Again included for consistency. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
This step can entirely remove observations (rows of data), which can have
unintended and/or problematic consequences when applying the step to new
data later via bake()
. Consider whether skip = TRUE
or
skip = FALSE
is more appropriate in any given use case. In most instances
that affect the rows of the data being predicted, this step probably should
not be applied at all; instead, execute operations like this outside and
before starting a preprocessing recipe()
.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other row operation steps:
step_arrange()
,
step_filter()
,
step_impute_roll()
,
step_lag()
,
step_sample()
,
step_shuffle()
,
step_slice()
recipe(Ozone ~ ., data = airquality) %>% step_naomit(Solar.R) %>% prep(airquality, verbose = FALSE) %>% bake(new_data = NULL)
recipe(Ozone ~ ., data = airquality) %>% step_naomit(Solar.R) %>% prep(airquality, verbose = FALSE) %>% bake(new_data = NULL)
step_nnmf()
creates a specification of a recipe step that will convert
numeric data into one or more non-negative components.
Please use step_nnmf_sparse()
instead of this step function.
step_nnmf( recipe, ..., role = "predictor", trained = FALSE, num_comp = 2, num_run = 30, options = list(), res = NULL, columns = NULL, prefix = "NNMF", seed = sample.int(10^5, 1), keep_original_cols = FALSE, skip = FALSE, id = rand_id("nnmf") )
step_nnmf( recipe, ..., role = "predictor", trained = FALSE, num_comp = 2, num_run = 30, options = list(), res = NULL, columns = NULL, prefix = "NNMF", seed = sample.int(10^5, 1), keep_original_cols = FALSE, skip = FALSE, id = rand_id("nnmf") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
num_comp |
The number of components to retain as new predictors.
If |
num_run |
A positive integer for the number of computations runs used to obtain a consensus projection. |
options |
A list of options to |
res |
The |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
prefix |
A character string that will be the prefix to the resulting new variables. See notes below. |
seed |
An integer that will be used to set the seed in isolation when computing the factorization. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Non-negative matrix factorization computes latent components that have non-negative values and take into account that the original data have non-negative values.
The argument num_comp
controls the number of components that will be retained
(the original variables that are used to derive the components are removed from
the data). The new components will have names that begin with prefix
and a
sequence of numbers. The variable names are padded with zeros. For example, if
num_comp < 10
, their names will be NNMF1
- NNMF9
. If num_comp = 101
,
the names would be NNMF1
- NNMF101
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, component
, and id
:
character, the selectors or variables selected
numeric, value of loading
character, name of component
character, id of this step
This step has 2 tuning parameters:
num_comp
: # Components (type: integer, default: 2)
num_run
: Number of Computation Runs (type: integer, default: 30)
The underlying operation does not allow for case weights.
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
data(biomass, package = "modeldata") # rec <- recipe(HHV ~ ., data = biomass) %>% # update_role(sample, new_role = "id var") %>% # update_role(dataset, new_role = "split variable") %>% # step_nnmf(all_numeric_predictors(), num_comp = 2, seed = 473, num_run = 2) %>% # prep(training = biomass) # # bake(rec, new_data = NULL) # # library(ggplot2) # bake(rec, new_data = NULL) %>% # ggplot(aes(x = NNMF2, y = NNMF1, col = HHV)) + geom_point()
data(biomass, package = "modeldata") # rec <- recipe(HHV ~ ., data = biomass) %>% # update_role(sample, new_role = "id var") %>% # update_role(dataset, new_role = "split variable") %>% # step_nnmf(all_numeric_predictors(), num_comp = 2, seed = 473, num_run = 2) %>% # prep(training = biomass) # # bake(rec, new_data = NULL) # # library(ggplot2) # bake(rec, new_data = NULL) %>% # ggplot(aes(x = NNMF2, y = NNMF1, col = HHV)) + geom_point()
step_nnmf_sparse()
creates a specification of a recipe step that will
convert numeric data into one or more non-negative components.
step_nnmf_sparse( recipe, ..., role = "predictor", trained = FALSE, num_comp = 2, penalty = 0.001, options = list(), res = NULL, prefix = "NNMF", seed = sample.int(10^5, 1), keep_original_cols = FALSE, skip = FALSE, id = rand_id("nnmf_sparse") )
step_nnmf_sparse( recipe, ..., role = "predictor", trained = FALSE, num_comp = 2, penalty = 0.001, options = list(), res = NULL, prefix = "NNMF", seed = sample.int(10^5, 1), keep_original_cols = FALSE, skip = FALSE, id = rand_id("nnmf_sparse") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
num_comp |
The number of components to retain as new predictors.
If |
penalty |
A non-negative number used as a penalization factor for the loadings. Values are usually between zero and one. |
options |
A list of options to |
res |
A matrix of loadings is stored here, along with the names of the
original predictors, once this preprocessing step has been trained by
|
prefix |
A character string for the prefix of the resulting new variables. See notes below. |
seed |
An integer that will be used to set the seed in isolation when computing the factorization. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Non-negative matrix factorization computes latent components that have non-negative values and take into account that the original data have non-negative values.
The argument num_comp
controls the number of components that will be retained
(the original variables that are used to derive the components are removed from
the data). The new components will have names that begin with prefix
and a
sequence of numbers. The variable names are padded with zeros. For example, if
num_comp < 10
, their names will be NNMF1
- NNMF9
. If num_comp = 101
,
the names would be NNMF1
- NNMF101
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, component
, and id
:
character, the selectors or variables selected
numeric, value of loading
character, name of component
character, id of this step
This step has 2 tuning parameters:
num_comp
: # Components (type: integer, default: 2)
penalty
: Amount of Regularization (type: double, default: 0.001)
The underlying operation does not allow for case weights.
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_pca()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
if (rlang::is_installed(c("modeldata", "RcppML", "ggplot2"))) { library(Matrix) data(biomass, package = "modeldata") rec <- recipe(HHV ~ ., data = biomass) %>% update_role(sample, new_role = "id var") %>% update_role(dataset, new_role = "split variable") %>% step_nnmf_sparse( all_numeric_predictors(), num_comp = 2, seed = 473, penalty = 0.01 ) %>% prep(training = biomass) bake(rec, new_data = NULL) library(ggplot2) bake(rec, new_data = NULL) %>% ggplot(aes(x = NNMF2, y = NNMF1, col = HHV)) + geom_point() }
if (rlang::is_installed(c("modeldata", "RcppML", "ggplot2"))) { library(Matrix) data(biomass, package = "modeldata") rec <- recipe(HHV ~ ., data = biomass) %>% update_role(sample, new_role = "id var") %>% update_role(dataset, new_role = "split variable") %>% step_nnmf_sparse( all_numeric_predictors(), num_comp = 2, seed = 473, penalty = 0.01 ) %>% prep(training = biomass) bake(rec, new_data = NULL) library(ggplot2) bake(rec, new_data = NULL) %>% ggplot(aes(x = NNMF2, y = NNMF1, col = HHV)) + geom_point() }
step_normalize()
creates a specification of a recipe step that will
normalize numeric data to have a standard deviation of one and a mean of
zero.
step_normalize( recipe, ..., role = NA, trained = FALSE, means = NULL, sds = NULL, na_rm = TRUE, skip = FALSE, id = rand_id("normalize") )
step_normalize( recipe, ..., role = NA, trained = FALSE, means = NULL, sds = NULL, na_rm = TRUE, skip = FALSE, id = rand_id("normalize") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
means |
A named numeric vector of means. This is |
sds |
A named numeric vector of standard deviations This is |
na_rm |
A logical value indicating whether |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Centering data means that the average of a variable is subtracted
from the data. Scaling data means that the standard deviation of a variable
is divided out of the data. step_normalize
estimates the variable standard
deviations and means from the data used in the training
argument of
prep.recipe
. bake.recipe
then applies the scaling to new data sets using
these estimates.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, statistic
, value
, and id
:
character, the selectors or variables selected
character, name of statistic ("mean"
or "sd"
)
numeric, value of the statistic
character, id of this step
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Other normalization steps:
step_center()
,
step_range()
,
step_scale()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) norm_trans <- rec %>% step_normalize(carbon, hydrogen) norm_obj <- prep(norm_trans, training = biomass_tr) transformed_te <- bake(norm_obj, biomass_te) biomass_te[1:10, names(transformed_te)] transformed_te tidy(norm_trans, number = 1) tidy(norm_obj, number = 1) # To keep the original variables in the output, use `step_mutate_at`: norm_keep_orig <- rec %>% step_mutate_at(all_numeric_predictors(), fn = list(orig = ~.)) %>% step_normalize(-contains("orig"), -all_outcomes()) keep_orig_obj <- prep(norm_keep_orig, training = biomass_tr) keep_orig_te <- bake(keep_orig_obj, biomass_te) keep_orig_te
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) norm_trans <- rec %>% step_normalize(carbon, hydrogen) norm_obj <- prep(norm_trans, training = biomass_tr) transformed_te <- bake(norm_obj, biomass_te) biomass_te[1:10, names(transformed_te)] transformed_te tidy(norm_trans, number = 1) tidy(norm_obj, number = 1) # To keep the original variables in the output, use `step_mutate_at`: norm_keep_orig <- rec %>% step_mutate_at(all_numeric_predictors(), fn = list(orig = ~.)) %>% step_normalize(-contains("orig"), -all_outcomes()) keep_orig_obj <- prep(norm_keep_orig, training = biomass_tr) keep_orig_te <- bake(keep_orig_obj, biomass_te) keep_orig_te
step_novel()
creates a specification of a recipe step that will assign a
previously unseen factor level to "new"
.
step_novel( recipe, ..., role = NA, trained = FALSE, new_level = "new", objects = NULL, skip = FALSE, id = rand_id("novel") )
step_novel( recipe, ..., role = NA, trained = FALSE, new_level = "new", objects = NULL, skip = FALSE, id = rand_id("novel") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
new_level |
A single character value that will be assigned to new factor levels. |
objects |
A list of objects that contain the information
on factor levels that will be determined by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The selected variables are adjusted to have a new
level (given by new_level
) that is placed in the last
position. During preparation there will be no data points
associated with this new level since all of the data have been
seen.
Note that if the original columns are character, they will be converted to factors by this step.
Missing values will remain missing.
If new_level
is already in the data given to prep
, an error
is thrown.
When fitting a model that can deal with new factor levels, consider using
workflows::add_recipe()
with allow_novel_levels = TRUE
set in
hardhat::default_recipe_blueprint()
. This will allow your model to handle
new levels at prediction time, instead of throwing warnings or errors.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
character, the factor levels that are used for the new value
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
data(Sacramento, package = "modeldata") sacr_tr <- Sacramento[1:800, ] sacr_te <- Sacramento[801:806, ] # Without converting the predictor to a character, the new level would be converted # to `NA`. sacr_te$city <- as.character(sacr_te$city) sacr_te$city[3] <- "beeptown" sacr_te$city[4] <- "boopville" sacr_te$city <- as.factor(sacr_te$city) rec <- recipe(~ city + zip, data = sacr_tr) rec <- rec %>% step_novel(city, zip) rec <- prep(rec, training = sacr_tr) processed <- bake(rec, sacr_te) tibble(old = sacr_te$city, new = processed$city) tidy(rec, number = 1)
data(Sacramento, package = "modeldata") sacr_tr <- Sacramento[1:800, ] sacr_te <- Sacramento[801:806, ] # Without converting the predictor to a character, the new level would be converted # to `NA`. sacr_te$city <- as.character(sacr_te$city) sacr_te$city[3] <- "beeptown" sacr_te$city[4] <- "boopville" sacr_te$city <- as.factor(sacr_te$city) rec <- recipe(~ city + zip, data = sacr_tr) rec <- rec %>% step_novel(city, zip) rec <- prep(rec, training = sacr_tr) processed <- bake(rec, sacr_te) tibble(old = sacr_te$city, new = processed$city) tidy(rec, number = 1)
step_ns()
creates a specification of a recipe step that will create new
columns that are basis expansions of variables using natural splines.
step_ns( recipe, ..., role = "predictor", trained = FALSE, objects = NULL, deg_free = 2, options = list(), keep_original_cols = FALSE, skip = FALSE, id = rand_id("ns") )
step_ns( recipe, ..., role = "predictor", trained = FALSE, objects = NULL, deg_free = 2, options = list(), keep_original_cols = FALSE, skip = FALSE, id = rand_id("ns") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
objects |
A list of |
deg_free |
The degrees of freedom for the natural spline. As the degrees of freedom for a natural spline increase, more flexible and complex curves can be generated. When a single degree of freedom is used, the result is a rescaled version of the original data. |
options |
A list of options for |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_ns
can create new features from a single variable
that enable fitting routines to model this variable in a
nonlinear manner. The extent of the possible nonlinearity is
determined by the df
or knots
arguments of
splines::ns()
. The original variables are removed
from the data and new columns are added. The naming convention
for the new variables is varname_ns_1
and so on.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
This step has 1 tuning parameters:
deg_free
: Spline Degrees of Freedom (type: integer, default: 2)
The underlying operation does not allow for case weights.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_bs()
,
step_harmonic()
,
step_hyperbolic()
,
step_inverse()
,
step_invlogit()
,
step_log()
,
step_logit()
,
step_mutate()
,
step_percentile()
,
step_poly()
,
step_relu()
,
step_sqrt()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) with_splines <- rec %>% step_ns(carbon, hydrogen) with_splines <- prep(with_splines, training = biomass_tr) expanded <- bake(with_splines, biomass_te) expanded
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) with_splines <- rec %>% step_ns(carbon, hydrogen) with_splines <- prep(with_splines, training = biomass_tr) expanded <- bake(with_splines, biomass_te) expanded
step_num2factor()
will convert one or more numeric vectors to factors
(ordered or unordered). This can be useful when categories are encoded as
integers.
step_num2factor( recipe, ..., role = NA, transform = function(x) x, trained = FALSE, levels, ordered = FALSE, skip = FALSE, id = rand_id("num2factor") )
step_num2factor( recipe, ..., role = NA, transform = function(x) x, trained = FALSE, levels, ordered = FALSE, skip = FALSE, id = rand_id("num2factor") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
transform |
A function taking a single argument |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
levels |
A character vector of values that will be used as the levels.
These are the numeric data converted to character and ordered. This is
modified once |
ordered |
A single logical value; should the factor(s) be ordered? |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, ordered
, and id
:
character, the selectors or variables selected
logical, were the factor(s) ordered
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
library(dplyr) data(attrition, package = "modeldata") attrition %>% group_by(StockOptionLevel) %>% count() amnt <- c("nothin", "meh", "some", "copious") rec <- recipe(Attrition ~ StockOptionLevel, data = attrition) %>% step_num2factor( StockOptionLevel, transform = function(x) x + 1, levels = amnt ) encoded <- rec %>% prep() %>% bake(new_data = NULL) table(encoded$StockOptionLevel, attrition$StockOptionLevel) # an example for binning binner <- function(x) { x <- cut(x, breaks = 1000 * c(0, 5, 10, 20), include.lowest = TRUE) # now return the group number as.numeric(x) } inc <- c("low", "med", "high") rec <- recipe(Attrition ~ MonthlyIncome, data = attrition) %>% step_num2factor( MonthlyIncome, transform = binner, levels = inc, ordered = TRUE ) %>% prep() encoded <- bake(rec, new_data = NULL) table(encoded$MonthlyIncome, binner(attrition$MonthlyIncome)) # What happens when a value is out of range? ceo <- attrition %>% slice(1) %>% mutate(MonthlyIncome = 10^10) bake(rec, ceo)
library(dplyr) data(attrition, package = "modeldata") attrition %>% group_by(StockOptionLevel) %>% count() amnt <- c("nothin", "meh", "some", "copious") rec <- recipe(Attrition ~ StockOptionLevel, data = attrition) %>% step_num2factor( StockOptionLevel, transform = function(x) x + 1, levels = amnt ) encoded <- rec %>% prep() %>% bake(new_data = NULL) table(encoded$StockOptionLevel, attrition$StockOptionLevel) # an example for binning binner <- function(x) { x <- cut(x, breaks = 1000 * c(0, 5, 10, 20), include.lowest = TRUE) # now return the group number as.numeric(x) } inc <- c("low", "med", "high") rec <- recipe(Attrition ~ MonthlyIncome, data = attrition) %>% step_num2factor( MonthlyIncome, transform = binner, levels = inc, ordered = TRUE ) %>% prep() encoded <- bake(rec, new_data = NULL) table(encoded$MonthlyIncome, binner(attrition$MonthlyIncome)) # What happens when a value is out of range? ceo <- attrition %>% slice(1) %>% mutate(MonthlyIncome = 10^10) bake(rec, ceo)
step_nzv()
creates a specification of a recipe step that will potentially
remove variables that are highly sparse and unbalanced.
step_nzv( recipe, ..., role = NA, trained = FALSE, freq_cut = 95/5, unique_cut = 10, options = list(freq_cut = 95/5, unique_cut = 10), removals = NULL, skip = FALSE, id = rand_id("nzv") )
step_nzv( recipe, ..., role = NA, trained = FALSE, freq_cut = 95/5, unique_cut = 10, options = list(freq_cut = 95/5, unique_cut = 10), removals = NULL, skip = FALSE, id = rand_id("nzv") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
freq_cut , unique_cut
|
Numeric parameters for the filtering process. See the Details section below. |
options |
A list of options for the filter (see Details below). |
removals |
A character string that contains the names of
columns that should be removed. These values are not determined
until |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
This step can potentially remove columns from the data set. This may cause issues for subsequent steps in your recipe if the missing columns are specifically referenced by name. To avoid this, see the advice in the Tips for saving recipes and filtering columns section of selections.
This step diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that have both of the following characteristics:
they have very few unique values relative to the number of samples and
the ratio of the frequency of the most common value to the frequency of the second most common value is large.
For example, an example of near-zero variance predictor is one that, for 1000 samples, has two distinct values and 999 of them are a single value.
To be flagged, first, the frequency of the most prevalent value
over the second most frequent value (called the "frequency
ratio") must be above freq_cut
. Secondly, the "percent of
unique values," the number of unique values divided by the total
number of samples (times 100), must also be below
unique_cut
.
In the above example, the frequency ratio is 999 and the unique value percent is 0.2%.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
This step has 2 tuning parameters:
freq_cut
: Frequency Distribution Ratio (type: double, default: 95/5)
unique_cut
: % Unique Values (type: double, default: 10)
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Other variable filter steps:
step_corr()
,
step_filter_missing()
,
step_lincomb()
,
step_rm()
,
step_select()
,
step_zv()
data(biomass, package = "modeldata") biomass$sparse <- c(1, rep(0, nrow(biomass) - 1)) biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + sparse, data = biomass_tr ) nzv_filter <- rec %>% step_nzv(all_predictors()) filter_obj <- prep(nzv_filter, training = biomass_tr) filtered_te <- bake(filter_obj, biomass_te) any(names(filtered_te) == "sparse") tidy(nzv_filter, number = 1) tidy(filter_obj, number = 1)
data(biomass, package = "modeldata") biomass$sparse <- c(1, rep(0, nrow(biomass) - 1)) biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + sparse, data = biomass_tr ) nzv_filter <- rec %>% step_nzv(all_predictors()) filter_obj <- prep(nzv_filter, training = biomass_tr) filtered_te <- bake(filter_obj, biomass_te) any(names(filtered_te) == "sparse") tidy(nzv_filter, number = 1) tidy(filter_obj, number = 1)
step_ordinalscore()
creates a specification of a recipe step that will
convert ordinal factor variables into numeric scores.
step_ordinalscore( recipe, ..., role = NA, trained = FALSE, columns = NULL, convert = as.numeric, skip = FALSE, id = rand_id("ordinalscore") )
step_ordinalscore( recipe, ..., role = NA, trained = FALSE, columns = NULL, convert = as.numeric, skip = FALSE, id = rand_id("ordinalscore") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
convert |
A function that takes an ordinal factor vector as an input and outputs a single numeric variable. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Dummy variables from ordered factors with C
levels will create polynomial basis functions with C-1
terms. As an alternative, this step can be used to translate the
ordered levels into a single numeric vector of values that
represent (subjective) scores. By default, the translation uses
a linear scale (1, 2, 3, ... C
) but custom score
functions can also be used (see the example below).
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
fail_lvls <- c("meh", "annoying", "really_bad") ord_data <- data.frame( item = c("paperclip", "twitter", "airbag"), fail_severity = factor(fail_lvls, levels = fail_lvls, ordered = TRUE ) ) model.matrix(~fail_severity, data = ord_data) linear_values <- recipe(~ item + fail_severity, data = ord_data) %>% step_dummy(item) %>% step_ordinalscore(fail_severity) linear_values <- prep(linear_values, training = ord_data) bake(linear_values, new_data = NULL) custom <- function(x) { new_values <- c(1, 3, 7) new_values[as.numeric(x)] } nonlin_scores <- recipe(~ item + fail_severity, data = ord_data) %>% step_dummy(item) %>% step_ordinalscore(fail_severity, convert = custom) tidy(nonlin_scores, number = 2) nonlin_scores <- prep(nonlin_scores, training = ord_data) bake(nonlin_scores, new_data = NULL) tidy(nonlin_scores, number = 2)
fail_lvls <- c("meh", "annoying", "really_bad") ord_data <- data.frame( item = c("paperclip", "twitter", "airbag"), fail_severity = factor(fail_lvls, levels = fail_lvls, ordered = TRUE ) ) model.matrix(~fail_severity, data = ord_data) linear_values <- recipe(~ item + fail_severity, data = ord_data) %>% step_dummy(item) %>% step_ordinalscore(fail_severity) linear_values <- prep(linear_values, training = ord_data) bake(linear_values, new_data = NULL) custom <- function(x) { new_values <- c(1, 3, 7) new_values[as.numeric(x)] } nonlin_scores <- recipe(~ item + fail_severity, data = ord_data) %>% step_dummy(item) %>% step_ordinalscore(fail_severity, convert = custom) tidy(nonlin_scores, number = 2) nonlin_scores <- prep(nonlin_scores, training = ord_data) bake(nonlin_scores, new_data = NULL) tidy(nonlin_scores, number = 2)
step_other()
creates a specification of a recipe step that will
potentially pool infrequently occurring values into an "other"
category.
step_other( recipe, ..., role = NA, trained = FALSE, threshold = 0.05, other = "other", objects = NULL, skip = FALSE, id = rand_id("other") )
step_other( recipe, ..., role = NA, trained = FALSE, threshold = 0.05, other = "other", objects = NULL, skip = FALSE, id = rand_id("other") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
threshold |
A numeric value between 0 and 1, or an integer greater or
equal to one. If less than one, then factor levels with a rate of
occurrence in the training set below |
other |
A single character value for the "other" category. |
objects |
A list of objects that contain the information
to pool infrequent levels that is determined by
|
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The overall proportion (or total counts) of the categories are
computed. The "other" category is used in place of any categorical levels
whose individual proportion (or frequency) in the training set is less than
threshold
.
If no pooling is done the data are unmodified (although character data may
be changed to factors based on the value of strings_as_factors
in
prep()
). Otherwise, a factor is always returned with
different factor levels.
If threshold
is less than the largest category proportion, all levels
except for the most frequent are collapsed to the other
level.
If the retained categories include the value of other
, an error is
thrown. If other
is in the list of discarded levels, no error
occurs.
If no pooling is done, novel factor levels are converted to missing. If pooling is needed, they will be placed into the other category.
When data to be processed contains novel levels (i.e., not contained in the training set), the other category is assigned.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, retained
, and id
:
character, the selectors or variables selected
character, factor levels not pulled into "other"
character, id of this step
This step has 1 tuning parameters:
threshold
: Threshold (type: double, default: 0.05)
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
data(Sacramento, package = "modeldata") set.seed(19) in_train <- sample(1:nrow(Sacramento), size = 800) sacr_tr <- Sacramento[in_train, ] sacr_te <- Sacramento[-in_train, ] rec <- recipe(~ city + zip, data = sacr_tr) rec <- rec %>% step_other(city, zip, threshold = .1, other = "other values") rec <- prep(rec, training = sacr_tr) collapsed <- bake(rec, sacr_te) table(sacr_te$city, collapsed$city, useNA = "always") tidy(rec, number = 1) # novel levels are also "othered" tahiti <- Sacramento[1, ] tahiti$zip <- "a magical place" bake(rec, tahiti) # threshold as a frequency rec <- recipe(~ city + zip, data = sacr_tr) rec <- rec %>% step_other(city, zip, threshold = 2000, other = "other values") rec <- prep(rec, training = sacr_tr) tidy(rec, number = 1) # compare it to # sacr_tr %>% count(city, sort = TRUE) %>% top_n(4) # sacr_tr %>% count(zip, sort = TRUE) %>% top_n(3)
data(Sacramento, package = "modeldata") set.seed(19) in_train <- sample(1:nrow(Sacramento), size = 800) sacr_tr <- Sacramento[in_train, ] sacr_te <- Sacramento[-in_train, ] rec <- recipe(~ city + zip, data = sacr_tr) rec <- rec %>% step_other(city, zip, threshold = .1, other = "other values") rec <- prep(rec, training = sacr_tr) collapsed <- bake(rec, sacr_te) table(sacr_te$city, collapsed$city, useNA = "always") tidy(rec, number = 1) # novel levels are also "othered" tahiti <- Sacramento[1, ] tahiti$zip <- "a magical place" bake(rec, tahiti) # threshold as a frequency rec <- recipe(~ city + zip, data = sacr_tr) rec <- rec %>% step_other(city, zip, threshold = 2000, other = "other values") rec <- prep(rec, training = sacr_tr) tidy(rec, number = 1) # compare it to # sacr_tr %>% count(city, sort = TRUE) %>% top_n(4) # sacr_tr %>% count(zip, sort = TRUE) %>% top_n(3)
step_pca()
creates a specification of a recipe step that will convert
numeric variables into one or more principal components.
step_pca( recipe, ..., role = "predictor", trained = FALSE, num_comp = 5, threshold = NA, options = list(), res = NULL, columns = NULL, prefix = "PC", keep_original_cols = FALSE, skip = FALSE, id = rand_id("pca") )
step_pca( recipe, ..., role = "predictor", trained = FALSE, num_comp = 5, threshold = NA, options = list(), res = NULL, columns = NULL, prefix = "PC", keep_original_cols = FALSE, skip = FALSE, id = rand_id("pca") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
num_comp |
The number of components to retain as new predictors.
If |
threshold |
A fraction of the total variance that should be covered by
the components. For example, |
options |
A list of options to the default method for
|
res |
The |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
prefix |
A character string for the prefix of the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Principal component analysis (PCA) is a transformation of a group of variables that produces a new set of artificial features or components. These components are designed to capture the maximum amount of information (i.e. variance) in the original variables. Also, the components are statistically independent from one another. This means that they can be used to combat large inter-variables correlations in a data set.
It is advisable to standardize the variables prior to running
PCA. Here, each variable will be centered and scaled prior to
the PCA calculation. This can be changed using the
options
argument or by using step_center()
and step_scale()
.
The argument num_comp
controls the number of components that will be retained
(the original variables that are used to derive the components are removed from
the data). The new components will have names that begin with prefix
and a
sequence of numbers. The variable names are padded with zeros. For example, if
num_comp < 10
, their names will be PC1
- PC9
. If num_comp = 101
,
the names would be PC1
- PC101
.
Alternatively, threshold
can be used to determine the
number of components that are required to capture a specified
fraction of the total variance in the variables.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step two things can happen depending
the type
argument. If type = "coef"
a tibble returned with 4 columns
terms
, value
, component
, and id
:
character, the selectors or variables selected
numeric, variable loading
character, principle component
character, id of this step
If type = "variance"
a tibble returned with 4 columns terms
, value
,
component
, and id
:
character, type of variance
numeric, value of the variance
integer, principle component
character, id of this step
This step has 2 tuning parameters:
num_comp
: # Components (type: integer, default: 5)
threshold
: Threshold (type: double, default: NA)
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Jolliffe, I. T. (2010). Principal Component Analysis. Springer.
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pls()
,
step_ratio()
,
step_spatialsign()
rec <- recipe(~., data = USArrests) pca_trans <- rec %>% step_normalize(all_numeric()) %>% step_pca(all_numeric(), num_comp = 3) pca_estimates <- prep(pca_trans, training = USArrests) pca_data <- bake(pca_estimates, USArrests) rng <- extendrange(c(pca_data$PC1, pca_data$PC2)) plot(pca_data$PC1, pca_data$PC2, xlim = rng, ylim = rng ) with_thresh <- rec %>% step_normalize(all_numeric()) %>% step_pca(all_numeric(), threshold = .99) with_thresh <- prep(with_thresh, training = USArrests) bake(with_thresh, USArrests) tidy(pca_trans, number = 2) tidy(pca_estimates, number = 2) tidy(pca_estimates, number = 2, type = "variance")
rec <- recipe(~., data = USArrests) pca_trans <- rec %>% step_normalize(all_numeric()) %>% step_pca(all_numeric(), num_comp = 3) pca_estimates <- prep(pca_trans, training = USArrests) pca_data <- bake(pca_estimates, USArrests) rng <- extendrange(c(pca_data$PC1, pca_data$PC2)) plot(pca_data$PC1, pca_data$PC2, xlim = rng, ylim = rng ) with_thresh <- rec %>% step_normalize(all_numeric()) %>% step_pca(all_numeric(), threshold = .99) with_thresh <- prep(with_thresh, training = USArrests) bake(with_thresh, USArrests) tidy(pca_trans, number = 2) tidy(pca_estimates, number = 2) tidy(pca_estimates, number = 2, type = "variance")
step_percentile()
creates a specification of a recipe step that replaces
the value of a variable with its percentile from the training set.
step_percentile( recipe, ..., role = NA, trained = FALSE, ref_dist = NULL, options = list(probs = (0:100)/100), outside = "none", skip = FALSE, id = rand_id("percentile") )
step_percentile( recipe, ..., role = NA, trained = FALSE, ref_dist = NULL, options = list(probs = (0:100)/100), outside = "none", skip = FALSE, id = rand_id("percentile") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
ref_dist |
The computed percentiles is stored here once this
preprocessing step has be trained by |
options |
A named list of options to pass to |
outside |
A character, describing how interpolation is to take place
outside the interval |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, percentile
, and id
:
character, the selectors or variables selected
numeric, the value at the percentile
numeric, the percentile as a percentage
character, id of this step
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_bs()
,
step_harmonic()
,
step_hyperbolic()
,
step_inverse()
,
step_invlogit()
,
step_log()
,
step_logit()
,
step_mutate()
,
step_ns()
,
step_poly()
,
step_relu()
,
step_sqrt()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) %>% step_percentile(carbon) prepped_rec <- prep(rec) prepped_rec %>% bake(biomass_te) tidy(rec, 1) tidy(prepped_rec, 1)
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) %>% step_percentile(carbon) prepped_rec <- prep(rec) prepped_rec %>% bake(biomass_te) tidy(rec, 1) tidy(prepped_rec, 1)
step_pls()
creates a specification of a recipe step that will convert
numeric data into one or more new dimensions.
step_pls( recipe, ..., role = "predictor", trained = FALSE, num_comp = 2, predictor_prop = 1, outcome = NULL, options = list(scale = TRUE), preserve = deprecated(), res = NULL, columns = NULL, prefix = "PLS", keep_original_cols = FALSE, skip = FALSE, id = rand_id("pls") )
step_pls( recipe, ..., role = "predictor", trained = FALSE, num_comp = 2, predictor_prop = 1, outcome = NULL, options = list(scale = TRUE), preserve = deprecated(), res = NULL, columns = NULL, prefix = "PLS", keep_original_cols = FALSE, skip = FALSE, id = rand_id("pls") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
num_comp |
The number of components to retain as new predictors.
If |
predictor_prop |
The maximum number of original predictors that can have non-zero coefficients for each PLS component (via regularization). |
outcome |
When a single outcome is available, character string or call
to |
options |
A list of options to |
preserve |
Use |
res |
A list of results are stored here once this preprocessing step has
been trained by |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
prefix |
A character string for the prefix of the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
PLS is a supervised version of principal component analysis that requires the outcome data to compute the new features.
This step requires the Bioconductor mixOmics package. If not installed, the step will stop with a note about installing the package. Install mixOmics using the pak package:
# install.packages("pak") pak::pak("mixOmics")
The argument num_comp
controls the number of components that will be retained
(the original variables that are used to derive the components are removed from
the data). The new components will have names that begin with prefix
and a
sequence of numbers. The variable names are padded with zeros. For example, if
num_comp < 10
, their names will be PLS1
- PLS9
. If num_comp = 101
,
the names would be PLS1
- PLS101
.
Sparsity can be encouraged using the predictor_prop
parameter. This affects
each PLS component, and indicates the maximum proportion of predictors with
non-zero coefficients in each component. step_pls()
converts this
proportion to determine the keepX
parameter in mixOmics::spls()
and
mixOmics::splsda()
. See the references in mixOmics::spls()
for details.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, component
, and id
:
character, the selectors or variables selected
numeric, coefficients defined as
character, name of component
character, id of this step
This step has 2 tuning parameters:
num_comp
: # Components (type: integer, default: 2)
predictor_prop
: Proportion of Predictors (type: double, default: 1)
The underlying operation does not allow for case weights.
https://en.wikipedia.org/wiki/Partial_least_squares_regression
Rohart F, Gautier B, Singh A, Lê Cao K-A (2017) mixOmics: An R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752. doi:10.1371/journal.pcbi.1005752
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_ratio()
,
step_spatialsign()
# requires the Bioconductor mixOmics package data(biomass, package = "modeldata") biom_tr <- biomass %>% dplyr::filter(dataset == "Training") %>% dplyr::select(-dataset, -sample) biom_te <- biomass %>% dplyr::filter(dataset == "Testing") %>% dplyr::select(-dataset, -sample, -HHV) dense_pls <- recipe(HHV ~ ., data = biom_tr) %>% step_pls(all_numeric_predictors(), outcome = "HHV", num_comp = 3) sparse_pls <- recipe(HHV ~ ., data = biom_tr) %>% step_pls(all_numeric_predictors(), outcome = "HHV", num_comp = 3, predictor_prop = 4 / 5) ## ----------------------------------------------------------------------------- ## PLS discriminant analysis data(cells, package = "modeldata") cell_tr <- cells %>% dplyr::filter(case == "Train") %>% dplyr::select(-case) cell_te <- cells %>% dplyr::filter(case == "Test") %>% dplyr::select(-case, -class) dense_plsda <- recipe(class ~ ., data = cell_tr) %>% step_pls(all_numeric_predictors(), outcome = "class", num_comp = 5) sparse_plsda <- recipe(class ~ ., data = cell_tr) %>% step_pls(all_numeric_predictors(), outcome = "class", num_comp = 5, predictor_prop = 1 / 4)
# requires the Bioconductor mixOmics package data(biomass, package = "modeldata") biom_tr <- biomass %>% dplyr::filter(dataset == "Training") %>% dplyr::select(-dataset, -sample) biom_te <- biomass %>% dplyr::filter(dataset == "Testing") %>% dplyr::select(-dataset, -sample, -HHV) dense_pls <- recipe(HHV ~ ., data = biom_tr) %>% step_pls(all_numeric_predictors(), outcome = "HHV", num_comp = 3) sparse_pls <- recipe(HHV ~ ., data = biom_tr) %>% step_pls(all_numeric_predictors(), outcome = "HHV", num_comp = 3, predictor_prop = 4 / 5) ## ----------------------------------------------------------------------------- ## PLS discriminant analysis data(cells, package = "modeldata") cell_tr <- cells %>% dplyr::filter(case == "Train") %>% dplyr::select(-case) cell_te <- cells %>% dplyr::filter(case == "Test") %>% dplyr::select(-case, -class) dense_plsda <- recipe(class ~ ., data = cell_tr) %>% step_pls(all_numeric_predictors(), outcome = "class", num_comp = 5) sparse_plsda <- recipe(class ~ ., data = cell_tr) %>% step_pls(all_numeric_predictors(), outcome = "class", num_comp = 5, predictor_prop = 1 / 4)
step_poly()
creates a specification of a recipe step that will create new
columns that are basis expansions of variables using orthogonal polynomials.
step_poly( recipe, ..., role = "predictor", trained = FALSE, objects = NULL, degree = 2L, options = list(), keep_original_cols = FALSE, skip = FALSE, id = rand_id("poly") )
step_poly( recipe, ..., role = "predictor", trained = FALSE, objects = NULL, degree = 2L, options = list(), keep_original_cols = FALSE, skip = FALSE, id = rand_id("poly") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
objects |
A list of |
degree |
The polynomial degree (an integer). |
options |
A list of options for |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
step_poly()
can create new features from a single variable that enable
fitting routines to model this variable in a nonlinear manner. The extent of
the possible nonlinearity is determined by the degree
argument of
stats::poly()
. The original variables are removed from the data by default,
but can be retained by setting keep_original_cols = TRUE
and new columns
are added. The naming convention for the new variables is varname_poly_1
and so on.
The orthogonal polynomial expansion is used by default because it yields
variables that are uncorrelated and doesn't produce large values which would
otherwise be a problem for large values of degree
. Orthogonal polynomial
expansion pick up the same signal as their uncorrelated counterpart.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, degree
, and id
:
character, the selectors or variables selected
integer, the polynomial degree
character, id of this step
This step has 1 tuning parameters:
degree
: Polynomial Degree (type: integer, default: 2)
The underlying operation does not allow for case weights.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_bs()
,
step_harmonic()
,
step_hyperbolic()
,
step_inverse()
,
step_invlogit()
,
step_log()
,
step_logit()
,
step_mutate()
,
step_ns()
,
step_percentile()
,
step_relu()
,
step_sqrt()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) quadratic <- rec %>% step_poly(carbon, hydrogen) quadratic <- prep(quadratic, training = biomass_tr) expanded <- bake(quadratic, biomass_te) expanded tidy(quadratic, number = 1)
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) quadratic <- rec %>% step_poly(carbon, hydrogen) quadratic <- prep(quadratic, training = biomass_tr) expanded <- bake(quadratic, biomass_te) expanded tidy(quadratic, number = 1)
step_poly_bernstein()
creates a specification of a recipe step that
creates Bernstein polynomial features.
step_poly_bernstein( recipe, ..., role = NA, trained = FALSE, degree = 10, complete_set = FALSE, options = NULL, keep_original_cols = FALSE, results = NULL, skip = FALSE, id = rand_id("poly_bernstein") )
step_poly_bernstein( recipe, ..., role = NA, trained = FALSE, degree = 10, complete_set = FALSE, options = NULL, keep_original_cols = FALSE, results = NULL, skip = FALSE, id = rand_id("poly_bernstein") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
degree |
The degrees of the polynomial. As the degrees for a polynomial increase, more flexible and complex curves can be generated. |
complete_set |
If |
options |
A list of options for |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
results |
A list of objects created once the step has been trained. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Polynomial transformations take a numeric column and create multiple features that, when used in a model, can estimate nonlinear trends between the column and some outcome. The degrees of freedom determines how many new features are added to the data.
If the spline expansion fails for a selected column, the step will
remove that column's results (but will retain the original data). Use the
tidy()
method to determine which columns were used.
An object with classes "step_poly_bernstein"
and "step"
.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
This step has 1 tuning parameters:
degree
: Polynomial Degree (type: integer, default: 10)
The underlying operation does not allow for case weights.
library(tidyr) library(dplyr) library(ggplot2) data(ames, package = "modeldata") spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_poly_bernstein(Longitude, degree = 6, keep_original_cols = TRUE) %>% prep() tidy(spline_rec, number = 1) # Show where each feature is active spline_rec %>% bake(new_data = NULL,-Sale_Price) %>% pivot_longer(c(starts_with("Longitude_")), names_to = "feature", values_to = "value") %>% mutate(feature = gsub("Longitude_", "feature ", feature)) %>% filter(value > 0) %>% ggplot(aes(x = Longitude, y = value)) + geom_line() + facet_wrap(~ feature)
library(tidyr) library(dplyr) library(ggplot2) data(ames, package = "modeldata") spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_poly_bernstein(Longitude, degree = 6, keep_original_cols = TRUE) %>% prep() tidy(spline_rec, number = 1) # Show where each feature is active spline_rec %>% bake(new_data = NULL,-Sale_Price) %>% pivot_longer(c(starts_with("Longitude_")), names_to = "feature", values_to = "value") %>% mutate(feature = gsub("Longitude_", "feature ", feature)) %>% filter(value > 0) %>% ggplot(aes(x = Longitude, y = value)) + geom_line() + facet_wrap(~ feature)
step_profile()
creates a specification of a recipe step that will fix the
levels of all variables but one and will create a sequence of values for the
remaining variable. This step can be helpful when creating partial regression
plots for additive models.
step_profile( recipe, ..., profile = NULL, pct = 0.5, index = 1, grid = list(pctl = TRUE, len = 100), columns = NULL, role = NA, trained = FALSE, skip = FALSE, id = rand_id("profile") )
step_profile( recipe, ..., profile = NULL, pct = 0.5, index = 1, grid = list(pctl = TRUE, len = 100), columns = NULL, role = NA, trained = FALSE, skip = FALSE, id = rand_id("profile") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
profile |
A call to |
pct |
A value between 0 and 1 that is the percentile to
fix continuous variables. This is applied to all continuous
variables captured by the selectors. For date variables, either
the minimum, median, or maximum used based on their distance to
|
index |
The level that qualitative variables will be fixed. If the variables are character (not factors), this will be the index of the sorted unique values. This is applied to all qualitative variables captured by the selectors. |
grid |
A named list with elements |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
This step is atypical in that, when baked, the
new_data
argument is ignored; the resulting data set is
based on the fixed and profiled variable's information.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, type
, and id
:
character, the selectors or variables selected
character, "fixed"
or "profiled"
character, id of this step
The underlying operation does not allow for case weights.
data(Sacramento, package = "modeldata") # Setup a grid across beds but keep the other values fixed recipe(~ city + price + beds, data = Sacramento) %>% step_profile(-beds, profile = vars(beds)) %>% prep(training = Sacramento) %>% bake(new_data = NULL) ########## # An *additive* model; not for use when there are interactions or # other functional relationships between predictors lin_mod <- lm(mpg ~ poly(disp, 2) + cyl + hp, data = mtcars) # Show the difference in the two grid creation methods disp_pctl <- recipe(~ disp + cyl + hp, data = mtcars) %>% step_profile(-disp, profile = vars(disp)) %>% prep(training = mtcars) disp_grid <- recipe(~ disp + cyl + hp, data = mtcars) %>% step_profile( -disp, profile = vars(disp), grid = list(pctl = FALSE, len = 100) ) %>% prep(training = mtcars) grid_data <- bake(disp_grid, new_data = NULL) grid_data <- grid_data %>% mutate( pred = predict(lin_mod, grid_data), method = "grid" ) pctl_data <- bake(disp_pctl, new_data = NULL) pctl_data <- pctl_data %>% mutate( pred = predict(lin_mod, pctl_data), method = "percentile" ) plot_data <- bind_rows(grid_data, pctl_data) library(ggplot2) ggplot(plot_data, aes(x = disp, y = pred)) + geom_point(alpha = .5, cex = 1) + facet_wrap(~method)
data(Sacramento, package = "modeldata") # Setup a grid across beds but keep the other values fixed recipe(~ city + price + beds, data = Sacramento) %>% step_profile(-beds, profile = vars(beds)) %>% prep(training = Sacramento) %>% bake(new_data = NULL) ########## # An *additive* model; not for use when there are interactions or # other functional relationships between predictors lin_mod <- lm(mpg ~ poly(disp, 2) + cyl + hp, data = mtcars) # Show the difference in the two grid creation methods disp_pctl <- recipe(~ disp + cyl + hp, data = mtcars) %>% step_profile(-disp, profile = vars(disp)) %>% prep(training = mtcars) disp_grid <- recipe(~ disp + cyl + hp, data = mtcars) %>% step_profile( -disp, profile = vars(disp), grid = list(pctl = FALSE, len = 100) ) %>% prep(training = mtcars) grid_data <- bake(disp_grid, new_data = NULL) grid_data <- grid_data %>% mutate( pred = predict(lin_mod, grid_data), method = "grid" ) pctl_data <- bake(disp_pctl, new_data = NULL) pctl_data <- pctl_data %>% mutate( pred = predict(lin_mod, pctl_data), method = "percentile" ) plot_data <- bind_rows(grid_data, pctl_data) library(ggplot2) ggplot(plot_data, aes(x = disp, y = pred)) + geom_point(alpha = .5, cex = 1) + facet_wrap(~method)
step_range()
creates a specification of a recipe step that will normalize
numeric data to be within a pre-defined range of values.
step_range( recipe, ..., role = NA, trained = FALSE, min = 0, max = 1, clipping = TRUE, ranges = NULL, skip = FALSE, id = rand_id("range") )
step_range( recipe, ..., role = NA, trained = FALSE, min = 0, max = 1, clipping = TRUE, ranges = NULL, skip = FALSE, id = rand_id("range") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
min , max
|
Single numeric values for the smallest (or largest) value in the transformed data. |
clipping |
A single logical value for determining whether
application of transformation onto new data should be forced
to be inside |
ranges |
A character vector of variables that will be
normalized. Note that this is ignored until the values are
determined by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
When a new data point is outside of the ranges seen in
the training set, the new values are truncated at min
or
max
.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, min
, max
, and id
:
character, the selectors or variables selected
numeric, lower range
numeric, upper range
character, id of this step
The underlying operation does not allow for case weights.
Other normalization steps:
step_center()
,
step_normalize()
,
step_scale()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) ranged_trans <- rec %>% step_range(carbon, hydrogen) ranged_obj <- prep(ranged_trans, training = biomass_tr) transformed_te <- bake(ranged_obj, biomass_te) biomass_te[1:10, names(transformed_te)] transformed_te tidy(ranged_trans, number = 1) tidy(ranged_obj, number = 1)
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) ranged_trans <- rec %>% step_range(carbon, hydrogen) ranged_obj <- prep(ranged_trans, training = biomass_tr) transformed_te <- bake(ranged_obj, biomass_te) biomass_te[1:10, names(transformed_te)] transformed_te tidy(ranged_trans, number = 1) tidy(ranged_obj, number = 1)
step_ratio()
creates a specification of a recipe step that will create
one or more ratios from selected numeric variables.
step_ratio( recipe, ..., role = "predictor", trained = FALSE, denom = denom_vars(), naming = function(numer, denom) { make.names(paste(numer, denom, sep = "_o_")) }, columns = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("ratio") ) denom_vars(...)
step_ratio( recipe, ..., role = "predictor", trained = FALSE, denom = denom_vars(), naming = function(numer, denom) { make.names(paste(numer, denom, sep = "_o_")) }, columns = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("ratio") ) denom_vars(...)
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables will be used in the numerator of the ratio.
When used with |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
denom |
A call to |
naming |
A function that defines the naming convention for new ratio columns. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble with columns
terms
(the selectors or variables selected) and denom
is returned.
When you tidy()
this step, a tibble is returned with
columns terms
, denom
, and id
:
character, the selectors or variables selected
character, name of denominator selected
character, id of this step
The underlying operation does not allow for case weights.
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_spatialsign()
library(recipes) data(biomass, package = "modeldata") biomass$total <- apply(biomass[, 3:7], 1, sum) biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + total, data = biomass_tr ) ratio_recipe <- rec %>% # all predictors over total step_ratio(all_numeric_predictors(), denom = denom_vars(total), keep_original_cols = FALSE) ratio_recipe <- prep(ratio_recipe, training = biomass_tr) ratio_data <- bake(ratio_recipe, biomass_te) ratio_data
library(recipes) data(biomass, package = "modeldata") biomass$total <- apply(biomass[, 3:7], 1, sum) biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + total, data = biomass_tr ) ratio_recipe <- rec %>% # all predictors over total step_ratio(all_numeric_predictors(), denom = denom_vars(total), keep_original_cols = FALSE) ratio_recipe <- prep(ratio_recipe, training = biomass_tr) ratio_data <- bake(ratio_recipe, biomass_te) ratio_data
step_regex()
creates a specification of a recipe step that will create a
new dummy variable based on a regular expression.
step_regex( recipe, ..., role = "predictor", trained = FALSE, pattern = ".", options = list(), result = make.names(pattern), input = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("regex") )
step_regex( recipe, ..., role = "predictor", trained = FALSE, pattern = ".", options = list(), result = make.names(pattern), input = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("regex") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
A single selector function to choose which variable
will be searched for the regex pattern. The selector should resolve
to a single variable. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
pattern |
A character string containing a regular
expression (or character string for |
options |
A list of options to |
result |
A single character value for the name of the new variable. It should be a valid column name. |
input |
A single character value for the name of the
variable being searched. This is |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, result
, and id
:
character, the selectors or variables selected
character, new column name
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
data(covers, package = "modeldata") rec <- recipe(~description, covers) %>% step_regex(description, pattern = "(rock|stony)", result = "rocks") %>% step_regex(description, pattern = "ratake families") rec2 <- prep(rec, training = covers) rec2 with_dummies <- bake(rec2, new_data = covers) with_dummies tidy(rec, number = 1) tidy(rec2, number = 1)
data(covers, package = "modeldata") rec <- recipe(~description, covers) %>% step_regex(description, pattern = "(rock|stony)", result = "rocks") %>% step_regex(description, pattern = "ratake families") rec2 <- prep(rec, training = covers) rec2 with_dummies <- bake(rec2, new_data = covers) with_dummies tidy(rec, number = 1) tidy(rec2, number = 1)
step_relevel()
creates a specification of a recipe step that will reorder
the provided factor columns so that the level specified by ref_level
is
first. This is useful for contr.treatment()
contrasts which take the first
level as the reference.
step_relevel( recipe, ..., role = NA, trained = FALSE, ref_level, objects = NULL, skip = FALSE, id = rand_id("relevel") )
step_relevel( recipe, ..., role = NA, trained = FALSE, ref_level, objects = NULL, skip = FALSE, id = rand_id("relevel") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
ref_level |
A single character value that will be used to relevel the factor column(s) (if the level is present). |
objects |
A list of objects that contain the information
on factor levels that will be determined by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The selected variables are releveled to a level
(given by ref_level
), placing the ref_level
in the first
position.
Note that if the original columns are character, they will be converted to factors by this step.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
character, the value of ref_level
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_string2factor()
,
step_time()
,
step_unknown()
,
step_unorder()
data(Sacramento, package = "modeldata") rec <- recipe(~ city + zip, data = Sacramento) %>% step_unknown(city, new_level = "UNKNOWN") %>% step_relevel(city, ref_level = "UNKNOWN") %>% prep() data <- bake(rec, Sacramento) levels(data$city)
data(Sacramento, package = "modeldata") rec <- recipe(~ city + zip, data = Sacramento) %>% step_unknown(city, new_level = "UNKNOWN") %>% step_relevel(city, ref_level = "UNKNOWN") %>% prep() data <- bake(rec, Sacramento) levels(data$city)
step_relu()
creates a specification of a recipe step that will add the
rectified linear or softplus transformations of a variable to the data set.
step_relu( recipe, ..., role = "predictor", trained = FALSE, shift = 0, reverse = FALSE, smooth = FALSE, prefix = "right_relu_", columns = NULL, skip = FALSE, id = rand_id("relu") )
step_relu( recipe, ..., role = "predictor", trained = FALSE, shift = 0, reverse = FALSE, smooth = FALSE, prefix = "right_relu_", columns = NULL, skip = FALSE, id = rand_id("relu") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
shift |
A numeric value dictating a translation to apply to the data. |
reverse |
A logical to indicate if the left hinge should be used as opposed to the right hinge. |
smooth |
A logical indicating if the softplus function, a smooth approximation to the rectified linear transformation, should be used. |
prefix |
A prefix for generated column names, defaults to "right_relu_" for right hinge transformation and "left_relu_" for reversed/left hinge transformations. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The rectified linear transformation is calculated as
and is also known as the ReLu or right hinge function.
If reverse
is true, then the transformation is reflected about the
y-axis, like so:
Setting the smooth
option
to true will instead calculate a smooth approximation to ReLu
according to
The reverse
argument may
also be applied to this transformation.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
The rectified linear transformation is used in Multivariate Adaptive Regression Splines as a basis function to fit piecewise linear functions to data in a strategy similar to that employed in tree based models. The transformation is a popular choice as an activation function in many neural networks, which could then be seen as a stacked generalization of MARS when making use of ReLu activations. The hinge function also appears in the loss function of Support Vector Machines, where it penalizes residuals only if they are within a certain margin of the decision boundary.
When you tidy()
this step, a tibble is returned with
columns terms
, shift
, reverse
, and id
:
character, the selectors or variables selected
numeric, location of hinge
logical, whether left hinge is used
character, id of this step
The underlying operation does not allow for case weights.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_bs()
,
step_harmonic()
,
step_hyperbolic()
,
step_inverse()
,
step_invlogit()
,
step_log()
,
step_logit()
,
step_mutate()
,
step_ns()
,
step_percentile()
,
step_poly()
,
step_sqrt()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) transformed_te <- rec %>% step_relu(carbon, shift = 40) %>% prep(biomass_tr) %>% bake(biomass_te) transformed_te
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) transformed_te <- rec %>% step_relu(carbon, shift = 40) %>% prep(biomass_tr) %>% bake(biomass_te) transformed_te
step_rename()
creates a specification of a recipe step that will add
variables using dplyr::rename()
.
step_rename( recipe, ..., role = "predictor", trained = FALSE, inputs = NULL, skip = FALSE, id = rand_id("rename") )
step_rename( recipe, ..., role = "predictor", trained = FALSE, inputs = NULL, skip = FALSE, id = rand_id("rename") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more unquoted expressions separated by commas. See
|
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
inputs |
Quosure(s) of |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
When an object in the user's global environment is referenced in
the expression defining the new variable(s), it is a good idea to use
quasiquotation (e.g. !!
) to embed the value of the object in the
expression (to be portable between sessions).
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
character, rename
expression
character, id of this step
The underlying operation does not allow for case weights.
Other dplyr steps:
step_arrange()
,
step_filter()
,
step_mutate()
,
step_mutate_at()
,
step_rename_at()
,
step_sample()
,
step_select()
,
step_slice()
recipe(~., data = iris) %>% step_rename(Sepal_Width = Sepal.Width) %>% prep() %>% bake(new_data = NULL) %>% slice(1:5) vars <- c(var1 = "cyl", var2 = "am") car_rec <- recipe(~., data = mtcars) %>% step_rename(!!!vars) car_rec %>% prep() %>% bake(new_data = NULL) car_rec %>% tidy(number = 1)
recipe(~., data = iris) %>% step_rename(Sepal_Width = Sepal.Width) %>% prep() %>% bake(new_data = NULL) %>% slice(1:5) vars <- c(var1 = "cyl", var2 = "am") car_rec <- recipe(~., data = mtcars) %>% step_rename(!!!vars) car_rec %>% prep() %>% bake(new_data = NULL) car_rec %>% tidy(number = 1)
step_rename_at()
creates a specification of a recipe step that will
rename the selected variables using a common function via
dplyr::rename_at()
.
step_rename_at( recipe, ..., fn, role = "predictor", trained = FALSE, inputs = NULL, skip = FALSE, id = rand_id("rename_at") )
step_rename_at( recipe, ..., fn, role = "predictor", trained = FALSE, inputs = NULL, skip = FALSE, id = rand_id("rename_at") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
fn |
A function |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
inputs |
A vector of column names populated by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other dplyr steps:
step_arrange()
,
step_filter()
,
step_mutate()
,
step_mutate_at()
,
step_rename()
,
step_sample()
,
step_select()
,
step_slice()
library(dplyr) recipe(~., data = iris) %>% step_rename_at(all_predictors(), fn = ~ gsub(".", "_", ., fixed = TRUE)) %>% prep() %>% bake(new_data = NULL) %>% slice(1:10)
library(dplyr) recipe(~., data = iris) %>% step_rename_at(all_predictors(), fn = ~ gsub(".", "_", ., fixed = TRUE)) %>% prep() %>% bake(new_data = NULL) %>% slice(1:10)
step_rm()
creates a specification of a recipe step that will remove
selected variables.
step_rm( recipe, ..., role = NA, trained = FALSE, removals = NULL, skip = FALSE, id = rand_id("rm") )
step_rm( recipe, ..., role = NA, trained = FALSE, removals = NULL, skip = FALSE, id = rand_id("rm") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
removals |
A character string that contains the names of
columns that should be removed. These values are not determined
until |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
This step can potentially remove columns from the data set. This may cause issues for subsequent steps in your recipe if the missing columns are specifically referenced by name. To avoid this, see the advice in the Tips for saving recipes and filtering columns section of selections.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other variable filter steps:
step_corr()
,
step_filter_missing()
,
step_lincomb()
,
step_nzv()
,
step_select()
,
step_zv()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) library(dplyr) smaller_set <- rec %>% step_rm(contains("gen")) smaller_set <- prep(smaller_set, training = biomass_tr) filtered_te <- bake(smaller_set, biomass_te) filtered_te tidy(smaller_set, number = 1)
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) library(dplyr) smaller_set <- rec %>% step_rm(contains("gen")) smaller_set <- prep(smaller_set, training = biomass_tr) filtered_te <- bake(smaller_set, biomass_te) filtered_te tidy(smaller_set, number = 1)
step_sample()
creates a specification of a recipe step that will sample
rows using dplyr::sample_n()
or dplyr::sample_frac()
.
step_sample( recipe, ..., role = NA, trained = FALSE, size = NULL, replace = FALSE, skip = TRUE, id = rand_id("sample") )
step_sample( recipe, ..., role = NA, trained = FALSE, size = NULL, replace = FALSE, skip = TRUE, id = rand_id("sample") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
Argument ignored; included for consistency with other step specification functions. |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
size |
An integer or fraction. If the value is within (0, 1),
|
replace |
Sample with or without replacement? |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
This step can entirely remove observations (rows of data), which can have
unintended and/or problematic consequences when applying the step to new
data later via bake()
. Consider whether skip = TRUE
or
skip = FALSE
is more appropriate in any given use case. In most instances
that affect the rows of the data being predicted, this step probably should
not be applied at all; instead, execute operations like this outside and
before starting a preprocessing recipe()
.
When you tidy()
this step, a tibble is returned with
columns terms
, size
, replace
, and id
:
character, the selectors or variables selected
numeric, amount of sampling
logical, whether sampling is done with replacement
character, id of this step
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Other row operation steps:
step_arrange()
,
step_filter()
,
step_impute_roll()
,
step_lag()
,
step_naomit()
,
step_shuffle()
,
step_slice()
Other dplyr steps:
step_arrange()
,
step_filter()
,
step_mutate()
,
step_mutate_at()
,
step_rename()
,
step_rename_at()
,
step_select()
,
step_slice()
# Uses `sample_n` recipe(~., data = mtcars) %>% step_sample(size = 1) %>% prep(training = mtcars) %>% bake(new_data = NULL) %>% nrow() # Uses `sample_frac` recipe(~., data = mtcars) %>% step_sample(size = 0.9999) %>% prep(training = mtcars) %>% bake(new_data = NULL) %>% nrow() # Uses `sample_n` and returns _at maximum_ 20 samples. smaller_cars <- recipe(~., data = mtcars) %>% step_sample() %>% prep(training = mtcars %>% slice(1:20)) bake(smaller_cars, new_data = NULL) %>% nrow() bake(smaller_cars, new_data = mtcars %>% slice(21:32)) %>% nrow()
# Uses `sample_n` recipe(~., data = mtcars) %>% step_sample(size = 1) %>% prep(training = mtcars) %>% bake(new_data = NULL) %>% nrow() # Uses `sample_frac` recipe(~., data = mtcars) %>% step_sample(size = 0.9999) %>% prep(training = mtcars) %>% bake(new_data = NULL) %>% nrow() # Uses `sample_n` and returns _at maximum_ 20 samples. smaller_cars <- recipe(~., data = mtcars) %>% step_sample() %>% prep(training = mtcars %>% slice(1:20)) bake(smaller_cars, new_data = NULL) %>% nrow() bake(smaller_cars, new_data = mtcars %>% slice(21:32)) %>% nrow()
step_scale()
creates a specification of a recipe step that will normalize
numeric data to have a standard deviation of one.
step_scale( recipe, ..., role = NA, trained = FALSE, sds = NULL, factor = 1, na_rm = TRUE, skip = FALSE, id = rand_id("scale") )
step_scale( recipe, ..., role = NA, trained = FALSE, sds = NULL, factor = 1, na_rm = TRUE, skip = FALSE, id = rand_id("scale") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
sds |
A named numeric vector of standard deviations. This is |
factor |
A numeric value of either 1 or 2 that scales the
numeric inputs by one or two standard deviations. By dividing
by two standard deviations, the coefficients attached to
continuous predictors can be interpreted the same way as with
binary inputs. Defaults to |
na_rm |
A logical value indicating whether |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Scaling data means that the standard deviation of a
variable is divided out of the data. step_scale
estimates
the variable standard deviations from the data used in the
training
argument of prep.recipe
.
bake.recipe
then applies the scaling to new data sets
using these standard deviations.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
numeric, the standard deviations
character, id of this step
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Gelman, A. (2007) "Scaling regression inputs by dividing by two standard deviations." Unpublished. Source: http://www.stat.columbia.edu/~gelman/research/unpublished/standardizing.pdf.
Other normalization steps:
step_center()
,
step_normalize()
,
step_range()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) scaled_trans <- rec %>% step_scale(carbon, hydrogen) scaled_obj <- prep(scaled_trans, training = biomass_tr) transformed_te <- bake(scaled_obj, biomass_te) biomass_te[1:10, names(transformed_te)] transformed_te tidy(scaled_trans, number = 1) tidy(scaled_obj, number = 1)
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) scaled_trans <- rec %>% step_scale(carbon, hydrogen) scaled_obj <- prep(scaled_trans, training = biomass_tr) transformed_te <- bake(scaled_obj, biomass_te) biomass_te[1:10, names(transformed_te)] transformed_te tidy(scaled_trans, number = 1) tidy(scaled_obj, number = 1)
step_select()
creates a specification of a recipe step that will select
variables using dplyr::select()
.
step_select( recipe, ..., role = NA, trained = FALSE, skip = FALSE, id = rand_id("select") )
step_select( recipe, ..., role = NA, trained = FALSE, skip = FALSE, id = rand_id("select") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms selected by this step, what analysis role should they be assigned? |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
When an object in the user's global environment is
referenced in the expression defining the new variable(s),
it is a good idea to use quasiquotation (e.g. !!
) to embed
the value of the object in the expression (to be portable
between sessions). See the examples.
This step can potentially remove columns from the data set. This may cause issues for subsequent steps in your recipe if the missing columns are specifically referenced by name. To avoid this, see the advice in the Tips for saving recipes and filtering columns section of selections.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other variable filter steps:
step_corr()
,
step_filter_missing()
,
step_lincomb()
,
step_nzv()
,
step_rm()
,
step_zv()
Other dplyr steps:
step_arrange()
,
step_filter()
,
step_mutate()
,
step_mutate_at()
,
step_rename()
,
step_rename_at()
,
step_sample()
,
step_slice()
library(dplyr) iris_tbl <- as_tibble(iris) iris_train <- slice(iris_tbl, 1:75) iris_test <- slice(iris_tbl, 76:150) dplyr_train <- select(iris_train, Species, starts_with("Sepal")) dplyr_test <- select(iris_test, Species, starts_with("Sepal")) rec <- recipe(~., data = iris_train) %>% step_select(Species, starts_with("Sepal")) %>% prep(training = iris_train) rec_train <- bake(rec, new_data = NULL) all.equal(dplyr_train, rec_train) rec_test <- bake(rec, iris_test) all.equal(dplyr_test, rec_test) # Local variables sepal_vars <- c("Sepal.Width", "Sepal.Length") qq_rec <- recipe(~., data = iris_train) %>% # fine for interactive usage step_select(Species, all_of(sepal_vars)) %>% # best approach for saving a recipe to disk step_select(Species, all_of(!!sepal_vars)) # Note that `sepal_vars` is inlined in the second approach qq_rec
library(dplyr) iris_tbl <- as_tibble(iris) iris_train <- slice(iris_tbl, 1:75) iris_test <- slice(iris_tbl, 76:150) dplyr_train <- select(iris_train, Species, starts_with("Sepal")) dplyr_test <- select(iris_test, Species, starts_with("Sepal")) rec <- recipe(~., data = iris_train) %>% step_select(Species, starts_with("Sepal")) %>% prep(training = iris_train) rec_train <- bake(rec, new_data = NULL) all.equal(dplyr_train, rec_train) rec_test <- bake(rec, iris_test) all.equal(dplyr_test, rec_test) # Local variables sepal_vars <- c("Sepal.Width", "Sepal.Length") qq_rec <- recipe(~., data = iris_train) %>% # fine for interactive usage step_select(Species, all_of(sepal_vars)) %>% # best approach for saving a recipe to disk step_select(Species, all_of(!!sepal_vars)) # Note that `sepal_vars` is inlined in the second approach qq_rec
step_shuffle()
creates a specification of a recipe step that will
randomly change the order of rows for selected variables.
step_shuffle( recipe, ..., role = NA, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("shuffle") )
step_shuffle( recipe, ..., role = NA, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("shuffle") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other row operation steps:
step_arrange()
,
step_filter()
,
step_impute_roll()
,
step_lag()
,
step_naomit()
,
step_sample()
,
step_slice()
integers <- data.frame(A = 1:12, B = 13:24, C = 25:36) library(dplyr) rec <- recipe(~ A + B + C, data = integers) %>% step_shuffle(A, B) rand_set <- prep(rec, training = integers) set.seed(5377) bake(rand_set, integers) tidy(rec, number = 1) tidy(rand_set, number = 1)
integers <- data.frame(A = 1:12, B = 13:24, C = 25:36) library(dplyr) rec <- recipe(~ A + B + C, data = integers) %>% step_shuffle(A, B) rand_set <- prep(rec, training = integers) set.seed(5377) bake(rand_set, integers) tidy(rec, number = 1) tidy(rand_set, number = 1)
step_slice()
creates a specification of a recipe step that will filter
rows using dplyr::slice()
.
step_slice( recipe, ..., role = NA, trained = FALSE, inputs = NULL, skip = TRUE, id = rand_id("slice") )
step_slice( recipe, ..., role = NA, trained = FALSE, inputs = NULL, skip = TRUE, id = rand_id("slice") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
Integer row values. See
|
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
inputs |
Quosure of values given by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
When an object in the user's global environment is
referenced in the expression defining the new variable(s),
it is a good idea to use quasiquotation (e.g. !!
)
to embed the value of the object in the expression (to
be portable between sessions). See the examples.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
This step can entirely remove observations (rows of data), which can have
unintended and/or problematic consequences when applying the step to new
data later via bake()
. Consider whether skip = TRUE
or
skip = FALSE
is more appropriate in any given use case. In most instances
that affect the rows of the data being predicted, this step probably should
not be applied at all; instead, execute operations like this outside and
before starting a preprocessing recipe()
.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, containing the filtering indices
character, id of this step
The underlying operation does not allow for case weights.
Other row operation steps:
step_arrange()
,
step_filter()
,
step_impute_roll()
,
step_lag()
,
step_naomit()
,
step_sample()
,
step_shuffle()
Other dplyr steps:
step_arrange()
,
step_filter()
,
step_mutate()
,
step_mutate_at()
,
step_rename()
,
step_rename_at()
,
step_sample()
,
step_select()
rec <- recipe(~., data = iris) %>% step_slice(1:3) prepped <- prep(rec, training = iris %>% slice(1:75)) tidy(prepped, number = 1) library(dplyr) dplyr_train <- iris %>% as_tibble() %>% slice(1:75) %>% slice(1:3) rec_train <- bake(prepped, new_data = NULL) all.equal(dplyr_train, rec_train) dplyr_test <- iris %>% as_tibble() %>% slice(76:150) rec_test <- bake(prepped, iris %>% slice(76:150)) all.equal(dplyr_test, rec_test) # Embedding the integer expression (or vector) into the # recipe: keep_rows <- 1:6 qq_rec <- recipe(~., data = iris) %>% # Embed `keep_rows` in the call using !!! step_slice(!!!keep_rows) %>% prep(training = iris) tidy(qq_rec, number = 1)
rec <- recipe(~., data = iris) %>% step_slice(1:3) prepped <- prep(rec, training = iris %>% slice(1:75)) tidy(prepped, number = 1) library(dplyr) dplyr_train <- iris %>% as_tibble() %>% slice(1:75) %>% slice(1:3) rec_train <- bake(prepped, new_data = NULL) all.equal(dplyr_train, rec_train) dplyr_test <- iris %>% as_tibble() %>% slice(76:150) rec_test <- bake(prepped, iris %>% slice(76:150)) all.equal(dplyr_test, rec_test) # Embedding the integer expression (or vector) into the # recipe: keep_rows <- 1:6 qq_rec <- recipe(~., data = iris) %>% # Embed `keep_rows` in the call using !!! step_slice(!!!keep_rows) %>% prep(training = iris) tidy(qq_rec, number = 1)
step_spatialsign()
is a specification of a recipe step that will convert
numeric data into a projection on to a unit sphere.
step_spatialsign( recipe, ..., role = "predictor", na_rm = TRUE, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("spatialsign") )
step_spatialsign( recipe, ..., role = "predictor", na_rm = TRUE, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("spatialsign") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
na_rm |
A logical: should missing data be removed from the norm computation? |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The spatial sign transformation projects the variables
onto a unit sphere and is related to global contrast
normalization. The spatial sign of a vector w
is
w/norm(w)
.
The variables should be centered and scaled prior to the computations.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
This step performs an unsupervised operation that can utilize case weights.
As a result, only frequency weights are allowed. For more
information, see the documentation in case_weights and the examples on
tidymodels.org
.
Unlike most, this step requires the case weights to be available when new
samples are processed (e.g., when bake()
is used or predict()
with a
workflow). To tell recipes that the case weights are required at bake time,
use
recipe %>% update_role_requirements(role = "case_weights", bake = TRUE)
.
See update_role_requirements()
for more information.
Serneels, S., De Nolf, E., and Van Espen, P. (2006). Spatial sign preprocessing: a simple way to impart moderate robustness to multivariate estimators. Journal of Chemical Information and Modeling, 46(3), 1402-1409.
Other multivariate transformation steps:
step_classdist()
,
step_classdist_shrunken()
,
step_depth()
,
step_geodist()
,
step_ica()
,
step_isomap()
,
step_kpca()
,
step_kpca_poly()
,
step_kpca_rbf()
,
step_mutate_at()
,
step_nnmf()
,
step_nnmf_sparse()
,
step_pca()
,
step_pls()
,
step_ratio()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) ss_trans <- rec %>% step_center(carbon, hydrogen) %>% step_scale(carbon, hydrogen) %>% step_spatialsign(carbon, hydrogen) ss_obj <- prep(ss_trans, training = biomass_tr) transformed_te <- bake(ss_obj, biomass_te) plot(biomass_te$carbon, biomass_te$hydrogen) plot(transformed_te$carbon, transformed_te$hydrogen) tidy(ss_trans, number = 3) tidy(ss_obj, number = 3)
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) ss_trans <- rec %>% step_center(carbon, hydrogen) %>% step_scale(carbon, hydrogen) %>% step_spatialsign(carbon, hydrogen) ss_obj <- prep(ss_trans, training = biomass_tr) transformed_te <- bake(ss_obj, biomass_te) plot(biomass_te$carbon, biomass_te$hydrogen) plot(transformed_te$carbon, transformed_te$hydrogen) tidy(ss_trans, number = 3) tidy(ss_obj, number = 3)
step_spline_b()
creates a specification of a recipe step that creates
b-spline features.
step_spline_b( recipe, ..., role = "predictor", trained = FALSE, deg_free = 10, degree = 3, complete_set = FALSE, options = NULL, keep_original_cols = FALSE, results = NULL, skip = FALSE, id = rand_id("spline_b") )
step_spline_b( recipe, ..., role = "predictor", trained = FALSE, deg_free = 10, degree = 3, complete_set = FALSE, options = NULL, keep_original_cols = FALSE, results = NULL, skip = FALSE, id = rand_id("spline_b") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
deg_free |
The degrees of freedom for the b-spline. As the degrees of freedom for a b-spline increase, more flexible and complex curves can be generated. |
degree |
A non-negative integer specifying the degree of the piece-wise polynomial. The default value is 3 for cubic splines. Zero degree is allowed for piece-wise constant basis functions. |
complete_set |
If |
options |
A list of options for |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
results |
A list of objects created once the step has been trained. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Spline transformations take a numeric column and create multiple features that, when used in a model, can estimate nonlinear trends between the column and some outcome. The degrees of freedom determines how many new features are added to the data.
Setting periodic = TRUE
in the list passed to options
, a periodic version
of the spline is used.
If the spline expansion fails for a selected column, the step will
remove that column's results (but will retain the original data). Use the
tidy()
method to determine which columns were used.
An object with classes "step_spline_b"
and "step"
.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
This step has 2 tuning parameters:
deg_free
: Spline Degrees of Freedom (type: integer, default: 10)
degree
: Polynomial Degree (type: integer, default: 3)
The underlying operation does not allow for case weights.
library(tidyr) library(dplyr) library(ggplot2) data(ames, package = "modeldata") spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_spline_b(Longitude, deg_free = 6, keep_original_cols = TRUE) %>% prep() tidy(spline_rec, number = 1) # Show where each feature is active spline_rec %>% bake(new_data = NULL,-Sale_Price) %>% pivot_longer(c(starts_with("Longitude_")), names_to = "feature", values_to = "value") %>% mutate(feature = gsub("Longitude_", "feature ", feature)) %>% filter(value > 0) %>% ggplot(aes(x = Longitude, y = value)) + geom_line() + facet_wrap(~ feature)
library(tidyr) library(dplyr) library(ggplot2) data(ames, package = "modeldata") spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_spline_b(Longitude, deg_free = 6, keep_original_cols = TRUE) %>% prep() tidy(spline_rec, number = 1) # Show where each feature is active spline_rec %>% bake(new_data = NULL,-Sale_Price) %>% pivot_longer(c(starts_with("Longitude_")), names_to = "feature", values_to = "value") %>% mutate(feature = gsub("Longitude_", "feature ", feature)) %>% filter(value > 0) %>% ggplot(aes(x = Longitude, y = value)) + geom_line() + facet_wrap(~ feature)
step_spline_convex()
creates a specification of a recipe step that
creates convex spline features.
step_spline_convex( recipe, ..., role = "predictor", trained = FALSE, deg_free = 10, degree = 3, complete_set = TRUE, options = NULL, keep_original_cols = FALSE, results = NULL, skip = FALSE, id = rand_id("spline_convex") )
step_spline_convex( recipe, ..., role = "predictor", trained = FALSE, deg_free = 10, degree = 3, complete_set = TRUE, options = NULL, keep_original_cols = FALSE, results = NULL, skip = FALSE, id = rand_id("spline_convex") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
deg_free |
The degrees of freedom for the b-spline. As the degrees of freedom for a b-spline increase, more flexible and complex curves can be generated. |
degree |
The degree of C-spline defined to be the degree of the associated M-spline instead of actual polynomial degree. For example, C-spline basis of degree 2 is defined as the scaled double integral of associated M-spline basis of degree 2. |
complete_set |
If |
options |
A list of options for |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
results |
A list of objects created once the step has been trained. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Spline transformations take a numeric column and create multiple features that, when used in a model, can estimate nonlinear trends between the column and some outcome. The degrees of freedom determines how many new features are added to the data.
These particular spline functions have forms that are guaranteed to be convex.
If the spline expansion fails for a selected column, the step will
remove that column's results (but will retain the original data). Use the
tidy()
method to determine which columns were used.
An object with classes "step_spline_convex"
and "step"
.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
This step has 2 tuning parameters:
deg_free
: Spline Degrees of Freedom (type: integer, default: 10)
degree
: Polynomial Degree (type: integer, default: 3)
The underlying operation does not allow for case weights.
library(tidyr) library(dplyr) library(ggplot2) data(ames, package = "modeldata") spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_spline_convex(Longitude, deg_free = 6, keep_original_cols = TRUE) %>% prep() tidy(spline_rec, number = 1) # Show where each feature is active spline_rec %>% bake(new_data = NULL,-Sale_Price) %>% pivot_longer(c(starts_with("Longitude_")), names_to = "feature", values_to = "value") %>% mutate(feature = gsub("Longitude_", "feature ", feature)) %>% filter(value > 0) %>% ggplot(aes(x = Longitude, y = value)) + geom_line() + facet_wrap(~ feature)
library(tidyr) library(dplyr) library(ggplot2) data(ames, package = "modeldata") spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_spline_convex(Longitude, deg_free = 6, keep_original_cols = TRUE) %>% prep() tidy(spline_rec, number = 1) # Show where each feature is active spline_rec %>% bake(new_data = NULL,-Sale_Price) %>% pivot_longer(c(starts_with("Longitude_")), names_to = "feature", values_to = "value") %>% mutate(feature = gsub("Longitude_", "feature ", feature)) %>% filter(value > 0) %>% ggplot(aes(x = Longitude, y = value)) + geom_line() + facet_wrap(~ feature)
step_spline_monotone()
creates a specification of a recipe step that
creates monotone spline features.
step_spline_monotone( recipe, ..., role = "predictor", trained = FALSE, deg_free = 10, degree = 3, complete_set = TRUE, options = NULL, keep_original_cols = FALSE, results = NULL, skip = FALSE, id = rand_id("spline_monotone") )
step_spline_monotone( recipe, ..., role = "predictor", trained = FALSE, deg_free = 10, degree = 3, complete_set = TRUE, options = NULL, keep_original_cols = FALSE, results = NULL, skip = FALSE, id = rand_id("spline_monotone") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
deg_free |
The degrees of freedom for the b-spline. As the degrees of freedom for a b-spline increase, more flexible and complex curves can be generated. |
degree |
The degree of I-spline defined to be the degree of the associated M-spline instead of actual polynomial degree. For example, I-spline basis of degree 2 is defined as the integral of associated M-spline basis of degree 2. |
complete_set |
If |
options |
A list of options for |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
results |
A list of objects created once the step has been trained. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Spline transformations take a numeric column and create multiple features that, when used in a model, can estimate nonlinear trends between the column and some outcome. The degrees of freedom determines how many new features are added to the data.
These splines are integrated forms of M-splines and are non-negative and monotonic. This means that, when used with a fitting function that produces non-negative regression coefficients, the resulting fit is monotonic.
If the spline expansion fails for a selected column, the step will
remove that column's results (but will retain the original data). Use the
tidy()
method to determine which columns were used.
An object with classes "step_spline_monotone"
and "step"
.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
This step has 2 tuning parameters:
deg_free
: Spline Degrees of Freedom (type: integer, default: 10)
degree
: Polynomial Degree (type: integer, default: 3)
The underlying operation does not allow for case weights.
library(tidyr) library(dplyr) library(ggplot2) data(ames, package = "modeldata") spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_spline_monotone(Longitude, deg_free = 6, keep_original_cols = TRUE) %>% prep() tidy(spline_rec, number = 1) # Show where each feature is active spline_rec %>% bake(new_data = NULL,-Sale_Price) %>% pivot_longer(c(starts_with("Longitude_")), names_to = "feature", values_to = "value") %>% mutate(feature = gsub("Longitude_", "feature ", feature)) %>% filter(value > 0) %>% ggplot(aes(x = Longitude, y = value)) + geom_line() + facet_wrap(~ feature)
library(tidyr) library(dplyr) library(ggplot2) data(ames, package = "modeldata") spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_spline_monotone(Longitude, deg_free = 6, keep_original_cols = TRUE) %>% prep() tidy(spline_rec, number = 1) # Show where each feature is active spline_rec %>% bake(new_data = NULL,-Sale_Price) %>% pivot_longer(c(starts_with("Longitude_")), names_to = "feature", values_to = "value") %>% mutate(feature = gsub("Longitude_", "feature ", feature)) %>% filter(value > 0) %>% ggplot(aes(x = Longitude, y = value)) + geom_line() + facet_wrap(~ feature)
step_spline_natural()
creates a specification of a recipe step that
creates natural spline features.
step_spline_natural( recipe, ..., role = "predictor", trained = FALSE, deg_free = 10, complete_set = FALSE, options = NULL, keep_original_cols = FALSE, results = NULL, skip = FALSE, id = rand_id("spline_natural") )
step_spline_natural( recipe, ..., role = "predictor", trained = FALSE, deg_free = 10, complete_set = FALSE, options = NULL, keep_original_cols = FALSE, results = NULL, skip = FALSE, id = rand_id("spline_natural") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
deg_free |
The degrees of freedom for the natural spline. As the degrees of freedom for a natural spline increase, more flexible and complex curves can be generated. This step requires at least two degrees of freedom. |
complete_set |
If |
options |
A list of options for |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
results |
A list of objects created once the step has been trained. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Spline transformations take a numeric column and create multiple features that, when used in a model, can estimate nonlinear trends between the column and some outcome. The degrees of freedom determines how many new features are added to the data.
This spline is a piece-wise cubic polynomial function.
If the spline expansion fails for a selected column, the step will
remove that column's results (but will retain the original data). Use the
tidy()
method to determine which columns were used.
An object with classes "step_spline_natural"
and "step"
.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
This step has 1 tuning parameters:
deg_free
: Spline Degrees of Freedom (type: integer, default: 10)
The underlying operation does not allow for case weights.
library(tidyr) library(dplyr) library(ggplot2) data(ames, package = "modeldata") spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_spline_natural(Longitude, deg_free = 6, keep_original_cols = TRUE) %>% prep() tidy(spline_rec, number = 1) # Show where each feature is active spline_rec %>% bake(new_data = NULL,-Sale_Price) %>% pivot_longer(c(starts_with("Longitude_")), names_to = "feature", values_to = "value") %>% mutate(feature = gsub("Longitude_", "feature ", feature)) %>% filter(value > 0) %>% ggplot(aes(x = Longitude, y = value)) + geom_line() + facet_wrap(~ feature)
library(tidyr) library(dplyr) library(ggplot2) data(ames, package = "modeldata") spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_spline_natural(Longitude, deg_free = 6, keep_original_cols = TRUE) %>% prep() tidy(spline_rec, number = 1) # Show where each feature is active spline_rec %>% bake(new_data = NULL,-Sale_Price) %>% pivot_longer(c(starts_with("Longitude_")), names_to = "feature", values_to = "value") %>% mutate(feature = gsub("Longitude_", "feature ", feature)) %>% filter(value > 0) %>% ggplot(aes(x = Longitude, y = value)) + geom_line() + facet_wrap(~ feature)
step_spline_nonnegative()
creates a specification of a recipe step that
creates non-negative spline features.
step_spline_nonnegative( recipe, ..., role = "predictor", trained = FALSE, deg_free = 10, degree = 3, complete_set = FALSE, options = NULL, keep_original_cols = FALSE, results = NULL, skip = FALSE, id = rand_id("spline_nonnegative") )
step_spline_nonnegative( recipe, ..., role = "predictor", trained = FALSE, deg_free = 10, degree = 3, complete_set = FALSE, options = NULL, keep_original_cols = FALSE, results = NULL, skip = FALSE, id = rand_id("spline_nonnegative") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
deg_free |
The degrees of freedom for the b-spline. As the degrees of freedom for a b-spline increase, more flexible and complex curves can be generated. |
degree |
A nonnegative integer specifying the degree of the piecewise polynomial. The default value is 3 for cubic splines. Zero degree is allowed for piecewise constant basis functions. |
complete_set |
If |
options |
A list of options for |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
results |
A list of objects created once the step has been trained. |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Spline transformations take a numeric column and create multiple features that, when used in a model, can estimate nonlinear trends between the column and some outcome. The degrees of freedom determines how many new features are added to the data.
This function generates M-splines (Curry, and Schoenberg 1988) which are non-negative and have interesting statistical properties (such as integrating to one). A zero-degree M-spline generates box/step functions while a first degree basis function is triangular.
Setting periodic = TRUE
in the list passed to options
, a periodic version
of the spline is used.
If the spline expansion fails for a selected column, the step will
remove that column's results (but will retain the original data). Use the
tidy()
method to determine which columns were used.
An object with classes "step_spline_nonnegative"
and "step"
.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
This step has 2 tuning parameters:
deg_free
: Spline Degrees of Freedom (type: integer, default: 10)
degree
: Polynomial Degree (type: integer, default: 3)
The underlying operation does not allow for case weights.
Curry, H.B., Schoenberg, I.J. (1988). On Polya Frequency Functions IV: The Fundamental Spline Functions and their Limits. In: de Boor, C. (eds) I. J. Schoenberg Selected Papers. Contemporary Mathematicians. Birkhäuser, Boston, MA
Ramsay, J. O. "Monotone Regression Splines in Action." Statistical Science, vol. 3, no. 4, 1988, pp. 425–41
library(tidyr) library(dplyr) library(ggplot2) data(ames, package = "modeldata") spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_spline_nonnegative(Longitude, deg_free = 6, keep_original_cols = TRUE) %>% prep() tidy(spline_rec, number = 1) # Show where each feature is active spline_rec %>% bake(new_data = NULL,-Sale_Price) %>% pivot_longer(c(starts_with("Longitude_")), names_to = "feature", values_to = "value") %>% mutate(feature = gsub("Longitude_", "feature ", feature)) %>% filter(value > 0) %>% ggplot(aes(x = Longitude, y = value)) + geom_line() + facet_wrap(~ feature)
library(tidyr) library(dplyr) library(ggplot2) data(ames, package = "modeldata") spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_spline_nonnegative(Longitude, deg_free = 6, keep_original_cols = TRUE) %>% prep() tidy(spline_rec, number = 1) # Show where each feature is active spline_rec %>% bake(new_data = NULL,-Sale_Price) %>% pivot_longer(c(starts_with("Longitude_")), names_to = "feature", values_to = "value") %>% mutate(feature = gsub("Longitude_", "feature ", feature)) %>% filter(value > 0) %>% ggplot(aes(x = Longitude, y = value)) + geom_line() + facet_wrap(~ feature)
step_sqrt()
creates a specification of a recipe step that will apply
square root transform to the variables.
step_sqrt( recipe, ..., role = NA, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("sqrt") )
step_sqrt( recipe, ..., role = NA, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("sqrt") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other individual transformation steps:
step_BoxCox()
,
step_YeoJohnson()
,
step_bs()
,
step_harmonic()
,
step_hyperbolic()
,
step_inverse()
,
step_invlogit()
,
step_log()
,
step_logit()
,
step_mutate()
,
step_ns()
,
step_percentile()
,
step_poly()
,
step_relu()
set.seed(313) examples <- matrix(rnorm(40)^2, ncol = 2) examples <- as.data.frame(examples) rec <- recipe(~ V1 + V2, data = examples) sqrt_trans <- rec %>% step_sqrt(all_numeric_predictors()) sqrt_obj <- prep(sqrt_trans, training = examples) transformed_te <- bake(sqrt_obj, examples) plot(examples$V1, transformed_te$V1) tidy(sqrt_trans, number = 1) tidy(sqrt_obj, number = 1)
set.seed(313) examples <- matrix(rnorm(40)^2, ncol = 2) examples <- as.data.frame(examples) rec <- recipe(~ V1 + V2, data = examples) sqrt_trans <- rec %>% step_sqrt(all_numeric_predictors()) sqrt_obj <- prep(sqrt_trans, training = examples) transformed_te <- bake(sqrt_obj, examples) plot(examples$V1, transformed_te$V1) tidy(sqrt_trans, number = 1) tidy(sqrt_obj, number = 1)
step_string2factor()
will convert one or more character vectors to factors
(ordered or unordered).
Use this step only in special cases (see Details) and instead convert strings to factors before using any tidymodels functions.
step_string2factor( recipe, ..., role = NA, trained = FALSE, levels = NULL, ordered = FALSE, skip = FALSE, id = rand_id("string2factor") )
step_string2factor( recipe, ..., role = NA, trained = FALSE, levels = NULL, ordered = FALSE, skip = FALSE, id = rand_id("string2factor") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
levels |
An optional specification of the levels to be used
for the new factor. If left |
ordered |
A single logical value; should the factor(s) be ordered? |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
In most cases, if you are planning to use step_string2factor()
without setting levels
, you will be better off converting
those character variables to factor variables before using a recipe.
This can be done using dplyr with the following code
df <- mutate(df, across(where(is.character), as.factor))
During resampling, the complete set of values might
not be in the character data. Converting them to factors with
step_string2factor()
then will misconfigure the levels.
If the levels
argument to step_string2factor()
is used, it will
convert all variables affected by this step to have the same
levels. Because of this, you will need to know the full set of level
when you define the recipe.
Also, note that prep()
has an option strings_as_factors
that
defaults to TRUE
. This should be changed so that raw character
data will be applied to step_string2factor()
. However, this step
can also take existing factors (but will leave them as-is).
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, ordered
, and id
:
character, the selectors or variables selected
logical, are factors ordered
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_time()
,
step_unknown()
,
step_unorder()
data(Sacramento, package = "modeldata") # convert factor to string to demonstrate Sacramento$city <- as.character(Sacramento$city) rec <- recipe(~ city + zip, data = Sacramento) make_factor <- rec %>% step_string2factor(city) make_factor <- prep(make_factor, training = Sacramento ) make_factor # note that `city` is a factor in recipe output bake(make_factor, new_data = NULL) %>% head() # ...but remains a string in the data Sacramento %>% head()
data(Sacramento, package = "modeldata") # convert factor to string to demonstrate Sacramento$city <- as.character(Sacramento$city) rec <- recipe(~ city + zip, data = Sacramento) make_factor <- rec %>% step_string2factor(city) make_factor <- prep(make_factor, training = Sacramento ) make_factor # note that `city` is a factor in recipe output bake(make_factor, new_data = NULL) %>% head() # ...but remains a string in the data Sacramento %>% head()
step_time()
creates a specification of a recipe step that will convert
date-time data into one or more factor or numeric variables.
step_time( recipe, ..., role = "predictor", trained = FALSE, features = c("hour", "minute", "second"), columns = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("time") )
step_time( recipe, ..., role = "predictor", trained = FALSE, features = c("hour", "minute", "second"), columns = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("time") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. The selected variables should have class
|
role |
For model terms created by this step, what analysis role should they be assigned? By default, the new columns created by this step from the original variables will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
features |
A character string that includes at least one
of the following values: |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Unlike some other steps, step_time()
does not
remove the original time variables by default. Set keep_original_cols
to FALSE
to remove them.
decimal_day
return time of day as a decimal number between 0 and 24. for
example "07:15:00"
would be transformed to 7.25
and "03:59:59"
would
be transformed to 3.999722
. The formula for these calculations are
hour(x) + (second(x) + minute(x) * 60) / 3600
.
See step_date()
if you want to calculate features that are larger than
hours.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
character, the feature names
character, id of this step
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_unknown()
,
step_unorder()
library(lubridate) examples <- data.frame( times = ymd_hms("2022-05-06 23:51:07") + hours(1:5) + minutes(1:5) + seconds(1:5) ) time_rec <- recipe(~ times, examples) %>% step_time(all_predictors()) tidy(time_rec, number = 1) time_rec <- prep(time_rec, training = examples) time_values <- bake(time_rec, new_data = examples) time_values tidy(time_rec, number = 1)
library(lubridate) examples <- data.frame( times = ymd_hms("2022-05-06 23:51:07") + hours(1:5) + minutes(1:5) + seconds(1:5) ) time_rec <- recipe(~ times, examples) %>% step_time(all_predictors()) tidy(time_rec, number = 1) time_rec <- prep(time_rec, training = examples) time_values <- bake(time_rec, new_data = examples) time_values tidy(time_rec, number = 1)
step_unknown()
creates a specification of a recipe step that will assign
a missing value in a factor level to "unknown"
.
step_unknown( recipe, ..., role = NA, trained = FALSE, new_level = "unknown", objects = NULL, skip = FALSE, id = rand_id("unknown") )
step_unknown( recipe, ..., role = NA, trained = FALSE, new_level = "unknown", objects = NULL, skip = FALSE, id = rand_id("unknown") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
new_level |
A single character value that will be assigned to new factor levels. |
objects |
A list of objects that contain the information
on factor levels that will be determined by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The selected variables are adjusted to have a new
level (given by new_level
) that is placed in the last
position.
Note that if the original columns are character, they will be converted to factors by this step.
If new_level
is already in the data given to prep
, an error
is thrown.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
character, the factor levels for the new values
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unorder()
data(Sacramento, package = "modeldata") rec <- recipe(~ city + zip, data = Sacramento) %>% step_unknown(city, new_level = "unknown city") %>% step_unknown(zip, new_level = "unknown zip") %>% prep() table(bake(rec, new_data = NULL) %>% pull(city), Sacramento %>% pull(city), useNA = "always" ) %>% as.data.frame() %>% dplyr::filter(Freq > 0) tidy(rec, number = 1)
data(Sacramento, package = "modeldata") rec <- recipe(~ city + zip, data = Sacramento) %>% step_unknown(city, new_level = "unknown city") %>% step_unknown(zip, new_level = "unknown zip") %>% prep() table(bake(rec, new_data = NULL) %>% pull(city), Sacramento %>% pull(city), useNA = "always" ) %>% as.data.frame() %>% dplyr::filter(Freq > 0) tidy(rec, number = 1)
step_unorder()
creates a specification of a recipe step that will turn
ordered factor variables into unordered factor variables.
step_unorder( recipe, ..., role = NA, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("unorder") )
step_unorder( recipe, ..., role = NA, trained = FALSE, columns = NULL, skip = FALSE, id = rand_id("unorder") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The factors level order is preserved during the transformation.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, the selectors or variables selected
character, id of this step
The underlying operation does not allow for case weights.
Other dummy variable and encoding steps:
step_bin2factor()
,
step_count()
,
step_date()
,
step_dummy()
,
step_dummy_extract()
,
step_dummy_multi_choice()
,
step_factor2string()
,
step_holiday()
,
step_indicate_na()
,
step_integer()
,
step_novel()
,
step_num2factor()
,
step_ordinalscore()
,
step_other()
,
step_regex()
,
step_relevel()
,
step_string2factor()
,
step_time()
,
step_unknown()
lmh <- c("Low", "Med", "High") examples <- data.frame( X1 = factor(rep(letters[1:4], each = 3)), X2 = ordered(rep(lmh, each = 4), levels = lmh ) ) rec <- recipe(~ X1 + X2, data = examples) factor_trans <- rec %>% step_unorder(all_nominal_predictors()) factor_obj <- prep(factor_trans, training = examples) transformed_te <- bake(factor_obj, examples) table(transformed_te$X2, examples$X2) tidy(factor_trans, number = 1) tidy(factor_obj, number = 1)
lmh <- c("Low", "Med", "High") examples <- data.frame( X1 = factor(rep(letters[1:4], each = 3)), X2 = ordered(rep(lmh, each = 4), levels = lmh ) ) rec <- recipe(~ X1 + X2, data = examples) factor_trans <- rec %>% step_unorder(all_nominal_predictors()) factor_obj <- prep(factor_trans, training = examples) transformed_te <- bake(factor_obj, examples) table(transformed_te$X2, examples$X2) tidy(factor_trans, number = 1) tidy(factor_obj, number = 1)
step_window()
creates a specification of a recipe step that will create
new columns that are the results of functions that compute statistics across
moving windows.
step_window( recipe, ..., role = NA, trained = FALSE, size = 3, na_rm = TRUE, statistic = "mean", columns = NULL, names = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("window") )
step_window( recipe, ..., role = NA, trained = FALSE, size = 3, na_rm = TRUE, statistic = "mean", columns = NULL, names = NULL, keep_original_cols = TRUE, skip = FALSE, id = rand_id("window") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
For model terms created by this step, what analysis
role should they be assigned? If |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
size |
An odd integer |
na_rm |
A logical for whether missing values should be removed from the calculations within each window. |
statistic |
A character string for the type of statistic
that should be calculated for each moving window. Possible
values are: |
columns |
A character string of the selected variable names. This field
is a placeholder and will be populated once |
names |
An optional character string that is the same
length of the number of terms selected by |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The calculations use a somewhat atypical method for
handling the beginning and end parts of the rolling statistics.
The process starts with the center justified window calculations
and the beginning and ending parts of the rolling values are
determined using the first and last rolling values,
respectively. For example, if a column x
with 12 values is
smoothed with a 5-point moving median, the first three smoothed
values are estimated by median(x[1:5])
and the fourth
uses median(x[2:6])
.
keep_original_cols
also applies to this step if names
is specified.
step will stop with a note about installing the package.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, statistic
, size
, and id
:
character, the selectors or variables selected
character, the summary function name
integer, window size
character, id of this step
This step has 2 tuning parameters:
statistic
: Rolling Summary Statistic (type: character, default: mean)
size
: Window Size (type: integer, default: 3)
The underlying operation does not allow for case weights.
library(recipes) library(dplyr) library(rlang) library(ggplot2, quietly = TRUE) set.seed(5522) sim_dat <- data.frame(x1 = (20:100) / 10) n <- nrow(sim_dat) sim_dat$y1 <- sin(sim_dat$x1) + rnorm(n, sd = 0.1) sim_dat$y2 <- cos(sim_dat$x1) + rnorm(n, sd = 0.1) sim_dat$x2 <- runif(n) sim_dat$x3 <- rnorm(n) rec <- recipe(y1 + y2 ~ x1 + x2 + x3, data = sim_dat) %>% step_window(starts_with("y"), size = 7, statistic = "median", names = paste0("med_7pt_", 1:2), role = "outcome" ) %>% step_window(starts_with("y"), names = paste0("mean_3pt_", 1:2), role = "outcome" ) rec <- prep(rec, training = sim_dat) smoothed_dat <- bake(rec, sim_dat) ggplot(data = sim_dat, aes(x = x1, y = y1)) + geom_point() + geom_line(data = smoothed_dat, aes(y = med_7pt_1)) + geom_line(data = smoothed_dat, aes(y = mean_3pt_1), col = "red") + theme_bw() tidy(rec, number = 1) tidy(rec, number = 2) # If you want to replace the selected variables with the rolling statistic # don't set `names` sim_dat$original <- sim_dat$y1 rec <- recipe(y1 + y2 + original ~ x1 + x2 + x3, data = sim_dat) %>% step_window(starts_with("y")) rec <- prep(rec, training = sim_dat) smoothed_dat <- bake(rec, sim_dat) ggplot(smoothed_dat, aes(x = original, y = y1)) + geom_point() + theme_bw()
library(recipes) library(dplyr) library(rlang) library(ggplot2, quietly = TRUE) set.seed(5522) sim_dat <- data.frame(x1 = (20:100) / 10) n <- nrow(sim_dat) sim_dat$y1 <- sin(sim_dat$x1) + rnorm(n, sd = 0.1) sim_dat$y2 <- cos(sim_dat$x1) + rnorm(n, sd = 0.1) sim_dat$x2 <- runif(n) sim_dat$x3 <- rnorm(n) rec <- recipe(y1 + y2 ~ x1 + x2 + x3, data = sim_dat) %>% step_window(starts_with("y"), size = 7, statistic = "median", names = paste0("med_7pt_", 1:2), role = "outcome" ) %>% step_window(starts_with("y"), names = paste0("mean_3pt_", 1:2), role = "outcome" ) rec <- prep(rec, training = sim_dat) smoothed_dat <- bake(rec, sim_dat) ggplot(data = sim_dat, aes(x = x1, y = y1)) + geom_point() + geom_line(data = smoothed_dat, aes(y = med_7pt_1)) + geom_line(data = smoothed_dat, aes(y = mean_3pt_1), col = "red") + theme_bw() tidy(rec, number = 1) tidy(rec, number = 2) # If you want to replace the selected variables with the rolling statistic # don't set `names` sim_dat$original <- sim_dat$y1 rec <- recipe(y1 + y2 + original ~ x1 + x2 + x3, data = sim_dat) %>% step_window(starts_with("y")) rec <- prep(rec, training = sim_dat) smoothed_dat <- bake(rec, sim_dat) ggplot(smoothed_dat, aes(x = original, y = y1)) + geom_point() + theme_bw()
step_YeoJohnson()
creates a specification of a recipe step that will
transform data using a Yeo-Johnson transformation.
step_YeoJohnson( recipe, ..., role = NA, trained = FALSE, lambdas = NULL, limits = c(-5, 5), num_unique = 5, na_rm = TRUE, skip = FALSE, id = rand_id("YeoJohnson") )
step_YeoJohnson( recipe, ..., role = NA, trained = FALSE, lambdas = NULL, limits = c(-5, 5), num_unique = 5, na_rm = TRUE, skip = FALSE, id = rand_id("YeoJohnson") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
lambdas |
A numeric vector of transformation values. This
is |
limits |
A length 2 numeric vector defining the range to compute the transformation parameter lambda. |
num_unique |
An integer where data that have less possible values will not be evaluated for a transformation. |
na_rm |
A logical value indicating whether |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
The Yeo-Johnson transformation is very similar to the Box-Cox but does not require the input variables to be strictly positive. In the package, the partial log-likelihood function is directly optimized within a reasonable set of transformation values (which can be changed by the user).
This transformation is typically done on the outcome variable using the residuals for a statistical model (such as ordinary least squares). Here, a simple null model (intercept only) is used to apply the transformation to the predictor variables individually. This can have the effect of making the variable distributions more symmetric.
If the transformation parameters are estimated to be very
closed to the bounds, or if the optimization fails, a value of
NA
is used and no transformation is applied.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
, value
, and id
:
character, the selectors or variables selected
numeric, the lambda estimate
character, id of this step
The underlying operation does not allow for case weights.
Yeo, I. K., and Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry. Biometrika.
Other individual transformation steps:
step_BoxCox()
,
step_bs()
,
step_harmonic()
,
step_hyperbolic()
,
step_inverse()
,
step_invlogit()
,
step_log()
,
step_logit()
,
step_mutate()
,
step_ns()
,
step_percentile()
,
step_poly()
,
step_relu()
,
step_sqrt()
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) yj_transform <- step_YeoJohnson(rec, all_numeric()) yj_estimates <- prep(yj_transform, training = biomass_tr) yj_te <- bake(yj_estimates, biomass_te) plot(density(biomass_te$sulfur), main = "before") plot(density(yj_te$sulfur), main = "after") tidy(yj_transform, number = 1) tidy(yj_estimates, number = 1)
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) yj_transform <- step_YeoJohnson(rec, all_numeric()) yj_estimates <- prep(yj_transform, training = biomass_tr) yj_te <- bake(yj_estimates, biomass_te) plot(density(biomass_te$sulfur), main = "before") plot(density(yj_te$sulfur), main = "after") tidy(yj_transform, number = 1) tidy(yj_estimates, number = 1)
step_zv()
creates a specification of a recipe step that will remove
variables that contain only a single value.
step_zv( recipe, ..., role = NA, trained = FALSE, group = NULL, removals = NULL, skip = FALSE, id = rand_id("zv") )
step_zv( recipe, ..., role = NA, trained = FALSE, group = NULL, removals = NULL, skip = FALSE, id = rand_id("zv") )
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose variables
for this step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
group |
An optional character string or call to |
removals |
A character string that contains the names of
columns that should be removed. These values are not determined
until |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
This step can potentially remove columns from the data set. This may cause issues for subsequent steps in your recipe if the missing columns are specifically referenced by name. To avoid this, see the advice in the Tips for saving recipes and filtering columns section of selections.
An updated version of recipe
with the new step added to the
sequence of any existing operations.
When you tidy()
this step, a tibble is returned with
columns terms
and id
:
character, names of the columns that will be removed
character, id of this step
The underlying operation does not allow for case weights.
Other variable filter steps:
step_corr()
,
step_filter_missing()
,
step_lincomb()
,
step_nzv()
,
step_rm()
,
step_select()
data(biomass, package = "modeldata") biomass$one_value <- 1 biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + one_value, data = biomass_tr ) zv_filter <- rec %>% step_zv(all_predictors()) filter_obj <- prep(zv_filter, training = biomass_tr) filtered_te <- bake(filter_obj, biomass_te) any(names(filtered_te) == "one_value") tidy(zv_filter, number = 1) tidy(filter_obj, number = 1)
data(biomass, package = "modeldata") biomass$one_value <- 1 biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + one_value, data = biomass_tr ) zv_filter <- rec %>% step_zv(all_predictors()) filter_obj <- prep(zv_filter, training = biomass_tr) filtered_te <- bake(filter_obj, biomass_te) any(names(filtered_te) == "one_value") tidy(zv_filter, number = 1) tidy(filter_obj, number = 1)
This function prints the current set of variables/features and some of their characteristics.
## S3 method for class 'recipe' summary(object, original = FALSE, ...)
## S3 method for class 'recipe' summary(object, original = FALSE, ...)
object |
A |
original |
A logical: show the current set of variables or the original set when the recipe was defined. |
... |
further arguments passed to or from other methods (not currently used). |
Note that, until the recipe has been trained, the current and original variables are the same.
It is possible for variables to have multiple roles by adding them with
add_role()
. If a variable has multiple roles, it will have more than one
row in the summary tibble.
A tibble with columns variable
, type
, role
,
and source
. When original = TRUE
, an additional column is included
named required_to_bake
(based on the results of
update_role_requirements()
).
rec <- recipe(~., data = USArrests) summary(rec) rec <- step_pca(rec, all_numeric(), num_comp = 3) summary(rec) # still the same since not yet trained rec <- prep(rec, training = USArrests) summary(rec)
rec <- recipe(~., data = USArrests) summary(rec) rec <- step_pca(rec, all_numeric(), num_comp = 3) summary(rec) # still the same since not yet trained rec <- prep(rec, training = USArrests) summary(rec)
tidy
will return a data frame that contains information
regarding a recipe or operation within the recipe (when a tidy
method for the operation exists).
## S3 method for class 'step_BoxCox' tidy(x, ...) ## S3 method for class 'step_YeoJohnson' tidy(x, ...) ## S3 method for class 'step_arrange' tidy(x, ...) ## S3 method for class 'step_bin2factor' tidy(x, ...) ## S3 method for class 'step_bs' tidy(x, ...) ## S3 method for class 'step_center' tidy(x, ...) ## S3 method for class 'check_class' tidy(x, ...) ## S3 method for class 'step_classdist' tidy(x, ...) ## S3 method for class 'step_classdist_shrunken' tidy(x, ...) ## S3 method for class 'check_cols' tidy(x, ...) ## S3 method for class 'step_corr' tidy(x, ...) ## S3 method for class 'step_count' tidy(x, ...) ## S3 method for class 'step_cut' tidy(x, ...) ## S3 method for class 'step_date' tidy(x, ...) ## S3 method for class 'step_depth' tidy(x, ...) ## S3 method for class 'step_discretize' tidy(x, ...) ## S3 method for class 'step_dummy' tidy(x, ...) ## S3 method for class 'step_dummy_extract' tidy(x, ...) ## S3 method for class 'step_dummy_multi_choice' tidy(x, ...) ## S3 method for class 'step_factor2string' tidy(x, ...) ## S3 method for class 'step_filter' tidy(x, ...) ## S3 method for class 'step_filter_missing' tidy(x, ...) ## S3 method for class 'step_geodist' tidy(x, ...) ## S3 method for class 'step_harmonic' tidy(x, ...) ## S3 method for class 'step_holiday' tidy(x, ...) ## S3 method for class 'step_hyperbolic' tidy(x, ...) ## S3 method for class 'step_ica' tidy(x, ...) ## S3 method for class 'step_impute_bag' tidy(x, ...) ## S3 method for class 'step_impute_knn' tidy(x, ...) ## S3 method for class 'step_impute_linear' tidy(x, ...) ## S3 method for class 'step_impute_lower' tidy(x, ...) ## S3 method for class 'step_impute_mean' tidy(x, ...) ## S3 method for class 'step_impute_median' tidy(x, ...) ## S3 method for class 'step_impute_mode' tidy(x, ...) ## S3 method for class 'step_impute_roll' tidy(x, ...) ## S3 method for class 'step_indicate_na' tidy(x, ...) ## S3 method for class 'step_integer' tidy(x, ...) ## S3 method for class 'step_interact' tidy(x, ...) ## S3 method for class 'step_intercept' tidy(x, ...) ## S3 method for class 'step_inverse' tidy(x, ...) ## S3 method for class 'step_invlogit' tidy(x, ...) ## S3 method for class 'step_isomap' tidy(x, ...) ## S3 method for class 'step_kpca' tidy(x, ...) ## S3 method for class 'step_kpca_poly' tidy(x, ...) ## S3 method for class 'step_kpca_rbf' tidy(x, ...) ## S3 method for class 'step_lag' tidy(x, ...) ## S3 method for class 'step_lincomb' tidy(x, ...) ## S3 method for class 'step_log' tidy(x, ...) ## S3 method for class 'step_logit' tidy(x, ...) ## S3 method for class 'check_missing' tidy(x, ...) ## S3 method for class 'step_mutate' tidy(x, ...) ## S3 method for class 'step_mutate_at' tidy(x, ...) ## S3 method for class 'step_naomit' tidy(x, ...) ## S3 method for class 'check_new_values' tidy(x, ...) ## S3 method for class 'step_nnmf' tidy(x, ...) ## S3 method for class 'step_nnmf_sparse' tidy(x, ...) ## S3 method for class 'step_normalize' tidy(x, ...) ## S3 method for class 'step_novel' tidy(x, ...) ## S3 method for class 'step_ns' tidy(x, ...) ## S3 method for class 'step_num2factor' tidy(x, ...) ## S3 method for class 'step_nzv' tidy(x, ...) ## S3 method for class 'step_ordinalscore' tidy(x, ...) ## S3 method for class 'step_other' tidy(x, ...) ## S3 method for class 'step_pca' tidy(x, type = "coef", ...) ## S3 method for class 'step_percentile' tidy(x, ...) ## S3 method for class 'step_pls' tidy(x, ...) ## S3 method for class 'step_poly' tidy(x, ...) ## S3 method for class 'step_poly_bernstein' tidy(x, ...) ## S3 method for class 'step_profile' tidy(x, ...) ## S3 method for class 'step_range' tidy(x, ...) ## S3 method for class 'check_range' tidy(x, ...) ## S3 method for class 'step_ratio' tidy(x, ...) ## S3 method for class 'step_regex' tidy(x, ...) ## S3 method for class 'step_relevel' tidy(x, ...) ## S3 method for class 'step_relu' tidy(x, ...) ## S3 method for class 'step_rename' tidy(x, ...) ## S3 method for class 'step_rename_at' tidy(x, ...) ## S3 method for class 'step_rm' tidy(x, ...) ## S3 method for class 'step_sample' tidy(x, ...) ## S3 method for class 'step_scale' tidy(x, ...) ## S3 method for class 'step_select' tidy(x, ...) ## S3 method for class 'step_shuffle' tidy(x, ...) ## S3 method for class 'step_slice' tidy(x, ...) ## S3 method for class 'step_spatialsign' tidy(x, ...) ## S3 method for class 'step_spline_b' tidy(x, ...) ## S3 method for class 'step_spline_convex' tidy(x, ...) ## S3 method for class 'step_spline_monotone' tidy(x, ...) ## S3 method for class 'step_spline_natural' tidy(x, ...) ## S3 method for class 'step_spline_nonnegative' tidy(x, ...) ## S3 method for class 'step_sqrt' tidy(x, ...) ## S3 method for class 'step_string2factor' tidy(x, ...) ## S3 method for class 'recipe' tidy(x, number = NA, id = NA, ...) ## S3 method for class 'step' tidy(x, ...) ## S3 method for class 'check' tidy(x, ...) ## S3 method for class 'step_time' tidy(x, ...) ## S3 method for class 'step_unknown' tidy(x, ...) ## S3 method for class 'step_unorder' tidy(x, ...) ## S3 method for class 'step_window' tidy(x, ...) ## S3 method for class 'step_zv' tidy(x, ...)
## S3 method for class 'step_BoxCox' tidy(x, ...) ## S3 method for class 'step_YeoJohnson' tidy(x, ...) ## S3 method for class 'step_arrange' tidy(x, ...) ## S3 method for class 'step_bin2factor' tidy(x, ...) ## S3 method for class 'step_bs' tidy(x, ...) ## S3 method for class 'step_center' tidy(x, ...) ## S3 method for class 'check_class' tidy(x, ...) ## S3 method for class 'step_classdist' tidy(x, ...) ## S3 method for class 'step_classdist_shrunken' tidy(x, ...) ## S3 method for class 'check_cols' tidy(x, ...) ## S3 method for class 'step_corr' tidy(x, ...) ## S3 method for class 'step_count' tidy(x, ...) ## S3 method for class 'step_cut' tidy(x, ...) ## S3 method for class 'step_date' tidy(x, ...) ## S3 method for class 'step_depth' tidy(x, ...) ## S3 method for class 'step_discretize' tidy(x, ...) ## S3 method for class 'step_dummy' tidy(x, ...) ## S3 method for class 'step_dummy_extract' tidy(x, ...) ## S3 method for class 'step_dummy_multi_choice' tidy(x, ...) ## S3 method for class 'step_factor2string' tidy(x, ...) ## S3 method for class 'step_filter' tidy(x, ...) ## S3 method for class 'step_filter_missing' tidy(x, ...) ## S3 method for class 'step_geodist' tidy(x, ...) ## S3 method for class 'step_harmonic' tidy(x, ...) ## S3 method for class 'step_holiday' tidy(x, ...) ## S3 method for class 'step_hyperbolic' tidy(x, ...) ## S3 method for class 'step_ica' tidy(x, ...) ## S3 method for class 'step_impute_bag' tidy(x, ...) ## S3 method for class 'step_impute_knn' tidy(x, ...) ## S3 method for class 'step_impute_linear' tidy(x, ...) ## S3 method for class 'step_impute_lower' tidy(x, ...) ## S3 method for class 'step_impute_mean' tidy(x, ...) ## S3 method for class 'step_impute_median' tidy(x, ...) ## S3 method for class 'step_impute_mode' tidy(x, ...) ## S3 method for class 'step_impute_roll' tidy(x, ...) ## S3 method for class 'step_indicate_na' tidy(x, ...) ## S3 method for class 'step_integer' tidy(x, ...) ## S3 method for class 'step_interact' tidy(x, ...) ## S3 method for class 'step_intercept' tidy(x, ...) ## S3 method for class 'step_inverse' tidy(x, ...) ## S3 method for class 'step_invlogit' tidy(x, ...) ## S3 method for class 'step_isomap' tidy(x, ...) ## S3 method for class 'step_kpca' tidy(x, ...) ## S3 method for class 'step_kpca_poly' tidy(x, ...) ## S3 method for class 'step_kpca_rbf' tidy(x, ...) ## S3 method for class 'step_lag' tidy(x, ...) ## S3 method for class 'step_lincomb' tidy(x, ...) ## S3 method for class 'step_log' tidy(x, ...) ## S3 method for class 'step_logit' tidy(x, ...) ## S3 method for class 'check_missing' tidy(x, ...) ## S3 method for class 'step_mutate' tidy(x, ...) ## S3 method for class 'step_mutate_at' tidy(x, ...) ## S3 method for class 'step_naomit' tidy(x, ...) ## S3 method for class 'check_new_values' tidy(x, ...) ## S3 method for class 'step_nnmf' tidy(x, ...) ## S3 method for class 'step_nnmf_sparse' tidy(x, ...) ## S3 method for class 'step_normalize' tidy(x, ...) ## S3 method for class 'step_novel' tidy(x, ...) ## S3 method for class 'step_ns' tidy(x, ...) ## S3 method for class 'step_num2factor' tidy(x, ...) ## S3 method for class 'step_nzv' tidy(x, ...) ## S3 method for class 'step_ordinalscore' tidy(x, ...) ## S3 method for class 'step_other' tidy(x, ...) ## S3 method for class 'step_pca' tidy(x, type = "coef", ...) ## S3 method for class 'step_percentile' tidy(x, ...) ## S3 method for class 'step_pls' tidy(x, ...) ## S3 method for class 'step_poly' tidy(x, ...) ## S3 method for class 'step_poly_bernstein' tidy(x, ...) ## S3 method for class 'step_profile' tidy(x, ...) ## S3 method for class 'step_range' tidy(x, ...) ## S3 method for class 'check_range' tidy(x, ...) ## S3 method for class 'step_ratio' tidy(x, ...) ## S3 method for class 'step_regex' tidy(x, ...) ## S3 method for class 'step_relevel' tidy(x, ...) ## S3 method for class 'step_relu' tidy(x, ...) ## S3 method for class 'step_rename' tidy(x, ...) ## S3 method for class 'step_rename_at' tidy(x, ...) ## S3 method for class 'step_rm' tidy(x, ...) ## S3 method for class 'step_sample' tidy(x, ...) ## S3 method for class 'step_scale' tidy(x, ...) ## S3 method for class 'step_select' tidy(x, ...) ## S3 method for class 'step_shuffle' tidy(x, ...) ## S3 method for class 'step_slice' tidy(x, ...) ## S3 method for class 'step_spatialsign' tidy(x, ...) ## S3 method for class 'step_spline_b' tidy(x, ...) ## S3 method for class 'step_spline_convex' tidy(x, ...) ## S3 method for class 'step_spline_monotone' tidy(x, ...) ## S3 method for class 'step_spline_natural' tidy(x, ...) ## S3 method for class 'step_spline_nonnegative' tidy(x, ...) ## S3 method for class 'step_sqrt' tidy(x, ...) ## S3 method for class 'step_string2factor' tidy(x, ...) ## S3 method for class 'recipe' tidy(x, number = NA, id = NA, ...) ## S3 method for class 'step' tidy(x, ...) ## S3 method for class 'check' tidy(x, ...) ## S3 method for class 'step_time' tidy(x, ...) ## S3 method for class 'step_unknown' tidy(x, ...) ## S3 method for class 'step_unorder' tidy(x, ...) ## S3 method for class 'step_window' tidy(x, ...) ## S3 method for class 'step_zv' tidy(x, ...)
x |
A |
... |
Not currently used. |
type |
For |
number |
An integer or |
id |
A character string or |
A tibble with columns that vary depending on what
tidy
method is executed. When number
, and id
are NA
, a
tibble with columns number
(the operation iteration),
operation
(either "step" or "check"),
type
(the method, e.g. "nzv", "center"), a logical
column called trained
for whether the operation has been
estimated using prep
, a logical for skip
, and a character column id
.
data(Sacramento, package = "modeldata") Sacramento_rec <- recipe(~., data = Sacramento) %>% step_other(all_nominal(), threshold = 0.05, other = "another") %>% step_center(all_numeric()) %>% step_dummy(all_nominal()) %>% check_cols(ends_with("ude"), sqft, price) tidy(Sacramento_rec) tidy(Sacramento_rec, number = 2) tidy(Sacramento_rec, number = 3) Sacramento_rec_trained <- prep(Sacramento_rec, training = Sacramento) tidy(Sacramento_rec_trained) tidy(Sacramento_rec_trained, number = 3) tidy(Sacramento_rec_trained, number = 4)
data(Sacramento, package = "modeldata") Sacramento_rec <- recipe(~., data = Sacramento) %>% step_other(all_nominal(), threshold = 0.05, other = "another") %>% step_center(all_numeric()) %>% step_dummy(all_nominal()) %>% check_cols(ends_with("ude"), sqft, price) tidy(Sacramento_rec) tidy(Sacramento_rec, number = 2) tidy(Sacramento_rec, number = 3) Sacramento_rec_trained <- prep(Sacramento_rec, training = Sacramento) tidy(Sacramento_rec_trained) tidy(Sacramento_rec_trained, number = 3) tidy(Sacramento_rec_trained, number = 4)
update_role_requirements()
allows you to fine tune requirements of the
various roles you might come across in recipes (see update_role()
for
general information about roles). Role requirements can only be altered for
roles that exist in the original data supplied to recipe()
, they are not
applied to columns computed by steps.
Like update_role()
, update_role_requirements()
is applied to the recipe
immediately, unlike the step_*()
functions which do most of their work
at prep()
time.
update_role_requirements(recipe, role, ..., bake = NULL)
update_role_requirements(recipe, role, ..., bake = NULL)
recipe |
A recipe. |
role |
A string representing the role that you'd like to modify the requirements of. This must be a role that already exists in the recipe. |
... |
These dots are for future extensions and must be empty. |
bake |
At Must be a single The following represents the default bake time requirements of specific types of roles:
|
df <- tibble(y = c(1, 2, 3), x = c(4, 5, 6), var = c("a", "b", "c")) # Let's assume that you have a `var` column that isn't used in the recipe. # We typically recommend that you remove this column before passing the # `data` to `recipe()`, but for now let's pass it through and assign it an # `"id"` role. rec <- recipe(y ~ ., df) %>% update_role(var, new_role = "id") %>% step_center(x) prepped <- prep(rec, df) # Now assume you have some "new data" and you are ready to `bake()` it # to prepare it for prediction purposes. Here, you might not have `var` # available as a column because it isn't important to your model. new_data <- df[c("y", "x")] # By default `var` is required at `bake()` time because we don't know if # you actually use it in the recipe or not try(bake(prepped, new_data)) # You can turn off this check by using `update_role_requirements()` and # setting `bake = FALSE` for the `"id"` role. We recommend doing this on # the original unprepped recipe, but it will also work on a prepped recipe. rec <- update_role_requirements(rec, "id", bake = FALSE) prepped <- prep(rec, df) # Now you can `bake()` on `new_data` even though `var` is missing bake(prepped, new_data)
df <- tibble(y = c(1, 2, 3), x = c(4, 5, 6), var = c("a", "b", "c")) # Let's assume that you have a `var` column that isn't used in the recipe. # We typically recommend that you remove this column before passing the # `data` to `recipe()`, but for now let's pass it through and assign it an # `"id"` role. rec <- recipe(y ~ ., df) %>% update_role(var, new_role = "id") %>% step_center(x) prepped <- prep(rec, df) # Now assume you have some "new data" and you are ready to `bake()` it # to prepare it for prediction purposes. Here, you might not have `var` # available as a column because it isn't important to your model. new_data <- df[c("y", "x")] # By default `var` is required at `bake()` time because we don't know if # you actually use it in the recipe or not try(bake(prepped, new_data)) # You can turn off this check by using `update_role_requirements()` and # setting `bake = FALSE` for the `"id"` role. We recommend doing this on # the original unprepped recipe, but it will also work on a prepped recipe. rec <- update_role_requirements(rec, "id", bake = FALSE) prepped <- prep(rec, df) # Now you can `bake()` on `new_data` even though `var` is missing bake(prepped, new_data)
This step
method for update()
takes named arguments as ...
who's values
will replace the elements of the same name in the actual step.
## S3 method for class 'step' update(object, ...)
## S3 method for class 'step' update(object, ...)
object |
A recipe |
... |
Key-value pairs where the keys match up with names of elements in the step, and the values are the new values to update the step with. |
For a step to be updated, it must not already have been trained. Otherwise,
conflicting information can arise between the data returned from
bake(object, new_data = NULL)
and the information in the step.
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] # Create a recipe using step_bs() with degree = 3 rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) %>% step_bs(carbon, hydrogen, degree = 3) # Update the step to use degree = 4 rec2 <- rec rec2$steps[[1]] <- update(rec2$steps[[1]], degree = 4) # Prep both recipes rec_prepped <- prep(rec, training = biomass_tr) rec2_prepped <- prep(rec2, training = biomass_tr) # To see what changed bake(rec_prepped, new_data = NULL) bake(rec2_prepped, new_data = NULL) # Cannot update a recipe step that has been trained! ## Not run: update(rec_prepped$steps[[1]], degree = 4) ## End(Not run)
data(biomass, package = "modeldata") biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] # Create a recipe using step_bs() with degree = 3 rec <- recipe( HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr ) %>% step_bs(carbon, hydrogen, degree = 3) # Update the step to use degree = 4 rec2 <- rec rec2$steps[[1]] <- update(rec2$steps[[1]], degree = 4) # Prep both recipes rec_prepped <- prep(rec, training = biomass_tr) rec2_prepped <- prep(rec2, training = biomass_tr) # To see what changed bake(rec_prepped, new_data = NULL) bake(rec2_prepped, new_data = NULL) # Cannot update a recipe step that has been trained! ## Not run: update(rec_prepped$steps[[1]], degree = 4) ## End(Not run)