--- title: "Ordering of steps" output: rmarkdown::html_vignette description: | The order in which recipe steps are specified matters, and this vignette gives some general suggestions that you should consider. vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Ordering of steps} %\VignetteEncoding{UTF-8} --- In the recipes package, there are no constraints on the order in which steps are added to the recipe; you as a user are free to apply steps in the order appropriate to your data preprocessing needs. However, the **order of steps matters** and there are some general suggestions that you should consider. ## Transforming a variable * If using a Box-Cox transformation, don't center the data first or do any operations that might make the data non-positive. * Alternatively, use the Yeo-Johnson transformation so you don't have to worry about this. ## Handling levels in categorical data The order of steps for handling categorical levels is important, because each step sets levels for the next step to use as input. These steps create _factor_ output, even if the input is of character type. * Typically use `step_novel()` before other steps for changing factor levels, so that the new factor level can be set as you desire rather than coerced to `NA` by other factor handling steps. * Use steps like `step_unknown()` and `step_other()` after other steps for changing factor levels. * If you are creating dummy variables from a categorical variable (see below), complete handling of the categorical variable's levels _before_ `step_dummy()`. ## Dummy variables Recipes do not automatically create dummy variables (unlike _most_ formula methods). * If you want to center, scale, or do any other operations on _all_ of the predictors, run `step_dummy()` first so that numeric columns are in the data set instead of factors. * As noted in the [help file for `step_interact()`](https://recipes.tidymodels.org/reference/step_interact.html), you should make dummy variables _before_ creating the interactions. ## Recommended preprocessing outline While every individual project's needs are different, here is a suggested order of _potential_ steps that should work for most problems: 1. Impute 1. Handle factor levels 1. Individual transformations for skewness and other issues 1. Discretize (if needed and if you have no other choice) 1. Create dummy variables 1. Create interactions 1. Normalization steps (center, scale, range, etc) 1. Multivariate transformation (e.g. PCA, spatial sign, etc) Again, your mileage may vary for your particular problem.