| Title: | Run Predictions Inside the Database |
|---|---|
| Description: | It parses a fitted 'R' model object, and returns a formula in 'Tidy Eval' code that calculates the predictions. It works with several databases back-ends because it leverages 'dplyr' and 'dbplyr' for the final 'SQL' translation of the algorithm. It currently supports lm(), glm(), randomForest(), ranger(), rpart(), earth(), xgb.Booster.complete(), lgb.Booster(), catboost.Model(), cubist(), and ctree() models. |
| Authors: | Emil Hvitfeldt [aut, cre], Edgar Ruiz [aut], Max Kuhn [aut] |
| Maintainer: | Emil Hvitfeldt <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.0.9000 |
| Built: | 2026-05-21 10:35:14 UTC |
| Source: | https://github.com/tidymodels/tidypredict |
Uses an S3 method to check that a given formula can be parsed based on its class. It currently scans for contrasts that are not supported and in-line functions. (e.g: lm(wt ~ as.factor(am))). Since this function is meant for function interaction, as opposed to human interaction, a successful check is silent.
acceptable_formula(model)acceptable_formula(model)
model |
An R model object |
model <- lm(mpg ~ wt, mtcars) acceptable_formula(model)model <- lm(mpg ~ wt, mtcars) acceptable_formula(model)
Prepares parsed model object
as_parsed_model(x)as_parsed_model(x)
x |
A parsed model object |
Construct a single node of a tree
generate_tree_node(node, calc_mode = "")generate_tree_node(node, calc_mode = "")
node |
a list with named elements |
calc_mode |
character, takes values The The This list can contain 0 or more elemements. The elements but each be of the following format:
The It can either be a singular value or a list.
If it is a list it will have the following 4 named elements
@keywords internal |
Parses a fitted R model's structure and extracts the components needed to create a dplyr formula for prediction. The parsed model can be serialized (e.g., saved to YAML) and later used to generate predictions without the original model object.
parse_model(model)parse_model(model)
model |
An R model object. |
A parsed model object with class parsed_model and a model-specific
subclass (e.g., pm_xgb, pm_tree, pm_regression). The object contains:
$general: List with model metadata including model (model type),
type (used for S3 dispatch), version (parsed model format version),
and model-specific parameters.
Model-specific fields containing coefficients, tree structures, etc.
The $general$version field indicates the parsed model format:
Version 1: Original format. Linear models store coefficients in a
data frame. Tree models use flat case_when() expressions where all leaf
conditions are at the same level.
Version 2: Improved coefficient storage for linear models (lm, earth).
Tree models still use flat case_when().
Version 3: Current format. Tree models (rpart, ranger, randomForest,
xgboost, lightgbm, catboost, partykit, cubist) use nested case_when()
expressions that mirror the tree structure. This produces more efficient
SQL and R code because conditions are evaluated hierarchically rather than
checking all leaf paths.
When loading a parsed model saved with an older version, tidypredict automatically uses the appropriate formula builder for backwards compatibility.
Each parsed model has a type that determines the S3 class used for dispatch:
pm_regression: Linear models (lm, glm, earth, glmnet)
pm_tree: Single trees and forests (rpart, partykit, ranger, randomForest,
cubist)
pm_xgb: XGBoost gradient boosting models
pm_lgb: LightGBM gradient boosting models
pm_catboost: CatBoost gradient boosting models
library(dplyr) df <- mutate(mtcars, cyl = paste0("cyl", cyl)) model <- lm(mpg ~ wt + cyl * disp, offset = am, data = df) parse_model(model)library(dplyr) df <- mutate(mtcars, cyl = paste0("cyl", cyl)) model <- lm(mpg ~ wt + cyl * disp, offset = am, data = df) parse_model(model)
Turn a path object into an expression
path_formula(x)path_formula(x)
x |
a list. The input of this function is a list with 4 values.
|
Turn a path object into a combined expression
path_formulas(path)path_formulas(path)
path |
a list of lists. This list can contain 0 or more elemements. The elements but each be of the following format:
|
CatBoost stores categorical features as hash values internally. This function establishes the mapping between hash values and category names by examining a data frame with the same factor columns used during training.
set_catboost_categories(parsed_model, model, data)set_catboost_categories(parsed_model, model, data)
parsed_model |
A parsed CatBoost model from |
model |
The original CatBoost model object |
data |
A data frame containing factor columns matching the categorical features used in the model. The factor levels must match those from training. |
This function is only needed when using raw CatBoost models (trained with
catboost.train()). When using parsnip/bonsai, categorical features are
handled automatically and this function is not required.
The parsed model with category mappings added
## Not run: # For raw CatBoost models with categorical features: pm <- parse_model(catboost_model) pm <- set_catboost_categories(pm, catboost_model, training_data) tidypredict_fit(pm) # For parsnip/bonsai models, this is not needed: # tidypredict_fit(parsnip_model_fit) # works automatically ## End(Not run)## Not run: # For raw CatBoost models with categorical features: pm <- parse_model(catboost_model) pm <- set_catboost_categories(pm, catboost_model, training_data) tidypredict_fit(pm) # For parsnip/bonsai models, this is not needed: # tidypredict_fit(parsnip_model_fit) # works automatically ## End(Not run)
Tidy the parsed model results
## S3 method for class 'pm_regression' tidy(x, ...)## S3 method for class 'pm_regression' tidy(x, ...)
x |
A parsed_model object |
... |
Reserved for future use |
It parses a model or uses an already parsed model to return a Tidy Eval formula that can then be used inside a dplyr command.
tidypredict_fit(model)tidypredict_fit(model)
model |
An R model or a list with a parsed model. |
model <- lm(mpg ~ wt + cyl * disp, offset = am, data = mtcars) tidypredict_fit(model)model <- lm(mpg ~ wt + cyl * disp, offset = am, data = mtcars) tidypredict_fit(model)
It parses a model or uses an already parsed model to return a Tidy Eval formula that can then be used inside a dplyr command.
tidypredict_interval(model, interval = 0.95)tidypredict_interval(model, interval = 0.95)
model |
An R model or a list with a parsed model |
interval |
The prediction interval, defaults to 0.95 |
The result still has to be added to and subtracted from the fit to obtain the upper and lower bound respectively.
model <- lm(mpg ~ wt + cyl * disp, offset = am, data = mtcars) tidypredict_interval(model)model <- lm(mpg ~ wt + cyl * disp, offset = am, data = mtcars) tidypredict_interval(model)
Compares the results of predict() and tidypredict_to_column() functions.
tidypredict_test( model, df = model$model, threshold = 1e-12, include_intervals = FALSE, max_rows = NULL, xg_df = NULL )tidypredict_test( model, df = model$model, threshold = 1e-12, include_intervals = FALSE, max_rows = NULL, xg_df = NULL )
model |
An R model or a list with a parsed model. It currently supports lm(), glm() and randomForest() models. |
df |
A data frame that contains all of the needed fields to run the prediction. It defaults to the "model" data frame object inside the model object. |
threshold |
The number that a given result difference, between predict() and tidypredict_to_column() should not exceed. For continuous predictions, the default value is 0.000000000001 (1e-12), and for categorical predictions, the default value is 0. |
include_intervals |
Switch to indicate if the prediction intervals should be included in the test. It defaults to FALSE. |
max_rows |
The number of rows in the object passed in the df argument. Highly recommended for large data sets. |
xg_df |
A xgb.DMatrix object, required only for XGBoost models. It defaults to NULL recommended for large data sets. |
model <- lm(mpg ~ wt + cyl * disp, offset = am, data = mtcars) tidypredict_test(model)model <- lm(mpg ~ wt + cyl * disp, offset = am, data = mtcars) tidypredict_test(model)
Adds a new column with the results from tidypredict_fit() to a piped command set. If add_interval is set to TRUE, it will add two additional columns- one for the lower and another for the upper prediction interval bounds.
tidypredict_to_column( df, model, add_interval = FALSE, interval = 0.95, vars = c("fit", "upper", "lower") )tidypredict_to_column( df, model, add_interval = FALSE, interval = 0.95, vars = c("fit", "upper", "lower") )
df |
A data.frame or tibble |
model |
An R model or a parsed model inside a data frame |
add_interval |
Switch that indicates if the prediction interval columns should be added. Defaults to FALSE |
interval |
The prediction interval, defaults to 0.95. Ignored if add_interval is set to FALSE |
vars |
The name of the variables that this function will produce. Defaults to "fit", "upper", and "lower". |