Title: | 'tidymodels' Integration with 'h2o' |
---|---|
Description: | Create and evaluate models using 'tidymodels' and 'h2o' <https://h2o.ai/>. The package enables users to specify 'h2o' as an engine for several modeling methods. |
Authors: | Max Kuhn [aut] , Qiushi Yan [aut, cre], Steven Pawley [aut], Posit Software, PBC [cph, fnd] |
Maintainer: | Qiushi Yan <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.4.9000 |
Built: | 2024-11-10 03:46:13 UTC |
Source: | https://github.com/tidymodels/agua |
h2o::h2o.grid()
Control model tuning via h2o::h2o.grid()
agua_backend_options(parallelism = 1)
agua_backend_options(parallelism = 1)
parallelism |
Level of Parallelism during grid model building. 1 = sequential building (default). Use the value of 0 for adaptive parallelism - decided by H2O. Any number > 1 sets the exact number of models built in parallel. |
Data conversion tools
as_h2o(df, destination_frame_prefix = "object") ## S3 method for class 'H2OFrame' as_tibble( x, ..., .rows = NULL, .name_repair = c("check_unique", "unique", "universal", "minimal"), rownames = pkgconfig::get_config("tibble::rownames", NULL) )
as_h2o(df, destination_frame_prefix = "object") ## S3 method for class 'H2OFrame' as_tibble( x, ..., .rows = NULL, .name_repair = c("check_unique", "unique", "universal", "minimal"), rownames = pkgconfig::get_config("tibble::rownames", NULL) )
df |
A R data frame. |
destination_frame_prefix |
A character string to use as the base name. |
x |
An H2OFrame. |
... |
Unused, for extensibility. |
.rows |
The number of rows, useful to create a 0-column tibble or just as an additional check. |
.name_repair |
Treatment of problematic column names:
This argument is passed on as |
rownames |
How to treat existing row names of a data frame or matrix:
Read more in rownames. |
A tibble or, for as_h2o()
, a list with data
(an H2OFrame) and
id
(the id on the h2o server).
# start with h2o::h2o.init() if (h2o_running()) { cars2 <- as_h2o(mtcars) cars2 class(cars2$data) cars0 <- as_tibble(cars2$data) cars0 }
# start with h2o::h2o.init() if (h2o_running()) { cars2 <- as_h2o(mtcars) cars2 class(cars2$data) cars0 <- as_tibble(cars2$data) cars0 }
The autoplot()
method plots cross validation performances of candidate
models in H2O AutoML output via facets on each metric.
## S3 method for class 'workflow' autoplot(object, ...) ## S3 method for class 'H2OAutoML' autoplot( object, type = c("rank", "metric"), metric = NULL, std_errs = qnorm(0.95), ... )
## S3 method for class 'workflow' autoplot(object, ...) ## S3 method for class 'H2OAutoML' autoplot( object, type = c("rank", "metric"), metric = NULL, std_errs = qnorm(0.95), ... )
object |
A fitted |
... |
Other options to pass to |
type |
A character value for whether to plot average ranking ("rank") or metrics ("metric"). |
metric |
A character vector or NULL for which metric to plot. By default, all metrics will be shown via facets. |
std_errs |
The number of standard errors to plot. |
A ggplot object.
if (h2o_running()) { auto_fit <- auto_ml() %>% set_engine("h2o", max_runtime_secs = 5) %>% set_mode("regression") %>% fit(mpg ~ ., data = mtcars) autoplot(auto_fit) }
if (h2o_running()) { auto_fit <- auto_ml() %>% set_engine("h2o", max_runtime_secs = 5) %>% set_mode("regression") %>% fit(mpg ~ ., data = mtcars) autoplot(auto_fit) }
Prediction wrappers for fitted models with h2o engine that include data conversion, h2o server cleanup, and so on.
h2o_predict(object, new_data, ...) h2o_predict_classification(object, new_data, type = "class", ...) h2o_predict_regression(object, new_data, type = "numeric", ...) ## S3 method for class ''_H2OAutoML'' predict(object, new_data, id = NULL, ...)
h2o_predict(object, new_data, ...) h2o_predict_classification(object, new_data, type = "class", ...) h2o_predict_regression(object, new_data, type = "numeric", ...) ## S3 method for class ''_H2OAutoML'' predict(object, new_data, id = NULL, ...)
object |
An object of class |
new_data |
A rectangular data object, such as a data frame. |
... |
Other options passed to |
type |
A single character value or |
id |
Model id in AutoML results. |
For AutoML, prediction is based on the best performing model.
For type != "raw", a prediction data frame with the same number of
rows as new_data
. For type == "raw", return the result of
h2o::h2o.predict()
.
if (h2o_running()) { spec <- rand_forest(mtry = 3, trees = 100) %>% set_engine("h2o") %>% set_mode("regression") set.seed(1) mod <- fit(spec, mpg ~ ., data = mtcars) h2o_predict_regression(mod$fit, new_data = head(mtcars), type = "numeric") # using parsnip predict(mod, new_data = head(mtcars)) }
if (h2o_running()) { spec <- rand_forest(mtry = 3, trees = 100) %>% set_engine("h2o") %>% set_mode("regression") set.seed(1) mod <- fit(spec, mpg ~ ., data = mtcars) h2o_predict_regression(mod$fit, new_data = head(mtcars), type = "numeric") # using parsnip predict(mod, new_data = head(mtcars)) }
Utility functions for interacting with the h2o server
h2o_start() h2o_end() h2o_running(verbose = FALSE) h2o_remove(id) h2o_remove_all() h2o_get_model(id) h2o_get_frame(id) h2o_xgboost_available()
h2o_start() h2o_end() h2o_running(verbose = FALSE) h2o_remove(id) h2o_remove_all() h2o_get_model(id) h2o_get_frame(id) h2o_xgboost_available()
verbose |
Print out the message if no cluster is available. |
id |
Model or frame id. |
## Not run: if (!h2o_running()) { h2o_start() } ## End(Not run)
## Not run: if (!h2o_running()) { h2o_start() } ## End(Not run)
Basic model wrappers for h2o model functions that include data conversion, seed configuration, and so on.
h2o_train( x, y, model, weights = NULL, validation = NULL, save_data = FALSE, ... ) h2o_train_rf(x, y, ntrees = 50, mtries = -1, min_rows = 1, ...) h2o_train_xgboost( x, y, ntrees = 50, max_depth = 6, min_rows = 1, learn_rate = 0.3, sample_rate = 1, col_sample_rate = 1, min_split_improvement = 0, stopping_rounds = 0, validation = NULL, ... ) h2o_train_gbm( x, y, ntrees = 50, max_depth = 6, min_rows = 1, learn_rate = 0.3, sample_rate = 1, col_sample_rate = 1, min_split_improvement = 0, stopping_rounds = 0, ... ) h2o_train_glm(x, y, lambda = NULL, alpha = NULL, ...) h2o_train_nb(x, y, laplace = 0, ...) h2o_train_mlp( x, y, hidden = 200, l2 = 0, hidden_dropout_ratios = 0, epochs = 10, activation = "Rectifier", validation = NULL, ... ) h2o_train_rule( x, y, rule_generation_ntrees = 50, max_rule_length = 5, lambda = NULL, ... ) h2o_train_auto(x, y, verbosity = NULL, save_data = FALSE, ...)
h2o_train( x, y, model, weights = NULL, validation = NULL, save_data = FALSE, ... ) h2o_train_rf(x, y, ntrees = 50, mtries = -1, min_rows = 1, ...) h2o_train_xgboost( x, y, ntrees = 50, max_depth = 6, min_rows = 1, learn_rate = 0.3, sample_rate = 1, col_sample_rate = 1, min_split_improvement = 0, stopping_rounds = 0, validation = NULL, ... ) h2o_train_gbm( x, y, ntrees = 50, max_depth = 6, min_rows = 1, learn_rate = 0.3, sample_rate = 1, col_sample_rate = 1, min_split_improvement = 0, stopping_rounds = 0, ... ) h2o_train_glm(x, y, lambda = NULL, alpha = NULL, ...) h2o_train_nb(x, y, laplace = 0, ...) h2o_train_mlp( x, y, hidden = 200, l2 = 0, hidden_dropout_ratios = 0, epochs = 10, activation = "Rectifier", validation = NULL, ... ) h2o_train_rule( x, y, rule_generation_ntrees = 50, max_rule_length = 5, lambda = NULL, ... ) h2o_train_auto(x, y, verbosity = NULL, save_data = FALSE, ...)
x |
A data frame of predictors. |
y |
A vector of outcomes. |
model |
A character string for the model. Current selections are
|
weights |
A numeric vector of case weights. |
validation |
An integer between 0 and 1 specifying the proportion of the data reserved as validation set. This is used by h2o for performance assessment and potential early stopping. Default to 0. |
save_data |
A logical for whether training data should be saved on
the h2o server, set this to |
... |
Other options to pass to the h2o model functions (e.g.,
|
ntrees |
Number of trees. Defaults to 50. |
mtries |
Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification and p/3 for regression (where p is the # of predictors Defaults to -1. |
min_rows |
Fewest allowed (weighted) observations in a leaf. Defaults to 1. |
max_depth |
Maximum tree depth (0 for unlimited). Defaults to 20. |
learn_rate |
(same as eta) Learning rate (from 0.0 to 1.0) Defaults to 0.3. |
sample_rate |
Row sample rate per tree (from 0.0 to 1.0) Defaults to 0.632. |
col_sample_rate |
(same as colsample_bylevel) Column sample rate (from 0.0 to 1.0) Defaults to 1. |
min_split_improvement |
Minimum relative improvement in squared error reduction for a split to happen Defaults to 1e-05. |
stopping_rounds |
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable) Defaults to 0. |
lambda |
Regularization strength |
alpha |
Distribution of regularization between the L1 (Lasso) and L2 (Ridge) penalties. A value of 1 for alpha represents Lasso regression, a value of 0 produces Ridge regression, and anything in between specifies the amount of mixing between the two. Default value of alpha is 0 when SOLVER = 'L-BFGS'; 0.5 otherwise. |
laplace |
Laplace smoothing parameter Defaults to 0. |
Hidden layer sizes (e.g. [100, 100]). Defaults to c(200, 200). |
|
l2 |
L2 regularization (can add stability and improve generalization, causes many weights to be small. Defaults to 0. |
Hidden layer dropout ratios (can improve generalization), specify one value per hidden layer, defaults to 0.5. |
|
epochs |
How many times the dataset should be iterated (streamed), can be fractional. Defaults to 10. |
activation |
Activation function. Must be one of: "Tanh", "TanhWithDropout", "Rectifier", "RectifierWithDropout", "Maxout", "MaxoutWithDropout". Defaults to Rectifier. |
rule_generation_ntrees |
Specifies the number of trees to build in the tree model. Defaults to 50. Defaults to 50. |
max_rule_length |
Maximum length of rules. Defaults to 3. |
verbosity |
Verbosity of the backend messages printed during training; Must be one of NULL (live log disabled), "debug", "info", "warn", "error". Defaults to NULL. |
An h2o model object.
# start with h2o::h2o.init() if (h2o_running()) { # ------------------------------------------------------------------------- # Using the model wrappers: h2o_train_glm(mtcars[, -1], mtcars$mpg) # ------------------------------------------------------------------------- # using parsnip: spec <- rand_forest(mtry = 3, trees = 500) %>% set_engine("h2o") %>% set_mode("regression") set.seed(1) mod <- fit(spec, mpg ~ ., data = mtcars) mod predict(mod, head(mtcars)) }
# start with h2o::h2o.init() if (h2o_running()) { # ------------------------------------------------------------------------- # Using the model wrappers: h2o_train_glm(mtcars[, -1], mtcars$mpg) # ------------------------------------------------------------------------- # using parsnip: spec <- rand_forest(mtry = 3, trees = 500) %>% set_engine("h2o") %>% set_mode("regression") set.seed(1) mod <- fit(spec, mpg ~ ., data = mtcars) mod predict(mod, head(mtcars)) }
Functions that returns a tibble describing model performances.
rank_results()
ranks average cross validation performances
of candidate models on each metric.
collect_metrics()
computes average statistics of performance metrics
(summarized) for each model, or raw value in each resample (unsummarized).
tidy()
computes average performance for each model.
member_weights()
computes member importance for stacked ensemble
models, i.e., the relative importance of base models in the meta-learner.
This is typically the coefficient magnitude in the second-level GLM model.
extract_fit_engine()
extracts single candidate model from auto_ml()
results. When id
is null, it returns the leader model.
refit()
re-fits an existing AutoML model to add more candidates. The model
to be re-fitted needs to have engine argument save_data = TRUE
, and
keep_cross_validation_predictions = TRUE
if stacked ensembles is needed for
later models.
## S3 method for class 'workflow' rank_results(x, ...) ## S3 method for class ''_H2OAutoML'' rank_results(x, ...) ## S3 method for class 'H2OAutoML' rank_results(x, n = NULL, id = NULL, ...) ## S3 method for class 'workflow' collect_metrics(x, ...) ## S3 method for class ''_H2OAutoML'' collect_metrics(x, ...) ## S3 method for class 'H2OAutoML' collect_metrics(x, summarize = TRUE, n = NULL, id = NULL, ...) ## S3 method for class ''_H2OAutoML'' tidy(x, n = NULL, id = NULL, keep_model = TRUE, ...) get_leaderboard(x, n = NULL, id = NULL) member_weights(x, ...) ## S3 method for class ''_H2OAutoML'' extract_fit_parsnip(x, id = NULL, ...) ## S3 method for class ''_H2OAutoML'' extract_fit_engine(x, id = NULL, ...) ## S3 method for class 'workflow' refit(object, ...) ## S3 method for class ''_H2OAutoML'' refit(object, verbosity = NULL, ...)
## S3 method for class 'workflow' rank_results(x, ...) ## S3 method for class ''_H2OAutoML'' rank_results(x, ...) ## S3 method for class 'H2OAutoML' rank_results(x, n = NULL, id = NULL, ...) ## S3 method for class 'workflow' collect_metrics(x, ...) ## S3 method for class ''_H2OAutoML'' collect_metrics(x, ...) ## S3 method for class 'H2OAutoML' collect_metrics(x, summarize = TRUE, n = NULL, id = NULL, ...) ## S3 method for class ''_H2OAutoML'' tidy(x, n = NULL, id = NULL, keep_model = TRUE, ...) get_leaderboard(x, n = NULL, id = NULL) member_weights(x, ...) ## S3 method for class ''_H2OAutoML'' extract_fit_parsnip(x, id = NULL, ...) ## S3 method for class ''_H2OAutoML'' extract_fit_engine(x, id = NULL, ...) ## S3 method for class 'workflow' refit(object, ...) ## S3 method for class ''_H2OAutoML'' refit(object, verbosity = NULL, ...)
... |
Not used. |
n |
An integer for the number of top models to extract from AutoML results, default to all. |
id |
A character vector of model ids to retrieve. |
summarize |
A logical; should metrics be summarized over resamples (TRUE) or return the values for each individual resample. |
keep_model |
A logical value for if the actual model object
should be retrieved from the server. Defaults to |
object , x
|
A fitted |
verbosity |
Verbosity of the backend messages printed during training; Must be one of NULL (live log disabled), "debug", "info", "warn", "error". Defaults to NULL. |
H2O associates with each model in AutoML an unique id. This can be used for
model extraction and prediction, i.e., extract_fit_engine(x, id = id)
returns the model and predict(x, id = id)
will predict for that model.
extract_fit_parsnip(x, id = id)
wraps the h2o model with parsnip
parsnip model object is discouraged.
The algorithm
column corresponds to the model family H2O use for a
particular model, including xgboost ("XGBOOST"
),
gradient boosting ("GBM"
), random forest and variants ("DRF"
, "XRT"
),
generalized linear model ("GLM"
), and neural network ("deeplearning"
).
See the details section in h2o::h2o.automl()
for more information.
if (h2o_running()) { auto_fit <- auto_ml() %>% set_engine("h2o", max_runtime_secs = 5) %>% set_mode("regression") %>% fit(mpg ~ ., data = mtcars) rank_results(auto_fit, n = 5) collect_metrics(auto_fit, summarize = FALSE) tidy(auto_fit) member_weights(auto_fit) }
if (h2o_running()) { auto_fit <- auto_ml() %>% set_engine("h2o", max_runtime_secs = 5) %>% set_mode("regression") %>% fit(mpg ~ ., data = mtcars) rank_results(auto_fit, n = 5) collect_metrics(auto_fit, summarize = FALSE) tidy(auto_fit) member_weights(auto_fit) }