--- title: "Introduction to agua" output: rmarkdown::html_vignette description: Getting started with h2o and tidymodels vignette: > %\VignetteIndexEntry{Introduction to agua} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 5.75, out.width = "95%" ) options(digits = 3) ``` ## Introduction The `agua` package provides tidymodels interface to the [H2O](https://h2o.ai/) platform and the [h2o](https://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/index.html) R package. It has two main components - new parsnip engine `'h2o'` for the following models: - `linear_reg()`, `logistic_reg()`, `poisson_reg()`, `multinom_reg()`: All fit penalized generalized linear models. If the model parameters `penalty` and `mixture` are not specified, h2o will internally search for the optimal regularization settings. - `boost_tree()`: . Fits boosted trees via xgboost. Use `h2o::h2o.xgboost.available()` to see if h2o's xgboost is supported on your machine. For classical gradient boosting, use the `'h2o_gbm'` engine. - `rand_forest()`: Random forest models. - `naive_Bayes()`: Naive Bayes models. - `rule_fit()`: RuleFit models. - `mlp()`: Multi-layer feedforward neural networks. - `auto_ml()`: Automatic machine learning. - Infrastructure for the tune package, see [Tuning with agua](https://agua.tidymodels.org/articles/tune.html) for more details. All supported models can accept an additional engine argument `validation`, which is a number between 0 and 1 specifying the _proportion_ of data reserved as validation set. This can used by h2o for performance assessment and potential early stopping. ## Fitting models with the `'h2o'` engine As an example, we will fit a random forest model to the `concrete` data. This will be a regression model with the outcome being the compressive strength of concrete mixtures. ```{r startup, eval = FALSE} library(tidymodels) library(agua) library(ggplot2) tidymodels_prefer() theme_set(theme_bw()) # start h2o server h2o_start() data(concrete, package = "modeldata") concrete <- concrete %>% group_by(across(-compressive_strength)) %>% summarize(compressive_strength = mean(compressive_strength), .groups = "drop") concrete #> # A tibble: 992 × 9 #> cement blast_furn…¹ fly_ash water super…² coars…³ fine_…⁴ age compr…⁵ #> #> 1 102 153 0 192 0 887 942 3 4.57 #> 2 102 153 0 192 0 887 942 7 7.68 #> 3 102 153 0 192 0 887 942 28 17.3 #> 4 102 153 0 192 0 887 942 90 25.5 #> 5 108. 162. 0 204. 0 938. 849 3 2.33 #> 6 108. 162. 0 204. 0 938. 849 7 7.72 #> 7 108. 162. 0 204. 0 938. 849 28 20.6 #> 8 108. 162. 0 204. 0 938. 849 90 29.2 #> 9 116 173 0 192 0 910. 892. 3 6.28 #> 10 116 173 0 192 0 910. 892. 7 10.1 #> # … with 982 more rows, and abbreviated variable names #> # ¹​blast_furnace_slag, ²​superplasticizer, ³​coarse_aggregate, #> # ⁴​fine_aggregate, ⁵​compressive_strength ``` Note that we need to call `h2o_start()` or `h2o::h2o.init()` to start the h2o instance. The h2o server handles computations related to estimation and prediction, and passes the results back to R. agua takes care of data conversion and error handling, it also tries to store as least objects on the server as possible. The h2o server will automatically terminate once R session is closed. You can use `h2o::h2o.removeAll()` to remove all server-side objects and `h2o::h2o.shutdown()` to manually stop the server. The rest of the syntax of model fitting and prediction are identical to the usage of any other engine in tidymodels. ```{r rf-fit, eval = FALSE} set.seed(1501) concrete_split <- initial_split(concrete, strata = compressive_strength) concrete_train <- training(concrete_split) concrete_test <- testing(concrete_split) rf_spec <- rand_forest(mtry = 3, trees = 500) %>% set_engine("h2o", histogram_type = "Random") %>% set_mode("regression") normalized_rec <- recipe(compressive_strength ~ ., data = concrete_train) %>% step_normalize(all_predictors()) rf_wflow <- workflow() %>% add_model(rf_spec) %>% add_recipe(normalized_rec) rf_fit <- fit(rf_wflow, data = concrete_train) rf_fit #> ══ Workflow [trained] ════════════════════════════════════════════════════ #> Preprocessor: Recipe #> Model: rand_forest() #> #> ── Preprocessor ────────────────────────────────────────────────────────── #> 1 Recipe Step #> #> • step_normalize() #> #> ── Model ───────────────────────────────────────────────────────────────── #> Model Details: #> ============== #> #> H2ORegressionModel: drf #> Model ID: DRF_model_R_1665503649643_6 #> Model Summary: #> number_of_trees number_of_internal_trees model_size_in_bytes min_depth #> 1 500 500 2652880 15 #> max_depth mean_depth min_leaves max_leaves mean_leaves #> 1 20 17.97600 375 450 417.48000 #> #> #> H2ORegressionMetrics: drf #> ** Reported on training data. ** #> ** Metrics reported on Out-Of-Bag training samples ** #> #> MSE: 26.5 #> RMSE: 5.15 #> MAE: 3.7 #> RMSLE: 0.169 #> Mean Residual Deviance : 26.5 ``` ```{r rf-predict, eval = FALSE} predict(rf_fit, new_data = concrete_test) #> # A tibble: 249 × 1 #> .pred #> #> 1 6.42 #> 2 9.54 #> 3 9.20 #> 4 25.5 #> 5 6.60 #> 6 28.6 #> 7 10.0 #> 8 31.9 #> 9 12.1 #> 10 11.4 #> # … with 239 more rows ``` Here, we specify the engine argument `histogram_type = "Random"` to use the extremely randomized trees (XRT) algorithm. For all available engine arguments, consult the engine specific help page for "h2o" of that model. For instance, the h2o link in the help page of `rand_forest()` shows that it uses `h2o::h2o.randomForest()`, whose arguments can be passed in as engine arguments in `set_engine()`. You can also use `fit_resamples()` with h2o models. ```{r rf-fitresample, eval = FALSE} concrete_folds <- vfold_cv(concrete_train, strata = compressive_strength) fit_resamples(rf_wflow, resamples = concrete_folds) #> # Resampling results #> # 10-fold cross-validation using stratification #> # A tibble: 10 × 4 #> splits id .metrics .notes #> #> 1 Fold01 #> 2 Fold02 #> 3 Fold03 #> 4 Fold04 #> 5 Fold05 #> 6 Fold06 #> 7 Fold07 #> 8 Fold08 #> 9 Fold09 #> 10 Fold10 ``` Variable importance scores can be visualized by the vip package. ```{r, eval = FALSE} library(vip) rf_fit %>% extract_fit_parsnip() %>% vip() ``` ```{r, echo = FALSE} knitr::include_graphics("../man/figures/vip.png") ```