---
title: "butcher"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{butcher}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = requireNamespace("parsnip", quietly = TRUE)
)
```

```{r setup}
library(butcher)
library(parsnip)
```

One of the benefits of working in R is the ease with which you can implement complex models and implement challenging data analysis pipelines. Take, for example, the parsnip package; with the installation of a few associated libraries and a few lines of code, you can fit something as sophisticated as a boosted tree:  

```{r, eval = FALSE}
fitted_model <- boost_tree(mode = "regression") %>%
  fit(mpg ~ ., data = mtcars)
```

Yet, while this code is compact, the underlying fitted result may not be. Since parsnip works as a wrapper for many modeling packages, its fitted model objects inherit the same properties as those that arise from the original modeling package. A straightforward example is the `lm()` function from the base `stats` package. Whether you leverage parsnip or not, you get the same result:

```{r}
parsnip_lm <- linear_reg() %>% 
  fit(mpg ~ ., data = mtcars) 
parsnip_lm
```

Using just `lm()`:

```{r}
old_lm <- lm(mpg ~ ., data = mtcars) 
old_lm
```

Let's say we take this familiar `old_lm` approach in building a custom in-house modeling pipeline. Such a pipeline might entail wrapping `lm()` in other function, but in doing so, we may end up carrying around some unnecessary junk.

```{r}
in_house_model <- function() {
  some_junk_in_the_environment <- runif(1e6) # we didn't know about
  lm(mpg ~ ., data = mtcars) 
}
```

The linear model fit that exists in our custom modeling pipeline is then: 

```{r}
library(lobstr)
obj_size(in_house_model())
```

But it is functionally the same as our `old_lm`, which only takes up: 

```{r}
obj_size(old_lm)
```

Ideally, we want to avoid saving this new `in_house_model()` on disk, when we could have something like `old_lm` that takes up less memory. But what the heck is going on here? We can examine possible issues with a fitted model object using the butcher package: 

```{r}
big_lm <- in_house_model()
weigh(big_lm, threshold = 0, units = "MB")
```

The problem here is in the `terms` component of `big_lm`. Because of how `lm()` is implemented in the base `stats` package (relying on intermediate forms of the data from `model.frame` and `model.matrix`) the **environment** in which the linear fit was created is carried along in the model output. 

We can see this with the `env_print()` function from the rlang package:  

```{r}
library(rlang)
env_print(big_lm$terms)
```

To avoid carrying possible junk around in our production pipeline, whether it be associated with an `lm()` model (or something more complex), we can leverage `axe_env()` from the butcher package: 

```{r}
cleaned_lm <- axe_env(big_lm, verbose = TRUE)
```

Comparing it against our `old_lm`, we find:

```{r}
weigh(cleaned_lm, threshold = 0, units = "MB")
```

And now it takes the same memory on disk:

```{r}
weigh(old_lm, threshold = 0, units = "MB")
```

Axing the environment, however, is not the only functionality of butcher. This package provides five S3 generics that include: 

- `axe_call()`: Remove the call object. 
- `axe_ctrl()`: Remove the controls fixed for training.
- `axe_data()`: Remove the original data.
- `axe_env()`: Replace inherited environments with empty environments. 
- `axe_fitted()`: Remove fitted values.

In our case here with `lm()`, if we are only interested in prediction as the end product of our modeling pipeline, we could free up a lot of memory if we execute all the possible axe functions at once. To do so, we simply run `butcher()`: 

```{r}
butchered_lm <- butcher(big_lm)
predict(butchered_lm, mtcars[, 2:11])
```

Alternatively, we can pick and choose specific axe functions, removing only those parts of the model object that we are no longer interested in characterizing.

```{r}
butchered_lm <- big_lm %>%
  axe_env() %>% 
  axe_fitted()
predict(butchered_lm, mtcars[, 2:11])
```

The butcher package provides tooling to axe parts of the fitted output that are no longer needed, without sacrificing much functionality from the original model object.