--- title: "Database write-back" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Database write-back} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r pre, include = FALSE} if (!rlang::is_installed("randomForest")) { knitr::opts_chunk$set( eval = FALSE ) } ``` ```{r setup, include=FALSE} library(dplyr) library(tidypredict) library(randomForest) library(dbplyr) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) set.seed(100) ``` This article reviews two possible scenarios for "writing back" predictions to the database, and without importing the data into memory first. Both scenarios share a common model preparation method. ## Example setup To keep the example reproducible, a SQLite database will be used to simulate a larger scale deployment. First the data is prepared in memory. The article will use the `nycflights13::flights` data, with a couple of modifications ```{r} library(dplyr) library(tidypredict) library(randomForest) library(dbplyr) flights_table <- nycflights13::flights %>% mutate( current_score = 0, flight_id = row_number() ) ``` A new database is created using `RSQLite`. ```{r} library(DBI) con <- dbConnect(RSQLite::SQLite(), path = ":memory:") db_fligths <- copy_to(con, flights_table) ``` ## Model preparation A sample is downloaded from the database for modeling. This example already selects the needed variables. ```{r} df <- db_fligths %>% select(dep_delay, hour, distance) %>% head(1000) %>% collect() ``` A linear model is fitted using `lm()` ```{r} model <- lm(dep_delay ~ ., data = df) ``` It is highly recommendable to always run `tidypredict_test()` to make sure that the predictions are in line with what the `predict()` command returns. The `lm` and `glm` models contain the data that they were fitted with, so it's easy to run a test. `tidypredict_test()` uses the model's internal data set by default ```{r} tidypredict_test(model) ``` In cases where the model re-fitting is automated, the `tidypredict_test()` function returns an `alert` in case the threshold is exceeded, which can be used to fail the automated script. ```{r, eval = FALSE} if (tidypredict_test(model)$alert) stop("Threshold exceeded!") ``` ## Scenario 1 - Update scores In this scenario, The table that supplies the term values is also the recipient of the new score. This is done by updating a specific field in the table. For this example, the field is called `current_score`. The following SQL UPDATE statement should work in most databases: ```{r} library(dbplyr) update_statement <- build_sql("UPDATE flights_table SET current_score = ", tidypredict_sql(model, con = con), con = con) update_statement ``` This statement can be then passed on to the database team, via documentation or an automated process. In cases where the analyst has the responsibility to run the new SQL statement, or if R is being used to automate the scoring, the next line can be used: ```{r} dbSendQuery(con, update_statement) ``` Here is a sample of the newly populated field: ```{r} db_fligths %>% select(current_score) %>% head(10) ``` ## Scenario 2- Append new scores There may be the need to retain not only the new score, but when it was determined and its history. This is usually possible because the source of record possesses a unique key identifier per entity, such as transaction ID or customer ID. In the example, `flights_id` is the unique identifier. In this example, the new scores will be stored in a new table called `daily_scores`. The following code is part of the example preparation, it creates the table and seeds it with a single row. This is not the best way to create an empty table, but it'll do for the purposes of this example. ```{r} dbWriteTable( con, "daily_scores", tibble( flight_id = 0, score = 0, date = "" ) ) ``` The plan is to use a SQL statement that most vendors support, is called INSERT INTO SELECT. The idea is to use `dplyr` laziness to prepare the data transformation and predictions, but it's not going to be executed until is parsed into SQL and sent as part of a another statement. The INSERT INTO SELECT statement allows for the results of a query to be saved in a table, and without leaving the database. In this example, predictions are going to be executed for just the records in the month of December. The data is filtered, and then `tidypredict_to_column()` is used to create the new fit field. The results are then transformed to match to the structure of the new `daily_scores` table. ```{r} new_predictions <- db_fligths %>% filter(month == 12) %>% tidypredict_to_column(model, vars = "score") %>% select( flight_id, score ) %>% mutate(date = "01/01/2018") ``` ```{r} insert_scores <- build_sql("INSERT INTO daily_scores ", sql_render(new_predictions, con = con), con = con) insert_scores ``` As in the first scenario, this statement can be then passed on to the database team, via documentation or an automated process. In cases where the analyst has the responsibility to run the new SQL statement, or if R is being used to automate the scoring, the next line can be used: ```{r} dbSendQuery(con, insert_scores) ``` A simple table join can be used to confirm that the new update worked. For real life scenarios, a more sophisticated query should be performed in order to only get the latest score. For this example, we simple filter on the same date we inserted ```{r} tbl(con, "daily_scores") %>% inner_join(tbl(con, "flights_table"), by = "flight_id") %>% filter(date == "01/01/2018") %>% select(dep_delay, hour, distance, score, date) ``` ```{r, include = FALSE} dbDisconnect(con) ```