---
title: "Applicability domain methods for binary data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{binary-data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(ggplot2)
theme_set(theme_bw())
```

```{r, echo = FALSE}
# TODO
#- Mention different input data types: data.frame, recipes, matrix, etc.
#- Maybe make a (better) conclusion?
#- Explain the reason why the training set is diverse.
```

## Introduction

```{r}
library(applicable)
```

Similarity statistics can be used to compare data sets where all of the
predictors are binary. One of the most common measures is the Jaccard index.

For a training set of size `n`, there are `n` similarity statistics for each
new sample. These can be summarized via the mean statistic or a quantile. In
general, we want similarity to be low within the training set (i.e., a diverse
training set) and high for new samples to be predicted.

To analyze the Jaccard metric, `applicable` provides the following methods:

* `apd_similarity`: analyzes samples in terms of similarity scores. For a
training set of _n_ samples, a new sample is compared to each, resulting in _n_
similarity scores. These can be summarized into the median similarity.

* `autoplot`: shows the cumulative probability versus the unique similarity
values in the training set.

* `score`: scores new samples using similarity methods. In particular, it
calculates the similarity scores and if `add_percentile = TRUE`, it also
estimates the percentile of the similarity scores.

## Example 

The example data is from two QSAR data sets where binary fingerprints are used
as predictors.

```{r}
data(qsar_binary)
```

Let us construct the model:

```{r}
jacc_sim <- apd_similarity(binary_tr)
jacc_sim
```

As we can see below, this is a fairly diverse training set:

```{r jac-plot}
#| fig-alt: "Empirical cumulative distribution chart. Mean similarity along the x-axis, Cumulative Probability along the why axis. Reading from left to right, values stay close to 0 from x = 0 to x = 0.25, from x = 0.25 to x = 0.4 there is a near-linear upwards trend to about y = 0.70. After that y = 1."
library(ggplot2)

# Plot the empirical cumulative distribution function for the training set
autoplot(jacc_sim)
```

We can compare the similarity between new samples and the training set:

```{r}
# Summarize across all training set similarities
mean_sim <- score(jacc_sim, new_data = binary_unk)
mean_sim
```

Samples 3 and 5 are definitely extrapolations based on these predictors.
In other words, the new samples are not similar to the training set and so
predictions on them may not be very reliable.