Title: | Tidy Statistical Inference |
---|---|
Description: | The objective of this package is to perform inference using an expressive statistical grammar that coheres with the tidy design framework. |
Authors: | Andrew Bray [aut], Chester Ismay [aut] , Evgeni Chasnovski [aut] , Simon Couch [aut, cre] , Ben Baumer [aut] , Mine Cetinkaya-Rundel [aut] , Ted Laderas [ctb] , Nick Solomon [ctb], Johanna Hardin [ctb], Albert Y. Kim [ctb] , Neal Fultz [ctb], Doug Friedman [ctb], Richie Cotton [ctb] , Brian Fannin [ctb] |
Maintainer: | Simon Couch <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.7.9000 |
Built: | 2024-12-17 05:15:21 UTC |
Source: | https://github.com/tidymodels/infer |
Like {dplyr}, {infer} also uses the pipe (%>%
) function
from magrittr
to turn function composition into a series of
iterative statements.
lhs , rhs
|
Inference functions and the initial data frame. |
This function allows the user to define a null distribution based on
theoretical methods. In many infer pipelines, assume()
can be
used in place of generate()
and calculate()
to create a null
distribution. Rather than outputting a data frame containing a
distribution of test statistics calculated from resamples of the observed
data, assume()
outputs a more abstract type of object just containing
the distributional details supplied in the distribution
and df
arguments.
However, assume()
output can be passed to visualize()
, get_p_value()
,
and get_confidence_interval()
in the same way that simulation-based
distributions can.
To define a theoretical null distribution (for use in hypothesis testing),
be sure to provide a null hypothesis via hypothesize()
. To define a
theoretical sampling distribution (for use in confidence intervals),
provide the output of specify()
. Sampling distributions (only
implemented for t
and z
) lie on the scale of the data, and will be
recentered and rescaled to match the corresponding stat
given in
calculate()
to calculate the observed statistic.
assume(x, distribution, df = NULL, ...)
assume(x, distribution, df = NULL, ...)
x |
The output of |
distribution |
The distribution in question, as a string. One of
|
df |
Optional. The degrees of freedom parameter(s) for the |
... |
Currently ignored. |
Note that the assumption being expressed here, for use in theory-based
inference, only extends to distributional assumptions: the null
distribution in question and its parameters. Statistical inference with
infer, whether carried out via simulation (i.e. based on pipelines
using generate()
and calculate()
) or theory (i.e. with assume()
),
always involves the condition that observations are independent of
each other.
infer
only supports theoretical tests on one or two means via the
t
distribution and one or two proportions via the z
.
For tests comparing two means, if n1
is the group size for one level of
the explanatory variable, and n2
is that for the other level, infer
will recognize the following degrees of freedom (df
) arguments:
min(n1 - 1, n2 - 1)
n1 + n2 - 2
The "parameter"
entry of the analogous stats::t.test()
call
The "parameter"
entry of the analogous stats::t.test()
call with var.equal = TRUE
By default, the package will use the "parameter"
entry of the analogous
stats::t.test()
call with var.equal = FALSE
(the default).
An infer theoretical distribution that can be passed to helpers
like visualize()
, get_p_value()
, and get_confidence_interval()
.
# construct theoretical distributions --------------------------------- # F distribution # with the `partyid` explanatory variable gss %>% specify(age ~ partyid) %>% assume(distribution = "F") # Chi-squared goodness of fit distribution # on the `finrela` variable gss %>% specify(response = finrela) %>% hypothesize(null = "point", p = c("far below average" = 1/6, "below average" = 1/6, "average" = 1/6, "above average" = 1/6, "far above average" = 1/6, "DK" = 1/6)) %>% assume("Chisq") # Chi-squared test of independence # on the `finrela` and `sex` variables gss %>% specify(formula = finrela ~ sex) %>% assume(distribution = "Chisq") # T distribution gss %>% specify(age ~ college) %>% assume("t") # Z distribution gss %>% specify(response = sex, success = "female") %>% assume("z") ## Not run: # each of these distributions can be passed to infer helper # functions alongside observed statistics! # for example, a 1-sample t-test ------------------------------------- # calculate the observed statistic obs_stat <- gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% calculate(stat = "t") # construct a null distribution null_dist <- gss %>% specify(response = hours) %>% assume("t") # juxtapose them visually visualize(null_dist) + shade_p_value(obs_stat, direction = "both") # calculate a p-value get_p_value(null_dist, obs_stat, direction = "both") # or, an F test ------------------------------------------------------ # calculate the observed statistic obs_stat <- gss %>% specify(age ~ partyid) %>% hypothesize(null = "independence") %>% calculate(stat = "F") # construct a null distribution null_dist <- gss %>% specify(age ~ partyid) %>% assume(distribution = "F") # juxtapose them visually visualize(null_dist) + shade_p_value(obs_stat, direction = "both") # calculate a p-value get_p_value(null_dist, obs_stat, direction = "both") ## End(Not run)
# construct theoretical distributions --------------------------------- # F distribution # with the `partyid` explanatory variable gss %>% specify(age ~ partyid) %>% assume(distribution = "F") # Chi-squared goodness of fit distribution # on the `finrela` variable gss %>% specify(response = finrela) %>% hypothesize(null = "point", p = c("far below average" = 1/6, "below average" = 1/6, "average" = 1/6, "above average" = 1/6, "far above average" = 1/6, "DK" = 1/6)) %>% assume("Chisq") # Chi-squared test of independence # on the `finrela` and `sex` variables gss %>% specify(formula = finrela ~ sex) %>% assume(distribution = "Chisq") # T distribution gss %>% specify(age ~ college) %>% assume("t") # Z distribution gss %>% specify(response = sex, success = "female") %>% assume("z") ## Not run: # each of these distributions can be passed to infer helper # functions alongside observed statistics! # for example, a 1-sample t-test ------------------------------------- # calculate the observed statistic obs_stat <- gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% calculate(stat = "t") # construct a null distribution null_dist <- gss %>% specify(response = hours) %>% assume("t") # juxtapose them visually visualize(null_dist) + shade_p_value(obs_stat, direction = "both") # calculate a p-value get_p_value(null_dist, obs_stat, direction = "both") # or, an F test ------------------------------------------------------ # calculate the observed statistic obs_stat <- gss %>% specify(age ~ partyid) %>% hypothesize(null = "independence") %>% calculate(stat = "F") # construct a null distribution null_dist <- gss %>% specify(age ~ partyid) %>% assume(distribution = "F") # juxtapose them visually visualize(null_dist) + shade_p_value(obs_stat, direction = "both") # calculate a p-value get_p_value(null_dist, obs_stat, direction = "both") ## End(Not run)
Given the output of specify()
and/or hypothesize()
, this function will
return the observed statistic specified with the stat
argument. Some test
statistics, such as Chisq
, t
, and z
, require a null hypothesis. If
provided the output of generate()
, the function will calculate the
supplied stat
for each replicate
.
Learn more in vignette("infer")
.
calculate( x, stat = c("mean", "median", "sum", "sd", "prop", "count", "diff in means", "diff in medians", "diff in props", "Chisq", "F", "slope", "correlation", "t", "z", "ratio of props", "odds ratio", "ratio of means"), order = NULL, ... )
calculate( x, stat = c("mean", "median", "sum", "sd", "prop", "count", "diff in means", "diff in medians", "diff in props", "Chisq", "F", "slope", "correlation", "t", "z", "ratio of props", "odds ratio", "ratio of means"), order = NULL, ... )
x |
The output from |
stat |
A string giving the type of the statistic to calculate. Current
options include |
order |
A string vector of specifying the order in which the levels of
the explanatory variable should be ordered for subtraction (or division
for ratio-based statistics), where |
... |
To pass options like |
A tibble containing a stat
column of calculated statistics.
In some cases, when bootstrapping with small samples, some generated bootstrap samples will have only one level of the explanatory variable present. For some test statistics, the calculated statistic in these cases will be NaN. The package will omit non-finite values from visualizations (with a warning) and raise an error in p-value calculations.
When using the infer package for research, or in other cases when exact
reproducibility is a priority, be sure the set the seed for R’s random
number generator. infer will respect the random seed specified in the
set.seed()
function, returning the same result when generate()
ing
data given an identical seed. For instance, we can calculate the
difference in mean age
by college
degree status using the gss
dataset from 10 versions of the gss
resampled with permutation using
the following code.
set.seed(1) gss %>% specify(age ~ college) %>% hypothesize(null = "independence") %>% generate(reps = 5, type = "permute") %>% calculate("diff in means", order = c("degree", "no degree"))
## Response: age (numeric) ## Explanatory: college (factor) ## Null Hypothesis: independence ## # A tibble: 5 x 2 ## replicate stat ## <int> <dbl> ## 1 1 -0.531 ## 2 2 -2.35 ## 3 3 0.764 ## 4 4 0.280 ## 5 5 0.350
Setting the seed to the same value again and rerunning the same code will produce the same result.
# set the seed set.seed(1) gss %>% specify(age ~ college) %>% hypothesize(null = "independence") %>% generate(reps = 5, type = "permute") %>% calculate("diff in means", order = c("degree", "no degree"))
## Response: age (numeric) ## Explanatory: college (factor) ## Null Hypothesis: independence ## # A tibble: 5 x 2 ## replicate stat ## <int> <dbl> ## 1 1 -0.531 ## 2 2 -2.35 ## 3 3 0.764 ## 4 4 0.280 ## 5 5 0.350
Please keep this in mind when writing infer code that utilizes
resampling with generate()
.
visualize()
, get_p_value()
, and get_confidence_interval()
to extract value from this function's outputs.
Other core functions:
generate()
,
hypothesize()
,
specify()
# calculate a null distribution of hours worked per week under # the null hypothesis that the mean is 40 gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% generate(reps = 200, type = "bootstrap") %>% calculate(stat = "mean") # calculate the corresponding observed statistic gss %>% specify(response = hours) %>% calculate(stat = "mean") # calculate a null distribution assuming independence between age # of respondent and whether they have a college degree gss %>% specify(age ~ college) %>% hypothesize(null = "independence") %>% generate(reps = 200, type = "permute") %>% calculate("diff in means", order = c("degree", "no degree")) # calculate the corresponding observed statistic gss %>% specify(age ~ college) %>% calculate("diff in means", order = c("degree", "no degree")) # some statistics require a null hypothesis gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% calculate(stat = "t") # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
# calculate a null distribution of hours worked per week under # the null hypothesis that the mean is 40 gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% generate(reps = 200, type = "bootstrap") %>% calculate(stat = "mean") # calculate the corresponding observed statistic gss %>% specify(response = hours) %>% calculate(stat = "mean") # calculate a null distribution assuming independence between age # of respondent and whether they have a college degree gss %>% specify(age ~ college) %>% hypothesize(null = "independence") %>% generate(reps = 200, type = "permute") %>% calculate("diff in means", order = c("degree", "no degree")) # calculate the corresponding observed statistic gss %>% specify(age ~ college) %>% calculate("diff in means", order = c("degree", "no degree")) # some statistics require a null hypothesis gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% calculate(stat = "t") # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
@description
chisq_stat(x, formula, response = NULL, explanatory = NULL, ...)
chisq_stat(x, formula, response = NULL, explanatory = NULL, ...)
x |
A data frame that can be coerced into a tibble. |
formula |
A formula with the response variable on the left and the
explanatory on the right. Alternatively, a |
response |
The variable name in |
explanatory |
The variable name in |
... |
Additional arguments for chisq.test(). |
A shortcut wrapper function to get the observed test statistic for a chisq
test. Uses chisq.test(), which applies a continuity
correction. This function has been deprecated in favor of the more
general observe()
.
Other wrapper functions:
chisq_test()
,
observe()
,
prop_test()
,
t_stat()
,
t_test()
Other functions for calculating observed statistics:
observe()
,
t_stat()
# chi-squared test statistic for test of independence # of college completion status depending and one's # self-identified income class chisq_stat(gss, college ~ finrela) # chi-squared test statistic for a goodness of fit # test on whether self-identified income class # follows a uniform distribution chisq_stat(gss, response = finrela, p = c("far below average" = 1/6, "below average" = 1/6, "average" = 1/6, "above average" = 1/6, "far above average" = 1/6, "DK" = 1/6))
# chi-squared test statistic for test of independence # of college completion status depending and one's # self-identified income class chisq_stat(gss, college ~ finrela) # chi-squared test statistic for a goodness of fit # test on whether self-identified income class # follows a uniform distribution chisq_stat(gss, response = finrela, p = c("far below average" = 1/6, "below average" = 1/6, "average" = 1/6, "above average" = 1/6, "far above average" = 1/6, "DK" = 1/6))
A tidier version of chisq.test() for goodness of fit tests and tests of independence.
chisq_test(x, formula, response = NULL, explanatory = NULL, ...)
chisq_test(x, formula, response = NULL, explanatory = NULL, ...)
x |
A data frame that can be coerced into a tibble. |
formula |
A formula with the response variable on the left and the
explanatory on the right. Alternatively, a |
response |
The variable name in |
explanatory |
The variable name in |
... |
Additional arguments for chisq.test(). |
Other wrapper functions:
chisq_stat()
,
observe()
,
prop_test()
,
t_stat()
,
t_test()
# chi-squared test of independence for college completion # status depending on one's self-identified income class chisq_test(gss, college ~ finrela) # chi-squared goodness of fit test on whether self-identified # income class follows a uniform distribution chisq_test(gss, response = finrela, p = c("far below average" = 1/6, "below average" = 1/6, "average" = 1/6, "above average" = 1/6, "far above average" = 1/6, "DK" = 1/6))
# chi-squared test of independence for college completion # status depending on one's self-identified income class chisq_test(gss, college ~ finrela) # chi-squared goodness of fit test on whether self-identified # income class follows a uniform distribution chisq_test(gss, response = finrela, p = c("far below average" = 1/6, "below average" = 1/6, "average" = 1/6, "above average" = 1/6, "far above average" = 1/6, "DK" = 1/6))
These functions and objects should no longer be used. They will be removed in a future release of infer.
conf_int(x, level = 0.95, type = "percentile", point_estimate = NULL) p_value(x, obs_stat, direction)
conf_int(x, level = 0.95, type = "percentile", point_estimate = NULL) p_value(x, obs_stat, direction)
x |
See the non-deprecated function. |
level |
See the non-deprecated function. |
type |
See the non-deprecated function. |
point_estimate |
See the non-deprecated function. |
obs_stat |
See the non-deprecated function. |
direction |
See the non-deprecated function. |
get_p_value()
, get_confidence_interval()
, generate()
Given the output of an infer core function, this function will fit
a linear model using stats::glm()
according to the formula and data supplied
earlier in the pipeline. If passed the output of specify()
or
hypothesize()
, the function will fit one model. If passed the output
of generate()
, it will fit a model to each data resample, denoted in
the replicate
column. The family of the fitted model depends on the type
of the response variable. If the response is numeric, fit()
will use
family = "gaussian"
(linear regression). If the response is a 2-level
factor or character, fit()
will use family = "binomial"
(logistic
regression). To fit character or factor response variables with more than
two levels, we recommend parsnip::multinom_reg()
.
infer provides a fit "method" for infer objects, which is a way of carrying
out model fitting as applied to infer output. The "generic," imported from
the generics package and re-exported from this package, provides the
general form of fit()
that points to infer's method when called on an
infer object. That generic is also documented here.
Learn more in vignette("infer")
.
## S3 method for class 'infer' fit(object, ...)
## S3 method for class 'infer' fit(object, ...)
object |
Output from an infer function—likely |
... |
Any optional arguments to pass along to the model fitting
function. See |
Randomization-based statistical inference with multiple explanatory
variables requires careful consideration of the null hypothesis in question
and its implications for permutation procedures. Inference for partial
regression coefficients via the permutation method implemented in
generate()
for multiple explanatory variables, consistent with its meaning
elsewhere in the package, is subject to additional distributional assumptions
beyond those required for one explanatory variable. Namely, the distribution
of the response variable must be similar to the distribution of the errors
under the null hypothesis' specification of a fixed effect of the explanatory
variables. (This null hypothesis is reflected in the variables
argument to
generate()
. By default, all of the explanatory variables are treated
as fixed.) A general rule of thumb here is, if there are large outliers
in the distributions of any of the explanatory variables, this distributional
assumption will not be satisfied; when the response variable is permuted,
the (presumably outlying) value of the response will no longer be paired
with the outlier in the explanatory variable, causing an outsize effect
on the resulting slope coefficient for that explanatory variable.
More sophisticated methods that are outside of the scope of this package requiring fewer—or less strict—distributional assumptions exist. For an overview, see "Permutation tests for univariate or multivariate analysis of variance and regression" (Marti J. Anderson, 2001), doi:10.1139/cjfas-58-3-626.
A tibble containing the following columns:
replicate
: Only supplied if the input object had been previously
passed to generate()
. A number corresponding to which resample of the
original data set the model was fitted to.
term
: The explanatory variable (or intercept) in question.
estimate
: The model coefficient for the given resample (replicate
) and
explanatory variable (term
).
When using the infer package for research, or in other cases when exact
reproducibility is a priority, be sure the set the seed for R’s random
number generator. infer will respect the random seed specified in the
set.seed()
function, returning the same result when generate()
ing
data given an identical seed. For instance, we can calculate the
difference in mean age
by college
degree status using the gss
dataset from 10 versions of the gss
resampled with permutation using
the following code.
set.seed(1) gss %>% specify(age ~ college) %>% hypothesize(null = "independence") %>% generate(reps = 5, type = "permute") %>% calculate("diff in means", order = c("degree", "no degree"))
## Response: age (numeric) ## Explanatory: college (factor) ## Null Hypothesis: independence ## # A tibble: 5 x 2 ## replicate stat ## <int> <dbl> ## 1 1 -0.531 ## 2 2 -2.35 ## 3 3 0.764 ## 4 4 0.280 ## 5 5 0.350
Setting the seed to the same value again and rerunning the same code will produce the same result.
# set the seed set.seed(1) gss %>% specify(age ~ college) %>% hypothesize(null = "independence") %>% generate(reps = 5, type = "permute") %>% calculate("diff in means", order = c("degree", "no degree"))
## Response: age (numeric) ## Explanatory: college (factor) ## Null Hypothesis: independence ## # A tibble: 5 x 2 ## replicate stat ## <int> <dbl> ## 1 1 -0.531 ## 2 2 -2.35 ## 3 3 0.764 ## 4 4 0.280 ## 5 5 0.350
Please keep this in mind when writing infer code that utilizes
resampling with generate()
.
# fit a linear model predicting number of hours worked per # week using respondent age and degree status. observed_fit <- gss %>% specify(hours ~ age + college) %>% fit() observed_fit # fit 100 models to resamples of the gss dataset, where the response # `hours` is permuted in each. note that this code is the same as # the above except for the addition of the `generate` step. null_fits <- gss %>% specify(hours ~ age + college) %>% hypothesize(null = "independence") %>% generate(reps = 100, type = "permute") %>% fit() null_fits # for logistic regression, just supply a binary response variable! # (this can also be made explicit via the `family` argument in ...) gss %>% specify(college ~ age + hours) %>% fit() # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
# fit a linear model predicting number of hours worked per # week using respondent age and degree status. observed_fit <- gss %>% specify(hours ~ age + college) %>% fit() observed_fit # fit 100 models to resamples of the gss dataset, where the response # `hours` is permuted in each. note that this code is the same as # the above except for the addition of the `generate` step. null_fits <- gss %>% specify(hours ~ age + college) %>% hypothesize(null = "independence") %>% generate(reps = 100, type = "permute") %>% fit() null_fits # for logistic regression, just supply a binary response variable! # (this can also be made explicit via the `family` argument in ...) gss %>% specify(college ~ age + hours) %>% fit() # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
Generation creates a simulated distribution from specify()
.
In the context of confidence intervals, this is a bootstrap distribution
based on the result of specify()
. In the context of hypothesis testing,
this is a null distribution based on the result of specify()
and
hypothesize().
Learn more in vignette("infer")
.
generate(x, reps = 1, type = NULL, variables = !!response_expr(x), ...)
generate(x, reps = 1, type = NULL, variables = !!response_expr(x), ...)
x |
A data frame that can be coerced into a tibble. |
reps |
The number of resamples to generate. |
type |
The method used to generate resamples of the observed
data reflecting the null hypothesis. Currently one of
|
variables |
If |
... |
Currently ignored. |
A tibble containing reps
generated datasets, indicated by the
replicate
column.
The type
argument determines the method used to create the null
distribution.
bootstrap
: A bootstrap sample will be drawn for each replicate,
where a sample of size equal to the input sample size is drawn (with
replacement) from the input sample data.
permute
: For each replicate, each input value will be randomly
reassigned (without replacement) to a new output value in the sample.
draw
: A value will be sampled from a theoretical distribution
with parameter p
specified in hypothesize()
for each replicate. This
option is currently only applicable for testing on one proportion. This
generation type was previously called "simulate"
, which has been
superseded.
When using the infer package for research, or in other cases when exact
reproducibility is a priority, be sure the set the seed for R’s random
number generator. infer will respect the random seed specified in the
set.seed()
function, returning the same result when generate()
ing
data given an identical seed. For instance, we can calculate the
difference in mean age
by college
degree status using the gss
dataset from 10 versions of the gss
resampled with permutation using
the following code.
set.seed(1) gss %>% specify(age ~ college) %>% hypothesize(null = "independence") %>% generate(reps = 5, type = "permute") %>% calculate("diff in means", order = c("degree", "no degree"))
## Response: age (numeric) ## Explanatory: college (factor) ## Null Hypothesis: independence ## # A tibble: 5 x 2 ## replicate stat ## <int> <dbl> ## 1 1 -0.531 ## 2 2 -2.35 ## 3 3 0.764 ## 4 4 0.280 ## 5 5 0.350
Setting the seed to the same value again and rerunning the same code will produce the same result.
# set the seed set.seed(1) gss %>% specify(age ~ college) %>% hypothesize(null = "independence") %>% generate(reps = 5, type = "permute") %>% calculate("diff in means", order = c("degree", "no degree"))
## Response: age (numeric) ## Explanatory: college (factor) ## Null Hypothesis: independence ## # A tibble: 5 x 2 ## replicate stat ## <int> <dbl> ## 1 1 -0.531 ## 2 2 -2.35 ## 3 3 0.764 ## 4 4 0.280 ## 5 5 0.350
Please keep this in mind when writing infer code that utilizes
resampling with generate()
.
Other core functions:
calculate()
,
hypothesize()
,
specify()
# generate a null distribution by taking 200 bootstrap samples gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% generate(reps = 200, type = "bootstrap") # generate a null distribution for the independence of # two variables by permuting their values 200 times gss %>% specify(partyid ~ age) %>% hypothesize(null = "independence") %>% generate(reps = 200, type = "permute") # generate a null distribution via sampling from a # binomial distribution 200 times gss %>% specify(response = sex, success = "female") %>% hypothesize(null = "point", p = .5) %>% generate(reps = 200, type = "draw") %>% calculate(stat = "z") # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
# generate a null distribution by taking 200 bootstrap samples gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% generate(reps = 200, type = "bootstrap") # generate a null distribution for the independence of # two variables by permuting their values 200 times gss %>% specify(partyid ~ age) %>% hypothesize(null = "independence") %>% generate(reps = 200, type = "permute") # generate a null distribution via sampling from a # binomial distribution 200 times gss %>% specify(response = sex, success = "female") %>% hypothesize(null = "point", p = .5) %>% generate(reps = 200, type = "draw") %>% calculate(stat = "z") # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
Compute a confidence interval around a summary statistic. Both
simulation-based and theoretical methods are supported, though only
type = "se"
is supported for theoretical methods.
Learn more in vignette("infer")
.
get_confidence_interval(x, level = 0.95, type = NULL, point_estimate = NULL) get_ci(x, level = 0.95, type = NULL, point_estimate = NULL)
get_confidence_interval(x, level = 0.95, type = NULL, point_estimate = NULL) get_ci(x, level = 0.95, type = NULL, point_estimate = NULL)
x |
A distribution. For simulation-based inference, a data frame
containing a distribution of |
level |
A numerical value between 0 and 1 giving the confidence level. Default value is 0.95. |
type |
A string giving which method should be used for creating the
confidence interval. The default is |
point_estimate |
A data frame containing the observed statistic (in a
|
A null hypothesis is not required to compute a confidence interval. However,
including hypothesize()
in a pipeline leading to get_confidence_interval()
will not break anything. This can be useful when computing a confidence
interval using the same distribution used to compute a p-value.
Theoretical confidence intervals (i.e. calculated by supplying the output
of assume()
to the x
argument) require that the point estimate lies on
the scale of the data. The distribution defined in assume()
will be
recentered and rescaled to align with the point estimate, as can be shown
in the output of visualize()
when paired with shade_confidence_interval()
.
Confidence intervals are implemented for the following distributions and
point estimates:
distribution = "t"
: point_estimate
should be the output of
calculate()
with stat = "mean"
or stat = "diff in means"
distribution = "z"
: point_estimate
should be the output of
calculate()
with stat = "prop"
or stat = "diff in props"
A tibble containing the following columns:
term
: The explanatory variable (or intercept) in question. Only
supplied if the input had been previously passed to fit()
.
lower_ci
, upper_ci
: The lower and upper bounds of the confidence
interval, respectively.
get_ci()
is an alias of get_confidence_interval()
.
conf_int()
is a deprecated alias of get_confidence_interval()
.
Other auxillary functions:
get_p_value()
boot_dist <- gss %>% # We're interested in the number of hours worked per week specify(response = hours) %>% # Generate bootstrap samples generate(reps = 1000, type = "bootstrap") %>% # Calculate mean of each bootstrap sample calculate(stat = "mean") boot_dist %>% # Calculate the confidence interval around the point estimate get_confidence_interval( # At the 95% confidence level; percentile method level = 0.95 ) # for type = "se" or type = "bias-corrected" we need a point estimate sample_mean <- gss %>% specify(response = hours) %>% calculate(stat = "mean") boot_dist %>% get_confidence_interval( point_estimate = sample_mean, # At the 95% confidence level level = 0.95, # Using the standard error method type = "se" ) # using a theoretical distribution ----------------------------------- # define a sampling distribution sampling_dist <- gss %>% specify(response = hours) %>% assume("t") # get the confidence interval---note that the # point estimate is required here get_confidence_interval( sampling_dist, level = .95, point_estimate = sample_mean ) # using a model fitting workflow ----------------------- # fit a linear model predicting number of hours worked per # week using respondent age and degree status. observed_fit <- gss %>% specify(hours ~ age + college) %>% fit() observed_fit # fit 100 models to resamples of the gss dataset, where the response # `hours` is permuted in each. note that this code is the same as # the above except for the addition of the `generate` step. null_fits <- gss %>% specify(hours ~ age + college) %>% hypothesize(null = "independence") %>% generate(reps = 100, type = "permute") %>% fit() null_fits get_confidence_interval( null_fits, point_estimate = observed_fit, level = .95 ) # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
boot_dist <- gss %>% # We're interested in the number of hours worked per week specify(response = hours) %>% # Generate bootstrap samples generate(reps = 1000, type = "bootstrap") %>% # Calculate mean of each bootstrap sample calculate(stat = "mean") boot_dist %>% # Calculate the confidence interval around the point estimate get_confidence_interval( # At the 95% confidence level; percentile method level = 0.95 ) # for type = "se" or type = "bias-corrected" we need a point estimate sample_mean <- gss %>% specify(response = hours) %>% calculate(stat = "mean") boot_dist %>% get_confidence_interval( point_estimate = sample_mean, # At the 95% confidence level level = 0.95, # Using the standard error method type = "se" ) # using a theoretical distribution ----------------------------------- # define a sampling distribution sampling_dist <- gss %>% specify(response = hours) %>% assume("t") # get the confidence interval---note that the # point estimate is required here get_confidence_interval( sampling_dist, level = .95, point_estimate = sample_mean ) # using a model fitting workflow ----------------------- # fit a linear model predicting number of hours worked per # week using respondent age and degree status. observed_fit <- gss %>% specify(hours ~ age + college) %>% fit() observed_fit # fit 100 models to resamples of the gss dataset, where the response # `hours` is permuted in each. note that this code is the same as # the above except for the addition of the `generate` step. null_fits <- gss %>% specify(hours ~ age + college) %>% hypothesize(null = "independence") %>% generate(reps = 100, type = "permute") %>% fit() null_fits get_confidence_interval( null_fits, point_estimate = observed_fit, level = .95 ) # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
Compute a p-value from a null distribution and observed statistic.
Learn more in vignette("infer")
.
get_p_value(x, obs_stat, direction) ## Default S3 method: get_p_value(x, obs_stat, direction) get_pvalue(x, obs_stat, direction) ## S3 method for class 'infer_dist' get_p_value(x, obs_stat, direction)
get_p_value(x, obs_stat, direction) ## Default S3 method: get_p_value(x, obs_stat, direction) get_pvalue(x, obs_stat, direction) ## S3 method for class 'infer_dist' get_p_value(x, obs_stat, direction)
x |
A null distribution. For simulation-based inference, a data frame
containing a distribution of |
obs_stat |
A data frame containing the observed statistic (in a
|
direction |
A character string. Options are |
A tibble containing the following columns:
term
: The explanatory variable (or intercept) in question. Only
supplied if the input had been previously passed to fit()
.
p_value
: A value in [0, 1] giving the probability that a
statistic/coefficient as or more extreme than the observed
statistic/coefficient would occur if the null hypothesis were true.
get_pvalue()
is an alias of get_p_value()
.
p_value
is a deprecated alias of get_p_value()
.
Though a true p-value of 0 is impossible, get_p_value()
may return 0 in
some cases. This is due to the simulation-based nature of the {infer}
package; the output of this function is an approximation based on
the number of reps
chosen in the generate()
step. When the observed
statistic is very unlikely given the null hypothesis, and only a small
number of reps
have been generated to form a null distribution,
it is possible that the observed statistic will be more extreme than
every test statistic generated to form the null distribution, resulting
in an approximate p-value of 0. In this case, the true p-value is a small
value likely less than 3/reps
(based on a poisson approximation).
In the case that a p-value of zero is reported, a warning message will be raised to caution the user against reporting a p-value exactly equal to 0.
Other auxillary functions:
get_confidence_interval()
# using a simulation-based null distribution ------------------------------ # find the point estimate---mean number of hours worked per week point_estimate <- gss %>% specify(response = hours) %>% calculate(stat = "mean") # starting with the gss dataset gss %>% # ...we're interested in the number of hours worked per week specify(response = hours) %>% # hypothesizing that the mean is 40 hypothesize(null = "point", mu = 40) %>% # generating data points for a null distribution generate(reps = 1000, type = "bootstrap") %>% # finding the null distribution calculate(stat = "mean") %>% get_p_value(obs_stat = point_estimate, direction = "two-sided") # using a theoretical null distribution ----------------------------------- # calculate the observed statistic obs_stat <- gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% calculate(stat = "t") # define a null distribution null_dist <- gss %>% specify(response = hours) %>% assume("t") # calculate a p-value get_p_value(null_dist, obs_stat, direction = "both") # using a model fitting workflow ----------------------------------------- # fit a linear model predicting number of hours worked per # week using respondent age and degree status. observed_fit <- gss %>% specify(hours ~ age + college) %>% fit() observed_fit # fit 100 models to resamples of the gss dataset, where the response # `hours` is permuted in each. note that this code is the same as # the above except for the addition of the `generate` step. null_fits <- gss %>% specify(hours ~ age + college) %>% hypothesize(null = "independence") %>% generate(reps = 100, type = "permute") %>% fit() null_fits get_p_value(null_fits, obs_stat = observed_fit, direction = "two-sided") # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
# using a simulation-based null distribution ------------------------------ # find the point estimate---mean number of hours worked per week point_estimate <- gss %>% specify(response = hours) %>% calculate(stat = "mean") # starting with the gss dataset gss %>% # ...we're interested in the number of hours worked per week specify(response = hours) %>% # hypothesizing that the mean is 40 hypothesize(null = "point", mu = 40) %>% # generating data points for a null distribution generate(reps = 1000, type = "bootstrap") %>% # finding the null distribution calculate(stat = "mean") %>% get_p_value(obs_stat = point_estimate, direction = "two-sided") # using a theoretical null distribution ----------------------------------- # calculate the observed statistic obs_stat <- gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% calculate(stat = "t") # define a null distribution null_dist <- gss %>% specify(response = hours) %>% assume("t") # calculate a p-value get_p_value(null_dist, obs_stat, direction = "both") # using a model fitting workflow ----------------------------------------- # fit a linear model predicting number of hours worked per # week using respondent age and degree status. observed_fit <- gss %>% specify(hours ~ age + college) %>% fit() observed_fit # fit 100 models to resamples of the gss dataset, where the response # `hours` is permuted in each. note that this code is the same as # the above except for the addition of the `generate` step. null_fits <- gss %>% specify(hours ~ age + college) %>% hypothesize(null = "independence") %>% generate(reps = 100, type = "permute") %>% fit() null_fits get_p_value(null_fits, obs_stat = observed_fit, direction = "two-sided") # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
The General Social Survey is a high-quality survey which gathers data on American society and opinions, conducted since 1972. This data set is a sample of 500 entries from the GSS, spanning years 1973-2018, including demographic markers and some economic variables. Note that this data is included for demonstration only, and should not be assumed to provide accurate estimates relating to the GSS. However, due to the high quality of the GSS, the unweighted data will approximate the weighted data in some analyses.
gss
gss
A tibble with 500 rows and 11 variables:
year respondent was surveyed
age at time of survey, truncated at 89
respondent's sex (self-identified)
whether on not respondent has a college degree, including junior/community college
political party affiliation
number of persons in household
number of hours worked in week before survey, truncated at 89
total family income
subjective socioeconomic class identification
opinion of family income
survey weight
Declare a null hypothesis about variables selected in specify()
.
Learn more in vignette("infer")
.
hypothesize(x, null, p = NULL, mu = NULL, med = NULL, sigma = NULL) hypothesise(x, null, p = NULL, mu = NULL, med = NULL, sigma = NULL)
hypothesize(x, null, p = NULL, mu = NULL, med = NULL, sigma = NULL) hypothesise(x, null, p = NULL, mu = NULL, med = NULL, sigma = NULL)
x |
A data frame that can be coerced into a tibble. |
null |
The null hypothesis. Options include
|
p |
The true proportion of successes (a number between 0 and 1). To be used with point null hypotheses when the specified response variable is categorical. |
mu |
The true mean (any numerical value). To be used with point null hypotheses when the specified response variable is continuous. |
med |
The true median (any numerical value). To be used with point null hypotheses when the specified response variable is continuous. |
sigma |
The true standard deviation (any numerical value). To be used with point null hypotheses. |
A tibble containing the response (and explanatory, if specified) variable data with parameter information stored as well.
Other core functions:
calculate()
,
generate()
,
specify()
# hypothesize independence of two variables gss %>% specify(college ~ partyid, success = "degree") %>% hypothesize(null = "independence") # hypothesize a mean number of hours worked per week of 40 gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
# hypothesize independence of two variables gss %>% specify(college ~ partyid, success = "degree") %>% hypothesize(null = "independence") # hypothesize a mean number of hours worked per week of 40 gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
The objective of this package is to perform statistical inference using a grammar that illustrates the underlying concepts and a format that coheres with the tidyverse.
For an overview of how to use the core functionality, see vignette("infer")
Maintainer: Simon Couch [email protected] (ORCID)
Authors:
Andrew Bray [email protected]
Chester Ismay [email protected] (ORCID)
Evgeni Chasnovski [email protected] (ORCID)
Ben Baumer [email protected] (ORCID)
Mine Cetinkaya-Rundel [email protected] (ORCID)
Other contributors:
Ted Laderas [email protected] (ORCID) [contributor]
Nick Solomon [email protected] [contributor]
Johanna Hardin [email protected] [contributor]
Albert Y. Kim [email protected] (ORCID) [contributor]
Neal Fultz [email protected] [contributor]
Doug Friedman [email protected] [contributor]
Richie Cotton [email protected] (ORCID) [contributor]
Brian Fannin [email protected] [contributor]
Useful links:
Report bugs at https://github.com/tidymodels/infer/issues
This function is a wrapper that calls specify()
, hypothesize()
, and
calculate()
consecutively that can be used to calculate observed
statistics from data. hypothesize()
will only be called if a point
null hypothesis parameter is supplied.
Learn more in vignette("infer")
.
observe( x, formula, response = NULL, explanatory = NULL, success = NULL, null = NULL, p = NULL, mu = NULL, med = NULL, sigma = NULL, stat = c("mean", "median", "sum", "sd", "prop", "count", "diff in means", "diff in medians", "diff in props", "Chisq", "F", "slope", "correlation", "t", "z", "ratio of props", "odds ratio"), order = NULL, ... )
observe( x, formula, response = NULL, explanatory = NULL, success = NULL, null = NULL, p = NULL, mu = NULL, med = NULL, sigma = NULL, stat = c("mean", "median", "sum", "sd", "prop", "count", "diff in means", "diff in medians", "diff in props", "Chisq", "F", "slope", "correlation", "t", "z", "ratio of props", "odds ratio"), order = NULL, ... )
x |
A data frame that can be coerced into a tibble. |
formula |
A formula with the response variable on the left and the
explanatory on the right. Alternatively, a |
response |
The variable name in |
explanatory |
The variable name in |
success |
The level of |
null |
The null hypothesis. Options include
|
p |
The true proportion of successes (a number between 0 and 1). To be used with point null hypotheses when the specified response variable is categorical. |
mu |
The true mean (any numerical value). To be used with point null hypotheses when the specified response variable is continuous. |
med |
The true median (any numerical value). To be used with point null hypotheses when the specified response variable is continuous. |
sigma |
The true standard deviation (any numerical value). To be used with point null hypotheses. |
stat |
A string giving the type of the statistic to calculate. Current
options include |
order |
A string vector of specifying the order in which the levels of
the explanatory variable should be ordered for subtraction (or division
for ratio-based statistics), where |
... |
To pass options like |
A 1-column tibble containing the calculated statistic stat
.
Other wrapper functions:
chisq_stat()
,
chisq_test()
,
prop_test()
,
t_stat()
,
t_test()
Other functions for calculating observed statistics:
chisq_stat()
,
t_stat()
# calculating the observed mean number of hours worked per week gss %>% observe(hours ~ NULL, stat = "mean") # equivalently, calculating the same statistic with the core verbs gss %>% specify(response = hours) %>% calculate(stat = "mean") # calculating a t statistic for hypothesized mu = 40 hours worked/week gss %>% observe(hours ~ NULL, stat = "t", null = "point", mu = 40) # equivalently, calculating the same statistic with the core verbs gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% calculate(stat = "t") # similarly for a difference in means in age based on whether # the respondent has a college degree observe( gss, age ~ college, stat = "diff in means", order = c("degree", "no degree") ) # equivalently, calculating the same statistic with the core verbs gss %>% specify(age ~ college) %>% calculate("diff in means", order = c("degree", "no degree")) # for a more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
# calculating the observed mean number of hours worked per week gss %>% observe(hours ~ NULL, stat = "mean") # equivalently, calculating the same statistic with the core verbs gss %>% specify(response = hours) %>% calculate(stat = "mean") # calculating a t statistic for hypothesized mu = 40 hours worked/week gss %>% observe(hours ~ NULL, stat = "t", null = "point", mu = 40) # equivalently, calculating the same statistic with the core verbs gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% calculate(stat = "t") # similarly for a difference in means in age based on whether # the respondent has a college degree observe( gss, age ~ college, stat = "diff in means", order = c("degree", "no degree") ) # equivalently, calculating the same statistic with the core verbs gss %>% specify(age ~ college) %>% calculate("diff in means", order = c("degree", "no degree")) # for a more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
Print methods
## S3 method for class 'infer' print(x, ...) ## S3 method for class 'infer_layer' print(x, ...) ## S3 method for class 'infer_dist' print(x, ...)
## S3 method for class 'infer' print(x, ...) ## S3 method for class 'infer_layer' print(x, ...) ## S3 method for class 'infer_dist' print(x, ...)
x |
An object of class |
... |
Arguments passed to methods. |
A tidier version of prop.test() for equal or given proportions.
prop_test( x, formula, response = NULL, explanatory = NULL, p = NULL, order = NULL, alternative = "two-sided", conf_int = TRUE, conf_level = 0.95, success = NULL, correct = NULL, z = FALSE, ... )
prop_test( x, formula, response = NULL, explanatory = NULL, p = NULL, order = NULL, alternative = "two-sided", conf_int = TRUE, conf_level = 0.95, success = NULL, correct = NULL, z = FALSE, ... )
x |
A data frame that can be coerced into a tibble. |
formula |
A formula with the response variable on the left and the
explanatory on the right. Alternatively, a |
response |
The variable name in |
explanatory |
The variable name in |
p |
A numeric vector giving the hypothesized null proportion of success for each group. |
order |
A string vector specifying the order in which the proportions
should be subtracted, where |
alternative |
Character string giving the direction of the alternative
hypothesis. Options are |
conf_int |
A logical value for whether to include the confidence
interval or not. |
conf_level |
A numeric value between 0 and 1. Default value is 0.95. |
success |
The level of |
correct |
A logical indicating whether Yates' continuity correction
should be applied where possible. If |
z |
A logical value for whether to report the statistic as a standard
normal deviate or a Pearson's chi-square statistic. |
... |
Additional arguments for prop.test(). |
When testing with an explanatory variable with more than two levels, the
order
argument as used in the package is no longer well-defined. The function
will thus raise a warning and ignore the value if supplied a non-NULL order
argument.
The columns present in the output depend on the output of both prop.test()
and broom::glance.htest()
. See the latter's documentation for column
definitions; columns have been renamed with the following mapping:
chisq_df
= parameter
p_value
= p.value
lower_ci
= conf.low
upper_ci
= conf.high
Other wrapper functions:
chisq_stat()
,
chisq_test()
,
observe()
,
t_stat()
,
t_test()
# two-sample proportion test for difference in proportions of # college completion by respondent sex prop_test(gss, college ~ sex, order = c("female", "male")) # one-sample proportion test for hypothesized null # proportion of college completion of .2 prop_test(gss, college ~ NULL, p = .2) # report as a z-statistic rather than chi-square # and specify the success level of the response prop_test(gss, college ~ NULL, success = "degree", p = .2, z = TRUE)
# two-sample proportion test for difference in proportions of # college completion by respondent sex prop_test(gss, college ~ sex, order = c("female", "male")) # one-sample proportion test for hypothesized null # proportion of college completion of .2 prop_test(gss, college ~ NULL, p = .2) # report as a z-statistic rather than chi-square # and specify the success level of the response prop_test(gss, college ~ NULL, success = "degree", p = .2, z = TRUE)
These functions extend the functionality of dplyr::sample_n()
and
dplyr::slice_sample()
by allowing for repeated sampling of data.
This operation is especially helpful while creating sampling
distributions—see the examples below!
rep_sample_n(tbl, size, replace = FALSE, reps = 1, prob = NULL) rep_slice_sample( .data, n = NULL, prop = NULL, replace = FALSE, weight_by = NULL, reps = 1 )
rep_sample_n(tbl, size, replace = FALSE, reps = 1, prob = NULL) rep_slice_sample( .data, n = NULL, prop = NULL, replace = FALSE, weight_by = NULL, reps = 1 )
tbl , .data
|
Data frame of population from which to sample. |
size , n , prop
|
|
replace |
Should samples be taken with replacement? |
reps |
Number of samples to take. |
prob , weight_by
|
A vector of sampling weights for each of the rows in
|
rep_sample_n()
and rep_slice_sample()
are designed to behave similar to
their dplyr counterparts. As such, they have at least the following
differences:
In case replace = FALSE
having size
bigger than number of data rows in
rep_sample_n()
will give an error. In rep_slice_sample()
having such n
or prop > 1
will give warning and output sample size will be set to number
of rows in data.
Note that the dplyr::sample_n()
function has been superseded by
dplyr::slice_sample()
.
A tibble of size reps * n
rows corresponding to reps
samples of size n
from .data
, grouped by replicate
.
library(dplyr) library(ggplot2) library(tibble) # take 1000 samples of size n = 50, without replacement slices <- gss %>% rep_slice_sample(n = 50, reps = 1000) slices # compute the proportion of respondents with a college # degree in each replicate p_hats <- slices %>% group_by(replicate) %>% summarize(prop_college = mean(college == "degree")) # plot sampling distribution ggplot(p_hats, aes(x = prop_college)) + geom_density() + labs( x = "p_hat", y = "Number of samples", title = "Sampling distribution of p_hat" ) # sampling with probability weights. Note probabilities are automatically # renormalized to sum to 1 df <- tibble( id = 1:5, letter = factor(c("a", "b", "c", "d", "e")) ) rep_slice_sample(df, n = 2, reps = 5, weight_by = c(.5, .4, .3, .2, .1)) # alternatively, pass an unquoted column name in `.data` as `weight_by` df <- df %>% mutate(wts = c(.5, .4, .3, .2, .1)) rep_slice_sample(df, n = 2, reps = 5, weight_by = wts)
library(dplyr) library(ggplot2) library(tibble) # take 1000 samples of size n = 50, without replacement slices <- gss %>% rep_slice_sample(n = 50, reps = 1000) slices # compute the proportion of respondents with a college # degree in each replicate p_hats <- slices %>% group_by(replicate) %>% summarize(prop_college = mean(college == "degree")) # plot sampling distribution ggplot(p_hats, aes(x = prop_college)) + geom_density() + labs( x = "p_hat", y = "Number of samples", title = "Sampling distribution of p_hat" ) # sampling with probability weights. Note probabilities are automatically # renormalized to sum to 1 df <- tibble( id = 1:5, letter = factor(c("a", "b", "c", "d", "e")) ) rep_slice_sample(df, n = 2, reps = 5, weight_by = c(.5, .4, .3, .2, .1)) # alternatively, pass an unquoted column name in `.data` as `weight_by` df <- df %>% mutate(wts = c(.5, .4, .3, .2, .1)) rep_slice_sample(df, n = 2, reps = 5, weight_by = wts)
shade_confidence_interval()
plots a confidence interval region on top of
visualize()
output. The output is a ggplot2 layer that can be added with
+
. The function has a shorter alias, shade_ci()
.
Learn more in vignette("infer")
.
shade_confidence_interval( endpoints, color = "mediumaquamarine", fill = "turquoise", ... ) shade_ci(endpoints, color = "mediumaquamarine", fill = "turquoise", ...)
shade_confidence_interval( endpoints, color = "mediumaquamarine", fill = "turquoise", ... ) shade_ci(endpoints, color = "mediumaquamarine", fill = "turquoise", ...)
endpoints |
The lower and upper bounds of the interval to be plotted.
Likely, this will be the output of |
color |
A character or hex string specifying the color of the end points as a vertical lines on the plot. |
fill |
A character or hex string specifying the color to shade the
confidence interval. If |
... |
Other arguments passed along to ggplot2 functions. |
If added to an existing infer visualization, a ggplot2
object displaying the supplied intervals on top of its corresponding
distribution. Otherwise, an infer_layer
list.
Other visualization functions:
shade_p_value()
# find the point estimate---mean number of hours worked per week point_estimate <- gss %>% specify(response = hours) %>% calculate(stat = "mean") # ...and a bootstrap distribution boot_dist <- gss %>% # ...we're interested in the number of hours worked per week specify(response = hours) %>% # generating data points generate(reps = 1000, type = "bootstrap") %>% # finding the distribution from the generated data calculate(stat = "mean") # find a confidence interval around the point estimate ci <- boot_dist %>% get_confidence_interval(point_estimate = point_estimate, # at the 95% confidence level level = .95, # using the standard error method type = "se") # and plot it! boot_dist %>% visualize() + shade_confidence_interval(ci) # or just plot the bounds boot_dist %>% visualize() + shade_confidence_interval(ci, fill = NULL) # you can shade confidence intervals on top of # theoretical distributions, too---the theoretical # distribution will be recentered and rescaled to # align with the confidence interval sampling_dist <- gss %>% specify(response = hours) %>% assume(distribution = "t") visualize(sampling_dist) + shade_confidence_interval(ci) # to visualize distributions of coefficients for multiple # explanatory variables, use a `fit()`-based workflow # fit 1000 linear models with the `hours` variable permuted null_fits <- gss %>% specify(hours ~ age + college) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% fit() null_fits # fit a linear model to the observed data obs_fit <- gss %>% specify(hours ~ age + college) %>% fit() obs_fit # get confidence intervals for each term conf_ints <- get_confidence_interval( null_fits, point_estimate = obs_fit, level = .95 ) # visualize distributions of coefficients # generated under the null visualize(null_fits) # add a confidence interval shading layer to juxtapose # the null fits with the observed fit for each term visualize(null_fits) + shade_confidence_interval(conf_ints) # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
# find the point estimate---mean number of hours worked per week point_estimate <- gss %>% specify(response = hours) %>% calculate(stat = "mean") # ...and a bootstrap distribution boot_dist <- gss %>% # ...we're interested in the number of hours worked per week specify(response = hours) %>% # generating data points generate(reps = 1000, type = "bootstrap") %>% # finding the distribution from the generated data calculate(stat = "mean") # find a confidence interval around the point estimate ci <- boot_dist %>% get_confidence_interval(point_estimate = point_estimate, # at the 95% confidence level level = .95, # using the standard error method type = "se") # and plot it! boot_dist %>% visualize() + shade_confidence_interval(ci) # or just plot the bounds boot_dist %>% visualize() + shade_confidence_interval(ci, fill = NULL) # you can shade confidence intervals on top of # theoretical distributions, too---the theoretical # distribution will be recentered and rescaled to # align with the confidence interval sampling_dist <- gss %>% specify(response = hours) %>% assume(distribution = "t") visualize(sampling_dist) + shade_confidence_interval(ci) # to visualize distributions of coefficients for multiple # explanatory variables, use a `fit()`-based workflow # fit 1000 linear models with the `hours` variable permuted null_fits <- gss %>% specify(hours ~ age + college) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% fit() null_fits # fit a linear model to the observed data obs_fit <- gss %>% specify(hours ~ age + college) %>% fit() obs_fit # get confidence intervals for each term conf_ints <- get_confidence_interval( null_fits, point_estimate = obs_fit, level = .95 ) # visualize distributions of coefficients # generated under the null visualize(null_fits) # add a confidence interval shading layer to juxtapose # the null fits with the observed fit for each term visualize(null_fits) + shade_confidence_interval(conf_ints) # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
shade_p_value()
plots a p-value region on top of
visualize()
output. The output is a ggplot2 layer that can be added with
+
. The function has a shorter alias, shade_pvalue()
.
Learn more in vignette("infer")
.
shade_p_value(obs_stat, direction, color = "red2", fill = "pink", ...) shade_pvalue(obs_stat, direction, color = "red2", fill = "pink", ...)
shade_p_value(obs_stat, direction, color = "red2", fill = "pink", ...) shade_pvalue(obs_stat, direction, color = "red2", fill = "pink", ...)
obs_stat |
The observed statistic or estimate. For
|
direction |
A string specifying in which direction the shading should
occur. Options are |
color |
A character or hex string specifying the color of the observed statistic as a vertical line on the plot. |
fill |
A character or hex string specifying the color to shade the
p-value region. If |
... |
Other arguments passed along to ggplot2 functions. For expert use only. |
If added to an existing infer visualization, a ggplot2
object displaying the supplied statistic on top of its corresponding
distribution. Otherwise, an infer_layer
list.
Other visualization functions:
shade_confidence_interval()
# find the point estimate---mean number of hours worked per week point_estimate <- gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% calculate(stat = "t") # ...and a null distribution null_dist <- gss %>% # ...we're interested in the number of hours worked per week specify(response = hours) %>% # hypothesizing that the mean is 40 hypothesize(null = "point", mu = 40) %>% # generating data points for a null distribution generate(reps = 1000, type = "bootstrap") %>% # estimating the null distribution calculate(stat = "t") # shade the p-value of the point estimate null_dist %>% visualize() + shade_p_value(obs_stat = point_estimate, direction = "two-sided") # you can shade confidence intervals on top of # theoretical distributions, too! null_dist_theory <- gss %>% specify(response = hours) %>% assume(distribution = "t") null_dist_theory %>% visualize() + shade_p_value(obs_stat = point_estimate, direction = "two-sided") # to visualize distributions of coefficients for multiple # explanatory variables, use a `fit()`-based workflow # fit 1000 linear models with the `hours` variable permuted null_fits <- gss %>% specify(hours ~ age + college) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% fit() null_fits # fit a linear model to the observed data obs_fit <- gss %>% specify(hours ~ age + college) %>% fit() obs_fit # visualize distributions of coefficients # generated under the null visualize(null_fits) # add a p-value shading layer to juxtapose the null # fits with the observed fit for each term visualize(null_fits) + shade_p_value(obs_fit, direction = "both") # the direction argument will be applied # to the plot for each term visualize(null_fits) + shade_p_value(obs_fit, direction = "left") # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
# find the point estimate---mean number of hours worked per week point_estimate <- gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% calculate(stat = "t") # ...and a null distribution null_dist <- gss %>% # ...we're interested in the number of hours worked per week specify(response = hours) %>% # hypothesizing that the mean is 40 hypothesize(null = "point", mu = 40) %>% # generating data points for a null distribution generate(reps = 1000, type = "bootstrap") %>% # estimating the null distribution calculate(stat = "t") # shade the p-value of the point estimate null_dist %>% visualize() + shade_p_value(obs_stat = point_estimate, direction = "two-sided") # you can shade confidence intervals on top of # theoretical distributions, too! null_dist_theory <- gss %>% specify(response = hours) %>% assume(distribution = "t") null_dist_theory %>% visualize() + shade_p_value(obs_stat = point_estimate, direction = "two-sided") # to visualize distributions of coefficients for multiple # explanatory variables, use a `fit()`-based workflow # fit 1000 linear models with the `hours` variable permuted null_fits <- gss %>% specify(hours ~ age + college) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% fit() null_fits # fit a linear model to the observed data obs_fit <- gss %>% specify(hours ~ age + college) %>% fit() obs_fit # visualize distributions of coefficients # generated under the null visualize(null_fits) # add a p-value shading layer to juxtapose the null # fits with the observed fit for each term visualize(null_fits) + shade_p_value(obs_fit, direction = "both") # the direction argument will be applied # to the plot for each term visualize(null_fits) + shade_p_value(obs_fit, direction = "left") # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
specify()
is used to specify which columns in the supplied data frame are
the relevant response (and, if applicable, explanatory) variables. Note that
character variables are converted to factor
s.
Learn more in vignette("infer")
.
specify(x, formula, response = NULL, explanatory = NULL, success = NULL)
specify(x, formula, response = NULL, explanatory = NULL, success = NULL)
x |
A data frame that can be coerced into a tibble. |
formula |
A formula with the response variable on the left and the
explanatory on the right. Alternatively, a |
response |
The variable name in |
explanatory |
The variable name in |
success |
The level of |
A tibble containing the response (and explanatory, if specified) variable data.
Other core functions:
calculate()
,
generate()
,
hypothesize()
# specifying for a point estimate on one variable gss %>% specify(response = age) # specify a relationship between variables as a formula... gss %>% specify(age ~ partyid) # ...or with named arguments! gss %>% specify(response = age, explanatory = partyid) # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
# specifying for a point estimate on one variable gss %>% specify(response = age) # specify a relationship between variables as a formula... gss %>% specify(age ~ partyid) # ...or with named arguments! gss %>% specify(response = age, explanatory = partyid) # more in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
A shortcut wrapper function to get the observed test statistic for a t test.
This function has been deprecated in favor of the more general observe()
.
t_stat( x, formula, response = NULL, explanatory = NULL, order = NULL, alternative = "two-sided", mu = 0, conf_int = FALSE, conf_level = 0.95, ... )
t_stat( x, formula, response = NULL, explanatory = NULL, order = NULL, alternative = "two-sided", mu = 0, conf_int = FALSE, conf_level = 0.95, ... )
x |
A data frame that can be coerced into a tibble. |
formula |
A formula with the response variable on the left and the
explanatory on the right. Alternatively, a |
response |
The variable name in |
explanatory |
The variable name in |
order |
A string vector of specifying the order in which the levels of
the explanatory variable should be ordered for subtraction, where |
alternative |
Character string giving the direction of the alternative
hypothesis. Options are |
mu |
A numeric value giving the hypothesized null mean value for a one sample test and the hypothesized difference for a two sample test. |
conf_int |
A logical value for whether to include the confidence
interval or not. |
conf_level |
A numeric value between 0 and 1. Default value is 0.95. |
... |
Pass in arguments to infer functions. |
Other wrapper functions:
chisq_stat()
,
chisq_test()
,
observe()
,
prop_test()
,
t_test()
Other functions for calculating observed statistics:
chisq_stat()
,
observe()
library(tidyr) # t test statistic for true mean number of hours worked # per week of 40 gss %>% t_stat(response = hours, mu = 40) # t test statistic for number of hours worked per week # by college degree status gss %>% tidyr::drop_na(college) %>% t_stat(formula = hours ~ college, order = c("degree", "no degree"), alternative = "two-sided")
library(tidyr) # t test statistic for true mean number of hours worked # per week of 40 gss %>% t_stat(response = hours, mu = 40) # t test statistic for number of hours worked per week # by college degree status gss %>% tidyr::drop_na(college) %>% t_stat(formula = hours ~ college, order = c("degree", "no degree"), alternative = "two-sided")
A tidier version of t.test() for two sample tests.
t_test( x, formula, response = NULL, explanatory = NULL, order = NULL, alternative = "two-sided", mu = 0, conf_int = TRUE, conf_level = 0.95, ... )
t_test( x, formula, response = NULL, explanatory = NULL, order = NULL, alternative = "two-sided", mu = 0, conf_int = TRUE, conf_level = 0.95, ... )
x |
A data frame that can be coerced into a tibble. |
formula |
A formula with the response variable on the left and the
explanatory on the right. Alternatively, a |
response |
The variable name in |
explanatory |
The variable name in |
order |
A string vector of specifying the order in which the levels of
the explanatory variable should be ordered for subtraction, where |
alternative |
Character string giving the direction of the alternative
hypothesis. Options are |
mu |
A numeric value giving the hypothesized null mean value for a one sample test and the hypothesized difference for a two sample test. |
conf_int |
A logical value for whether to include the confidence
interval or not. |
conf_level |
A numeric value between 0 and 1. Default value is 0.95. |
... |
For passing in other arguments to t.test(). |
Other wrapper functions:
chisq_stat()
,
chisq_test()
,
observe()
,
prop_test()
,
t_stat()
library(tidyr) # t test for number of hours worked per week # by college degree status gss %>% tidyr::drop_na(college) %>% t_test(formula = hours ~ college, order = c("degree", "no degree"), alternative = "two-sided") # see vignette("infer") for more explanation of the # intuition behind the infer package, and vignette("t_test") # for more examples of t-tests using infer
library(tidyr) # t test for number of hours worked per week # by college degree status gss %>% tidyr::drop_na(college) %>% t_test(formula = hours ~ college, order = c("degree", "no degree"), alternative = "two-sided") # see vignette("infer") for more explanation of the # intuition behind the infer package, and vignette("t_test") # for more examples of t-tests using infer
Visualize the distribution of the simulation-based inferential statistics or the theoretical distribution (or both!).
Learn more in vignette("infer")
.
visualize(data, bins = 15, method = "simulation", dens_color = "black", ...) visualise(data, bins = 15, method = "simulation", dens_color = "black", ...)
visualize(data, bins = 15, method = "simulation", dens_color = "black", ...) visualise(data, bins = 15, method = "simulation", dens_color = "black", ...)
data |
A distribution. For simulation-based inference, a data frame
containing a distribution of |
bins |
The number of bins in the histogram. |
method |
A string giving the method to display. Options are
|
dens_color |
A character or hex string specifying the color of the theoretical density curve. |
... |
Additional arguments passed along to functions in ggplot2.
For |
In order to make the visualization workflow more straightforward
and explicit, visualize()
now only should be used to plot distributions
of statistics directly. A number of arguments related to shading p-values and
confidence intervals are now deprecated in visualize()
and should
now be passed to shade_p_value()
and shade_confidence_interval()
,
respectively. visualize()
will raise a warning if deprecated arguments
are supplied.
For calculate()
-based workflows, a ggplot showing the simulation-based
distribution as a histogram or bar graph. Can also be used to display
theoretical distributions.
For assume()
-based workflows, a ggplot showing the theoretical distribution.
For fit()
-based workflows, a patchwork
object
showing the simulation-based distributions as a histogram or bar graph.
The interface to adjust plot options and themes is a bit different
for patchwork
plots than ggplot2 plots. The examples highlight the
biggest differences here, but see patchwork::plot_annotation()
and
patchwork::&.gg for more details.
shade_p_value()
, shade_confidence_interval()
.
# generate a null distribution null_dist <- gss %>% # we're interested in the number of hours worked per week specify(response = hours) %>% # hypothesizing that the mean is 40 hypothesize(null = "point", mu = 40) %>% # generating data points for a null distribution generate(reps = 1000, type = "bootstrap") %>% # calculating a distribution of means calculate(stat = "mean") # or a bootstrap distribution, omitting the hypothesize() step, # for use in confidence intervals boot_dist <- gss %>% specify(response = hours) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean") # we can easily plot the null distribution by piping into visualize null_dist %>% visualize() # we can add layers to the plot as in ggplot, as well... # find the point estimate---mean number of hours worked per week point_estimate <- gss %>% specify(response = hours) %>% calculate(stat = "mean") # find a confidence interval around the point estimate ci <- boot_dist %>% get_confidence_interval(point_estimate = point_estimate, # at the 95% confidence level level = .95, # using the standard error method type = "se") # display a shading of the area beyond the p-value on the plot null_dist %>% visualize() + shade_p_value(obs_stat = point_estimate, direction = "two-sided") # ...or within the bounds of the confidence interval null_dist %>% visualize() + shade_confidence_interval(ci) # plot a theoretical sampling distribution by creating # a theory-based distribution with `assume()` sampling_dist <- gss %>% specify(response = hours) %>% assume(distribution = "t") visualize(sampling_dist) # you can shade confidence intervals on top of # theoretical distributions, too---the theoretical # distribution will be recentered and rescaled to # align with the confidence interval visualize(sampling_dist) + shade_confidence_interval(ci) # to plot both a theory-based and simulation-based null distribution, # use a theorized statistic (i.e. one of t, z, F, or Chisq) # and supply the simulation-based null distribution null_dist_t <- gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "t") obs_stat <- gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% calculate(stat = "t") visualize(null_dist_t, method = "both") visualize(null_dist_t, method = "both") + shade_p_value(obs_stat, "both") # to visualize distributions of coefficients for multiple # explanatory variables, use a `fit()`-based workflow # fit 1000 models with the `hours` variable permuted null_fits <- gss %>% specify(hours ~ age + college) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% fit() null_fits # visualize distributions of resulting coefficients visualize(null_fits) # the interface to add themes and other elements to patchwork # plots (outputted by `visualize` when the inputted data # is from the `fit()` function) is a bit different than adding # them to ggplot2 plots. library(ggplot2) # to add a ggplot2 theme to a `calculate()`-based visualization, use `+` null_dist %>% visualize() + theme_dark() # to add a ggplot2 theme to a `fit()`-based visualization, use `&` null_fits %>% visualize() & theme_dark() # More in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)
# generate a null distribution null_dist <- gss %>% # we're interested in the number of hours worked per week specify(response = hours) %>% # hypothesizing that the mean is 40 hypothesize(null = "point", mu = 40) %>% # generating data points for a null distribution generate(reps = 1000, type = "bootstrap") %>% # calculating a distribution of means calculate(stat = "mean") # or a bootstrap distribution, omitting the hypothesize() step, # for use in confidence intervals boot_dist <- gss %>% specify(response = hours) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean") # we can easily plot the null distribution by piping into visualize null_dist %>% visualize() # we can add layers to the plot as in ggplot, as well... # find the point estimate---mean number of hours worked per week point_estimate <- gss %>% specify(response = hours) %>% calculate(stat = "mean") # find a confidence interval around the point estimate ci <- boot_dist %>% get_confidence_interval(point_estimate = point_estimate, # at the 95% confidence level level = .95, # using the standard error method type = "se") # display a shading of the area beyond the p-value on the plot null_dist %>% visualize() + shade_p_value(obs_stat = point_estimate, direction = "two-sided") # ...or within the bounds of the confidence interval null_dist %>% visualize() + shade_confidence_interval(ci) # plot a theoretical sampling distribution by creating # a theory-based distribution with `assume()` sampling_dist <- gss %>% specify(response = hours) %>% assume(distribution = "t") visualize(sampling_dist) # you can shade confidence intervals on top of # theoretical distributions, too---the theoretical # distribution will be recentered and rescaled to # align with the confidence interval visualize(sampling_dist) + shade_confidence_interval(ci) # to plot both a theory-based and simulation-based null distribution, # use a theorized statistic (i.e. one of t, z, F, or Chisq) # and supply the simulation-based null distribution null_dist_t <- gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "t") obs_stat <- gss %>% specify(response = hours) %>% hypothesize(null = "point", mu = 40) %>% calculate(stat = "t") visualize(null_dist_t, method = "both") visualize(null_dist_t, method = "both") + shade_p_value(obs_stat, "both") # to visualize distributions of coefficients for multiple # explanatory variables, use a `fit()`-based workflow # fit 1000 models with the `hours` variable permuted null_fits <- gss %>% specify(hours ~ age + college) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% fit() null_fits # visualize distributions of resulting coefficients visualize(null_fits) # the interface to add themes and other elements to patchwork # plots (outputted by `visualize` when the inputted data # is from the `fit()` function) is a bit different than adding # them to ggplot2 plots. library(ggplot2) # to add a ggplot2 theme to a `calculate()`-based visualization, use `+` null_dist %>% visualize() + theme_dark() # to add a ggplot2 theme to a `fit()`-based visualization, use `&` null_fits %>% visualize() & theme_dark() # More in-depth explanation of how to use the infer package ## Not run: vignette("infer") ## End(Not run)