Adjusted predictions from regression models
Source:R/data_frame_methods.R
, R/ggaverage.R
, R/ggeffect.R
, and 2 more
ggpredict.Rd
After fitting a model, it is useful generate model-based estimates (expected values, or adjusted predictions) of the response variable for different combinations of predictor values. Such estimates can be used to make inferences about relationships between variables.
The ggeffects package computes marginal means and adjusted predicted
values for the response, at the margin of specific values or levels from
certain model terms. The package is built around three core functions:
predict_response()
(understanding results), test_predictions()
(testing
results for statistically significant differences) and plot()
(communicate
results).
By default, adjusted predictions or marginal means are by returned on the
response scale, which is the easiest and most intuitive scale to interpret
the results. There are other options for specific models as well, e.g. with
zero-inflation component (see documentation of the type
-argument). The
result is returned as consistent data frame, which is nicely printed by
default. plot()
can be used to easily create figures.
The main function to calculate marginal means and adjusted predictions is
predict_response()
. In previous versions of ggeffects, the functions
ggpredict()
, ggemmeans()
, ggeffect()
and ggaverage()
were used to
calculate marginal means and adjusted predictions. These functions are still
available, but predict_response()
as a "wrapper" around these functions is
the preferred way to do this now.
ggpredict()
callsget_predictions()
(which in turn callsstats::predict()
)ggemmeans()
callsemmeans::emmeans()
ggaverage()
callsmarginaleffects::avg_predictions()
ggeffect()
callseffects::Effect()
Usage
# S3 method for class 'ggeffects'
as.data.frame(
x,
row.names = NULL,
optional = FALSE,
...,
stringsAsFactors = FALSE,
terms_to_colnames = FALSE
)
ggaverage(
model,
terms,
ci_level = 0.95,
type = "fixed",
typical = "mean",
condition = NULL,
back_transform = TRUE,
vcov = NULL,
vcov_args = NULL,
weights = NULL,
verbose = TRUE,
...
)
ggeffect(
model,
terms,
ci_level = 0.95,
bias_correction = FALSE,
verbose = TRUE,
...
)
ggemmeans(
model,
terms,
ci_level = 0.95,
type = "fixed",
typical = "mean",
condition = NULL,
interval = "confidence",
back_transform = TRUE,
vcov = NULL,
vcov_args = NULL,
bias_correction = FALSE,
weights = NULL,
verbose = TRUE,
...
)
ggpredict(
model,
terms,
ci_level = 0.95,
type = "fixed",
typical = "mean",
condition = NULL,
interval = "confidence",
back_transform = TRUE,
vcov = NULL,
vcov_args = NULL,
bias_correction = FALSE,
verbose = TRUE,
...
)
Arguments
- x
An object of class
ggeffects
, as returned bypredict_response()
,ggpredict()
,ggeffect()
,ggaverage()
orggemmeans()
.- row.names
NULL
or a character vector giving the row names for the data frame. Missing values are not allowed.- optional
logical. If
TRUE
, setting row names and converting column names (to syntactic names: seemake.names
) is optional. Note that all of R's base packageas.data.frame()
methods useoptional
only for column names treatment, basically with the meaning ofdata.frame(*, check.names = !optional)
. See also themake.names
argument of thematrix
method.- ...
Arguments are passed down to
ggpredict()
(further down topredict()
) orggemmeans()
(and thereby toemmeans::emmeans()
), Iftype = "simulate"
,...
may also be used to set the number of simulation, e.g.nsim = 500
. When callingggeffect()
, further arguments passed down toeffects::Effect()
.- stringsAsFactors
logical: should the character vector be converted to a factor?
- terms_to_colnames
Logical, if
TRUE
, standardized column names (like"x"
,"group"
or"facet"
) are replaced by the variable names of the focal predictors specified interms
.- model
A model object, or a list of model objects.
- terms
Names of those terms from
model
, for which predictions should be displayed (so called focal terms). Can be:A character vector, specifying the names of the focal terms. This is the preferred and probably most flexible way to specify focal terms, e.g.
terms = "x [40:60]"
, to calculate predictions for the values 40 to 60.A list, where each element is a named vector, specifying the focal terms and their values. This is the "classical" R way to specify focal terms, e.g.
list(x = 40:60)
.A formula, e.g.
terms = ~ x + z
, which is internally converted to a character vector. This is probably the least flexible way, as you cannot specify representative values for the focal terms.A data frame representing a "data grid" or "reference grid". Predictions are then made for all combinations of the variables in the data frame.
terms
at least requires one variable name. The maximum length is five terms, where the second to fifth term indicate the groups, i.e. predictions of the first term are grouped at meaningful values or levels of the remaining terms (seevalues_at()
). It is also possible to define specific values for focal terms, at which adjusted predictions should be calculated (see details below). All remaining covariates that are not specified interms
are "marginalized", see themargin
argument in?predict_response
. See also argumentcondition
to fix non-focal terms to specific values, and argumenttypical
forggpredict()
orggemmeans()
.- ci_level
Numeric, the level of the confidence intervals. Use
ci_level = NA
if confidence intervals should not be calculated (for instance, due to computation time). Typically, confidence intervals are based on the returned standard errors for the predictions, assuming a t- or normal distribution (based on the model and the available degrees of freedom, i.e. roughly+/- 1.96 * SE
). See introduction of this vignette for more details.- type
Character, indicating whether predictions should be conditioned on specific model components or not, or whether population or unit-level predictions are desired. Consequently, most options only apply for survival models, mixed effects models and/or models with zero-inflation (and their Bayesian counter-parts); only exception is
type = "simulate"
, which is available for some other model classes as well (which respond tosimulate()
).Note 1: For
brmsfit
-models with zero-inflation component, there is notype = "zero_inflated"
nortype = "zi_random"
; predicted values for these models always condition on the zero-inflation part of the model. The same is true forMixMod
-models from GLMMadaptive with zero-inflation component (see 'Details').Note 2: If
margin = "empirical"
, or when callingggaverage()
respectively, (i.e. counterfactual predictions), thetype
argument is handled differently. It is set to"response"
by default, but usually accepts all possible options from thetype
-argument of the model's respectivepredict()
method. E.g., passing aglm
object would allow the options"response"
,"link"
, and"terms"
. For models with zero-inflation component, the below mentioned options"fixed"
,"zero_inflated"
and"zi_prob"
can also be used and will be "translated" into the correspondingtype
option of the model's respectivepredict()
-method.Note 3: If
margin = "marginalmeans"
, or when callingggemmeans()
respectively,type = "random"
andtype = "zi_random"
are not available, i.e. no unit-level predictions are possible."fixed"
(or"count"
)Predicted values are conditioned on the fixed effects or conditional model only. For mixed models, predicted values are on the population-level, i.e.
re.form = NA
when callingpredict()
. For models with zero-inflation component, this type would return the predicted mean from the count component (without conditioning on the zero-inflation part)."random"
This only applies to mixed models, and
type = "random"
does not condition on the zero-inflation component of the model. Use this for unit-level predictions, i.e. predicted values for each level of the random effects groups. Add the name of the related random effect term to theterms
-argument (for more details, see this vignette)."zero_inflated"
(or"zi"
)Predicted values are conditioned on the fixed effects and the zero-inflation component, returning the expected value of the response (
mu*(1-p)
). For For mixed models with zero-inflation component (e.g. from package glmmTMB), this would return the expected responsemu*(1-p)
on the population-level. See 'Details'."zi_random"
(or"zero_inflated_random"
)This only applies to mixed models. Predicted values are conditioned on the fixed effects and the zero-inflation component. Use this for unit-level predictions, i.e. predicted values for each level of the random effects groups. Add the name of the related random effect term to the
terms
-argument (for more details, see this vignette)."zi_prob"
Returns the predicted zero-inflation probability, i.e. probability of a structural or "true" zero (see this vignette for a short introduction into zero-inflation models).
"simulate"
Predicted values and confidence resp. prediction intervals are based on simulations, i.e. calls to
simulate()
. This type of prediction takes all model uncertainty into account. Currently supported models are objects of classlm
,glm
,glmmTMB
,wbm
,MixMod
andmerMod
. Usensim
to set the number of simulated draws (see...
for details)."survival"
,"cumulative_hazard"
and"quantile"
"survival"
and"cumulative_hazard"
apply only tocoxph
-objects from the survial-package. These options calculate the survival probability or the cumulative hazard of an event.type = "quantile"
only applies tosurvreg
-objects from package survival, which returns the predicted quantiles. For this option, thep
argument is passed topredict()
, so that quantiles for different probabilities can be calculated, e.g.predict_response(..., type = "quantile", p = c(0.2, 0.5, 0.8))
.
When
margin = "empirical"
(or when callingggaverage()
), thetype
argument accepts all values from thetype
-argument of the model's respectivepredict()
-method.- typical
Character vector, naming the function to be applied to the covariates (non-focal terms) over which the effect is "averaged". The default is
"mean"
. Can be"mean"
, "weighted.mean
","median"
,"mode"
or"zero"
, which call the corresponding R functions (except"mode"
, which calls an internal function to compute the most common value);"zero"
simply returns 0. By default, if the covariate is a factor, only"mode"
is applicable; for all other values (including the default,"mean"
) the reference level is returned. For character vectors, only the mode is returned. You can use a named vector to apply different functions to integer, numeric and categorical covariates, e.g.typical = c(numeric = "median", factor = "mode")
. Iftypical
is"weighted.mean"
, weights from the model are used. If no weights are available, the function falls back to"mean"
. Note that this argument is ignored forpredict_response()
, because themargin
argument takes care of this.- condition
Named character vector, which indicates covariates that should be held constant at specific values. Unlike
typical
, which applies a function to the covariates to determine the value that is used to hold these covariates constant,condition
can be used to define exact values, for instancecondition = c(covariate1 = 20, covariate2 = 5)
. See 'Examples'.- back_transform
Logical, if
TRUE
(the default), predicted values for log-, log-log, exp, sqrt and similar transformed responses will be back-transformed to original response-scale. Seeinsight::find_transformation()
for more details.- vcov
Variance-covariance matrix used to compute uncertainty estimates (e.g., for confidence intervals based on robust standard errors). This argument accepts a covariance matrix, a function which returns a covariance matrix, or a string which identifies the function to be used to compute the covariance matrix.
A covariance matrix
A function which returns a covariance matrix (e.g.,
stats::vcov()
)A string which indicates the kind of uncertainty estimates to return.
Heteroskedasticity-consistent:
"HC"
,"HC0"
,"HC1"
,"HC2"
,"HC3"
,"HC4"
,"HC4m"
,"HC5"
. See?sandwich::vcovHC
Cluster-robust:
"vcovCR"
,"CR0"
,"CR1"
,"CR1p"
,"CR1S"
,"CR2"
,"CR3"
. See?clubSandwich::vcovCR
.Bootstrap:
"BS"
,"xy"
,"fractional"
,"jackknife"
,"residual"
,"wild"
,"mammen"
,"norm"
,"webb"
. See?sandwich::vcovBS
Other
sandwich
package functions:"HAC"
,"PC"
,"CL"
, or"PL"
.
If
NULL
, standard errors (and confidence intervals) for predictions are based on the standard errors as returned by thepredict()
-function. Note that probably not all model objects that work withpredict_response()
are also supported by the sandwich or clubSandwich packages.See details in this vignette.
- vcov_args
List of arguments to be passed to the function identified by the
vcov
argument. This function is typically supplied by the sandwich or clubSandwich packages. Please refer to their documentation (e.g.,?sandwich::vcovHAC
) to see the list of available arguments. If no estimation type (argumenttype
) is given, the default type for"HC"
equals the default from the sandwich package; for type"CR"
the default is set to"CR3"
. For other defaults, refer to the documentation in the sandwich or clubSandwich package.- weights
This argument is used in two different ways, depending on the
margin
argument.When
margin = "empirical"
(or when callingggaverag()
),weights
can either be a character vector, naming the weigthing variable in the data, or a vector of weights (of same length as the number of observations in the data). This variable will be used to weight adjusted predictions.When
margin = "marginalmeans"
(or when callingggemmeans()
),weights
must be a character vector and is passed toemmeans::emmeans()
, specifying weights to use in averaging non-focal categorical predictors. Options are"equal"
,"proportional"
,"outer"
,"cells"
,"flat"
, and"show.levels"
. See?emmeans::emmeans
for details.
- verbose
Toggle messages or warnings.
- bias_correction
Logical, if
TRUE
, adjusts for bias-correction when back-transforming the predicted values (to the response scale) for non-Gaussian mixed models. Back-transforming the the population-level predictions ignores the effect of the variation around the population mean, so the result on the original data scale is biased due to Jensen's inequality. That means, whentype = "fixed"
(the default) and population level predictions are returned, it is recommended to setbias_correction = TRUE
. To apply bias-correction, a valid value of sigma is required, which is extracted by default usinginsight::get_variance_residual()
. Optionally, to provide own estimates of uncertainty, use thesigma
argument. Note thatbias_correction
currently only applies to mixed models, where there are additive random components involved and where that bias-adjustment can be appropriate. Ifggemmeans()
is called, bias-correction can also be applied to GEE-models.- interval
Type of interval calculation, can either be
"confidence"
(default) or"prediction"
. May be abbreviated. Unlike confidence intervals, prediction intervals include the residual variance (sigma^2) to account for the uncertainty of predicted values. Note that prediction intervals are not available for all models, but only for models that work withinsight::get_sigma()
. For Bayesian models, wheninterval = "confidence"
, predictions are based on posterior draws of the linear predictorrstantools::posterior_epred()
. Ifinterval = "prediction"
,rstantools::posterior_predict()
is called.
Value
A data frame (with ggeffects
class attribute) with consistent data columns:
"x"
: the values of the first term interms
, used as x-position in plots."predicted"
: the predicted values of the response, used as y-position in plots."std.error"
: the standard error of the predictions. Note that the standard errors are always on the link-scale, and not back-transformed for non-Gaussian models!"conf.low"
: the lower bound of the confidence interval for the predicted values."conf.high"
: the upper bound of the confidence interval for the predicted values."group"
: the grouping level from the second term interms
, used as grouping-aesthetics in plots."facet"
: the grouping level from the third term interms
, used to indicate facets in plots.The estimated marginal means (or predicted values) are always on the response scale!
For proportional odds logistic regression (see
?MASS::polr
) resp. cumulative link models (e.g., see?ordinal::clm
), an additional column"response.level"
is returned, which indicates the grouping of predictions based on the level of the model's response.Note that for convenience reasons, the columns for the intervals are always named
"conf.low"
and"conf.high"
, even though for Bayesian models credible or highest posterior density intervals are returned.There is an
as.data.frame()
method for objects of classggeffects
, which has anterms_to_colnames
argument, to use the term names as column names instead of the standardized names"x"
etc.
Details
Please see ?predict_response
for details and examples.