Bayesian Augmentation by Chained Equations (BACE) is an R package for imputing missing data in comparative (species-level) datasets using Bayesian chained equations with a phylogenetic random effect. Each per-variable imputation model is a phylogenetic mixed model fit with MCMCglmm, so closely related taxa inform each other's imputations — something general-purpose MI tools (e.g. mice, Amelia) do not do. The package is sometimes also referred to as Phylo-BACE.
BACE handles mixed variable types in a single workflow:
- Continuous (Gaussian)
- Count (Poisson)
- Binary (modelled on the threshold / probit scale)
- Multinomial categorical
- Ordinal (threshold models with ≥ 3 ordered levels)
Variable types are detected automatically from column classes; users can also pass per-variable formulas when they want explicit control over which predictors enter each imputation model.
⚠️ BACE is under active development and the API may still change. It is usable for research workflows today, but please read the vignette carefully and sanity-check both the per-imputation MCMC convergence and the chained-equations convergence before relying on results for publication.
You can install the development version of BACE from GitHub with:
# install.packages("devtools")
devtools::install_github("daniel1noble/BACE")library(BACE)
# Full workflow: initial chained imputation -> convergence check -> final
# independent imputations -> pooled posterior.
result <- bace(
fixformula = "y ~ x1 + x2",
ran_phylo_form = "~ 1 | species",
phylo = tree,
data = my_data,
runs = 15,
n_final = 10,
nitt = 100000,
burnin = 10000,
thin = 10
)
# Pooled posterior for the analysis model
summary(get_pooled_model(result, "y"))
# Imputed datasets (one per final run) for downstream analyses
imputed_list <- get_imputed_data(result)See ?bace, ?bace_imp, ?bace_final_imp, ?pool_posteriors, ?assess_convergence, ?get_pooled_model and ?get_imputed_data for details.
A full tutorial — including when Phylo-BACE is appropriate relative to other MI tools, the four-step workflow, interpreting pooled output, and the distinction between within-chain MCMC convergence and chained-equations convergence — is at https://daniel1noble.github.io/BACE/.
- Chained-equations imputation for continuous, count, binary, multinomial and ordinal variables
- Phylogenetic random effect in every per-variable model via
MCMCglmm - Optional decomposition of phylogenetic and non-phylogenetic species effects (
species = TRUE) for datasets with replicated observations per species - Per-variable MCMC settings —
nitt,burnin,thincan be supplied as lists to use different values for different response models - One-vs-rest (OVR) categorical imputation — unordered categorical variables are modelled as J independent binary threshold models (one per level) rather than a single multinomial probit. OVR chains mix more reliably and is the default (
ovr_categorical = TRUE); the multinomial probit remains available viaovr_categorical = FALSE -
nitt_cat_multparameter — integer multiplier applied tonittandburninfor categorical and ordinal variables only, for cases where harder-to-mix models need longer chains - Gelman / pseudo-Gelman prior options for categorical models to improve mixing
- Convergence diagnostics across the chained-equations loop (
assess_convergence()), with summary, Wasserstein and energy-distance methods - Auto-restart of initial imputation when chained-equations convergence is not reached (
bace()withmax_attempts) - Final independent imputations with posterior-predictive sampling (
bace_final_imp()), so each of then_finalimputed datasets is a proper draw rather than a repeated point estimate — continuous variables draw from the posterior predictive via the fixed-effect design matrix, categorical/ordinal variables sample from the posterior probability distribution - Posterior pooling across imputations by stacking per-imputation MCMC chains — a Monte Carlo approximation to the marginal posterior integrating over the imputation distribution; see
?pool_posteriorsfor the references and assumptions - Parallel execution of final imputation runs (
bace_final_imp(n_cores = …)) viaparallel::mclapply - Accessor helpers
get_pooled_model()andget_imputed_data()for extracting pooled models and imputed datasets frombace_complete/bace_pooled/bace_finalobjects - Simulation engine
sim_bace()for generating phylogenetically-structured comparative datasets with arbitrary response types, interaction terms and random slopes — useful for method validation
- Random slopes and more flexible random-effect structures per response variable (accept a list of random-effect formulas so each imputation model can have its own random-effect specification)
- Parallelisation within a single chained-equations iteration (currently only the
n_finalindependent runs are parallelised) - A Rubin's-rules combiner as an alternative scalar-summary path alongside the current stacked-chain combiner
- Tests for
pool_posteriors(),assess_convergence()and the parallel branch ofbace_final_imp() - Tighter CRAN-ready API surface (fewer exported internals) and a
NEWS.md
Bug reports, feature requests and worked examples are very welcome at https://github.com/daniel1noble/BACE/issues.