An R Package for Simulation-Based Causal Estimator Benchmarking
Benchmarking causal estimators requires knowing the true treatment effect, a condition that observational data never satisfies. causalsim is an R package that makes this possible by letting researchers define structural causal models with known ground truth, simulate data from them, and measure estimator performance across many replications and parameter settings. The package provides a four-function API designed for reproducible simulation studies, causal methods research, and pedagogical use.
Evaluating causal estimators is fundamentally different from evaluating predictive models. A predictive model can be scored on held-out data; a causal estimator cannot, because the counterfactual outcomes needed to compute the true treatment effect are never observed. Researchers typically rely on asymptotic theory, but that says little about behavior at realistic sample sizes, under specific confounding structures, or when identification assumptions are partially violated.
Simulation offers a principled alternative. By specifying a structural causal model with a known average treatment effect, researchers can generate data under exactly the conditions they want to study and observe directly how well any estimator recovers the truth. This approach is standard in the causal inference methods literature, but it has no unified tooling in R. Researchers write bespoke simulation scripts that are hard to reproduce, compare, and extend.
causalsim addresses this gap. The package provides a four-function API for defining data generating processes, simulating data from them, evaluating estimator performance over repeated replications, and sweeping evaluation across a grid of DGP parameters. The goal is to make simulation-based evaluation as easy to run and share as any other analysis.
causalsim generates synthetic data from user-defined structural causal models following the standard potential outcomes framework. Covariates can be standard normal, binary, or uniform, and are assigned causal roles (confounder, instrument, effect modifier, noise) that determine how they enter the treatment and outcome models. Treatment propensity and outcome baseline can be scalars, preset confounding levels, or arbitrary functions of the covariates. The true ATE is computed exactly for scalar effects and approximated via 10,000-draw Monte Carlo for heterogeneous effects.
No external data are used. The package is a simulation framework, not an analysis of any particular dataset.
The package is organized as a four-layer API. causalsim_dgp() defines and validates a DGP at construction time, computing the true ATE and checking that all user-supplied functions reference valid covariate names before any data are generated. causalsim_draw() samples one dataset from a DGP. causalsim_eval() runs a user-specified number of replications, applies an estimator to each dataset, and returns bias, RMSE, coverage, and power with Monte Carlo standard errors. causalsim_grid() evaluates an estimator over the Cartesian product of any DGP parameters, returning a tidy data frame of metrics for each cell.
The estimator interface is designed for minimal friction: any function that accepts a data frame and returns a named numeric vector with an estimate field is a valid estimator. The package passes R CMD CHECK with 0 errors, 0 warnings, and 0 notes and is implemented in base R with no non-standard dependencies.
The package vignette demonstrates a complete simulation study comparing a naive unadjusted estimator to OLS under confounding. With one standard-normal confounder and moderate propensity confounding, the naive estimator shows bias of approximately 0.25 at all sample sizes while the adjusted estimator is unbiased. RMSE for the adjusted estimator follows the expected convergence rate, roughly halving as sample size quadruples. Varying confounding strength from low to high shows naive bias growing proportionally while OLS bias remains near zero across all levels.
causalsim currently supports independent covariates only; correlated covariate structures require a custom covariates list. The grid interface accepts only atomic-valued parameters, so varying function-valued DGP arguments requires calling causalsim_eval() directly. The package does not yet support longitudinal data structures, clustered treatment assignment, or survival outcomes.
The Monte Carlo ATE approximation for heterogeneous effects is subject to simulation variance, though negligible at the default of 10,000 draws.
causalsim fills a practical gap in the causal inference toolkit by providing a clean, reproducible interface for simulation-based estimator evaluation. The structural causal model framework is general enough to represent a wide range of applied settings, and the layered API keeps simple use cases simple while supporting more complex configurations.
Planned extensions include multi-estimator comparison in a single call, plot methods for grid results, and support for longitudinal DGPs. The package is designed with CRAN submission and long-term maintenance in mind.