dualbounds.gen_data.gen_regression_data

dualbounds.gen_data.gen_regression_data(n: int, p: int, lmda_dist: str = 'constant', eps_dist: str = 'gaussian', heterosked: str = 'constant', tauv: float = 1, r2: float = 0.95, sparsity: float = 0, interactions: bool = True, tau: float = 3, betaW_norm: float = 0, covmethod: str = 'identity', dgp_seed: int = 1, sample_seed: int | None = None)[source]

Samples a synthetic regression dataset.

Parameters:
n : int

Number of observations.

p : int

Number of covariates

lmda_dist : str

str specifying the distribution of lmdai, where Xi = lmdai * N(0, Sigma), so the covariates are elliptically distributed.

eps_dist : str

str specifying the distribution of the residuals. See utilities.parse_dist for the list of options.

heterosked : str

str specifying the type of heteroskedasticity. Defaults to constant.

tauv : float

Ratio of Var(Y(1) | X) / Var(Y(0) | X)

r2 : float

Population r^2 of 1 - E[Var(Y | X)] / Var(Y).

sparsity : float

Proportion of covariates with zero coefficients. Defaults to zero (no sparsity).

interactions : bool

If True (default), Y = X beta + W * X * beta_int + epsilon. Else, the interactions between the treatment and the covariates are ommitted.

tau : float

Average treatment effect.

betaW_norm : float

E[W | X] = logistic(X @ betaW). This parameter controls the norm of betaW and thus the variance of the propensity scores.

covmethod : str

str identifier for how to generate the covariance matrix.

dgp_seed : int

Random seed for the data-generating parameters.

sample_seed : int

Random seed for the randomness from sampling.

Returns:

data – Dictionary with keys X (covariates), y (response), W (treatment), pis (true propensity scores), beta, beta_int, betaW, and more.

Return type:

dict