dualbounds.gen_data.gen_regression_data¶

dualbounds.gen_data.gen_regression_data(n: int, p: int, lmda_dist: str = 'constant', eps_dist: str = 'gaussian', heterosked: str = 'constant', tauv: float = 1, r2: float = 0.95, sparsity: float = 0, interactions: bool = True, tau: float = 3, betaW_norm: float = 0, covmethod: str = 'identity', dgp_seed: int = 1, sample_seed: int | None = None)[source]¶

Samples a synthetic regression dataset.

Parameters:¶

n : int¶: Number of observations.
p : int¶: Number of covariates
lmda_dist : str¶: str specifying the distribution of lmdai, where Xi = lmdai * N(0, Sigma), so the covariates are elliptically distributed.
eps_dist : str¶: str specifying the distribution of the residuals. See utilities.parse_dist for the list of options.
heterosked : str¶: str specifying the type of heteroskedasticity. Defaults to constant.
tauv : float¶: Ratio of Var(Y(1) | X) / Var(Y(0) | X)
r2 : float¶: Population r^2 of 1 - E[Var(Y | X)] / Var(Y).
sparsity : float¶: Proportion of covariates with zero coefficients. Defaults to zero (no sparsity).
interactions : bool¶: If True (default), Y = X beta + W * X * beta_int + epsilon. Else, the interactions between the treatment and the covariates are ommitted.
tau : float¶: Average treatment effect.
betaW_norm : float¶: E[W | X] = logistic(X @ betaW). This parameter controls the norm of betaW and thus the variance of the propensity scores.
covmethod : str¶: str identifier for how to generate the covariance matrix.
dgp_seed : int¶: Random seed for the data-generating parameters.
sample_seed : int¶: Random seed for the randomness from sampling.

Returns:¶

data – Dictionary with keys X (covariates), y (response), W (treatment), pis (true propensity scores), beta, beta_int, betaW, and more.

Return type:¶

dict