Python API#

Overview of classes#

ModifiedCausalForest([var_d_name, ...])

Estimation of treatment effects with the Modified Causal Forest.

OptimalPolicy([dc_check_perfectcorr, ...])

Optimal policy learning

McfOptPolReport([mcf, mcf_sense, optpol, ...])

Provides reports about the main specification choices and most

Modified Causal Forest#

class mcf_main.ModifiedCausalForest(var_d_name=None, var_id_name=None, var_iv_name=None, var_w_name=None, var_x_name_always_in_ord=None, var_x_name_always_in_unord=None, var_x_name_balance_test_ord=None, var_x_name_balance_test_unord=None, var_x_name_remain_ord=None, var_x_name_remain_unord=None, var_x_name_ord=None, var_x_name_unord=None, var_y_name=None, var_y_tree_name=None, var_z_name_cont=None, var_z_name_ord=None, var_z_name_unord=None, cf_alpha_reg_grid=1, cf_alpha_reg_max=0.15, cf_alpha_reg_min=0.05, cf_boot=1000, cf_chunks_maxsize=None, cf_compare_only_to_zero=False, cf_n_min_grid=1, cf_n_min_max=None, cf_n_min_min=None, cf_n_min_treat=None, cf_nn_main_diag_only=False, cf_m_grid=1, cf_m_random_poisson=True, cf_m_share_max=0.6, cf_m_share_min=0.1, cf_match_nn_prog_score=True, cf_mce_vart=1, cf_random_thresholds=None, cf_p_diff_penalty=None, cf_penalty_type='mse_d', cf_subsample_factor_eval=None, cf_subsample_factor_forest=1, cf_tune_all=False, cf_vi_oob_yes=False, cs_adjust_limits=None, cs_detect_const_vars_stop=True, cs_max_del_train=0.5, cs_min_p=0.01, cs_quantil=1, cs_type=1, ct_grid_dr=100, ct_grid_nn=10, ct_grid_w=10, dc_check_perfectcorr=True, dc_clean_data=True, dc_min_dummy_obs=10, dc_screen_covariates=True, fs_rf_threshold=1, fs_other_sample=True, fs_other_sample_share=0.33, fs_yes=False, gen_d_type='discrete', gen_iate_eff=False, gen_panel_data=False, gen_mp_parallel=None, gen_outfiletext=None, gen_outpath=None, gen_output_type=2, gen_panel_in_rf=True, gen_weighted=False, lc_cs_cv=True, lc_cs_cv_k=None, lc_cs_share=0.25, lc_estimator='RandomForest', lc_yes=True, lc_uncenter_po=True, p_ate_no_se_only=False, p_atet=False, p_bgate=False, p_bgate_sample_share=None, p_bt_yes=True, p_cbgate=False, p_choice_based_sampling=False, p_choice_based_probs=None, p_ci_level=0.95, p_cluster_std=False, p_cond_var=True, p_gates_minus_previous=False, p_gates_smooth=True, p_gates_smooth_bandwidth=1, p_gates_smooth_no_evalu_points=50, p_gates_no_evalu_points=50, p_gatet=False, p_iate=True, p_iate_se=False, p_iate_m_ate=False, p_iv_aggregation_method=('local', 'global'), p_knn=True, p_knn_const=1, p_knn_min_k=10, p_nw_bandw=1, p_nw_kern=1, p_max_cats_z_vars=None, p_max_weight_share=0.05, p_qiate=False, p_qiate_se=False, p_qiate_m_mqiate=False, p_qiate_m_opp=False, p_qiate_no_of_quantiles=99, p_qiate_smooth=True, p_qiate_smooth_bandwidth=1, p_qiate_bias_adjust=True, p_se_boot_ate=None, p_se_boot_gate=None, p_se_boot_iate=None, p_se_boot_qiate=None, var_x_name_balance_bgate=None, var_cluster_name=None, post_bin_corr_threshold=0.1, post_bin_corr_yes=True, post_est_stats=True, post_kmeans_no_of_groups=None, post_kmeans_max_tries=1000, post_kmeans_min_size_share=None, post_kmeans_replications=10, post_kmeans_single=False, post_kmeans_yes=True, post_random_forest_vi=True, post_relative_to_first_group_only=True, post_plots=True, post_tree=True, _int_cuda=False, _int_del_forest=False, _int_descriptive_stats=True, _int_dpi=500, _int_fontsize=2, _int_iate_chunk_size=None, _int_keep_w0=False, _int_no_filled_plot=20, _int_max_cats_cont_vars=None, _int_max_save_values=50, _int_max_obs_training=inf, _int_max_obs_prediction=250000, _int_max_obs_kmeans=200000, _int_max_obs_post_rel_graphs=50000, _int_mp_ray_del=('refs',), _int_mp_ray_objstore_multiplier=1, _int_mp_ray_shutdown=None, _int_mp_vim_type=None, _int_mp_weights_tree_batch=None, _int_mp_weights_type=1, _int_obs_bigdata=1000000, _int_output_no_new_dir=False, _int_red_largest_group_train=False, _int_replication=False, _int_report=True, _int_return_iate_sp=False, _int_seed_sample_split=67567885, _int_share_forest_sample=0.5, _int_show_plots=True, _int_verbose=True, _int_weight_as_sparse=True, _int_weight_as_sparse_splits=None, _int_with_output=True)#

Estimation of treatment effects with the Modified Causal Forest.

var_y_nameString or List of strings (or None), optional

Name of outcome variables. If several variables are specified, either var_y_tree_name is used for tree building, or (if var_y_tree_name is None), the 1st variable in the list is used. Only necessary for train() method. Default is None.

var_d_nameString or List of string (or None), optional

Name of treatment variable. Must be provided to use the train() method. Can be provided for the predict() method.

var_x_name_ordString or List of strings (or None), optional

Name of ordered features (including dummy variables). Either ordered or unordered features must be provided. Default is None.

var_x_name_unordString or List of strings (or None), optional

Name of unordered features. Either ordered or unordered features must be provided. Default is None.

var_x_name_balance_bgateString or List of strings (or None), optional

Variables to balance the GATEs on. Only relevant if p_bgate is True. The distribution of these variables is kept constant when a BGATE is computed. None: Use the other heterogeneity variables (var_z_…) (if there are any) for balancing. Default is None.

var_cluster_nameString or List of string (or None), optional

Name of variable defining clusters. Only relevant if p_cluster_std is True. Default is None.

var_id_nameString or List of string (or None), optional

Name of identifier. None: Identifier will be added to the data. Default is None.

var_iv_nameString or List of string (or None), optional

Name of binary instrumental variable. Only relevant if train_iv method is used. Default is None.

var_x_name_balance_test_ordString or List of strings (or None), optional

Name of ordered variables to be used in balancing tests. Only relevant if p_bt_yes is True. Default is None.

var_x_name_balance_test_unordString or List of strings (or None),

optional Name of ordered variables to be used in balancing tests. Treatment specific descriptive statistics are only printed for those variables. Default is None.

var_x_name_always_in_ordString or List of strings (or None), optional

Name of ordered variables that are always checked on when deciding on the next split during tree building. Only relevant for train() method. Default is None.

var_x_name_always_in_unordString or List of strings (or None), optional

Name of unordered variables that always checked on when deciding on the next split during tree building. Only relevant for train() method. Default is None.

var_x_name_remain_ordString or List of strings (or None), optional

Name of ordered variables that cannot be removed by feature selection. Only relevant for train() method. Default is None.

var_x_name_remain_unordString or List of strings (or None), optional

Name of unordered variables that cannot be removed by feature selection. Only relevant for train() method. Default is None.

var_w_nameString or List of string (or None), optional

Name of weight. Only relevant if gen_weighted is True. Default is None.

var_z_name_listString or List of strings (or None), optional

Names of ordered variables with many values to define causal heterogeneity. They will be discretized and (dependening p_gates_smooth) also treated as continuous. If not already included in var_x_name_ord, they will be added to the list of features. Default is None.

var_z_name_ordString or List of strings (or None), optional

Names of ordered variables with not so many values to define causal heterogeneity. If not already included in var_x_name_ord, they will be added to the list of features. Default is None.

var_z_name_unordString or List of strings (or None), optional

Names of unordered variables with not so many values to define causal heterogeneity. If not already included in var_x_name_ord, they will be added to the list of features. Default is None.

var_y_tree_nameString or List of string (or None), optional

Name of outcome variables to be used to build trees. Only relevant if multiple outcome variables are specified in var_y_name. Only relevant for train() method. Default is None.

cf_alpha_reg_gridInteger (or None), optional

Minimum remaining share when splitting leaf: Number of grid values. If grid is used, optimal value is determined by out-of-bag estimation of objective function. Default (or None) is 1.

cf_alpha_reg_maxFloat (or None), optional

Minimum remaining share when splitting leaf: Largest value of grid (keep it below 0.2). Default (or None) is 0.15.

cf_alpha_reg_minFloat (or None), optional

Minimum remaining share when splitting leaf: Smallest value of grid (keep it below 0.2). Default (or None) is 0.05.

cf_bootInteger (or None), optional

Number of Causal Trees. Default (or None) is 1000.

cf_chunks_maxsizeInteger (or None), optional

For large samples, randomly split the training data into equally sized chunks, train a forest in each chunk, and estimate effects for each forest. Final effect estimates are obtained by averaging effects obtained for each forest. This procedures improves scalability by reducing computation time (at the possible price of a somewhat larger finite sample bias). If cf_chunks_maxsize is larger than the sample size, there is no random splitting. The default (None) is dependent on the size of the training data: If there are less than 90’000 training observations: No splitting. Otherwise:

\[\text{cf_chunks_maxsize} = 90000 + \frac{{(\text{number of observations} - 90000)^{0.8}}}{{(\text{# of treatments} - 1)}}\]

Default is None.

cf_compare_only_to_zeroBoolean (or None), optional

If True, the computation of the MCE ignores all elements not related to the first treatment (which usually is the control group). This speeds up computation, should give better effect estimates, and may be attractive when interest is only in the comparisons of each treatment to the control group and not among each other. This may also be attractive for optimal policy analysis based on using estimated potential outcomes normalized by the estimated potential outcome of the control group (i.e., IATEs of treatments vs. control group). Default (or None) is False.

cf_n_min_gridInteger (or None), optional

Minimum leaf size: Number of grid values. If grid is used, optimal value is determined by out-of-bag estimation of objective function. Default (or None) is 1.

cf_n_min_maxInteger (or None), optional

Minimum leaf size: Largest minimum leaf size. If None :

\[\text{A} = \frac{\sqrt{\text{number of observations in the smallest treatment group}}^{0.5}}{10}, \text{at least 2}\]

\(\text{cf_n_min_max} = \text{round}(A \times \text{number of treatments})\) Default is None.

cf_n_min_minInteger (or None), optional

Minimum leaf size: Smallest minimum leaf size. If None:

\[\text{A} = \text{number of observations in smallest treatment group}^{0.4} / 10, \text{at least 1.5}\]

\(\text{cf_n_min_min} = \text{round}(A \times \text{number of treatments})\) Default is None.

cf_n_min_treatInteger (or None), optional

Minimum number of observations per treatment in leaf. A higher value reduces the risk that a leaf cannot be filled with outcomes from all treatment arms in the evaluation subsample. There is no grid based tuning for this parameter. This parameter impacts the minimum leaf size which will be at least to \(\text{n_min_treat} \times \text{number of treatments}\) None :

\[\frac{\frac{{\text{n_min_min}} + {\text{n_min_max}}}{2}}{\text{number of treatments} \times 10}, \text{at least 1}\]

Default is None.

cf_match_nn_prog_scoreBoolean (or None), optional

Choice of method of nearest neighbour matching. True : Prognostic scores. False: Inverse of covariance matrix of features. Default (or None) is True.

cf_nn_main_diag_onlyBoolean (or None), optional

Nearest neighbour matching: Use main diagonal of covariance matrix only. Only relevant if match_nn_prog_score == False. Default (or None) is False.

cf_m_gridInteger (or None), optional

Number of variables used at each new split of tree: Number of grid values. If grid is used, optimal value is determined by out-of-bag estimation of objective function. Default (or None) is 1.

cf_m_random_poissonBoolean (or None), optional

Number of variables used at each new split of tree: True : Number of randomly selected variables is stochastic for each split, drawn from a Poisson distribution. Grid gives mean value of 1 + poisson distribution (m-1) (m is determined by cf_m_share parameters). False : No additional randomisation. Default (or None) is True.

cf_m_share_maxFloat (or None), optional

Share of variables used at each new split of tree: Maximum. Default (or None) is 0.6. If variables randomly selected for splitting do not show any variation in leaf considered for splitting, then all variables will be used for that split.

cf_m_share_minFloat (or None), optional

Share of variables used at each new split of tree: Minimum. Default (or None) is 0.1. If variables randomly selected for splitting do not show any variation in leaf considered for splitting, then all variables will be used for that split.

cf_mce_vartInteger (or None), optional

Splitting rule for tree building: 0 : mse’s of regression only considered. 1 : mse+mce criterion (default). 2 : -var(effect): heterogeneity maximising splitting rule of Wager & Athey (2018). 3 : randomly switching between outcome-mse+mce criterion & penalty functions. Default (or None) is 1.

cf_p_diff_penaltyInteger (or None), optional

Penalty function (depends on the value of mce_vart).

mce_vart == 0

Irrelevant (no penalty).

mce_vart == 1

Multiplier of penalty (in terms of var(y)). 0 : No penalty. None :

\[\frac{2 \times (\text{n} \times \text{subsam_share})^{0.9}}{\text{n} \times \text{subsam_share}} \times \sqrt{\frac{\text{no_of_treatments} \times (\text{no_of_treatments} - 1)}{2}}\]
mce_vart == 2

Multiplier of penalty (in terms of MSE(y) value function without splits) for penalty. 0 : No penalty. None :

\[\frac{100 \times 4 \times (n \times \text{f_c.subsam_share})^{0.8}}{n \times \text{f_c.subsam_share}}\]
mce_vart == 3

Probability of using p-score (0-1). None : 0.5. Increase value if balancing tests indicate problems. Default is None.

cf_penalty_typeString (or None), optional

Type of penalty function. ‘mse_d’: MSE of treatment prediction in daughter leaf (new in 0.7.0) ‘diff_d’: Penalty as squared leaf difference (as in Lechner, 2018) Note that an important advantage of ‘mse_d’ that it can also be used for tuning (due to its computation, this is not possible for ‘diff_d’). Default (or None) is ‘mse_d’.

cf_random_thresholdsInteger (or None), optional

Use only a random selection of values for splitting (continuous feature only; re-randomize for each splitting decision; fewer thresholds speeds up programme but may lead to less accurate results). 0 : No random thresholds. > 0 : Number of random thresholds used for ordered variables. None : \(4 + \text{number of training observations}^{0.2}\) Default is None.

cf_subsample_factor_forestFloat (or None), optional

Multiplier of default size of subsampling sample (S) used to build tree.

\[S = \max((n^{0.5},min(0.67n, \frac{2 \times (n^{0.85})}{n}))), \text{n: # of training observations}\]

\(S \times \text{cf_subsample_factor_forest}, \text{is not larger than 80%.}\) Default (or None) is 1.

cf_subsample_factor_evalFloat or Boolean (or None), optional

Size of subsampling sample used to populate tree. False: No subsampling in evaluation subsample. True or None: :math:(2 times text{subsample size}) used for tree building (to avoid too many empty leaves). Float (>0): Multiplier of subsample size used for tree building. In particular for larger samples, using subsampling in evaluation will speed up computations and reduces demand on memory. Tree-specific subsampling in evaluation sample increases speed at which the asymtotic bias disappears (at the expense of a slower disappearance of the variance; however, simulations so far show no relevant impact). Default is None.

cf_tune_allBoolean (or None), optional

Tune all parameters. If True, all *_grid keywords will be set to 3. User specified values are respected if larger than 3. Default (or None) is False.

cf_vi_oob_yesBoolean (or None), optional

Variable importance for causal forest computed by permuting single variables and comparing share of increase in objective function of mcf (computed with out-of-bag data). Default (or None) is False.

cs_typeInteger (or None), optional

Common support adjustment: Method. 0 : No common support adjustment. 1,2 : Support check based on estimated classification forests. 1 : Min-max rules for probabilities in treatment subsamples. 2 : Enforce minimum and maximum probabilities for all obs all but one probability. Observations off support are removed. Out-of-bag predictions are used to avoid overfitting (which would lead to a too large reduction in the number of observations). Default (or None) is 1.

cs_adjust_limitsFloat (or None), optional

Common support adjustment: Accounting for multiple treatments. None : \((\text{number of treatments} - 2) \times 0.05\) If cs_type > 0: \(\text{upper limit} \times = 1 + \text{support_adjust_limits}\), \(\text{lower limit} \times = 1 - \text{support_adjust_limits}\). The restrictiveness of the common support criterion increases with the number of treatments. This parameter allows to reduce this restrictiveness. Default is None.

cs_max_del_trainFloat (or None), optional

Common support adjustment: If share of observations in training data used that are off support is larger than cs_max_del_train (0-1), an exception is raised. In this case, user should change input data. Default (or None) is 0.5.

cs_min_pFloat (or None), optional

Common support adjustment: If cs_type == 2, observations are deleted if \(p(d=m|x)\) is less or equal than cs_min_p for at least one treatment. Default (or None) is 0.01.

cs_quantilFloat (or None), optional

Common support adjustment: How to determine upper and lower bounds. If CS_TYPE == 1: 1 or None : Min-max rule. < 1 : Respective quantile. Default (or None) is 1.

ct_grid_drInteger (or None), optional

Number of grid point for discretization of continuous treatment (with 0 mass point; grid is defined in terms of quantiles of continuous part of treatment) for dose response function. Default (or None) is 100.

ct_grid_nnInteger (or None), optional

Number of grid point for discretization of continuous treatment (with 0 mass point; grid is defined in terms of quantiles of continuous part of treatment) for neighbourhood matching. Default (or None) is 10.

ct_grid_wInteger (or None), optional

Number of grid point for discretization of continuous treatment (with 0 mass point; grid is defined in terms of quantiles of continuous part of treatment) for weights. Default (or None) is 10.

dc_clean_dataBoolean (or None), optional

Clean covariates. Remove all rows with missing observations and unnecessary variables from DataFrame. Default (or None) is True.

dc_check_perfectcorrBoolean (or None), optional

Screen and clean covariates: Variables that are perfectly correlated with each others will be deleted. Default (or None) is True.

dc_min_dummy_obsInteger (or None), optional

Screen covariates: If > 0 dummy variables with less than dc_min_dummy_obs observations in one category will be deleted. Default (or None) is 10.

dc_screen_covariatesBoolean (or None), optional

Screen and clean covariates. Default (or None) is True.

fs_yesBoolean (or None), optional

Feature selection before building causal forest: A feature is deleted if it is irrelevant in the reduced forms for the treatment AND the outcome. Reduced forms are computed with random forest classifiers or random forest regression, depending on the type of variable. Irrelevance is measured by variable importance measures based on randomly permuting a single variable and checking its reduction in either accuracy (classification) or R2 (regression) compared to the test set prediction based on the full model. Exceptions: (i) If the correlation of two variables to be deleted is larger than 0.5, one of the two variables is kept. (ii) Variables used to compute GATEs, BGATEs, CBGATEs. Variables contained in ‘var_x_name_remain_ord’ or ‘var_x_name_remain_unord’, or are needed otherwise, are not removed. If the number of variables is very large (and the space of relevant features is much sparser, then using feature selection is likely to improve computational and statistical properties of the mcf etimator). Default (or None) is False.

fs_rf_thresholdInteger or Float (or None), optional

Feature selection: Threshold in terms of relative loss of variable importance in %. Default (or None) is 1.

fs_other_sampleBoolean (or None), optional

True : Random sample from training data used. These observations will not be used for causal forest. False : Use the same sample as used for causal forest estimation. Default (or None) is True.

fs_other_sample_shareFloat (or None), optional

Feature selection: Share of sample used for feature selection (only relevant if fs_other_sample is True). Default (or None) is 0.33.

gen_d_typeString (or None), optional

Type of treatment. ‘discrete’: Discrete treatment. ‘continuous’: Continuous treatment. Default (or None) is ‘discrete’.

gen_iate_effBoolean (or None), optional

Additionally, compute more efficient IATE (IATE are estimated twice and averaged where role of tree_building and tree_filling sample is exchanged; X-fitting). No inference is attempted for these parameters. Default (or None) is False.

gen_mp_parallelInteger (or None), optional

Number of parallel processes (using ray on CPU). The smaller this value is, the slower the programme, the smaller its demands on RAM. None : 80% of logical cores. Default is None.

gen_outfiletextString (or None), optional

File for text output. (.txt) file extension will be added. None : ‘txtFileWithOutput’. Default is None.

gen_outpathString or Pathlib object (or None), optional

Path were the output is written too (text, estimated effects, etc.) If specified directory does not exist, it will be created. None : An (…/out) directory below the current directory is used. Default is None.

gen_output_typeInteger (or None), optional

Destination of text output. 0: Terminal. 1: File. 2: Terminal and file. Default (or None) is 2.

gen_panel_dataBoolean (or None), optional

Panel data used. p_cluster_std is set to True. Default (or None) is False.

gen_panel_in_rfBoolean (or None), optional

Panel data used: Use panel structure also when building the random samples within the forest procedure. Default (or None) is True.

gen_weightedBoolean (or None), optional

Use of sampling weights to be provided in var_w_name. Default (or None) is False.

lc_yesBoolean (or None), optional

Local centering. The predicted value of the outcome from a regression with all features (but without the treatment) is subtracted from the observed outcomes (using 5-fold cross-fitting). The best method for the regression is selected among scikit-learn’s Random Forest, Support Vector Machines, and AdaBoost Regression based on their out-of-sample mean squared error. The method selection is either performed on the subsample used to build the forest ((1-lc_cs_share) for training, lc_cs_share for test). Default (or None) is True.

lc_estimatorString (or None), optional

The estimator used for local centering. Possible choices are scikit-learn’s regression methods ‘RandomForest’, ‘RandomForestNminl5’, ‘RandomForestNminls5’, ‘SupportVectorMachine’, ‘SupportVectorMachineC2’, ‘SupportVectorMachineC4’, ‘AdaBoost’, ‘AdaBoost100’, ‘AdaBoost200’, ‘GradBoost’, ‘GradBoostDepth6’, ‘GradBoostDepth12’, ‘LASSO’, ‘NeuralNet’, ‘NeuralNetLarge’, ‘NeuralNetLarger’, ‘Mean’. If set to ‘automatic’, the estimator with the lowest out-of-sample mean squared error (MSE) is selected. Whether this selection is based on cross-validation or a test sample is governed by the keyword lc_cs_cv. ‘Mean’ is included for the cases when none of the methods have explanatory power. Default (or None) is ‘RandomForest’.

lc_uncenter_poBoolean (or None), optional

Predicted potential outcomes are re-adjusted for local centering are added to data output (iate and iate_eff in results dictionary). Default (or None) is True.

lc_cs_cvBoolean (or None), optional

Data to be used for local centering & common support adjustment. True : Crossvalidation. False : Random sample not to be used for forest building. Default (or None) is True.

lc_cs_cv_kInteger (or None), optional

Data to be used for local centering & common support adjustment: Number of folds in cross-validation (if lc_cs_cv is True). Default (or None) depends on the size of the training sample (N): N < 100’000: 5; 100’000 <= N < 250’000: 4 250’000 <= N < 500’000: 3, 500’000 <= N: 2.

lc_cs_shareFloat (or None), optional

Data to be used for local centering & common support adjustment: Share of trainig data (if lc_cs_cv is False). Default (or None) is 0.25.

p_atetBoolean (or None), optional

Compute effects for specific treatment groups. Only possible if treatment is included in prediction data. Default (or None) is False.

p_gates_minus_previousBoolean (or None), optional

Estimate increase of difference of GATEs, CBGATEs, BGATEs when evaluated at next larger observed value. Default (or None) is False.

p_gates_no_evalu_pointsInteger (or None), optional

Number of evaluation points for discretized variables in (CB)(B)GATE estimation. Default (or None) is 50.

p_gates_smoothBoolean (or None), optional

Alternative way to estimate GATEs for continuous features. Instead of discretizing variable, its GATE is evaluated at p_gates_smooth_no_evalu_points. Since there are likely to be no observations, a local neighbourhood around the evaluation points is considered. Default (or None) is True.

p_gates_smooth_bandwidthFloat (or None), optional

Multiplier for bandwidth used in (C)BGATE estimation with smooth variables. Default (or None) is 1.

p_gates_smooth_no_evalu_pointsInteger (or None), optional

Number of evaluation points for discretized variables in GATE estimation. Default (or None) is 50.

p_gatetBoolean (or None), optional

Compute effects for specific treatment groups. Only possible if treatment is included in prediction data. Default (or None) is False.

p_bgateBoolean (or None), optional

Estimate a GATE that is balanced in selected features (as specified in var_x_name_balance_bgate). Default (or None) is False.

p_cbgateBoolean (or None), optional

Estimate a GATE that is balanced in all other features. Default (or None) is False.

p_bgate_sample_shareFloat (or None), optional

Implementation of (C)BGATE estimation is very cpu intensive. Therefore, random samples are used to speed up the programme if there are number observations / number of evaluation points > 10. None : If observation in prediction data (n) < 1000: 1 If n >= 1000:

\[1000 + \frac{{(n - 1000)^{\frac{3}{4}}}}{{\text{evaluation points}}}\]

Default is None.

p_max_cats_z_varsInteger (or None), optional

Maximum number of categories for discretizing continuous z variables. None : \(\text{Number of observations}^{0.3}\) Default is None.

p_iateBoolean (or None), optional

IATEs will be estimated. Default (or None) is True.

p_iate_seBoolean (or None), optional

Standard errors of IATEs will be estimated. Default (or None) is False.

p_iate_m_ateBoolean (or None), optional

IATEs minus ATE will be estimated. Default (or None) is False.

p_qiateBoolean (or None), optional

QIATEs will be estimated. Default (or None) is False.

p_qiate_seBoolean (or None), optional

Standard errors of QIATEs will be estimated. Default (or None) is False.

p_qiate_m_mqiateBoolean (or None), optional

QIATEs minus median of QIATEs will be estimated. Default (or None) is False.

p_qiate_m_oppBoolean (or None), optional.

QIATE(x, q) - QIATE(x, 1-q) will be estimated (q denotes quantil level, q < 0.5), Default is False.

p_qiate_no_of_quantilesInteger (or None), optional

Number of quantiles used for QIATE. Default (or None) is 99.

p_qiate_smoothBoolean (or None), optional

Smooth estimated QIATEs using kernel smoothing. Default is True.

p_qiate_smooth_bandwidthInteger or Float (or None), optional

Multiplier applied to default bandwidth used for kernel smoothing of QIATE. Default (or None) is 1.

p_qiate_bias_adjustBoolean (or None), optional

Bias correction procedure for QIATEs based on simulations. Default is True.

If p_qiate_bias_adjust is True, P_IATE_SE is set to True as well.

p_qiate_bias_adjust_drawsInteger or Float (or None), optional

Number of random draws used in computing the bias adjustment. Default is 1000.

p_ci_levelFloat (or None), optional

Confidence level for bounds used in plots. Default (or None) is 0.95.

p_cond_varBoolean (or None), optional

True : Conditional mean & variances are used. False : Variance estimation uses \(wy_i = w_i \times y_i\) directly. Default (or None) is True.

p_knnBoolean (or None), optional

True : k-NN estimation. False: Nadaraya-Watson estimation. Nadaray-Watson estimation gives a better approximaton of the variance, but k-NN is much faster, in particular for larger datasets. Default (or None) is True.

p_knn_min_kInteger (or None), optional

Minimum number of neighbours k-nn estimation. Default (or None) is 10.

p_nw_bandwFloat (or None), optional

Bandwidth for nw estimation: Multiplier of Silverman’s optimal bandwidth. Default (or None) is 1.

p_nw_kernInteger (or None), optional

Kernel for Nadaraya-Watson estimation. 1 : Epanechikov. 2 : Normal pdf. Default (or None) is 1.

p_max_weight_shareFloat (or None), optional

Truncation of extreme weights. Maximum share of any weight, 0 <, <= 1. Enforced by trimming excess weights and renormalisation for each (BG,G,I,CBG)ATE separately. Because of renormalisation, the final weights could be somewhat above this threshold. Default (or None) is 0.05.

p_cluster_stdBoolean (or None), optional

Clustered standard errors. Always True if gen_panel_data is True. Default (or None) is False.

p_se_boot_ateInteger or Boolean (or None), optional

Bootstrap of standard errors for ATE. Specify either a Boolean (if True, number of bootstrap replications will be set to 199) or an integer corresponding to the number of bootstrap replications (this implies True). None : 199 replications p_cluster_std is True, and False otherwise. Default is None.

p_se_boot_gateInteger or Boolean (or None), optional

Bootstrap of standard errors for GATE. Specify either a Boolean (if True, number of bootstrap replications will be set to 199) or an integer corresponding to the number of bootstrap replications (this implies True). None : 199 replications p_cluster_std is True, and False otherwise. Default is None.

p_se_boot_iateInteger or Boolean (or None), optional

Bootstrap of standard errors for IATE. Specify either a Boolean (if True, number of bootstrap replications will be set to 199) or an integer corresponding to the number of bootstrap replications (this implies True). None : 199 replications p_cluster_std is True, and False otherwise. Default is None.

p_se_boot_qiateInteger or Boolean (or None), optional

Bootstrap of standard errors for QIATE. Specify either a Boolean (if True, number of bootstrap replications will be set to 199) or an integer corresponding to the number of bootstrap replications (this implies True). None : 199 replications p_cluster_std is True, and False otherwise. Default is None.

p_bt_yesBoolean (or None), optional

ATE based balancing test based on weights. Relevance of this test in its current implementation is not fully clear. Default (or None) is True.

p_choice_based_samplingBoolean (or None), optional

Choice based sampling to speed up programme if treatment groups have very different sizes. Default (or None) is False.

p_choice_based_probsList of Floats (or None), optional

Choice based sampling: Sampling probabilities to be specified. These weights are used for (G,B,CB)ATEs only. Treatment information must be available in the prediction data. Default is None.

p_ate_no_se_onlyBoolean (or None),optional

Computes only the ATE without standard errors. Default (or None) is False.

post_est_statsBoolean (or None), optional

Descriptive Analyses of IATEs (p_iate must be True). Default (or None) is True.

post_relative_to_first_group_onlyBoolean (or None), optional

Descriptive Analyses of IATEs: Use only effects relative to treatment with lowest treatment value. Default (or None) is True.

post_bin_corr_yesBoolean (or None), optional

Descriptive Analyses of IATEs: Checking the binary correlations of predictions with features. Default (or None) is True.

post_bin_corr_thresholdFloat, optional

Descriptive Analyses of IATEs: Minimum threshhold of absolute correlation to be displayed. Default (or None) is 0.1.

post_kmeans_yesBoolean (or None), optional

Descriptive Analyses of IATEs: Using k-means clustering to analyse patterns in the estimated effects. Default (or None) is True.

post_kmeans_singleBoolean (or None), optional

If True (and post_kmeans_yes is True), clustering is also with respect to all single effects. If False (and post_kmeans_yes is True), clustering is only with respect to all relevant IATEs jointly. Default (or None) is False.

post_kmeans_no_of_groupsInteger or List or Tuple (or None), optional

Descriptive Analyses of IATEs: Number of clusters to be built in k-means. None : List of 5 values: [a, b, c, d, e]; c = 5 to 10; depending on number of observations; c<7: a=c-2, b=c-1, d=c+1, e=c+2, else a=c-4, b=c-2, d=c+2, e=c+4. Default is None.

post_kmeans_max_triesInteger (or None), optional

Descriptive Analyses of IATEs: Maximum number of iterations of k-means to achive convergence. Default (or None) is 1000.

post_kmeans_replicationsInteger (or None), optional

Descriptive Analyses of IATEs: Number of replications with random start centers to avoid local extrema. Default (or None) is 10.

post_kmeans_min_size_shareFloat (or None).

Smallest share observations for cluster size allowed in % (0-33). Default (None) is 1 (%).

post_random_forest_viBoolean (or None), optional

Descriptive Analyses of IATEs: Variable importance measure of random forest used to learn factors influencing IATEs. Default (or None) is True.

post_plotsBoolean (or None), optional

Descriptive Analyses of IATEs: Plots of estimated treatment effects. Default (or None) is True.

post_treeBoolean (or None), optional

Regression trees (honest and standard) of Depth 2 to 5 are estimated to describe IATES(x). Default (or None) is True.

p_knn_constBoolean (or None), optional

Multiplier of default number of observation used in moving average of analyse() method. Default (or None) is 1.

_int_cudaBoolean (or None), optional

Use CUDA based GPU if CUDA-compatible GPU is available on hardware (experimental). Default (or None) is False.

_int_descriptive_statsBoolean (or None), optional

Print descriptive stats if _int_with_output is True. Default (or None) is True. Internal variable, change default only if you know what you do.

_int_show_plotsBoolean (or None), optional

Execute show() command if _int_with_output is True. Default (or None) is True. Internal variable, change default only if you know what you do.

_int_dpiInteger (or None), optional

dpi in plots. Default (or None) is 500. Internal variable, change default only if you know what you do.

_int_fontsizeInteger (or None), optional

Font for legends, from 1 (very small) to 7 (very large). Default (or None) is 2. Internal variable, change default only if you know what you do.

_int_no_filled_plotInteger (or None), optional

Use filled plot if more than _int_no_filled_plot different values. Default (or None) is 20. Internal variable, change default only if you know what you do.

_int_max_cats_cont_varsInteger (or None), optional

Discretise continuous variables: _int_max_cats_cont_vars is maximum number of categories for continuous variables. This speeds up the programme but may introduce some bias. None: No use of discretisation to speed up programme. Default is None. Internal variable, change default only if you know what you do.

_int_max_save_valuesInteger (or None), optional

Save value of features in table only if less than _int_max_save_values different values. Default (or None) is 50. Internal variable, change default only if you know what you do.

_int_max_obs_trainingInteger (or None), optional

Upper limit for sample size. If actual number is larger than this number, then the respective data will be randomly reduced to the specified upper limit. Training method: Reducing observations for training increases MSE and thus should be avoided. Default is infinity. Internal variable, change default only if you know what you do.

_int_max_obs_predictionInteger (or None), optional

Upper limit for sample size. If actual number is larger than this number, then the respective data will be randomly reduced to the specified upper limit. Prediction method: Reducing observations for prediction does not much affect MSE. It may reduce detectable heterogeneity, but may also dramatically reduce computation time. Default is 250’000. Internal variable, change default only if you know what you do.

_int_max_obs_kmeansInteger (or None), optional

Upper limit for sample size. If actual number is larger than this number, then the respective data will be randomly reduced to the specified upper limit. kmeans in analyse method: Reducing observations may reduce detectable heterogeneity, but also reduces computation time. Default is 200’000. Internal variable, change default only if you know what you do.

_int_max_obs_post_rel_graphsInteger (or None), optional

Upper limit for sample size. If actual number is larger than this number, then the respective data will be randomly reduced to the specified upper limit. Figures show the relation of IATEs and features (note that the built-in non-parametric regression is computationally intensive). Default is 50’000. Internal variable, change default only if you know what you do.

_int_mp_ray_delTuple of strings (or None), optional

‘refs’ : Delete references to object store. ‘rest’ : Delete all other objects of Ray task. ‘none’ : Delete no objects. These 3 options can be combined. Default (or None) is (‘refs’,). Internal variable, change default only if you know what you do.

_int_mp_ray_objstore_multiplierFloat (or None), optional

Changes internal default values for size of Ray object store. Change to 1 if programme crashes because object store is full. Only relevant if _int_mp_ray_shutdown is True. Default (or None) is 1. Internal variable, change default only if you know what you do.

_int_mp_ray_shutdownBoolean (or None), optional

When computing the mcf repeatedly like in Monte Carlo studies, setting _int_mp_ray_shutdown to True may be a good idea. None: False if obs < 100000, True otherwise. Default is None. Internal variable, change default only if you know what you do.

_int_mp_vim_typeInteger (or None), optional

Type of multiprocessing when computing variable importance statistics: 1 : Variable based (fast, lots of memory). 2 : Bootstrap based (slower, less memory). None: 1 if obs < 20000, 2 otherwise. Default is None. Internal variable, change default only if you know what you do.

_int_iate_chunk_sizeInteger or None, optional

Number of IATEs that are estimated in a single ray worker. Default is number of prediction observations / workers. If programme crashes in second part of IATE because of excess memory consumption, reduce _int_iate_chunk_size.

_int_mp_weights_tree_batchInteger (or None), optional

Number of batches to split data in weight computation for variable importance statistics: The smaller the number of batches, the faster the programme and the more memory is needed. None : Automatically determined. Default is None. Internal variable, change default only if you know what you do.

_int_mp_weights_typeInteger (or None), optional

Type of multiprocessing when computing weights. 1 : Groups of observations based (fast, lots of memory). 2 : Tree based (takes forever, less memory). Value of 2 will be internally changed to 1 if multiprocessing Default (or None) is 1. Internal variable, change default only if you know what you do.

_int_obs_bigdataInteger or None, optional

If number of training observations is larger than this number, the following happens during training: (i) Number of workers is halved in local centering. (ii) Ray is explicitely shut down. (iii) The number of workers used is reduced to 75% of default. (iv) The data type for some numpy arrays is reduced from float64 to float32. Default is 1’000’000.

_int_output_no_new_dirBoolean (or None), optional

Do not create a new directory when the path already exists. Default (or None) is False.

_int_reportBoolean (or None), optional

Provide information for McfOptPolReports to construct informative reports. Default (or None) is True.

_int_return_iate_spBoolean (or None), optional

Return all data with predictions despite _int_with_output is False (useful for cross-validation and simulations). Default (or None) is False. Internal variable, change default only if you know what you do.

_int_replicationBoolean (or None), optional

If True all scikit-learn based computations will NOT use multi- processing. Default (or None) is False.

_int_seed_sample_splitInteger (or None), optional

Seeding is redone when building forest. Default (or None) is 67567885. Internal variable, change default only if you know what you do.

_int_share_forest_sampleFloat (or None), optional

Share of sample used build forest. Default (or None) is 0.5. Internal variable, change default only if you know what you do.

_int_verboseBoolean (or None), optional

Additional output about running of mcf if _int_with_output is True. Default (or None) is True. Internal variable, change default only if you know what you do.

_int_weight_as_sparseBoolean (or None), optional

Save weights matrix as sparse matrix. Default (or None) is True. Internal variable, change default only if you know what you do.

_int_weight_as_sparse_splitsInteger (or None), optional

Compute sparse weight matrix in several chuncks. None : (Rows of prediction data * rows of Fill_y data)/(number of training splits * 25’000 * 25’000)) Default is None. Internal variable, change default only if you know what you do.

_int_with_outputBoolean (or None), optional

Print output on txt file and/or console. Default (or None) is True. Internal variable, change default only if you know what you do.

_int_del_forestBoolean (or None), optional

Delete forests from instance. If True, less memory is needed, but the trained instance of the class cannot be reused when calling predict() with the same instance again, i.e. the forest has to be retrained when applied again. Default (or None) is False.

_int_keep_w0Boolean (or None), optional.

Keep all zeros weights when computing standard errors (slows down computation and may lead to undesirable behaviour). Default is False.

Methods#

train(data_df)

Build the modified causal forest on the training data.

train_iv(data_df)

Train the IV modified causal forest on the training data.

predict(data_df)

Compute all effects.

predict_different_allocations(data_df[, ...])

Predict average potential outcomes for different allocations.

predict_iv(data_df)

Compute all effects for instrument mcf (possibly in 2 differnt ways).

analyse(results)

Analyse estimated IATEs with various descriptive tools.

sensitivity(train_df[, predict_df, results, ...])

Compute simulation-based sensitivity indicators.

Optimal Policy#

class optpolicy_main.OptimalPolicy(dc_check_perfectcorr=True, dc_clean_data=True, dc_min_dummy_obs=10, dc_screen_covariates=True, estrisk_value=1, fair_adjust_target='xvariables', fair_consistency_test=False, fair_cont_min_values=20, fair_material_disc_method='Kmeans', fair_material_max_groups=5, fair_regression_method='RandomForest', fair_protected_disc_method='Kmeans', fair_protected_max_groups=5, fair_type='Quantiled', gen_method='best_policy_score', gen_mp_parallel='None', gen_outfiletext='txtFileWithOutput', gen_outpath=None, gen_output_type=2, gen_variable_importance=True, other_costs_of_treat=None, other_costs_of_treat_mult=None, other_max_shares=None, pt_depth_tree_1=3, pt_depth_tree_2=1, pt_enforce_restriction=False, pt_eva_cat_mult=1, pt_no_of_evalupoints=100, pt_min_leaf_size=None, pt_select_values_cat=False, rnd_shares=None, var_bb_restrict_name=None, var_d_name=None, var_effect_vs_0=None, var_effect_vs_0_se=None, var_id_name=None, var_material_name_ord=None, var_material_name_unord=None, var_polscore_desc_name=None, var_polscore_name=None, var_polscore_se_name=None, var_protected_name_ord=None, var_protected_name_unord=None, var_vi_x_name=None, var_vi_to_dummy_name=None, var_x_name_ord=None, var_x_name_unord=None, _int_dpi=500, _int_fontsize=2, _int_output_no_new_dir=False, _int_report=True, _int_with_numba=True, _int_with_output=True, _int_xtr_parallel=True)#

Optimal policy learning

Parameters
  • dc_screen_covariates (Boolean (or None), optional) – Check features. Default (or None) is True.

  • dc_check_perfectcorr (Boolean (or None), optional) – Features that are perfectly correlated are deleted (1 of them). Only relevant if dc_screen_covariates is True. Default (or None) is True.

  • dc_min_dummy_obs (Integer (or None), optional) – Delete dummy variables that have less than dc_min_dummy_obs in one of their categories. Only relevant if dc_screen_covariates is True. Default (or None) is 10.

  • dc_clean_data (Boolean (or None), optional) – Remove all missing & unnecessary variables. Default (or None) is True.

  • estrisk_value (Float or integer (or None), optional) – The is k in the formula ‘policy_score - k * standard_error’ used to adjust the scores for estimation risk. Default (or None) is 1.

  • fair_adjust_target (String (or None), optional Target for the fairness adjustment. 'scores' : Adjust policy scores. 'xvariables' : Adjust decision variables. 'scores_xvariables' : Adjust both decision variables and score. Default (or None) is ‘xvariables’.) –

  • fair_consistency_test (Boolean (or None), optional Test for internally consistency of fairness correction.) – When 'fair_adjust_target' is 'scores' or 'scores_xvariables', then the fairness corrections are applied independently to every policy score (which usually is a potential outcome or an IATE(x) for each treatment relative to some base treatment (i.e. comparing 1-0, 2-0, 3-0, etc.). Thus the IATE for the 2-1 comparison can be computed as IATE(2-0)-IATE(1-0). This tests compares two ways to compute a fair score for the 2-1 (and all# other comparisons) which should give simular results: a) Difference of two fair (!) scores b) Difference of corresponding scores, subsequently made fair. Note: Depending on the number of treatments, this test may be computationally more expensive than the orginal fairness corrections. Default (or None) is False.

  • fair_cont_min_values (Integer or float (or None), optional) – The methods used for fairness corrections depends on whether the variable is consider as continuous or discrete. All unordered variables are considered being discrete, and all ordered variables with more than fair_cont_min_values are considered as being discrete as well. The default (or None) is 20.

  • fair_material_disc_method (String (or None), optional Method on how to perform the discretization for materially relevant) – features. 'NoDiscretization' : Variables are not changed. If one of the features has more different values than fair_material_max_groups, all materially relevant features will formally be treated as continuous. The latter may become unreliable if their dimension is not year small. 'EqualCell' : Attempts to create equal cells for each variable. Maybe be useful for a very small number of variables with few different values. 'Kmeans' : Use Kmeans clustering algorithm to form homogeneous cells. Default (or None) is ‘Kmeans’.

  • fair_protected_disc_method (String (or None), optional Method on how to perform the discretization for protected features. 'NoDiscretization' : Variables are not changed. If one of the) – features has more different values than fair_protected_max_groups, all protected features will formally be treated as continuous. The latter may become unreliable if their dimension is not very small. 'EqualCell' : Attempts to create equal cells for each variable. Maybe be useful for a very small number of variables with few different values. 'Kmeans' : Use Kmeans clustering algorithm to form homogeneous cells. Default (or None) is 'Kmeans'.

  • fair_material_max_groups (Integer (or None), optional Level of discretization of materially relavant variables (only if) – needed). Number of groups of materially relavant features for cases when materially relavant variables are needed in protected form. This is currently only necessary for ‘Quantilized’. Its meaning depends on fair_material_disc_method: If 'EqualCell': If more than 1 variable is included among the protected variables, this restriction is applied to each variable. If 'Kmeans': This is the number of clusters used by Kmeans. Default (or None) is 5.

  • fair_protected_max_groups (Integer (or None), optional Level of discretization of protected variables (only if needed).) – Number of groups of protected features for cases when protected variables are needed in discretized form. This is currently only necessary for 'Quantilized'. Its meaning depends on fair_protected_disc_method: If 'EqualCell' : If more than 1 variable is included among the protected variables, this restriction is applied to each variable. If 'Kmeans' : This is the number of clusters used by Kmeans. Default (or None) is 5.

  • fair_regression_method (String (or None), optional Method choice when predictions from machine learning are needed for) – fairnesss corrections (fair_type in ('Mean', 'MeanVar'). Available methods are 'RandomForest', 'RandomForestNminl5', 'RandomForestNminls5', 'SupportVectorMachine', 'SupportVectorMachineC2', 'SupportVectorMachineC4', 'AdaBoost', 'AdaBoost100', 'AdaBoost200', 'GradBoost', 'GradBoostDepth6', 'GradBoostDepth12', 'LASSO', 'NeuralNet', 'NeuralNetLarge', 'NeuralNetLarger', 'Mean'. If 'automatic', an optimal method will be chosen based on 5-fold cross-validation in the training data. If a method is specified it will be used for all scores and all adjustments. If ‘automatic’, every policy score might be adjusted with a different method. ‘Mean’ is included for cases in which regression methods have no explanatory power. Default (or None) is 'RandomForest'.

  • fair_type (String (or None), optional Method to choose the type of correction for the policy scores. 'Mean' : Mean dependence of the policy score on protected var’s is) – removed by residualisation. 'MeanVar' : Mean dependence and heteroscedasticity is removed by residualisation and rescaling. 'Quantiled' : Removing dependence via (an empricial version of) the approach by Strack and Yang (2024) using quantiles. 'Mean' and 'MeanVar' are only availabe for adjusting the score (not the decision variables). See the paper by Bearth, Lechner, Mareckova, Muny (2024) for details on these methods. Default (or None) is ‘Quantiled’.

  • gen_method (String (or None), optional. Method to compute assignment algorithm (available methods:) – 'best_policy_score', 'bps_classifier', 'policy tree'). 'best_policy_score' conducts Black-Box allocations, which are obtained by using the scores directly (potentially subject to restrictions). When the Black-Box allocations are used for allocation of data not used for training, the respective scores must be available. 'bps_classifier' uses the allocations obtained by 'best_policy_score' and trains classifiers. The output will be a decision rule that depends on features only and does not require knowledge of the policy scores. The actual classifier used is selected among four different classifiers offered by sci-kit learn, namely a simple neural network, two classification random forests with minimum leaf size of 2 and 5, and ADDABoost. The selection is made according to the out-of-sample performance on scikit-learns Accuracy Score. The implemented 'policy tree' ‘s are optimal trees, i.e. all possible trees are checked if they lead to a better performance. If restrictions are specified, then this is incorporated into treatment specific cost parameters. Many ideas of the implementation follow Zhou, Athey, Wager (2022). If the provided policy scores fulfil their conditions (i.e., they use a doubly robust double machine learning like score), then they also provide attractive theoretical properties. Default (or None) is 'best_policy_score'.

  • gen_mp_parallel (Integer (or None), optional) – Number of parallel processes (using ray on CPU). The smaller this value is, the slower the programme, the smaller its demands on RAM. None : 80% of logical cores. Default is None.

  • gen_outfiletext (String (or None), optional) – File for text output. (.txt) file extension will be automatically added. Default (or None) is ‘txtFileWithOutput’.

  • gen_outpath (String or Pathlib object (or None), optional) – Directory to where to put text output and figures. If it does not exist, it will be created. None : Directory just below the directory where the programme is run. Default is None.

  • gen_output_type (Integer (or None), optional) – Destination of the output. 0 : Terminal. 1 : File. 2 : File and terminal. Default (or None) is 2.

  • gen_variable_importance (Boolean) – Compute variable importance statistics based on random forest classifiers. Default (or None) is True.

  • other_costs_of_treat (List of floats (or None), optional) – Treatment specific costs. These costs are directly subtracted from the policy scores. Therefore, they should be measured in the same units as the scores. Default value (or None) with constraints: It defaults to 0. Default value (or None) without constraints: Costs will be automatically determined such as to enforce constraints in the training data by finding cost values that lead to an allocation (‘best_policy_score’) that fulfils restrictions other_max_shares. Default (or None) is None.

  • other_costs_of_treat_mult (Float or tuple of floats (with as many) – elements as treatments) (or None), optional Multiplier of automatically determined cost values. Use only when automatic costs violate the constraints given by other_max_shares. This allows to increase (>1) or decrease (<1) the share of treated in particular treatment. None: (1, …, 1). Default (or None) is None.

  • other_max_shares (Tuple of float elements as treatments) (or None),) – optional Maximum share allowed for each treatment. Default (or None) is None.

  • pt_depth_tree_1 (Integer (or None), optional) – Depth of 1st optimal tree. Default is 3. Note that tree depth is defined such that a depth of 1 implies 2 leaves, a depth of 3 implies 4 leaves, a depth of 3 implies 8 leaves, etc.

  • pt_depth_tree_2 (Integer (or None), optional) – Depth of 2nd optimal tree. This set is built within the strata obtained from the leaves of the first tree. If set to 0, a second tree is not built. Default is 1 (together with the default for pt_depth_tree_1 this leads to a (not optimal) total tree of level of 4. Note that tree depth is defined such that a depth of 1 implies 2 leaves, a depth of 2 implies 4 leaves, a depth of 3 implies 8 leaves, etc.

  • pt_enforce_restriction (Boolean (or None), optional) – Enforces the imposed restriction (to some extent) during the computation of the policy tree. This increases the quality of trees concerning obeying the restrictions, but can be very time consuming. It will be automatically set to False if more than 1 policy tree is estimated. Default (or None) is False.

  • pt_eva_cat_mult (Integer (or None), optional) – Changes the number of the evaluation points (pt_no_of_evalupoints) for the unordered (categorical) variables to: \(\text{pt_eva_cat_mult} \times \text{pt_no_of_evalupoints}\) (available only for the method ‘policy tree’). Default (or None) is 2.

  • pt_no_of_evalupoints (Integer (or None), optional) – No of evaluation points for continuous variables. The lower this value, the faster the algorithm, but it may also deviate more from the optimal splitting rule. This parameter is closely related to the approximation parameter of Zhou, Athey, Wager (2022)(A) with \(\text{pt_no_of_evalupoints} = \text{number of observation} / \text{A}\). Only relevant if gen_method is ‘policy tree’. Default (or None) is 100.

  • pt_min_leaf_size (Integer (or None), optional) –

    Minimum leaf size. Leaves that are smaller than pt_min_leaf_size in the training data will not be considered. A larger number reduces computation time and avoids some overfitting. None :

    \[0.1 \times \frac{\text{Number of training observations}}{{\text{Number of leaves}}}\]

    (if treatment shares are restricted this is multiplied by the smallest share allowed). Only relevant if gen_method is ‘policy tree’. Default is None.

  • pt_select_values_cat (Boolean (or None), optional) – Approximation method for larger categorical variables. Since we search among optimal trees, for categorical variables variables we need to check for all possible combinations of the different values that lead to binary splits. Thus number could indeed be huge. Therefore, we compare only \(\text{pt_no_of_evalupoints} \times \text{pt_eva_cat_mult}\) different combinations. Method 1 (pt_select_values_cat == True) does this by randomly drawing values from the particular categorical variable and forming groups only using those values. Method 2 (pt_select_values_cat == False) sorts the values of the categorical variables according to a values of the policy score as one would do for a standard random forest. If this set is still too large, a random sample of the entailed combinations is drawn. Method 1 is only available for the method ‘policy tree’.

  • rnd_shares (Tuple of floats (or None), optional) – Share of treatments of a stochastic assignment as computed by the evaluate() method. Sum of all elements must add to 1. This used only used as a comparison in the evaluation of other allocations. None: Shares of treatments in the allocation under investigation. Default is None.

  • var_bb_restrict_name (String (or None), optional) – Name of variable related to a restriction in case of capacity constraints. If there is a capacity constraint, preference will be given to observations with highest values of this variable. Only relevant if gen_method is ‘best_policy_score’. Default is None.

  • var_d_name (String (or None), optional) – Name of (discrete) treatment. Needed in training data only if ‘changers’ (different treatment in allocation than observed treatment) are analysed and if allocation is compared to observed allocation (in evaluate() method). Default is None.

  • var_effect_vs_0 (List/tuple of strings (or None), optional) – Name of variables of effects of treatment relative to first treatment. Dimension is equal to the number of treatments minus 1. Default is None.

  • var_effect_vs_0_se (List/tuple of strings (or None), optional) – Name of variables of standard errors of the effects of treatment relative to first treatment. Dimension is equal to the number of treatments minus 1. Default is None.

  • var_id_name ((or None), optional) – Name of identifier in data. Default is None.

  • var_polscore_desc_name (List/tuple of tuples of strings (or None), optional) – Each tuple of dimension equal to the different treatments contains treatment specific variables that are used to evaluate the effect of the allocation with respect to those variables. This could be for example policy score not used in training, but which are relevant nevertheless. Default is None.

  • var_polscore_name (List or tuple of strings (or None), optional) – Names of treatment specific variables to measure the value of individual treatments. This is usually the estimated potential outcome or any other score related. This is required for the solve() method. Default is None.

  • var_material_name_ord (List or tuple of strings (nor None), optional) – Materially relavant ordered variables: An effect of the protected variables on the scores is allowed, if captured by these variables (only). These variables may (or may not) be included among the decision variables. These variables must (!) not be included among the protected variables. Default is None.

  • var_material_name_unord (List or tuple of strings (nor None), optional) – Materially relavant unordered variables: An effect of the protected variables on the scores is allowed, if captured by these variables (only). These variables may (or may not) be included among the decision variables. These variables must (!) not be included among the protected variables. Default is None.

  • var_protected_ord_name (List or tuple of strings (nor None), optional) – Names of protected ordered variables. Their influence on the policy scores will be removed (conditional on the ‘materially important’ variables). These variables should NOT be contained in decision variables, i.e., var_x_name_ord. If they are included, they will be removed and var_x_name_ord will be adjusted accordingly. Default is None.

  • var_protected_unord_name (List or tuple of strings (nor None), optional) – Names of protected unordered variables. Their influence on the policy scores will be removed (conditional on the ‘materially important’ variables). These variables should NOT be contained in decision variables, i.e., var_x_name_unord. If they are included, they will be removed andvar_x_name_unord will be adjusted accordingly. Default is None.

  • var_vi_x_name (List or tuple of strings or None, optional) – Names of variables for which variable importance is computed. Default is None.

  • var_vi_to_dummy_name (List or tuple of strings or None, optional) – Names of variables for which variable importance is computed. These variables will be broken up into dummies. Default is None.

  • var_x_name_ord (Tuple of strings (or None), optional) – Name of ordered variables (including dummy variables) used to build policy tree and classifier. They are also used to characterise the allocation. Default is None.

  • var_x_name_unord (Tuple of strings (or None), optional) – Name of unordered variables used to build policy tree and classifier. They are also used to characterise the allocation. Default is None.

  • _int_dpi (Integer (or None), optional) – dpi in plots. Default (or None) is 500. Internal variable, change default only if you know what you do.

  • _int_fontsize (Integer (or None), optional) – Font for legends, from 1 (very small) to 7 (very large). Default (or None) is 2. Internal variable, change default only if you know what you do.

  • _int_output_no_new_dir (Boolean) – Do not create a new directory when the path already exists. Default (or None) is False.

  • _int_report (Boolean, optional) – Provide information for McfOptPolReports to construct informative reports. Default (or None) is True.

  • _int_with_numba (Boolean (or None), optional) – Use Numba to speed up computations. Default (or None) is True.

  • _int_with_output (Boolean (or None), optional) – Print output on file and/or screen. Default (or None) is True.

  • _int_xtr_parallel (Boolean (or None), optional.) – Parallelize to a larger degree to make sure all CPUs are busy for most of the time. Default (or None) is True. Only used for ‘policy tree’ and only used if _int_parallel_processing > 1 (or None)

Methods#

allocate(data_df[, data_title, ...])

Allocate observations to treatment state.

evaluate(allocation_df, data_df[, ...])

Evaluate allocation with potential outcome data.

evaluate_multiple(allocations_dic, data_df)

Evaluate several allocations simultaneously.

estrisk_adjust(data_df[, data_title])

Adjust policy score for estimation risk.

solvefair(data_df[, data_title])

Solve for optimal allocation rule with fairness adjustments.

solve(data_df[, data_title])

Solve for optimal allocation rule.

print_time_strings_all_steps([title])

Print an overview over the time needed in all steps of programme.

winners_losers(data_df, welfare_df[, ...])

Compare the winners and loser.

Reporting#

class reporting.McfOptPolReport(mcf=None, mcf_sense=None, optpol=None, outputpath=None, outputfile=None)#

Provides reports about the main specification choices and most important results of the ModifiedCausalForest and OptimalPolicy estimations.

Parameters
  • mcf (Instance of the ModifiedCausalForest class or None, optional) – Contains all information needed for reports. The default is None.

  • mcf_sense (Instance of the ModifiedCausalForest class or None, optional) – Contains all information from sensitivity analysis needed for reports. The default is None.

  • optpol (Instance of the OptimalPolicy class or None, optional) – Contains all information from the optimal policy analysis needed for reports. The default is None.

  • outputpath (String, Pathlib object, or None, optional) – Path to write the pdf file that is created with the report() method. If None, then an ‘/out’ subdirectory of the current working directory is used. If the latter does not exist, it is created.

  • outputfile (String or None, optional) – Name of the pdf file that is created by the report() method. If None, ‘Reporting’ is used as name. Any name will always appended by string that contains the day and time (measured when the programme ends).

Methods#

report()

Create a PDF report save file to a user provided location.

Example Data function#

example_data_functions.example_data(obs_y_d_x_iate: int = 1000, obs_x_iate: int = 1000, no_features: int = 20, no_treatments: int = 3, type_of_heterogeneity: str = 'WagerAthey', seed: int = 12345, descr_stats: bool = True, strength_iv: int = 1, correlation_x: str = 'middle', no_effect=False)#

Create example data to be used with mcf estimation and optimal policy.

Parameters
  • obs_y_d_x_iate (Integer, optional) – Number of observations for training data. The default is 1000.

  • obs_x_iate (Integer, optional) – Number of observations for prediction data. The default is 1000.

  • no_features (Integer, optional) – Number of features of different type. The default is 20.

  • no_treatments (Integer, optional) – Number of treatments (all non-zero treatments have same IATEs). The default is 3.

  • type_of_heterogeneity (String, optional) – Different types of heterogeneity broadly (but not exactly) following the specifications used in the simulations of Lechner and Mareckova (Comprehensive Causal Machine Learning, arXiv, 2024). Possible types are ‘linear’, ‘nonlinear’, ‘quadratic’, ‘WagerAthey’.

  • seed (Integer, optional) – Seed of numpy random number generator object. The default is 12345.

  • descr_stats (Boolean, optional) – Show descriptive statistics. The default is True.

  • strength_iv (Integer or Float, optional.) – The larger this number is, the stronger the instrument will be. Default is 1.

  • correlation_x (str, optinal) – Allows three different levels of dependence between features (‘low’, ‘middle’, ‘high’). Default is ‘middle’.

  • no_effect (Boolean, optional) – All IATEs are set to 0 if True.

Returns

  • train_df (DataFrame) – Contains outcome, treatment, features, potential outcomes, IATEs, ITEs, and zero column (for convenience to be used with OptimalPolicy).

  • pred_df (DataFrame) – Contains features, potential outcomes, IATEs, ITEs.

  • name_dict (Dictionary) – Contains the names of the variable groups.

example_data([obs_y_d_x_iate, obs_x_iate, ...])

Create example data to be used with mcf estimation and optimal policy.