Python API#

Overview of classes#

ModifiedCausalForest([var_d_name, ...])

Estimation of treatment effects with the Modified Causal Forest

OptimalPolicy([dc_check_perfectcorr, ...])

Optimal policy learning

McfOptPolReport([mcf, mcf_sense, optpol, ...])

New in version 0.7.0.

Modified Causal Forest#

class mcf_functions.ModifiedCausalForest(var_d_name=None, var_id_name=None, var_iv_name=None, var_w_name=None, var_x_name_always_in_ord=None, var_x_name_always_in_unord=None, var_x_name_balance_test_ord=None, var_x_name_balance_test_unord=None, var_x_name_remain_ord=None, var_x_name_remain_unord=None, var_x_name_ord=None, var_x_name_unord=None, var_y_name=None, var_y_tree_name=None, var_z_name_list=None, var_z_name_ord=None, var_z_name_unord=None, cf_alpha_reg_grid=1, cf_alpha_reg_max=0.15, cf_alpha_reg_min=0.05, cf_boot=1000, cf_chunks_maxsize=None, cf_compare_only_to_zero=False, cf_n_min_grid=1, cf_n_min_max=None, cf_n_min_min=None, cf_n_min_treat=None, cf_nn_main_diag_only=False, cf_m_grid=1, cf_m_random_poisson=True, cf_m_share_max=0.6, cf_m_share_min=0.1, cf_match_nn_prog_score=True, cf_mce_vart=1, cf_random_thresholds=None, cf_p_diff_penalty=None, cf_penalty_type='mse_d', cf_subsample_factor_eval=None, cf_subsample_factor_forest=1, cf_tune_all=False, cf_vi_oob_yes=False, cs_adjust_limits=None, cs_max_del_train=0.5, cs_min_p=0.01, cs_quantil=1, cs_type=1, ct_grid_dr=100, ct_grid_nn=10, ct_grid_w=10, dc_check_perfectcorr=True, dc_clean_data=True, dc_min_dummy_obs=10, dc_screen_covariates=True, fs_rf_threshold=1, fs_other_sample=True, fs_other_sample_share=0.33, fs_yes=False, gen_d_type='discrete', gen_iate_eff=False, gen_panel_data=False, gen_mp_parallel=None, gen_outfiletext=None, gen_outpath=None, gen_output_type=2, gen_panel_in_rf=True, gen_weighted=False, lc_cs_cv=True, lc_cs_cv_k=None, lc_cs_share=0.25, lc_estimator='RandomForest', lc_yes=True, lc_uncenter_po=True, p_atet=False, p_gatet=False, p_bgate=False, p_cbgate=False, p_ate_no_se_only=False, p_bt_yes=True, p_choice_based_sampling=False, p_choice_based_probs=None, p_ci_level=0.95, p_cluster_std=False, p_cond_var=True, p_gates_minus_previous=False, p_gates_smooth=True, p_gates_smooth_bandwidth=1, p_gates_smooth_no_evalu_points=50, p_gates_no_evalu_points=50, p_bgate_sample_share=None, p_iate=True, p_iate_se=False, p_iate_m_ate=False, p_knn=True, p_knn_const=1, p_knn_min_k=10, p_nw_bandw=1, p_nw_kern=1, p_max_cats_z_vars=None, p_max_weight_share=0.05, p_qiate=False, p_qiate_se=False, p_qiate_m_mqiate=False, p_qiate_m_opp=False, p_qiate_no_of_quantiles=99, p_qiate_smooth=True, p_qiate_smooth_bandwidth=1, p_qiate_bias_adjust=True, p_qiate_bias_adjust_draws=1000, p_se_boot_ate=None, p_se_boot_gate=None, p_se_boot_iate=None, p_se_boot_qiate=None, var_x_name_balance_bgate=None, var_cluster_name=None, post_bin_corr_threshold=0.1, post_bin_corr_yes=True, post_est_stats=True, post_kmeans_no_of_groups=None, post_kmeans_max_tries=1000, post_kmeans_min_size_share=None, post_kmeans_replications=10, post_kmeans_single=False, post_kmeans_yes=True, post_random_forest_vi=True, post_relative_to_first_group_only=True, post_plots=True, post_tree=True, _int_cuda=False, _int_del_forest=False, _int_descriptive_stats=True, _int_dpi=500, _int_fontsize=2, _int_iate_chunk_size=None, _int_keep_w0=False, _int_no_filled_plot=20, _int_max_cats_cont_vars=None, _int_max_save_values=50, _int_max_obs_training=inf, _int_max_obs_prediction=250000, _int_max_obs_kmeans=200000, _int_max_obs_post_rel_graphs=50000, _int_mp_ray_del=('refs',), _int_mp_ray_objstore_multiplier=1, _int_mp_ray_shutdown=None, _int_mp_vim_type=None, _int_mp_weights_tree_batch=None, _int_mp_weights_type=1, _int_obs_bigdata=1000000, _int_output_no_new_dir=False, _int_red_largest_group_train=False, _int_replication=False, _int_report=True, _int_return_iate_sp=False, _int_seed_sample_split=67567885, _int_share_forest_sample=0.5, _int_show_plots=True, _int_verbose=True, _int_weight_as_sparse=True, _int_weight_as_sparse_splits=None, _int_with_output=True)#

Estimation of treatment effects with the Modified Causal Forest

Parameters

var_y_name (String or List of strings (or None), optional) – Name of outcome variables. If several variables are specified, either var_y_tree_name is used for tree building, or (if var_y_tree_name is None), the 1st variable in the list is used. Only necessary for train() method. Default is None.
var_d_name (String or List of string (or None), optional) – Name of treatment variable. Must be provided to use the train() method. Can be provided for the predict() method.
var_x_name_ord (String or List of strings (or None), optional) – Name of ordered features (including dummy variables). Either ordered or unordered features must be provided. Default is None.
var_x_name_unord (String or List of strings (or None), optional) – Name of unordered features. Either ordered or unordered features must be provided. Default is None.
var_x_name_balance_bgate (String or List of strings (or None), optional) – Variables to balance the GATEs on. Only relevant if p_bgate is True. The distribution of these variables is kept constant when a BGATE is computed. None: Use the other heterogeneity variables (var_z_…) (if there are any) for balancing. Default is None.
var_cluster_name (String or List of string (or None)) – Name of variable defining clusters. Only relevant if p_cluster_std is True. Default is None.
var_id_name (String or List of string (or None), optional) – Name of identifier. None: Identifier will be added to the data. Default is None.
var_iv_name (String or List of string (or None), optional) – Name of binary instrumental variable. Only relevant if train_iv method is used. Default is None.
var_x_name_balance_test_ord (String or List of strings (or None), optional) – Name of ordered variables to be used in balancing tests. Only relevant if p_bt_yes is True. Default is None.
var_x_name_balance_test_unord (String or List of strings (or None),) – optional Name of ordered variables to be used in balancing tests. Treatment specific descriptive statistics are only printed for those variables. Default is None.
var_x_name_always_in_ord (String or List of strings (or None),) – optional Name of ordered variables that are always checked on when deciding on the next split during tree building. Only relevant for train() method. Default is None.
var_x_name_always_in_unord (String or List of strings (or None),) – optional Name of unordered variables that always checked on when deciding on the next split during tree building. Only relevant for train() method. Default is None.
var_x_name_remain_ord (String or List of strings (or None), optional) – Name of ordered variables that cannot be removed by feature selection. Only relevant for train() method. Default is None.
var_x_name_remain_unord (String or List of strings (or None), optional) – Name of unordered variables that cannot be removed by feature selection. Only relevant for train() method. Default is None.
var_w_name (String or List of string (or None), optional) – Name of weight. Only relevant if gen_weighted is True. Default is None.
var_z_name_list (String or List of strings (or None), optional) – Names of ordered variables with many values to define causal heterogeneity. They will be discretized (and dependening p_gates_smooth) also treated as continuous. If not already included in var_x_name_ord, they will be added to the list of features. Default is None.
var_z_name_ord (String or List of strings (or None), optional) – Names of ordered variables with not so many values to define causal heterogeneity. If not already included in var_x_name_ord, they will be added to the list of features. Default is None.
var_z_name_unord (String or List of strings (or None), optional) – Names of unordered variables with not so many values to define causal heterogeneity. If not already included in var_x_name_ord, they will be added to the list of features. Default is None.
var_y_tree_name (String or List of string (or None), optional) – Name of outcome variables to be used to build trees. Only relevant if multiple outcome variables are specified in var_y_name. Only relevant for train() method. Default is None.
cf_alpha_reg_grid (Integer (or None), optional) – Minimum remaining share when splitting leaf: Number of grid values. If grid is used, optimal value is determined by out-of-bag estimation of objective function. Default (or None) is 1.
cf_alpha_reg_max (Float (or None), optional) – Minimum remaining share when splitting leaf: Largest value of grid (keep it below 0.2). Default (or None) is 0.15.
cf_alpha_reg_min (Float (or None), optional) – Minimum remaining share when splitting leaf: Smallest value of grid (keep it below 0.2). Default (or None) is 0.05.
cf_boot (Integer (or None), optional) – Number of Causal Trees. Default (or None) is 1000.
cf_chunks_maxsize (Integer (or None), optional) –
For large samples, randomly split the training data into equally sized chunks, train a forest in each chunk, and estimate effects for each forest. Final effect estimates are obtained by averaging effects obtained for each forest. This procedures improves scalability by reducing computation time (at the possible price of a somewhat larger finite sample bias). If cf_chunks_maxsize is larger than the sample size, there is no random splitting. The default (None) is dependent on the size of the training data: If there are less than 90’000 training observations: No splitting. Otherwise:

\[\text{cf_chunks_maxsize} = 90000 + \frac{{(\text{number of observations} - 90000)^{0.8}}}{{(\text{# of treatments} - 1)}}\]

Default is None.
cf_compare_only_to_zero (Boolean (or None), optional) – If True, the computation of the MCE ignores all elements not related to the first treatment (which usually is the control group). This speeds up computation, should give better effect estimates, and may be attractive when interest is only in the comparisons of each treatment to the control group and not among each other. This may also be attractive for optimal policy analysis based on using estimated potential outcomes normalized by the estimated potential outcome of the control group (i.e., IATEs of treatments vs. control group). Default (or None) is False.
cf_n_min_grid (Integer (or None), optional) – Minimum leaf size: Number of grid values. If grid is used, optimal value is determined by out-of-bag estimation of objective function. Default (or None) is 1.
cf_n_min_max (Integer (or None), optional) –
Minimum leaf size: Largest minimum leaf size. If None :

\[\text{A} = \frac{\sqrt{\text{number of observations in the smallest treatment group}}^{0.5}}{10}, \text{at least 2}\]

\(\text{cf_n_min_max} = \text{round}(A \times \text{number of treatments})\) Default is None.
cf_n_min_min (Integer (or None), optional) –
Minimum leaf size: Smallest minimum leaf size. If None:

\[\text{A} = \text{number of observations in smallest treatment group}^{0.4} / 10, \text{at least 1.5}\]

\(\text{cf_n_min_min} = \text{round}(A \times \text{number of treatments})\) Default is None.
cf_n_min_treat (Integer (or None), optional) –
Minimum number of observations per treatment in leaf. A higher value reduces the risk that a leaf cannot be filled with outcomes from all treatment arms in the evaluation subsample. There is no grid based tuning for this parameter. This parameter impacts the minimum leaf size which will be at least to \(\text{n_min_treat} \times \text{number of treatments}\) None :

\[\frac{\frac{{\text{n_min_min}} + {\text{n_min_max}}}{2}}{\text{number of treatments} \times 10}, \text{at least 1}\]

Default is None.
cf_match_nn_prog_score (Boolean (or None), optional) – Choice of method of nearest neighbour matching. True : Prognostic scores. False: Inverse of covariance matrix of features. Default (or None) is True.
cf_nn_main_diag_only (Boolean (or None), optional) – Nearest neighbour matching: Use main diagonal of covariance matrix only. Only relevant if match_nn_prog_score == False. Default (or None) is False.
cf_m_grid (Integer (or None), optional) – Number of variables used at each new split of tree: Number of grid values. If grid is used, optimal value is determined by out-of-bag estimation of objective function. Default (or None) is 1.
cf_m_random_poisson (Boolean (or None), optional) – Number of variables used at each new split of tree: True : Number of randomly selected variables is stochastic for each split, drawn from a Poisson distribution. Grid gives mean value of 1 + poisson distribution (m-1) (m is determined by cf_m_share parameters). False : No additional randomisation. Default (or None) is True.
cf_m_share_max (Float (or None), optional) – Share of variables used at each new split of tree: Maximum. Default (or None) is 0.6. If variables randomly selected for splitting do not show any variation in leaf considered for splitting, then all variables will be used for that split.
cf_m_share_min (Float (or None), optional) – Share of variables used at each new split of tree: Minimum. Default (or None) is 0.1. If variables randomly selected for splitting do not show any variation in leaf considered for splitting, then all variables will be used for that split.
cf_mce_vart (Integer (or None), optional) – Splitting rule for tree building: 0 : mse’s of regression only considered. 1 : mse+mce criterion (default). 2 : -var(effect): heterogeneity maximising splitting rule of Wager & Athey (2018). 3 : randomly switching between outcome-mse+mce criterion & penalty functions. Default (or None) is 1.
cf_p_diff_penalty (Integer (or None), optional) –
Penalty function (depends on the value of mce_vart).

mce_vart == 0
Irrelevant (no penalty).

mce_vart == 1
Multiplier of penalty (in terms of var(y)). 0 : No penalty. None :

\[\frac{2 \times (\text{n} \times \text{subsam_share})^{0.9}}{\text{n} \times \text{subsam_share}} \times \sqrt{\frac{\text{no_of_treatments} \times (\text{no_of_treatments} - 1)}{2}}\]

mce_vart == 2
Multiplier of penalty (in terms of MSE(y) value function without splits) for penalty. 0 : No penalty. None :

\[\frac{100 \times 4 \times (n \times \text{f_c.subsam_share})^{0.8}}{n \times \text{f_c.subsam_share}}\]
cf_penalty_type (String (or None), optional) – Type of penalty function. ‘mse_d’: MSE of treatment prediction in daughter leaf (new in 0.7.0) ‘diff_d’: Penalty as squared leaf difference (as in Lechner, 2018) Note that an important advantage of ‘mse_d’ that it can also be used for tuning (due to its computation, this is not possible for ‘diff_d’). Default (or None) is ‘mse_d’.
cf_random_thresholds (Integer (or None), optional) – Use only a random selection of values for splitting (continuous feature only; re-randomize for each splitting decision; fewer thresholds speeds up programme but may lead to less accurate results). 0 : No random thresholds. > 0 : Number of random thresholds used for ordered variables. None : \(4 + \text{number of training observations}^{0.2}\) Default is None.
cf_subsample_factor_forest (Float (or None), optional) –
Multiplier of default size of subsampling sample (S) used to build tree.

\[S = \max((n^{0.5},min(0.67 n, \frac{2 \times (n^{0.85})}{n}))), \text{n: # of training observations}\]

\(S \times \text{cf_subsample_factor_forest}, \text{is not larger than 80%.}\) Default (or None) is 1.
cf_subsample_factor_eval (Float or Boolean (or None), optional) – Size of subsampling sample used to populate tree. False: No subsampling in evaluation subsample. True or None: :math:(2 times text{subsample size}) used for tree building (to avoid too many empty leaves). Float (>0): Multiplier of subsample size used for tree building. In particular for larger samples, using subsampling in evaluation will speed up computations and reduces demand on memory. Tree-specific subsampling in evaluation sample increases speed at which the asymtotic bias disappears (at the expense of a slower disappearance of the variance; however, simulations so far show no relevant impact). Default is None.
cf_tune_all (Boolean (or None), optional) – Tune all parameters. If True, all *_grid keywords will be set to 3. User specified values are respected if larger than 3. Default (or None) is False.
cf_vi_oob_yes (Boolean (or None), optional) – Variable importance for causal forest computed by permuting single variables and comparing share of increase in objective function of mcf (computed with out-of-bag data). Default (or None) is False.
cs_type (Integer (or None), optional) – Common support adjustment: Method. 0 : No common support adjustment. 1,2 : Support check based on estimated classification forests. 1 : Min-max rules for probabilities in treatment subsamples. 2 : Enforce minimum and maximum probabilities for all obs all but one probability. Observations off support are removed. Out-of-bag predictions are used to avoid overfitting (which would lead to a too large reduction in the number of observations). Default (or None) is 1.
cs_adjust_limits (Float (or None), optional) – Common support adjustment: Accounting for multiple treatments. None : \((\text{number of treatments} - 2) \times 0.05\) If cs_type > 0: \(\text{upper limit} \times = 1 + \text{support_adjust_limits}\), \(\text{lower limit} \times = 1 - \text{support_adjust_limits}\). The restrictiveness of the common support criterion increases with the number of treatments. This parameter allows to reduce this restrictiveness. Default is None.
cs_max_del_train (Float (or None), optional) – Common support adjustment: If share of observations in training data used that are off support is larger than cs_max_del_train (0-1), an exception is raised. In this case, user should change input data. Default (or None) is 0.5.
cs_min_p (Float (or None), optional) – Common support adjustment: If cs_type == 2, observations are deleted if \(p(d=m|x)\) is less or equal than cs_min_p for at least one treatment. Default (or None) is 0.01.
cs_quantil (Float (or None), optional) – Common support adjustment: How to determine upper and lower bounds. If CS_TYPE == 1: 1 or None : Min-max rule. < 1 : Respective quantile. Default (or None) is 1.
ct_grid_dr (Integer (or None), optional) – Number of grid point for discretization of continuous treatment (with 0 mass point; grid is defined in terms of quantiles of continuous part of treatment) for dose response function. Default (or None) is 100.
ct_grid_nn (Integer (or None), optional) – Number of grid point for discretization of continuous treatment (with 0 mass point; grid is defined in terms of quantiles of continuous part of treatment) for neighbourhood matching. Default (or None) is 10.
ct_grid_w (Integer (or None), optional) – Number of grid point for discretization of continuous treatment (with 0 mass point; grid is defined in terms of quantiles of continuous part of treatment) for weights. Default (or None) is 10.
dc_clean_data (Boolean (or None), optional) – Clean covariates. Remove all rows with missing observations and unnecessary variables from DataFrame. Default (or None) is True.
dc_check_perfectcorr (Boolean (or None), optional) – Screen and clean covariates: Variables that are perfectly correlated with each others will be deleted. Default (or None) is True.
dc_min_dummy_obs (Integer (or None), optional) – Screen covariates: If > 0 dummy variables with less than dc_min_dummy_obs observations in one category will be deleted. Default (or None) is 10.
dc_screen_covariates (Boolean (or None), optional) – Screen and clean covariates. Default (or None) is True.
fs_yes (Boolean (or None), optional) – Feature selection before building causal forest: A feature is deleted if it is irrelevant in the reduced forms for the treatment AND the outcome. Reduced forms are computed with random forest classifiers or random forest regression, depending on the type of variable. Irrelevance is measured by variable importance measures based on randomly permuting a single variable and checking its reduction in either accuracy (classification) or R2 (regression) compared to the test set prediction based on the full model. Exceptions: (i) If the correlation of two variables to be deleted is larger than 0.5, one of the two variables is kept. (ii) Variables used to compute GATEs, BGATEs, CBGATEs. Variables contained in ‘var_x_name_remain_ord’ or ‘var_x_name_remain_unord’, or are needed otherwise, are not removed. If the number of variables is very large (and the space of relevant features is much sparser, then using feature selection is likely to improve computational and statistical properties of the mcf etimator). Default (or None) is False.
fs_rf_threshold (Integer or Float (or None), optional) – Feature selection: Threshold in terms of relative loss of variable importance in %. Default (or None) is 1.
fs_other_sample (Boolean (or None), optional) – True : Random sample from training data used. These observations will not be used for causal forest. False : Use the same sample as used for causal forest estimation. Default (or None) is True.
fs_other_sample_share (Float (or None), optional) – Feature selection: Share of sample used for feature selection (only relevant if fs_other_sample is True). Default (or None) is 0.33.
gen_d_type (String (or None), optional) – Type of treatment. ‘discrete’: Discrete treatment. ‘continuous’: Continuous treatment. Default (or None) is ‘discrete’.
gen_iate_eff (Boolean (or None), optional) – Additionally, compute more efficient IATE (IATE are estimated twice and averaged where role of tree_building and tree_filling sample is exchanged; X-fitting). No inference is attempted for these parameters. Default (or None) is False.
gen_mp_parallel (Integer (or None), optional) – Number of parallel processes (using ray on CPU). The smaller this value is, the slower the programme, the smaller its demands on RAM. None : 80% of logical cores. Default is None.
gen_outfiletext (String (or None), optional) – File for text output. (*.txt) file extension will be added. None : ‘txtFileWithOutput’. Default is None.
gen_outpath (String or Pathlib object (or None), optional) – Path were the output is written too (text, estimated effects, etc.) If specified directory does not exist, it will be created. None : An (…/out) directory below the current directory is used. Default is None.
gen_output_type (Integer (or None), optional) – Destination of text output. 0: Terminal. 1: File. 2: Terminal and file. Default (or None) is 2.
gen_panel_data (Boolean (or None), optional) – Panel data used. p_cluster_std is set to True. Default (or None) is False.
gen_panel_in_rf (Boolean (or None), optional) – Panel data used: Use panel structure also when building the random samples within the forest procedure. Default (or None) is True.
gen_weighted (Boolean (or None), optional) – Use of sampling weights to be provided in var_w_name. Default (or None) is False.
lc_yes (Boolean (or None), optional) – Local centering. The predicted value of the outcome from a regression with all features (but without the treatment) is subtracted from the observed outcomes (using 5-fold cross-fitting). The best method for the regression is selected among scikit-learn’s Random Forest, Support Vector Machines, and AdaBoost Regression based on their out-of-sample mean squared error. The method selection is either performed on the subsample used to build the forest ((1-lc_cs_share) for training, lc_cs_share for test). Default (or None) is True.
lc_estimator (String (or None), optional) – The estimator used for local centering. Possible choices are scikit-learn’s regression methods ‘RandomForest’, ‘RandomForestNminl5’, ‘RandomForestNminls5’, ‘SupportVectorMachine’, ‘SupportVectorMachineC2’, ‘SupportVectorMachineC4’, ‘AdaBoost’, ‘AdaBoost100’, ‘AdaBoost200’, ‘GradBoost’, ‘GradBoostDepth6’, ‘GradBoostDepth12’, ‘LASSO’, ‘NeuralNet’, ‘NeuralNetLarge’, ‘NeuralNetLarger’, ‘Mean’. If set to ‘automatic’, the estimator with the lowest out-of-sample mean squared error (MSE) is selected. Whether this selection is based on cross-validation or a test sample is governed by the keyword lc_cs_cv. ‘Mean’ is included for the cases when none of the methods have explanatory power. Default (or None) is ‘RandomForest’.
lc_uncenter_po (Boolean (or None), optional) – Predicted potential outcomes are re-adjusted for local centering are added to data output (iate and iate_eff in results dictionary). Default (or None) is True.
lc_cs_cv (Boolean (or None), optional) – Data to be used for local centering & common support adjustment. True : Crossvalidation. False : Random sample not to be used for forest building. Default (or None) is True.
lc_cs_cv_k (Integer (or None), optional) – Data to be used for local centering & common support adjustment: Number of folds in cross-validation (if lc_cs_cv is True). Default (or None) depends on the size of the training sample (N): N < 100’000: 5; 100’000 <= N < 250’000: 4 250’000 <= N < 500’000: 3, 500’000 <= N: 2.
lc_cs_share (Float (or None), optional) – Data to be used for local centering & common support adjustment: Share of trainig data (if lc_cs_cv is False). Default (or None) is 0.25.
p_atet (Boolean (or None), optional) – Compute effects for specific treatment groups. Only possible if treatment is included in prediction data. Default (or None) is False.
p_gates_minus_previous (Boolean (or None), optional) – Estimate increase of difference of GATEs, CBGATEs, BGATEs when evaluated at next larger observed value. Default (or None) is False.
p_gates_no_evalu_points (Integer (or None), optional) – Number of evaluation points for discretized variables in (CB)(B)GATE estimation. Default (or None) is 50.
p_gates_smooth (Boolean (or None), optional) – Alternative way to estimate GATEs for continuous features. Instead of discretizing variable, its GATE is evaluated at p_gates_smooth_no_evalu_points. Since there are likely to be no observations, a local neighbourhood around the evaluation points is considered. Default (or None) is True.
p_gates_smooth_bandwidth (Float (or None), optional) – Multiplier for bandwidth used in (C)BGATE estimation with smooth variables. Default (or None) is 1.
p_gates_smooth_no_evalu_points (Integer (or None), optional) – Number of evaluation points for discretized variables in GATE estimation. Default (or None) is 50.
p_gatet (Boolean (or None), optional) – Compute effects for specific treatment groups. Only possible if treatment is included in prediction data. Default (or None) is False.
p_bgate (Boolean (or None), optional) – Estimate a GATE that is balanced in selected features (as specified in var_x_name_balance_bgate). Default (or None) is False.
p_cbgate (Boolean (or None), optional) – Estimate a GATE that is balanced in all other features. Default (or None) is False.
p_bgate_sample_share (Float (or None), optional) –
Implementation of (C)BGATE estimation is very cpu intensive. Therefore, random samples are used to speed up the programme if there are number observations / number of evaluation points > 10. None : If observation in prediction data (n) < 1000: 1 If n >= 1000:

\[1000 + \frac{{(n - 1000)^{\frac{3}{4}}}}{{\text{evaluation points}}}\]

Default is None.
p_max_cats_z_vars (Integer (or None), optional) – Maximum number of categories for discretizing continuous z variables. None : \(\text{Number of observations}^{0.3}\) Default is None.
p_iate (Boolean (or None), optional) – IATEs will be estimated. Default (or None) is True.
p_iate_se (Boolean (or None), optional) – Standard errors of IATEs will be estimated. Default (or None) is False.
p_iate_m_ate (Boolean (or None), optional) – IATEs minus ATE will be estimated. Default (or None) is False.
p_qiate (Boolean (or None), optional) – QIATEs will be estimated. Default (or None) is False.
p_qiate_se (Boolean (or None), optional) – Standard errors of QIATEs will be estimated. Default (or None) is False.
p_qiate_m_mqiate (Boolean (or None), optional) – QIATEs minus median of QIATEs will be estimated. Default (or None) is False.
p_qiate_m_opp (Boolean (or None), optional.) – QIATE(x, q) - QIATE(x, 1-q) will be estimated (q denotes quantil level, q < 0.5), Default is False.
p_qiate_no_of_quantiles (Integer (or None), optional) – Number of quantiles used for QIATE. Default (or None) is 99.
p_qiate_smooth (Boolean (or None), optional) – Smooth estimated QIATEs using kernel smoothing. Default is True.
p_qiate_smooth_bandwidth (Integer or Float (or None), optional) – Multiplier applied to default bandwidth used for kernel smoothing of QIATE. Default (or None) is 1.
p_qiate_bias_adjust (Boolean (or None), optional) – Bias correction procedure for QIATEs based on simulations. Default is True. If p_qiate_bias_adjust is True, P_IATE_SE is set to True as well.
p_qiate_bias_adjust_draws (Integer or Float (or None), optional) – Number of random draws used in computing the bias adjustment. Default is 1000.
p_ci_level (Float (or None), optional) – Confidence level for bounds used in plots. Default (or None) is 0.95.
p_cond_var (Boolean (or None), optional) – True : Conditional mean & variances are used. False : Variance estimation uses \(wy_i = w_i \times y_i\) directly. Default (or None) is True.
p_knn (Boolean (or None), optional) – True : k-NN estimation. False: Nadaraya-Watson estimation. Nadaray-Watson estimation gives a better approximaton of the variance, but k-NN is much faster, in particular for larger datasets. Default (or None) is True.
p_knn_min_k (Integer (or None), optional) – Minimum number of neighbours k-nn estimation. Default (or None) is 10.
p_nw_bandw (Float (or None), optional) – Bandwidth for nw estimation: Multiplier of Silverman’s optimal bandwidth. Default (or None) is 1.
p_nw_kern (Integer (or None), optional) – Kernel for Nadaraya-Watson estimation. 1 : Epanechikov. 2 : Normal pdf. Default (or None) is 1.
p_max_weight_share (Float (or None), optional) – Truncation of extreme weights. Maximum share of any weight, 0 <, <= 1. Enforced by trimming excess weights and renormalisation for each (BG,G,I,CBG)ATE separately. Because of renormalisation, the final weights could be somewhat above this threshold. Default (or None) is 0.05.
p_cluster_std (Boolean (or None), optional) – Clustered standard errors. Always True if gen_panel_data is True. Default (or None) is False.
p_se_boot_ate (Integer or Boolean (or None), optional) – Bootstrap of standard errors for ATE. Specify either a Boolean (if True, number of bootstrap replications will be set to 199) or an integer corresponding to the number of bootstrap replications (this implies True). None : 199 replications p_cluster_std is True, and False otherwise. Default is None.
p_se_boot_gate (Integer or Boolean (or None), optional) – Bootstrap of standard errors for GATE. Specify either a Boolean (if True, number of bootstrap replications will be set to 199) or an integer corresponding to the number of bootstrap replications (this implies True). None : 199 replications p_cluster_std is True, and False otherwise. Default is None.
p_se_boot_iate (Integer or Boolean (or None), optional) – Bootstrap of standard errors for IATE. Specify either a Boolean (if True, number of bootstrap replications will be set to 199) or an integer corresponding to the number of bootstrap replications (this implies True). None : 199 replications p_cluster_std is True, and False otherwise. Default is None.
p_se_boot_qiate (Integer or Boolean (or None), optional) – Bootstrap of standard errors for QIATE. Specify either a Boolean (if True, number of bootstrap replications will be set to 199) or an integer corresponding to the number of bootstrap replications (this implies True). None : 199 replications p_cluster_std is True, and False otherwise. Default is None.
p_bt_yes (Boolean (or None), optional) – ATE based balancing test based on weights. Relevance of this test in its current implementation is not fully clear. Default (or None) is True.
p_choice_based_sampling (Boolean (or None), optional) – Choice based sampling to speed up programme if treatment groups have very different sizes. Default (or None) is False.
p_choice_based_probs (List of Floats (or None), optional) – Choice based sampling: Sampling probabilities to be specified. These weights are used for (G,B,CB)ATEs only. Treatment information must be available in the prediction data. Default is None.
p_ate_no_se_only (Boolean (or None),optional) – Computes only the ATE without standard errors. Default (or None) is False.
post_est_stats (Boolean (or None), optional) – Descriptive Analyses of IATEs (p_iate must be True). Default (or None) is True.
post_relative_to_first_group_only (Boolean (or None), optional) – Descriptive Analyses of IATEs: Use only effects relative to treatment with lowest treatment value. Default (or None) is True.
post_bin_corr_yes (Boolean (or None), optional) – Descriptive Analyses of IATEs: Checking the binary correlations of predictions with features. Default (or None) is True.
post_bin_corr_threshold (Float, optional) – Descriptive Analyses of IATEs: Minimum threshhold of absolute correlation to be displayed. Default (or None) is 0.1.
post_kmeans_yes (Boolean (or None), optional) – Descriptive Analyses of IATEs: Using k-means clustering to analyse patterns in the estimated effects. Default (or None) is True.
post_kmeans_single (Boolean (or None), optional) – If True (and post_kmeans_yes is True), clustering is also with respect to all single effects. If False (and post_kmeans_yes is True), clustering is only with respect to all relevant IATEs jointly. Default (or None) is False.
post_kmeans_no_of_groups (Integer or List or Tuple (or None), optional) – Descriptive Analyses of IATEs: Number of clusters to be built in k-means. None : List of 5 values: [a, b, c, d, e]; c = 5 to 10; depending on number of observations; c<7: a=c-2, b=c-1, d=c+1, e=c+2, else a=c-4, b=c-2, d=c+2, e=c+4. Default is None.
post_kmeans_max_tries (Integer (or None), optional) – Descriptive Analyses of IATEs: Maximum number of iterations of k-means to achive convergence. Default (or None) is 1000.
post_kmeans_replications (Integer (or None), optional) – Descriptive Analyses of IATEs: Number of replications with random start centers to avoid local extrema. Default (or None) is 10.
post_kmeans_min_size_share (Float (or None).) – Smallest share observations for cluster size allowed in % (0-33). Default (None) is 1 (%).
post_random_forest_vi (Boolean (or None), optional) – Descriptive Analyses of IATEs: Variable importance measure of random forest used to learn factors influencing IATEs. Default (or None) is True.
post_plots (Boolean (or None), optional) – Descriptive Analyses of IATEs: Plots of estimated treatment effects. Default (or None) is True.
post_tree (Boolean (or None), optional) – Regression trees (honest and standard) of Depth 2 to 5 are estimated to describe IATES(x). Default (or None) is True.
p_knn_const (Boolean (or None), optional) – Multiplier of default number of observation used in moving average of analyse() method. Default (or None) is 1.
_int_cuda (Boolean (or None), optional) – Use CUDA based GPU if CUDA-compatible GPU is available on hardware (experimental). Default (or None) is False.
_int_descriptive_stats (Boolean (or None), optional) – Print descriptive stats if _int_with_output is True. Default (or None) is True. Internal variable, change default only if you know what you do.
_int_show_plots (Boolean (or None), optional) – Execute show() command if _int_with_output is True. Default (or None) is True. Internal variable, change default only if you know what you do.
_int_dpi (Integer (or None), optional) – Dpi in plots. Default (or None) is 500. Internal variable, change default only if you know what you do.
_int_fontsize (Integer (or None), optional) – Font for legends, from 1 (very small) to 7 (very large). Default (or None) is 2. Internal variable, change default only if you know what you do.
_int_no_filled_plot (Integer (or None), optional) – Use filled plot if more than _int_no_filled_plot different values. Default (or None) is 20. Internal variable, change default only if you know what you do.
_int_max_cats_cont_vars (Integer (or None), optional) – Discretise continuous variables: _int_max_cats_cont_vars is maximum number of categories for continuous variables. This speeds up the programme but may introduce some bias. None: No use of discretisation to speed up programme. Default is None. Internal variable, change default only if you know what you do.
_int_max_save_values (Integer (or None), optional) – Save value of features in table only if less than _int_max_save_values different values. Default (or None) is 50. Internal variable, change default only if you know what you do.
_int_max_obs_training (Integer (or None), optional) – Upper limit for sample size. If actual number is larger than this number, then the respective data will be randomly reduced to the specified upper limit. Training method: Reducing observations for training increases MSE and thus should be avoided. Default is infinity. Internal variable, change default only if you know what you do.
_int_max_obs_prediction (Integer (or None), optional) – Upper limit for sample size. If actual number is larger than this number, then the respective data will be randomly reduced to the specified upper limit. Prediction method: Reducing observations for prediction does not much affect MSE. It may reduce detectable heterogeneity, but may also dramatically reduce computation time. Default is 250’000. Internal variable, change default only if you know what you do.
_int_max_obs_kmeans (Integer (or None), optional) – Upper limit for sample size. If actual number is larger than this number, then the respective data will be randomly reduced to the specified upper limit. kmeans in analyse method: Reducing observations may reduce detectable heterogeneity, but also reduces computation time. Default is 200’000. Internal variable, change default only if you know what you do.
_int_max_obs_post_rel_graphs (Integer (or None), optional) – Upper limit for sample size. If actual number is larger than this number, then the respective data will be randomly reduced to the specified upper limit. Figures show the relation of IATEs and features (note that the built-in non-parametric regression is computationally intensive). Default is 50’000. Internal variable, change default only if you know what you do.
_int_mp_ray_del (Tuple of strings (or None), optional) – ‘refs’ : Delete references to object store. ‘rest’ : Delete all other objects of Ray task. ‘none’ : Delete no objects. These 3 options can be combined. Default (or None) is (‘refs’,). Internal variable, change default only if you know what you do.
_int_mp_ray_objstore_multiplier (Float (or None), optional) – Changes internal default values for Ray object store. Change above 1 if programme crashes because object store is full. Only relevant if _int_mp_ray_shutdown is True. Default (or None) is 1. Internal variable, change default only if you know what you do.
_int_mp_ray_shutdown (Boolean (or None), optional) – When computing the mcf repeatedly like in Monte Carlo studies, setting _int_mp_ray_shutdown to True may be a good idea. None: False if obs < 100000, True otherwise. Default is None. Internal variable, change default only if you know what you do.
_int_mp_vim_type (Integer (or None), optional) – Type of multiprocessing when computing variable importance statistics: 1 : Variable based (fast, lots of memory). 2 : Bootstrap based (slower, less memory). None: 1 if obs < 20000, 2 otherwise. Default is None. Internal variable, change default only if you know what you do.
_int_iate_chunk_size (Integer or None, optional) – Number of IATEs that are estimated in a single ray worker. Default is number of prediction observations / workers. If programme crashes in second part of IATE because of excess memory consumption, reduce _int_iate_chunk_size.
_int_mp_weights_tree_batch (Integer (or None), optional) – Number of batches to split data in weight computation for variable importance statistics. The smaller the number of batches, the faster the program and the more memory is needed. None: Automatically determined. Default is None. Internal variable, change default only if you know what you do.
_int_mp_weights_type (Integer (or None), optional) – Type of multiprocessing when computing weights: 1 : Groups-of-obs based (fast, lots of memory). 2 : Tree based (takes forever, less memory). Value of 2 will be internally changed to 1 if multiprocessing. Default (or None) is 1. Internal variable, change default only if you know what you do.
_int_obs_bigdata (Integer (or None), optional) – If number of training observations is larger than this number, the following happens during training: (i) Number of workers is halved in local centering. (ii) Ray is explicitely shut down. (iii) The number of workers used is reduced to 75% of default. (iv) The data type for some numpy arrays is reduced from float64 to float32. Default is 1’000’000.
_int_output_no_new_dir (Boolean (or None), optional) – Do not create a new directory when the path already exists. Default (or None) is False.
_int_report (Boolean (or None), optional) – Provide information for McfOptPolReports to construct informative reports. Default (or None) is True.
_int_return_iate_sp (Boolean (or None), optional) – Return all data with predictions despite _int_with_output is False (useful for cross-validation and simulations). Default (or None) is False. Internal variable, change default only if you know what you do.
_int_replication (Boolean (or None), optional) – If True all scikit-learn based computations will NOT use multi- processing. Default (or None) is False.
_int_seed_sample_split (Integer (or None), optional) – Seeding is redone when building forest. Default (or None) is 67567885. Internal variable, change default only if you know what you do.
_int_share_forest_sample (Float (or None), optional) – Share of sample used build forest. Default (or None) is 0.5. Internal variable, change default only if you know what you do.
_int_verbose (Boolean (or None), optional) – Additional output about running of mcf if _int_with_output is True. Default (or None) is True. Internal variable, change default only if you know what you do.
_int_weight_as_sparse (Boolean (or None), optional) – Save weights matrix as sparse matrix. Default (or None) is True. Internal variable, change default only if you know what you do.
_int_weight_as_sparse_splits (Integer (or None), optional) – Compute sparse weight matrix in several chunks. None: Automatically determined as: (Rows of prediction data * Rows of Fill_y data) /(Number of training splits * 25,000 * 25,000) Default is None. Internal variable, change the default only if you know what you are doing.
_int_with_output (Boolean (or None), optional) – Print output on txt file and/or console. Default (or None) is True. Internal variable, change default only if you know what you do.
_int_del_forest (Boolean (or None), optional) – Delete forests from instance. If True, less memory is needed, but the trained instance of the class cannot be reused when calling predict() with the same instance again, i.e. the forest has to be retrained when applied again. Default (or None) is False.
_int_keep_w0 (Boolean (or None), optional.) – Keep all zeros weights when computing standard errors (slows down computation and may lead to undesirable behaviour). Default is False.

version#

Version of mcf module used to create the instance.

Type: String

Methods#

`train`(data_df)	Build the modified causal forest on the training data.
`predict`(data_df)	Compute all effects given a causal forest estimated with `train()` method.
`analyse`(results)	Analyse estimated IATE with various descriptive tools.
`sensitivity`(train_df[, predict_df, results, ...])	Compute simulation based sensitivity indicators.

Optimal Policy#

class optpolicy_functions.OptimalPolicy(dc_check_perfectcorr=True, dc_clean_data=True, dc_min_dummy_obs=10, dc_screen_covariates=True, fair_type='MeanVar', fair_consistency_test=False, fair_material_disc_method='Kmeans', fair_protected_disc_method='Kmeans', fair_material_max_groups=5, fair_regression_method='RandomForest', fair_protected_max_groups=5, gen_method='best_policy_score', gen_mp_parallel='None', gen_outfiletext='txtFileWithOutput', gen_outpath=None, gen_output_type=2, gen_variable_importance=True, other_costs_of_treat=None, other_costs_of_treat_mult=None, other_max_shares=None, pt_depth_tree_1=3, pt_depth_tree_2=1, pt_enforce_restriction=False, pt_eva_cat_mult=1, pt_no_of_evalupoints=100, pt_min_leaf_size=None, pt_select_values_cat=False, rnd_shares=None, var_bb_restrict_name=None, var_d_name=None, var_effect_vs_0=None, var_effect_vs_0_se=None, var_id_name=None, var_material_name_ord=None, var_material_name_unord=None, var_polscore_desc_name=None, var_polscore_name=None, var_protected_name_ord=None, var_protected_name_unord=None, var_vi_x_name=None, var_vi_to_dummy_name=None, var_x_name_ord=None, var_x_name_unord=None, _int_dpi=500, _int_fontsize=2, _int_how_many_parallel=None, _int_output_no_new_dir=False, _int_parallel_processing=True, _int_report=True, _int_with_numba=True, _int_with_output=True, _int_xtr_parallel=True)#

Optimal policy learning

Parameters

dc_screen_covariates (Boolean (or None), optional) – Check features. Default (or None) is True.
dc_check_perfectcorr (Boolean (or None), optional) – Features that are perfectly correlated are deleted (1 of them). Only relevant if dc_screen_covariates is True. Default (or None) is True.
dc_min_dummy_obs (Integer (or None), optional) – Delete dummy variables that have less than dc_min_dummy_obs in one of their categories. Only relevant if dc_screen_covariates is True. Default (or None) is 10.
dc_clean_data (Boolean (or None), optional) – Remove all missing & unnecessary variables. Default (or None) is True.
fair_consistency_test (Boolean (or None), optional Test for internally consistency of fairness correction.) – The fairness corrections are applied independently to every policy score (which usually is a potential outcome or an IATE(x) for each treatment relative to some base treatment (i.e. comparing 1-0, 2-0, 3-0, etc.). Thus the IATE for the 2-1 comparison can be computed as IATE(2-0)-IATE(1-0). This tests compares two ways to compute a fair score for the 2-1 (and all# other comparisons) which should give simular results: a) Difference of two fair (!) scores b) Difference of corresponding scores, subsequently made fair. Note: Depending on the number of treatments, this test may be computationally more expensive than the orginal fairness corrections. Fairness adjustments are experimental. Default (or None) is False.
fair_material_disc_method (String (or None), optional Method on how to perform the discretization for materially relevant) – features. 'NoDiscretization' : Variables are not changed. If one of the features has more different values than fair_material_max_groups, all materially relevant features will formally be treated as continuous. The latter may become unreliable if their dimension is not year small. 'EqualCell' : Attempts to create equal cells for each variable. Maybe be useful for a very small number of variables with few different values. 'Kmeans' : Use Kmeans clustering algorithm to form homogeneous cells. Fairness adjustments are experimental. Default (or None) is ‘Kmeans’.
fair_protected_disc_method (String (or None), optional Method on how to perform the discretization for protected features. 'NoDiscretization' : Variables are not changed. If one of the) – features has more different values than fair_protected_max_groups, all protected features will formally be treated as continuous. The latter may become unreliable if their dimension is not very small. 'EqualCell' : Attempts to create equal cells for each variable. Maybe be useful for a very small number of variables with few different values. 'Kmeans' : Use Kmeans clustering algorithm to form homogeneous cells. Fairness adjustments are experimental. Default (or None) is 'Kmeans'.
fair_material_max_groups (Integer (or None), optional Level of discretization of materially relavant variables (only if) – needed). Number of groups of materially relavant features for cases when materially relavant variables are needed in protected form. This is currently only necessary for ‘Quantilized’. Its meaning depends on fair_material_disc_method: If 'EqualCell': If more than 1 variable is included among the protected variables, this restriction is applied to each variable. If 'Kmeans': This is the number of clusters used by Kmeans. Fairness adjustments are experimental. Default (or None) is 5.
fair_protected_max_groups (Integer (or None), optional Level of discretization of protected variables (only if needed).) – Number of groups of protected features for cases when protected variables are needed in discretized form. This is currently only necessary for 'Quantilized'. Its meaning depends on fair_protected_disc_method: If 'EqualCell' : If more than 1 variable is included among the protected variables, this restriction is applied to each variable. If 'Kmeans' : This is the number of clusters used by Kmeans. Fairness adjustments are experimental. Default (or None) is 5.
fair_regression_method (String (or None), optional Method choice when predictions from machine learning are needed for) – fairnesss corrections (fair_type in ('Mean', 'MeanVar'). Available methods are 'RandomForest', 'RandomForestNminl5', 'RandomForestNminls5', 'SupportVectorMachine', 'SupportVectorMachineC2', 'SupportVectorMachineC4', 'AdaBoost', 'AdaBoost100', 'AdaBoost200', 'GradBoost', 'GradBoostDepth6', 'GradBoostDepth12', 'LASSO', 'NeuralNet', 'NeuralNetLarge', 'NeuralNetLarger', 'Mean'. If 'automatic', an optimal method will be chosen based on 5-fold cross-validation in the training data. If a method is specified it will be used for all scores and all adjustments. If ‘automatic’, every policy score might be adjusted with a different method. ‘Mean’ is included for cases in which regression methods have no explanatory power. Fairness adjustments are experimental. Default (or None) is 'RandomForest'.
fair_type (String (or None), optional Method to choose the type of correction for the policy scores. 'Mean' : Mean dependence of the policy score on protected var’s is) – removed by residualisation. 'MeanVar' : Mean dependence and heteroscedasticity is removed by residualisation and rescaling. 'Quantiled' : Removing dependence via (an empricial version of) the approach by Strack and Yang (2024) using quantiles. See the paper by Bearth, Lechner, Mareckova, Muny (2024) for details on these methods. Fairness adjustments are experimental. Default (or None) is ‘Quantiled’.
gen_method (String (or None), optional. Method to compute assignment algorithm (available methods:) – 'best_policy_score', 'bps_classifier', 'policy tree'). 'best_policy_score' conducts Black-Box allocations, which are obtained by using the scores directly (potentially subject to restrictions). When the Black-Box allocations are used for allocation of data not used for training, the respective scores must be available. 'bps_classifier' uses the allocations obtained by 'best_policy_score' and trains classifiers. The output will be a decision rule that depends on features only and does not require knowledge of the policy scores. The actual classifier used is selected among four different classifiers offered by sci-kit learn, namely a simple neural network, two classification random forests with minimum leaf size of 2 and 5, and ADDABoost. The selection is made according to the out-of-sample performance on scikit-learns Accuracy Score. The implemented 'policy tree' ‘s are optimal trees, i.e. all possible trees are checked if they lead to a better performance. If restrictions are specified, then this is incorporated into treatment specific cost parameters. Many ideas of the implementation follow Zhou, Athey, Wager (2022). If the provided policy scores fulfil their conditions (i.e., they use a doubly robust double machine learning like score), then they also provide attractive theoretical properties. Default (or None) is 'best_policy_score'.
gen_mp_parallel (Integer (or None), optional) – Number of parallel processes (using ray on CPU). The smaller this value is, the slower the programme, the smaller its demands on RAM. None : 80% of logical cores. Default is None.
gen_outfiletext (String (or None), optional) – File for text output. (.txt) file extension will be automatically added. Default (or None) is ‘txtFileWithOutput’.
gen_outpath (String, Pathlib object (or None), optional) – Directory to where to put text output and figures. If it does not exist, it will be created. None : (*.out) directory just below to the directory where the programme is run. Default is None.
gen_output_type (Integer (or None), optional) – Destination of the output. 0 : Terminal. 1 : File. 2 : File and terminal. Default (or None) is 2.
gen_variable_importance (Boolean) – Compute variable importance statistics based on random forest classifiers. Default (or None) is True.
other_costs_of_treat (List of floats (or None), optional) – Treatment specific costs. These costs are directly subtracted from the policy scores. Therefore, they should be measured in the same units as the scores. Default value (or None) with constraints: It defaults to 0. Default value (or None) without constraints: Costs will be automatically determined such as to enforce constraints in the training data by finding cost values that lead to an allocation (‘best_policy_score’) that fulfils restrictions other_max_shares. Default (or None) is None.
other_costs_of_treat_mult (Float or tuple of floats (with as many) – elements as treatments) (or None), optional Multiplier of automatically determined cost values. Use only when automatic costs violate the constraints given by other_max_shares. This allows to increase (>1) or decrease (<1) the share of treated in particular treatment. None: (1, …, 1). Default (or None) is None.
other_max_shares (Tuple of float elements as treatments) (or None),) – optional Maximum share allowed for each treatment. Default (or None) is None.
pt_depth_tree_1 (Integer (or None), optional) – Depth of 1st optimal tree. Default is 3. Note that tree depth is defined such that a depth of 1 implies 2 leaves, a depth of 3 implies 4 leaves, a depth of 3 implies 8 leaves, etc.
pt_depth_tree_2 (Integer (or None), optional) – Depth of 2nd optimal tree. This set is built within the strata obtained from the leaves of the first tree. If set to 0, a second tree is not built. Default is 1 (together with the default for pt_depth_tree_1 this leads to a (not optimal) total tree of level of 4. Note that tree depth is defined such that a depth of 1 implies 2 leaves, a depth of 2 implies 4 leaves, a depth of 3 implies 8 leaves, etc.
pt_enforce_restriction (Boolean (or None), optional) – Enforces the imposed restriction (to some extent) during the computation of the policy tree. This increases the quality of trees concerning obeying the restrictions, but can be very time consuming. It will be automatically set to False if more than 1 policy tree is estimated. Default (or None) is False.
pt_eva_cat_mult (Integer (or None), optional) – Changes the number of the evaluation points (pt_no_of_evalupoints) for the unordered (categorical) variables to: \(\text{pt_eva_cat_mult} \times \text{pt_no_of_evalupoints}\) (available only for the method ‘policy tree’). Default (or None) is 2.
pt_no_of_evalupoints (Integer (or None), optional) – No of evaluation points for continuous variables. The lower this value, the faster the algorithm, but it may also deviate more from the optimal splitting rule. This parameter is closely related to the approximation parameter of Zhou, Athey, Wager (2022)(A) with \(\text{pt_no_of_evalupoints} = \text{number of observation} / \text{A}\). Only relevant if gen_method is ‘policy tree’. Default (or None) is 100.
pt_min_leaf_size (Integer (or None), optional) –
Minimum leaf size. Leaves that are smaller than pt_min_leaf_size in the training data will not be considered. A larger number reduces computation time and avoids some overfitting. None :

\[0.1 \times \frac{\text{Number of training observations}}{{\text{Number of leaves}}}\]

(if treatment shares are restricted this is multiplied by the smallest share allowed). Only relevant if gen_method is ‘policy tree’. Default is None.
pt_select_values_cat (Boolean (or None), optional) – Approximation method for larger categorical variables. Since we search among optimal trees, for categorical variables variables we need to check for all possible combinations of the different values that lead to binary splits. Thus number could indeed be huge. Therefore, we compare only \(\text{pt_no_of_evalupoints} \times \text{pt_eva_cat_mult}\) different combinations. Method 1 (pt_select_values_cat == True) does this by randomly drawing values from the particular categorical variable and forming groups only using those values. Method 2 (pt_select_values_cat == False) sorts the values of the categorical variables according to a values of the policy score as one would do for a standard random forest. If this set is still too large, a random sample of the entailed combinations is drawn. Method 1 is only available for the method ‘policy tree’.
rnd_shares (Tuple of floats (or None), optional) – Share of treatments of a stochastic assignment as computed by the evaluate() method. Sum of all elements must add to 1. This used only used as a comparison in the evaluation of other allocations. None: Shares of treatments in the allocation under investigation. Default is None.
var_bb_restrict_name (String (or None), optional) – Name of variable related to a restriction in case of capacity constraints. If there is a capacity constraint, preference will be given to observations with highest values of this variable. Only relevant if gen_method is ‘best_policy_score’. Default is None.
var_d_name (String (or None), optional) – Name of (discrete) treatment. Needed in training data only if ‘changers’ (different treatment in allocation than observed treatment) are analysed and if allocation is compared to observed allocation (in evaluate() method). Default is None.
var_effect_vs_0 (List/tuple of strings (or None), optional) – Name of variables of effects of treatment relative to first treatment. Dimension is equal to the number of treatments minus 1. Default is None.
var_effect_vs_0_se (List/tuple of strings (or None), optional) – Name of variables of standard errors of the effects of treatment relative to first treatment. Dimension is equal to the number of treatments minus 1. Default is None.
var_id_name ((or None), optional) – Name of identifier in data. Default is None.
var_polscore_desc_name (List/tuple of tuples of strings (or None), optional) – Each tuple of dimension equal to the different treatments contains treatment specific variables that are used to evaluate the effect of the allocation with respect to those variables. This could be for example policy score not used in training, but which are relevant nevertheless. Default is None.
var_polscore_name (List or tuple of strings (or None), optional) – Names of treatment specific variables to measure the value of individual treatments. This is usually the estimated potential outcome or any other score related. This is required for the solve() method. Default is None.
var_material_name_ord (List or tuple of strings (nor None), optional) – Materially relavant ordered variables: An effect of the protected variables on the scores is allowed, if captured by these variables (only). These variables may (or may not) be included among the decision variables. These variables must (!) not be included among the protected variables. Default is None.
var_material_name_unord (List or tuple of strings (nor None), optional) – Materially relavant unordered variables: An effect of the protected variables on the scores is allowed, if captured by these variables (only). These variables may (or may not) be included among the decision variables. These variables must (!) not be included among the protected variables. Default is None.
var_protected_ord_name (List or tuple of strings (nor None), optional) – Names of protected ordered variables. Their influence on the policy scores will be removed (conditional on the ‘materially important’ variables). These variables should NOT be contained in decision variables, i.e., var_x_name_ord. If they are included, they will be removed and var_x_name_ord will be adjusted accordingly. Default is None.
var_protected_unord_name (List or tuple of strings (nor None), optional) – Names of protected unordered variables. Their influence on the policy scores will be removed (conditional on the ‘materially important’ variables). These variables should NOT be contained in decision variables, i.e., var_x_name_unord. If they are included, they will be removed andvar_x_name_unord will be adjusted accordingly. Default is None.
var_vi_x_name (List or tuple of strings or None, optional) – Names of variables for which variable importance is computed. Default is None.
var_vi_to_dummy_name (List or tuple of strings or None, optional) – Names of variables for which variable importance is computed. These variables will be broken up into dummies. Default is None.
var_x_name_ord (Tuple of strings (or None), optional) – Name of ordered variables (including dummy variables) used to build policy tree and classifier. They are also used to characterise the allocation. Default is None.
var_x_name_unord (Tuple of strings (or None), optional) – Name of unordered variables used to build policy tree and classifier. They are also used to characterise the allocation. Default is None.
_int_dpi (Integer (or None), optional) – dpi in plots. Default (or None) is 500. Internal variable, change default only if you know what you do.
_int_fontsize (Integer (or None), optional) – Font for legends, from 1 (very small) to 7 (very large). Default (or None) is 2. Internal variable, change default only if you know what you do.
_int_how_many_parallel (Integer (or None), optional) – Number of parallel process. None : 80% of logical cores, if this can be effectively implemented. Default is None.
_int_output_no_new_dir (Boolean) – Do not create a new directory when the path already exists. Default (or None) is False.
_int_parallel_processing (Boolean (or None), optional) – Multiprocessing. Default (or None) is True.
_int_report (Boolean, optional) – Provide information for McfOptPolReports to construct informative reports. Default (or None) is True.
_int_with_numba (Boolean (or None), optional) – Use Numba to speed up computations. Default (or None) is True.
_int_with_output (Boolean (or None), optional) – Print output on file and/or screen. Default (or None) is True.
_int_xtr_parallel (Boolean (or None), optional.) – Parallelize to a larger degree to make sure all CPUs are busy for most of the time. Default (or None) is True. Only used for ‘policy tree’ and only used if _int_parallel_processing > 1 (or None)

version#

Version of mcf module used to create the instance.

Type: String

Methods#

`fairscores`(data_df[, data_title])	Make scores independent of protected variables.
`solve`(data_df[, data_title])	Solve for optimal allocation rule.
`allocate`(data_df[, data_title])	Allocate observations to treatment state.
`evaluate`(allocation_df, data_df[, ...])	Evaluate allocation with potential outcome data.
`evaluate_multiple`(allocations_dic, data_df)	Evaluate several allocations simultaneously.
`print_time_strings_all_steps`()	Print an overview over the time needed in all steps of programme.

Reporting#

class reporting.McfOptPolReport(mcf=None, mcf_sense=None, optpol=None, outputpath=None, outputfile=None)#

New in version 0.7.0: Provides reports about the main specification choices and most important results of the ModifiedCausalForest and OptimalPolicy estimations.

Parameters

mcf (Instance of the ModifiedCausalForest class or None, optional) – Contains all information needed for reports. The default is None.
mcf_sense (Instance of the ModifiedCausalForest class or None, optional) – Contains all information from sensitivity analysis needed for reports. The default is None.
optpol (Instance of the OptimalPolicy class or None, optional) – Contains all information from the optimal policy analysis needed for reports. The default is None.
outputpath (String, Pathlib object, or None, optional) – Path to write the pdf file that is created with the report() method. If None, then an ‘/out’ subdirectory of the current working directory is used. If the latter does not exist, it is created.
outputfile (String or None, optional) – Name of the pdf file that is created by the report() method. If None, ‘Reporting’ is used as name. Any name will always appended by string that contains the day and time (measured when the programme ends).

Methods#

report()

Create a PDF report using instances of the ModifiedCausalForest and OptimalPolicy classes and saves the file to a user provided location.

Example Data function#

example_data_functions.example_data(obs_y_d_x_iate=1000, obs_x_iate=1000, no_features=20, no_treatments=3, type_of_heterogeneity='WagerAthey', seed=12345, descr_stats=True, strength_iv=1)#

Create example data to be used with mcf estimation and optimal policy.

Parameters

obs_y_d_x_iate (Integer, optional) – Number of observations for training data. The default is 1000.
obs_x_iate (Integer, optional) – Number of observations for prediction data. The default is 1000.
no_features (Integer, optional) – Number of features of different type. The default is 20.
no_treatments (Integer, optional) – Number of treatments (all non-zero treatments have same IATEs). The default is 3.
type_of_heterogeneity (String, optional) – Different types of heterogeneity broadly (but not exactly) following the specifications used in the simulations of Lechner and Mareckova (Comprehensive Causal Machine Learning, arXiv, 2024). Possible types are ‘linear’, ‘nonlinear’, ‘quadratic’, ‘WagerAthey’.
seed (Integer, optional) – Seed of numpy random number generator object. The default is 12345.
descr_stats (Boolean, optional) – Show descriptive statistics. The default is True.
strength_iv (Integer or Float, optional.) – The larger this number is, the stronger the instrument will be. Default is 1.

Returns

train_df (DataFrame) – Contains outcome, treatment, features, potential outcomes, IATEs, ITEs, and zero column (for convenience to be used with OptimalPolicy).
pred_df (DataFrame) – Contains features, potential outcomes, IATEs, ITEs.
name_dict (Dictionary) – Contains the names of the variable groups.

example_data([obs_y_d_x_iate, obs_x_iate, ...])

Create example data to be used with mcf estimation and optimal policy.