1. Estimation of treatment effects#

1.1. Different types of treatment effects#

The Modified Causal Forest estimates three types of treatment effects, which differ in their aggregation level and are discussed in depth by Lechner (2018). These effects are the average treatment effect (\(\textrm{ATE}\)), the group average treatment effect (\(\textrm{GATE}\)), and the individualized average treatment effect (\(\textrm{IATE}\)). 1

Let us consider a discrete, multi-valued treatment \(D\). The potential outcome of treatment state \(d\) is denoted by \(Y^d\). The covariates that are needed to correct for selection bias are denoted by \(X\). \(Z \subset X\) is a vector of features that defines the effect heterogeneity of interest. \(Z\) can contain continuous and discrete variables. Often these are variables with relatively “few values” that define population groups (e.g. age, gender, etc.). The effects of interest are then defined as:

\[ \begin{align}\begin{aligned}\textrm{ATE}(m,l;\Delta) &:= \mathbb{E} \big[ Y^m-Y^l \big\vert D\in \Delta \big]\\\textrm{GATE}(m,l;z,\Delta) &:= \mathbb{E} \big[ Y^m-Y^l \big\vert Z=z, D\in \Delta \big]\\\textrm{IATE}(m,l;x) &:= \mathbb{E} \big[ Y^m-Y^l \big\vert X=x \big]\end{aligned}\end{align} \]

If \(\Delta = \{m\}\) then \(\textrm{ATE}(m,l;\Delta)\) is better known as the average treatment effect on the treated (\(\textrm{ATET}\)) for the individuals that received treatment \(m\). \(\textrm{ATE's}\) measure the average impact of treatment \(m\) compared to treatment \(l\) either for the entire population, or in case of an \(\textrm{ATET}\), for the units that actually received a specific treatment.

Whereas \(\textrm{ATE's}\) are population averages, \(\textrm{IATE's}\) are average effects at the finest possible aggregation level. They measure the average impact of treatment \(m\) compared to treatment \(l\) for units with features \(X = x\). \(\textrm{GATE's}\) lie somewhere in-between these two extremes. They measure the average impact of treatment \(m\) compared to treatment \(l\) for units in group \(Z = z\). \(\textrm{GATE's}\) and \(\textrm{IATES's}\) are special cases of the so-called conditional average treatment effects (\(\textrm{CATE's}\)).

The following sections will show you how to estimate these different types of treatment effects with the mcf package.


1

A recent paper by Bearth & Lechner (2024) introduced the Balanced Group Average Treatment Effect (\(\textrm{BGATE}\)). Click here to learn more about estimating \(\textrm{BGATE's}\) with the Modified Causal Forest.

1.2. Estimating ATE’s / IATE’s#

The \(\textrm{ATE's}\) as well as the \(\textrm{IATE's}\) are estimated by default through the predict() method of the class ModifiedCausalForest. See Getting started for a quick example on how to access these estimates.

Another way to access the estimated \(\textrm{ATE's}\) is through the output folder that the mcf package generates once a Modified Causal Forest is initialized. You can find the location of this folder by accessing the “outpath” entry of the gen_dict attribute of your Modified Causal Forest:

from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest

# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()

my_mcf = ModifiedCausalForest(
    var_y_name="outcome",
    var_d_name="treat",
    var_x_name_ord=["x_cont0", "x_cont1"]
)
my_mcf.gen_dict["outpath"]

You can also specify this path through the gen_outpath parameter of the class ModifiedCausalForest(). The output folder will contain csv-files with the estimated \(\textrm{ATE's}\) in the subfolder ate_iate.

You can control whether \(\textrm{IATE's}\) and their standard errors are estimated by setting the parameters p_iate and p_iate_se of the class ModifiedCausalForest to True or False:

Parameter

Description

p_iate

If True, IATE’s will be estimated. Default: True.

p_iate_se

If True, standard errors of IATE’s will be estimated. Default: False.

1.2.1. Example#

from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest

# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()

my_mcf = ModifiedCausalForest(
    var_y_name="outcome",
    var_d_name="treat",
    var_x_name_ord=["x_cont0", "x_cont1"],
    # Estimate IATE's but not their standard errors
    p_iate = True,
    p_iate_se = False
)

1.3. Estimating ATET’s#

The average treatment effects for the treated are estimated by the predict() method if the parameter p_atet of the class ModifiedCausalForest is set to True:

from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest

# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()

my_mcf = ModifiedCausalForest(
    var_y_name="outcome",
    var_d_name="treat",
    var_x_name_ord=["x_cont0", "x_cont1"],
    # Estimating ATET's
    p_atet = True
)

my_mcf.train(training_df)
results, _ = my_mcf.predict(prediction_df)

The \(\textrm{ATET's}\) are, similar to the \(\textrm{ATE's}\), stored in the “ate” entry of the dictionary returned by the predict() method. This entry will then contain both the estimated \(\textrm{ATET's}\) as well as the \(\textrm{ATE's}\). The output that is printed to the console during prediction will present you a table with all estimated \(\textrm{ATE's}\) and \(\textrm{ATET's}\), which should give you a good idea of the structure of the “ate” entry in the result dictionary.

results["ate"]

The standard errors of the estimates are stored in the “ate_se” entry of the same dictionary. The structure of the “ate_se” entry is analogous to the “ate” entry.

results["ate_se"]

Another way to access the estimated \(\textrm{ATET's}\) is through the output folder that the mcf package generates once a Modified Causal Forest is initialized. You can find the location of this folder by accessing the “outpath” entry of the gen_dict attribute of your Modified Causal Forest:

my_mcf.gen_dict["outpath"]

You can also specify this path through the gen_outpath parameter of the class ModifiedCausalForest(). The output folder will contain csv-files with the estimated \(\textrm{ATET's}\) in the subfolder ate_iate.

1.4. Estimating GATE’s#

Group average treatment effects are estimated by the predict() method if you define heterogeneity variables through the parameters var_z_name_list, var_z_name_ord or var_z_name_unord in your ModifiedCausalForest. For every feature in the vector of heterogeneity variables \(Z\), a \(\textrm{GATE}\) will be estimated separately. Please refer to the table further below or the API for more details on how to specify your heterogeneity variables with the above mentioned parameters.

from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest

# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()

my_mcf = ModifiedCausalForest(
    var_y_name="outcome",
    var_d_name="treat",
    # define binary variables as ordered for faster performance
    var_x_name_ord=["x_cont0", "x_cont1"],
    # Specify the unordered heterogeneity variable 'female' for GATE estimation
    var_z_name_unord=["x_unord0"]
)
my_mcf.train(training_df)
results, _ = my_mcf.predict(training_df)

You can access the estimated \(\textrm{GATE's}\) and their standard errors through their corresponding entries in the dictionary that is returned by the predict() method:

results["gate_names_values"] # Describes the structure of the 'gate' entry
results["gate"] # Estimated GATE's
results["gate_se"] # Standard errors of the estimated GATE's

A simpler way to inspect the estimated \(\textrm{GATE's}\) is through the output folder that the mcf package generates once a Modified Causal Forest is initialized. You can find the location of this folder by accessing the “outpath” entry of the gen_dict attribute of your Modified Causal Forest:

my_mcf.gen_dict["outpath"]

You can also specify this path through the gen_outpath parameter of the class ModifiedCausalForest(). The output folder will contain both csv-files with the results as well as plots of the estimated \(\textrm{GATE's}\) in the subfolder gate.

To estimate the \(\textrm{GATE's}\) for subpopulations defined by treatment status (\(\textrm{GATET's}\)), you can set the parameter p_gatet of the class ModifiedCausalForest to True. These estimates can be accessed in the same manner as regular \(\textrm{GATE's}\).

my_mcf = ModifiedCausalForest(
    var_y_name="outcome",
    var_d_name="treat",
    var_x_name_ord=["x_cont0", "x_cont1"],
    var_z_name_unord=["x_unord0"],
    # Estimate the GATE's for var_z_name_unord by treatment status
    p_gatet = True
)

For a continuous heterogeneity variable, the Modified Causal Forest will by default smooth the distribution of the variable. The smoothing procedure evaluates the effects at a local neighborhood around a pre-defined number of evaluation points. The number of evaluation points can be specified through the parameter p_gates_smooth_no_evalu_points of the class ModifiedCausalForest. The local neighborhood is based on an Epanechnikov kernel estimation using Silverman’s bandwidth rule. The multiplier for Silverman’s bandwidth rule can be chosen through the parameter p_gates_smooth_bandwidth.

my_mcf = ModifiedCausalForest(
    var_y_name="outcome",
    var_d_name="treat",
    var_x_name_ord=["x_cont0", "x_cont1"],
    # Specify the continuous heterogeneity variable for GATE estimation
    var_z_name_list=["x_ord0"],
    # Smoothing the distribution of the continuous variable for GATE estimation
    p_gates_smooth = True,
    # The number of evaluation points is set to 40
    p_gates_smooth_no_evalu_points = 40
)

Instead of smoothing continuous heterogeneity variables, you can also discretize them and estimate GATE’s for the resulting categories. This can be done by setting the parameter p_gates_smooth of the class ModifiedCausalForest to False. The maximum number of categories for discretizing continuous variables can be specified through the parameter p_max_cats_z_vars.

my_mcf = ModifiedCausalForest(
    var_y_name="outcome",
    var_d_name="treat",
    var_x_name_ord=["x_cont0", "x_cont1"],
    # Specify the continuous heterogeneity variable for GATE estimation
    var_z_name_list=["x_ord0"],
    # Discretizing the continuous variable for GATE estimation
    p_gates_smooth = False,
    # The maximum number of categories for discretizing is set to 5
    p_max_cats_z_vars = 5
)

Below you find a list of the discussed parameters that are relevant for the estimation of \(\textrm{GATE's}\). Please consult the API for more details or additional parameters on \(\textrm{GATE}\) estimation.

Commonly used parameters to estimate \(\ \textrm{GATE's}\)

Parameter

Description

var_z_name_list

Ordered feature(s) with many values used for \(\textrm{GATE}\) estimation.

var_z_name_ord

Ordered feature(s) with few values used for \(\textrm{GATE}\) estimation.

var_z_name_unord

Unordered feature(s) used for \(\textrm{GATE}\) estimation.

p_gatet

If True, \(\textrm{GATE's}\) are also computed by treatment status (\(\textrm{GATET's}\)). Default: False.

p_gates_smooth

If True, a smoothing procedure is applied to estimate \(\textrm{GATE's}\) for continuous variables in \(Z\). Default: True.

p_gates_smooth_no_evalu_points

If p_gates_smooth is True, this defines the number of evaluation points. Default: 50.

p_gates_smooth_bandwidth

If p_gates_smooth is True, this defines the multiplier for Silverman’s bandwidth rule. Default: 1.

p_max_cats_z_vars

If p_gates_smooth is False, this defines the maximum number categorizes when discretizing continuous heterogeneity variables in \(Z\). Default: \(N^{0.3}\).

1.5. Stabilizing estimates by truncating weights#

The Modified Causal Forest uses weighted averages to estimate treatment effects. If the weights of some observations are very large, they can lead to unstable estimates. To obtain more stable estimates, the mcf package provides the option to truncate forest weights to an upper threshold through the parameter p_max_weight_share of the class ModifiedCausalForest. By default, p_max_weight_share is set to 0.05. After truncation, the program renormalizes the weights for estimation. Because of the renormalization step, the final weights can be slightly above the threshold defined in p_max_weight_share.

1.5.1. Example#

my_mcf = ModifiedCausalForest(
    var_y_name="outcome",
    var_d_name="treat",
    var_x_name_ord=["x_cont0", "x_cont1"],
    # Truncate weights to an upper threshold of 0.01
    p_max_weight_share = 0.01
)