3. Local centering#

3.1. Method#

Local centering is a form of residualization, which can improve the performance of forest estimators. This performance improvement is achieved by regressing out the impact of the features on the outcome.

Formally, the conditionally centered outcome \((\tilde{Y}_i)\) can be defined as:

\[\tilde{Y}_i = Y_i - \hat{y}_{-i}(X_i)\]

where:

  • \((\tilde{Y}_i)\) is the conditionally centered outcome.

  • \((Y_i)\) indicates the outcome for observation \(i\).

  • \((\hat{y}_{-i}(X_i))\) is an estimate of the conditional outcome expectation \((E[Y_i | X_i = x_i])\), given the observed values \((x_i)\) of the feature vector \((X_i)\), computed without using the observation \(i\).

3.2. Implementation#

The local centering procedure applies the method of the sklearn.ensemble module RandomForestRegressor to compute the predicted outcomes \((\hat{y}_{-i}(X_i))\) for each observation \(i\) non-parametrically. To turn the procedure off, overrule the default lc_yes and set it to False. The predicted outcomes are computed in distinct subsets by cross-validation, where the number of folds can be specified in lc_cs_cv_k. Finally, the centered outcomes are obtained by subtracting the predicted from the observed outcomes.

Alternatively, two separate data sets can be generated for running the local centering procedure with lc_cs_cv. In this case, the size of the first data set can be defined in lc_cs_share and it is used for training a Random Forest, again by applying the RandomForestRegressor method. The predicted and centered outcomes \((\hat{y}_{-i}(X_i))\) and \((\tilde{Y}_i)\), respectively, are computed in the second data set. Finally, this second data set is divided into mutually exclusive data sets for feature selection (optionally), tree building, and effect estimation.

Below, you find a table with a brief description of the relevant keyword arguments for local centering:

Argument

Description

lc_yes

Activates local centering. Default is True

lc_cs_cv

Data for local centering & common support adjustment. True: Crossvalidation. False: Random sample not used for forest building. Default is True.

lc_cs_share

Data for local centering & common support adjustment. Share of trainig data (if lc_cs_cv is False). Default is 0.25.

lc_cs_cv_k

Number of folds in cross-validation (if lc_cs_cv is True). Default is 5.

3.2.1. Example#

from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest

# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()


my_mcf = ModifiedCausalForest(
    var_y_name="outcome",
    var_d_name="treat",
    var_x_name_ord=["x_cont0", "x_cont1", "x_ord1"],
    # Activates local centering
    lc_yes = True,
    # Data for local centering & common support adjustment by crossvalidation
    lc_cs_cv = True,
    # Number of folds in cross-validation
    lc_cs_cv_k = 5
)

my_mcf.train(training_df)
results, _ = my_mcf.predict(prediction_df)