3. Local centering#
3.1. Method#
Local centering is a form of residualization and can improve the performance of forest estimators by regressing out the impact of the features on the outcome. Let us define the conditionally centered outcome \(\tilde{Y}_i\) as:
where:
\(Y_i\) is the outcome for observation \(i\).
\(\hat{y}_{-i}(X_i)\) is an estimate of the conditional outcome expectation \(E[Y_i | X_i = x]\), given the realised \(x\) of the feature vector \(X_i\), and computed without using the observation \(i\).
3.2. Implementation#
Centered outcomes are obtained by subtracting the predicted from the observed outcomes.
The local centering procedure in the mcf applies the method from the sklearn.ensemble module RandomForestRegressor to compute the predicted outcomes \(\hat{y}_{-i}(X_i)\) for each observation \(i\) non-parametrically. The predicted outcomes are computed in distinct subsets by cross-validation with the number of folds specified by lc_cs_cv_k
.
By default lc_yes
is set to True
and runs the described local centering procedure. To overrule it, set lc_yes
to False
.
As an alternative, two separate data sets can be generated for running the local centering procedure with lc_cs_cv
. In this case, the first data set is used for training a Random Forest, again by applying the RandomForestRegressor method. The the size of this first dataset can be defined in lc_cs_share
. The second dataset is used to compute the predicted and centered outcomes \(\hat{y}_{-i}(X_i)\) and \(\tilde{Y}_i\). Furthermore, this second data set is divided into mutually exclusive data sets for feature selection (optionally), tree building, and effect estimation.
Below, the table below provides a brief description of the relevant keyword arguments for local centering:
Argument |
Description |
|
Activates local centering. Default is True |
|
Data for local centering & common support adjustment. True: Crossvalidation. False: Random sample not used for forest building. Default is True. |
|
Data for local centering & common support adjustment. Share of trainig data (if lc_cs_cv is False). Default is 0.25. |
|
Number of folds in cross-validation (if lc_cs_cv is True). This is dependent on the size of the training sample and ranges from 2 to 5. |
3.2.1. Example#
from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest
# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()
my_mcf = ModifiedCausalForest(
var_y_name="outcome",
var_d_name="treat",
var_x_name_ord=["x_cont0", "x_cont1", "x_ord1"],
# Activates local centering
lc_yes = True,
# Data for local centering & common support adjustment by crossvalidation
lc_cs_cv = True,
# Number of folds in cross-validation
lc_cs_cv_k = 5
)
my_mcf.train(training_df)
results, _ = my_mcf.predict(prediction_df)