5. Feature selection#

The estimation quality of a random forest deteriorates with the number of irrelevant features, because the probability of picking a split based on an irrelevant feature increases. For this reason, it makes sense to remove such features prior to estimation. A bonus of feature selection is that the computational speed increases as a result of a smaller feature space.

The class ModifiedCausalForest provides you with the option to perform feature selection through the parameter fs_yes. If set to True, feature selection is performed. Loosely speaking, the program estimates reduced forms for the treatment and the outcome using random forests and then drops features that have little power to predict the treatment and the outcome.

Note that, an irrelevant feature is never dropped if

  • the variable is required for the estimation of \(\textrm{GATE's}\), \(\textrm{BGATE's}\) or \(\textrm{CBGATE's}\)

  • the variable is specified in the parameters var_x_name_remain_ord or var_x_name_remain_unord of your ModifiedCausalForest

  • the correlation between two variables to be deleted is bigger than 0.5. In this case, one of the two variables is kept.

5.1. Parameter overview#

The following table summarizes the parameters related to feature selection in the class ModifiedCausalForest:

Parameter

Description

fs_yes

If True, feature selection is performed. Default: False.

fs_other_sample

If True, a random sample from the training data is used to perform feature selection. This sample will subsequently not be used to train the Modified Causal Forest. If False, the same data is used for feature selection and to estimate the Modified Causal Forest. Default: True. Only relevant if fs_yes is set to True.

fs_other_sample_share

If fs_other_sample is set to True, this determines the sample share used for feature selection. Default: 0.33. Only relevant if fs_yes is set to True.

fs_rf_threshold

Defines the threshold for a feature to be considered “irrelevant”. This is measured as the percentage increase of the loss function when the feature is randomly permuted. Default: 1. Only relevant if fs_yes is set to True.

Please consult the API for more details.

5.2. Example#

from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest

# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()

my_mcf = ModifiedCausalForest(
    var_y_name="outcome",
    var_d_name="treat",
    var_x_name_ord=["x_cont0", "x_cont1", "x_ord1"],
    # Parameters for feature selection:
    fs_yes=True,
    fs_other_sample=True,
    fs_other_sample_share=0.1,
    fs_rf_threshold=0.5
)