5. Feature selection#
The estimation quality of a random forest deteriorates with the number of irrelevant features, because the probability of picking a split based on an irrelevant feature increases. For this reason, it makes sense to remove such features prior to estimation. A bonus of feature selection is that the computational speed increases as a result of a smaller feature space.
The class ModifiedCausalForest
provides you with the option to perform feature selection through the parameter fs_yes
. If set to True, feature selection is performed. Loosely speaking, the program estimates reduced forms for the treatment and the outcome using random forests and then drops features that have little power to predict the treatment and the outcome.
Note that, an irrelevant feature is never dropped if
the variable is required for the estimation of \(\textrm{GATE's}\), \(\textrm{BGATE's}\) or \(\textrm{CBGATE's}\)
the variable is specified in the parameters
var_x_name_remain_ord
orvar_x_name_remain_unord
of yourModifiedCausalForest
the correlation between two variables to be deleted is bigger than 0.5. In this case, one of the two variables is kept.
5.1. Parameter overview#
The following table summarizes the parameters related to feature selection in the class ModifiedCausalForest
:
Parameter |
Description |
---|---|
|
If True, feature selection is performed. Default: False. |
|
If True, a random sample from the training data is used to perform feature selection. This sample will subsequently not be used to train the Modified Causal Forest. If False, the same data is used for feature selection and to estimate the Modified Causal Forest. Default: True. Only relevant if |
|
If |
|
Defines the threshold for a feature to be considered “irrelevant”. This is measured as the percentage increase of the loss function when the feature is randomly permuted. Default: 1. Only relevant if |
Please consult the API
for more details.
5.2. Example#
from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest
# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()
my_mcf = ModifiedCausalForest(
var_y_name="outcome",
var_d_name="treat",
var_x_name_ord=["x_cont0", "x_cont1", "x_ord1"],
# Parameters for feature selection:
fs_yes=True,
fs_other_sample=True,
fs_other_sample_share=0.1,
fs_rf_threshold=0.5
)