6. Post-estimation diagnostics#
The class ModifiedCausalForest
provides you with several diagnostic tools to analyse the estimated \(\text{IATE's}\). They cover
descriptive statistics
a correlation analysis
\(k\)-means clustering
a feature importance analysis
To conduct any post-estimation diagnostics, the parameter post_est_stats
of the class ModifiedCausalForest
needs to be set to True. Once you have estimated your \(\text{IATE's}\) using the predict()
method, you can conduct the post-estimation diagnostics with the analyse()
method:
from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest
from mcf import McfOptPolReport
# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()
my_mcf = ModifiedCausalForest(
var_y_name="outcome",
var_d_name="treat",
var_x_name_ord=["x_cont0", "x_cont1", "x_ord1"],
var_x_name_unord=["x_unord0"],
# Enable post-estimation diagnostics
post_est_stats=True
)
my_mcf.train(training_df)
results, _ = my_mcf.predict(prediction_df)
post_estimation_diagnostics, _ = my_mcf.analyse(results)
The easiest way to to inspect the results of the post-estimation diagnostics, is to read the PDF-report that can be generated using the class McfOptPolReport
:
mcf_report = McfOptPolReport(mcf=my_mcf, outputfile='Modified-Causal-Forest_Report')
mcf_report.report()
You can additionally specify the reference group for the \(\text{IATE's}\) with the parameter post_relative_to_first_group_only
. If post_relative_to_first_group_only
is True, the comparison group will be the first treatment state. This is the default. If False, all possible treatment combinations are compared with each other. The confidence level in the post-estimation diagnostics is specified through the parameter p_ci_level
.
6.1. Descriptive statistics#
With post_est_stats
set to True, the distribution of the estimated \(\text{IATE's}\) will be presented. The produced plots are also available in the output folder that is produced by the mcf package. You can find the location of this folder by accessing the “outpath” entry of the gen_dict attribute of your Modified Causal Forest:
my_mcf.gen_dict["outpath"]
You can also specify this path through the gen_outpath
parameter of the class ModifiedCausalForest()
. The output folder will contain the jpeg/pdf-files of the plots as well as csv-files of the underlying data in the subfolder ate_iate.
6.2. Correlation analysis#
The correlation analysis estimates the dependencies between the different \(\text{IATE's}\), between the \(\text{IATE's}\) and the potential outcomes, and between the \(\text{IATE's}\) and the features. You can activate the correlation analysis by setting the parameter post_bin_corr_yes
to True. Note that the correlation coefficients are only displayed if their absolute values exceeds the threshold specified by the parameter post_bin_corr_threshold
.
6.3. \(k\)-means clustering#
To analyze heterogeneity in different groups (clusters), you can conduct \(k\)-means clustering by setting the parameter post_kmeans_yes
to True. The mcf package uses the k-means++ algorithm from scikit-learn to build clusters based on the \(\text{IATE's}\).
from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest
from mcf import McfOptPolReport
# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()
my_mcf = ModifiedCausalForest(
var_y_name="outcome",
var_d_name="treat",
var_x_name_ord=["x_cont0", "x_cont1", "x_ord1"],
var_x_name_unord=["x_unord0"],
post_est_stats=True,
# Perform k-means clustering
post_kmeans_yes=True
)
my_mcf.train(training_df)
results, _ = my_mcf.predict(prediction_df)
post_estimation_diagnostics, _ = my_mcf.analyse(results)
The report obtained through the class McfOptPolReport
will contain descriptive statistics of the \(\text{IATE's}\), the potential outcomes and the features for each cluster.
mcf_report = McfOptPolReport(mcf=my_mcf, outputfile='Modified-Causal-Forest_Report')
mcf_report.report()
If you wish to analyse the clusters yourself, you can access the cluster membership of each observation through the “iate_data_df” entry of the dictionary returned by the analyse()
method. The cluster membership is stored in the column IATE_Cluster of the DataFrame.
post_estimation_diagnostics["iate_data_df"]
You can define a range for the number of clusters through the parameter post_kmeans_no_of_groups
. The final number of clusters is chosen via silhouette analysis. To guard against getting stuck at local extrema, the number of replications with different random start centers can be defined through the parameter post_kmeans_replications
. The parameter post_kmeans_max_tries
sets the maximum number of iterations in each replication to achieve convergence.
6.4. Feature importance#
If you are interested in learning which of your features have a lot of predictive power for the estimated \(\text{IATE's}\) you can activate the feature importance procedure by setting the parameter post_random_forest_vi
to True. This procedure will build a predictive random forest to determine which features influence the \(\text{IATE's}\) most. The feature importance statistics are presented in percentage points of the coefficient of determination, \(R^2\), that is lost when the respective feature is randomly permuted. The \(R^2\) statistics are obtained through the RandomForestRegressor provided by scikit-learn.
6.5. Parameter overview#
Below is an overview of the above mentioned parameters related to post-estimation diagnostics in the class ModifiedCausalForest
:
Parameter |
Description |
---|---|
|
If True, post-estimation diagnostics are conducted. Default: True. |
|
If True, post-estimation diagnostics will only be conducted for \(\text{IATE's}\) relative to the first treatment state. If False, the diagnostics cover the \(\text{IATE's}\) of all possible treatment combinations. Default: True. |
|
Confidence level for plots, including the post-estimation diagnostic plots. Default: 0.9. |
|
If True, the binary correlation analysis is conducted. Default: True. |
|
If |
|
If True, \(k\)-means clustering is conducted to build clusters based on the \(\text{IATE's}\). Default: True. |
|
Only relevant if |
|
Only relevant if |
|
Only relevant if |
|
Smallest share of cluster size allowed in %. Default (None) is 1. |
|
If True, the feature importance analysis is conduced. Default: True. |
|
If True, post-estimation diagnostic plots are printed during runtime. Default: True. |
|
Regression trees (honest and standard) of Depth 2 to 5 are estimated to describe IATES(x). Default (or None) is True. |
Please consult the API
for more details.
6.6. Example#
from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest
from mcf.reporting import McfOptPolReport
# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()
my_mcf = ModifiedCausalForest(
var_y_name="outcome",
var_d_name="treat",
var_x_name_ord=["x_cont0", "x_cont1", "x_ord1"],
var_x_name_unord=["x_unord0"],
p_ci_level=0.95,
# Parameters for post-estimation diagnostics
post_est_stats=True,
post_relative_to_first_group_only=True,
post_bin_corr_yes=True,
post_bin_corr_threshold=0.1,
post_kmeans_yes=True,
post_kmeans_no_of_groups=[3, 4, 5, 6, 7],
post_kmeans_max_tries=1000,
post_kmeans_replications=10,
post_random_forest_vi=True,
post_plots=True,
post_kmeans_min_size_share=1,
post_tree=True
)
my_mcf.train(training_df)
results, _ = my_mcf.predict(prediction_df)
# Compute the post-estimation diagnostics
post_estimation_diagnostics, _ = my_mcf.analyse(results)
# Access cluster memberships (column 'IATE_Cluster')
post_estimation_diagnostics["iate_data_df"]
# Produce a PDF-report with the results, including post-estimation diagnostics
mcf_report = McfOptPolReport(mcf=my_mcf, outputfile='Modified-Causal-Forest_Report')
mcf_report.report()