6. Post-estimation diagnostics#

The class ModifiedCausalForest provides you with several diagnostic tools to analyse the estimated \(\text{IATE's}\). They cover

descriptive statistics
a correlation analysis
\(k\)-means clustering
a feature importance analysis

To conduct any post-estimation diagnostics, the parameter post_est_stats of the class ModifiedCausalForest needs to be set to True. Once you have estimated your \(\text{IATE's}\) using the predict() method, you can conduct the post-estimation diagnostics with the analyse() method:

from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest
from mcf import McfOptPolReport

# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()


my_mcf = ModifiedCausalForest(
        var_y_name="outcome",
        var_d_name="treat",
        var_x_name_ord=["x_cont0", "x_cont1", "x_ord1"],
        var_x_name_unord=["x_unord0"],
        # Enable post-estimation diagnostics
        post_est_stats=True
    )

my_mcf.train(training_df)
results, _  = my_mcf.predict(prediction_df)

post_estimation_diagnostics, _ = my_mcf.analyse(results)

The easiest way to to inspect the results of the post-estimation diagnostics, is to read the PDF-report that can be generated using the class McfOptPolReport:

mcf_report = McfOptPolReport(mcf=my_mcf, outputfile='Modified-Causal-Forest_Report')
mcf_report.report()

You can additionally specify the reference group for the \(\text{IATE's}\) with the parameter post_relative_to_first_group_only. If post_relative_to_first_group_only is True, the comparison group will be the first treatment state. This is the default. If False, all possible treatment combinations are compared with each other. The confidence level in the post-estimation diagnostics is specified through the parameter p_ci_level.

6.1. Descriptive statistics#

With post_est_stats set to True, the distribution of the estimated \(\text{IATE's}\) will be presented. The produced plots are also available in the output folder that is produced by the mcf package. You can find the location of this folder by accessing the “outpath” entry of the gen_dict attribute of your Modified Causal Forest:

my_mcf.gen_dict["outpath"]

You can also specify this path through the gen_outpath parameter of the class ModifiedCausalForest(). The output folder will contain the jpeg/pdf-files of the plots as well as csv-files of the underlying data in the subfolder ate_iate.

6.2. Correlation analysis#

The correlation analysis estimates the dependencies between the different \(\text{IATE's}\), between the \(\text{IATE's}\) and the potential outcomes, and between the \(\text{IATE's}\) and the features. You can activate the correlation analysis by setting the parameter post_bin_corr_yes to True. Note that the correlation coefficients are only displayed if their absolute values exceed the threshold specified by the parameter post_bin_corr_threshold.

6.3. \(k\)-means clustering#

To analyze heterogeneity in different groups (clusters), you can conduct \(k\)-means clustering by setting the parameter post_kmeans_yes to True. The mcf package uses the k-means++ algorithm from scikit-learn to build clusters based on the \(\text{IATE's}\).

from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest
from mcf import McfOptPolReport

# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()

my_mcf = ModifiedCausalForest(
        var_y_name="outcome",
        var_d_name="treat",
        var_x_name_ord=["x_cont0", "x_cont1", "x_ord1"],
        var_x_name_unord=["x_unord0"],
        post_est_stats=True,
        # Perform k-means clustering
        post_kmeans_yes=True
    )

my_mcf.train(training_df)
results, _  = my_mcf.predict(prediction_df)

post_estimation_diagnostics, _ = my_mcf.analyse(results)

The report obtained through the class McfOptPolReport will contain descriptive statistics of the \(\text{IATE's}\), the potential outcomes and the features for each cluster.

mcf_report = McfOptPolReport(mcf=my_mcf, outputfile='Modified-Causal-Forest_Report')
mcf_report.report()

If you wish to analyse the clusters yourself, you can access the cluster membership of each observation through the “iate_data_df” entry of the dictionary returned by the analyse() method. The cluster membership is stored in the column IATE_Cluster of the DataFrame.

post_estimation_diagnostics["iate_data_df"]

You can define a range for the number of clusters through the parameter post_kmeans_no_of_groups. The final number of clusters is chosen via silhouette analysis. To guard against getting stuck at local extrema, the number of replications with different random start centers can be defined through the parameter post_kmeans_replications. The parameter post_kmeans_max_tries sets the maximum number of iterations in each replication to achieve convergence.

6.4. Feature importance#

If you are interested in learning which of your features have a lot of predictive power for the estimated \(\text{IATE's}\) you can activate the feature importance procedure by setting the parameter post_random_forest_vi to True. This procedure will build a predictive random forest to determine which features influence the \(\text{IATE's}\) most. The feature importance statistics are presented in percentage points of the coefficient of determination, \(R^2\), that is lost when the respective feature is randomly permuted. The \(R^2\) statistics are obtained through the RandomForestRegressor provided by scikit-learn.

6.5. Parameter overview#

Below is an overview of the above mentioned parameters related to post-estimation diagnostics in the class ModifiedCausalForest:

Parameter	Description
`post_est_stats`	If True, post-estimation diagnostics are conducted. Default: True.
`post_relative_to_first_group_only`	If True, post-estimation diagnostics will only be conducted for \(\text{IATE's}\) relative to the first treatment state. If False, the diagnostics cover the \(\text{IATE's}\) of all possible treatment combinations. Default: True.
`p_ci_level`	Confidence level for plots, including the post-estimation diagnostic plots. Default: 0.9.
`post_bin_corr_yes`	If True, the binary correlation analysis is conducted. Default: True.
`post_bin_corr_threshold`	If `post_bin_corr_yes` is True, correlations are only displayed if their absolute value is at least `post_bin_corr_threshold`. Default: 0.1.
`post_kmeans_yes`	If True, \(k\)-means clustering is conducted to build clusters based on the \(\text{IATE's}\). Default: True.
`post_kmeans_no_of_groups`	Only relevant if `post_kmeans_yes` is True. Determines the number of clusters for \(k\)-means clustering. Should be specified as a list of values. Default: See the `API`.
`post_kmeans_max_tries`	Only relevant if `post_kmeans_yes` is True. Determines the maximum number of iterations to achieve convergence in each \(k\)-means clustering replication. Default: 1000.
`post_kmeans_replications`	Only relevant if `post_kmeans_yes` is True. Determines the number of replications for \(k\)-means clustering. Default: 10.
`post_kmeans_min_size_share`	Smallest share of cluster size allowed in %. Default (None) is 1.
`post_random_forest_vi`	If True, the feature importance analysis is conduced. Default: True.
`post_plots`	If True, post-estimation diagnostic plots are printed during runtime. Default: True.
`post_tree`	Regression trees (honest and standard) of Depth 2 to 5 are estimated to describe IATES(x). Default (or None) is True.

Please consult the API for more details.

6.6. Example#

from mcf.example_data_functions import example_data
from mcf.mcf_functions import ModifiedCausalForest
from mcf.reporting import McfOptPolReport

# Generate example data using the built-in function `example_data()`
training_df, prediction_df, name_dict = example_data()

my_mcf = ModifiedCausalForest(
        var_y_name="outcome",
        var_d_name="treat",
        var_x_name_ord=["x_cont0", "x_cont1", "x_ord1"],
        var_x_name_unord=["x_unord0"],
        p_ci_level=0.95,
        # Parameters for post-estimation diagnostics
        post_est_stats=True,
        post_relative_to_first_group_only=True,
        post_bin_corr_yes=True,
        post_bin_corr_threshold=0.1,
        post_kmeans_yes=True,
        post_kmeans_no_of_groups=[3, 4, 5, 6, 7],
        post_kmeans_max_tries=1000,
        post_kmeans_replications=10,
        post_random_forest_vi=True,
        post_plots=True,
        post_kmeans_min_size_share=1,
        post_tree=True
    )

my_mcf.train(training_df)
results, _  = my_mcf.predict(prediction_df)

# Compute the post-estimation diagnostics
post_estimation_diagnostics, _ = my_mcf.analyse(results)

# Access cluster memberships (column 'IATE_Cluster')
post_estimation_diagnostics["iate_data_df"]

# Produce a PDF-report with the results, including post-estimation diagnostics
mcf_report = McfOptPolReport(mcf=my_mcf, outputfile='Modified-Causal-Forest_Report')
mcf_report.report()