8. Computational Speed and Ressources for Effect Estimation#

This section provides key considerations regarding computation and resource management. It includes speed- and resource-related information necessary for tuning the forest via grid search, setting parameter values to optimize runtime, and reducing RAM consumption.

8.2. Minimization of RAM usage#

When datasets are large, the computational burden (incl. demands on RAM) may increase rapidly. First of all, it is important to remember that the mcf estimation consists of two steps:

  1. Train the forest with the training data (outcome, treatment, features);

  2. Predict the effects with the prediction data (needs features only, or treatment and features if, e.g., treatment effects on the treated are estimated).

The precision of the results is (almost) entirely determined by the training data, while the prediction data (mainly) defines the population which the ATE and other effects are computed for.

The mcf deals as follows with large training data: When the training data becomes larger than cf_chunks_maxsize, the data is randomly split and for each split a new forest is estimated. In the prediction part, effects are estimated for each forest and subsequently averaged.

The mcf deals as follows with large prediction data: The critical part when computing the effects is the weight matrix. Its size is \(N_{Tf}\) x \(N_{P}\), where \(N_{P}\) is number of observations in the prediction data and \(N_{Tf}\) is the number of observations used for forest estimation. The weight matrix is estimated for each forest (to save memory it is deleted from memory and stored on disk). Although the weight matrix uses (by default) a sparse data format, it may still be very large and it can be very time consuming to compute.

Reducing computation and demand on memory without much performance loss: Tests for very large data (1 million and more) have shown that indeed the prediction part becomes the bottleneck,while the training part computes reasonably fast. Therefore, one way to speed up the mcf and reduce the demand on RAM is to reduce the size of the prediction data (e.g. take a x% random sample). Tests have shown that, for this approach, effect estimates and standard errors remain very similar whether 1 million or only 100,000 prediction observations are used, even with 1 million training observations.

The keywords _int_max_obs_training, _int_max_obs_prediction, _int_max_obs_kmeans, and _int_max_obs_post_rel_graphs allow one to set these parameters accordingly.

Please refer to the API for a detailed description of these and other options.

Adjusting these options can help to significantly reduce the computational time, but it may also affect the accuracy of the results. Therefore, it is recommended to understand the implications of each option before adjusting them. Below you find a list and a coding example indicating the discussed parameters that are relevant for parameter tuning and computational speed.

Note, the mcf achieves faster performance when binary features, such as gender, are defined as ordered, using var_x_name_ord instead of var_x_name_unord.

8.3. Example#

my_mcf = ModifiedCausalForest(
    var_y_name="outcome",
    var_d_name="treat",
    var_x_name_ord=["x_cont0", "x_cont1", "x_ord1"],
    # Number of trees
    cf_boot=500,
    # Maximum share of variables used at each new split of tree
    cf_m_share_max=0.6,
    # Minimum share of variables used at each new split of tree
    cf_m_share_min=0.15,
    # Number of variables used at each new split of tree
    cf_m_grid=2,
    # Smallest minimum leaf size
    cf_n_min_min=5,
    # Largest minimum leaf size
    cf_n_min_max=None,
    # Number of parallel processes
    gen_mp_parallel=None,
    # Tune all parameters
    cf_tune_all=True,
    # Smallest minimum leaf size
    _int_iate_chunk_size=None,  # Corrected here
    # Largest minimum leaf size
    _int_weight_as_sparse_splits=None,
    # Number of parallel processes
    _int_max_obs_training=None,
    # Tune all parameters
    _int_max_obs_prediction=None,
    # Number of parallel processes
    _int_max_obs_kmeans=None,
    # Tune all parameters
    _int_max_obs_post_rel_graphs=None,
)