TabularPredictor.leaderboard¶

TabularPredictor.leaderboard(data: str | TabularDataset | DataFrame | None = None, extra_info: bool = False, extra_metrics: list | None = None, decision_threshold: float | None = None, score_format: str = 'score', only_pareto_frontier: bool = False, skip_score: bool = False, refit_full: bool | None = None, set_refit_score_to_parent: bool = False, display: bool = False, **kwargs) → DataFrame[source]¶

Output summary of information about models produced during fit() as a pd.DataFrame. Includes information on test and validation scores for all models, model training times, inference times, and stack levels. Output DataFrame columns include:

‘model’: The name of the model.

‘score_val’: The validation score of the model on the ‘eval_metric’.
NOTE: Metrics scores always show in higher is better form. This means that metrics such as log_loss and root_mean_squared_error will have their signs FLIPPED, and values will be negative. This is necessary to avoid the user needing to know the metric to understand if higher is better when looking at leaderboard.

‘eval_metric’: The evaluation metric name used to calculate the scores.
This should be identical to predictor.eval_metric.name.

‘pred_time_val’: The inference time required to compute predictions on the validation data end-to-end.
Equivalent to the sum of all ‘pred_time_val_marginal’ values for the model and all of its base models.

‘fit_time’: The fit time required to train the model end-to-end (Including base models if the model is a stack ensemble).
Equivalent to the sum of all ‘fit_time_marginal’ values for the model and all of its base models.

‘pred_time_val_marginal’: The inference time required to compute predictions on the validation data (Ignoring inference times for base models).
Note that this ignores the time required to load the model into memory when bagging is disabled.

‘fit_time_marginal’: The fit time required to train the model (Ignoring base models). ‘stack_level’: The stack level of the model.

A model with stack level N can take any set of models with stack level less than N as input, with stack level 1 models having no model inputs.

‘can_infer’: If model is able to perform inference on new data. If False, then the model either was not saved, was deleted, or an ancestor of the model cannot infer.
can_infer is often False when save_bag_folds=False was specified in initial fit().

‘fit_order’: The order in which models were fit. The first model fit has fit_order=1, and the Nth model fit has fit_order=N. The order corresponds to the first child model fit in the case of bagged ensembles.

Parameters:

data (str or TabularDataset or pd.DataFrame (optional)) –
This Dataset must also contain the label-column with the same column-name as specified during fit(). If extra_metrics=None and skip_score=True, then the label column is not required. If specified, then the leaderboard returned will contain additional columns ‘score_test’, ‘pred_time_test’, and ‘pred_time_test_marginal’.

’score_test’: The score of the model on the ‘eval_metric’ for the data provided.
NOTE: Metrics scores always show in higher is better form. This means that metrics such as log_loss and root_mean_squared_error will have their signs FLIPPED, and values will be negative. This is necessary to avoid the user needing to know the metric to understand if higher is better when looking at leaderboard.

’pred_time_test’: The true end-to-end wall-clock inference time of the model for the data provided.
Equivalent to the sum of all ‘pred_time_test_marginal’ values for the model and all of its base models.

’pred_time_test_marginal’: The inference time of the model for the data provided, minus the inference time for the model’s base models, if it has any.
Note that this ignores the time required to load the model into memory when bagging is disabled.

If str is passed, data will be loaded using the str value as the file path.
extra_info (bool, default = False) –
If True, will return extra columns with advanced info. This requires additional computation as advanced info data is calculated on demand. Additional output columns when extra_info=True include:

’num_features’: Number of input features used by the model.
Some models may ignore certain features in the preprocessed data.

’num_models’: Number of models that actually make up this “model” object.
For non-bagged models, this is 1. For bagged models, this is equal to the number of child models (models trained on bagged folds) the bagged ensemble contains.

’num_models_w_ancestors’: Equivalent to the sum of ‘num_models’ values for the model and its’ ancestors (see below). ‘memory_size’: The amount of memory in bytes the model requires when persisted in memory. This is not equivalent to the amount of memory the model may use during inference.

For bagged models, this is the sum of the ‘memory_size’ of all child models.

’memory_size_w_ancestors’: Equivalent to the sum of ‘memory_size’ values for the model and its’ ancestors.
This is the amount of memory required to avoid loading any models in-between inference calls to get predictions from this model. For online-inference, this is critical. It is important that the machine performing online inference has memory more than twice this value to avoid loading models for every call to inference by persisting models in memory.

’memory_size_min’: The amount of memory in bytes the model minimally requires to perform inference.
For non-bagged models, this is equivalent to ‘memory_size’. For bagged models, this is equivalent to the largest child model’s ‘memory_size_min’. To minimize memory usage, child models can be loaded and un-persisted one by one to infer. This is the default behavior if a bagged model was not already persisted in memory prior to inference.

’memory_size_min_w_ancestors’: Equivalent to the max of the ‘memory_size_min’ values for the model and its’ ancestors.
This is the minimum required memory to infer with the model by only loading one model at a time, as each of its ancestors will also have to be loaded into memory. For offline-inference where latency is not a concern, this should be used to determine the required memory for a machine if ‘memory_size_w_ancestors’ is too large.

’num_ancestors’: Number of ancestor models for the given model.

’num_descendants’: Number of descendant models for the given model.

’model_type’: The type of the given model.
If the model is an ensemble type, ‘child_model_type’ will indicate the inner model type. A stack ensemble of bagged LightGBM models would have ‘StackerEnsembleModel’ as its model type.

’child_model_type’: The child model type. None if the model is not an ensemble. A stack ensemble of bagged LightGBM models would have ‘LGBModel’ as its child type.
child models are models which are used as a group to generate a given bagged ensemble model’s predictions. These are the models trained on each fold of a bagged ensemble. For 10-fold bagging, the bagged ensemble model would have 10 child models. For 10-fold bagging with 3 repeats, the bagged ensemble model would have 30 child models. Note that child models are distinct from ancestors and descendants.

’hyperparameters’: The hyperparameter values specified for the model.
All hyperparameters that do not appear in this dict remained at their default values.

’hyperparameters_fit’: The hyperparameters set by the model during fit.
This overrides the ‘hyperparameters’ value for a particular key if present in ‘hyperparameters_fit’ to determine the fit model’s final hyperparameters. This is most commonly set for hyperparameters that indicate model training iterations or epochs, as early stopping can find a different value from what ‘hyperparameters’ indicated. In these cases, the provided hyperparameter in ‘hyperparameters’ is used as a maximum for the model, but the model is still able to early stop at a smaller value during training to achieve a better validation score or to satisfy time constraints. For example, if a NN model was given epochs=500 as a hyperparameter, but found during training that epochs=60 resulted in optimal validation score, it would use epoch=60 and hyperparameters_fit={‘epoch’: 60} would be set.

’ag_args_fit’: Special AutoGluon arguments that influence model fit.
See the documentation of the hyperparameters argument in TabularPredictor.fit() for more information.

’features’: List of feature names used by the model.

’child_hyperparameters’: Equivalent to ‘hyperparameters’, but for the model’s children.

’child_hyperparameters_fit’: Equivalent to ‘hyperparameters_fit’, but for the model’s children.

’child_ag_args_fit’: Equivalent to ‘ag_args_fit’, but for the model’s children.

’ancestors’: The model’s ancestors. Ancestor models are the models which are required to make predictions during the construction of the model’s input features.
If A is an ancestor of B, then B is a descendant of A. If a model’s ancestor is deleted, the model is no longer able to infer on new data, and its ‘can_infer’ value will be False. A model can only have ancestor models whose ‘stack_level’ are lower than itself. ‘stack_level’=1 models have no ancestors.

’descendants’: The model’s descendants. Descendant models are the models which require this model to make predictions during the construction of their input features.
If A is a descendant of B, then B is an ancestor of A. If this model is deleted, then all descendant models will no longer be able to infer on new data, and their ‘can_infer’ values will be False. A model can only have descendant models whose ‘stack_level’ are higher than itself.
extra_metrics (list, default = None) – A list of metrics to calculate scores for and include in the output DataFrame. Only valid when data is specified. The scores refer to the scores on data (same data as used to calculate the score_test column). This list can contain any values which would also be valid for eval_metric in predictor init. For example, extra_metrics=[‘accuracy’, ‘roc_auc’, ‘log_loss’] would be valid in binary classification. This example would return 3 additional columns in the output DataFrame, whose column names match the names of the metrics. Passing extra_metrics=[predictor.eval_metric] would return an extra column in the name of the eval metric that has identical values to score_test. This also works with custom metrics. If passing an object instead of a string, the column name will be equal to the .name attribute of the object. NOTE: Metrics scores always show in higher is better form. This means that metrics such as log_loss and root_mean_squared_error will have their signs FLIPPED, and values will be negative. This is necessary to avoid the user needing to know the metric to understand if higher is better when looking at leaderboard.
decision_threshold (float, default = None) –
The decision threshold to use when converting prediction probabilities to predictions. This will impact the scores of metrics such as f1 and accuracy. If None, defaults to predictor.decision_threshold. Ignored unless problem_type=’binary’. Refer to the predictor.decision_threshold docstring for more information. NOTE: score_val will not be impacted by this value in v0.8.

score_val will always show the validation scores achieved with a decision threshold of 0.5. Only test scores will be properly updated.
score_format ({'score', 'error'}) –
If “score”, leaderboard is returned as normal. If “error”, the column “score_val” is converted to “metric_error_val”, and “score_test” is converted to “metric_error_test”.

”metric_error” is calculated by taking predictor.eval_metric.convert_score_to_error(score). This will result in errors where 0 is perfect and lower is better.
only_pareto_frontier (bool, default = False) – If True, only return model information of models in the Pareto frontier of the accuracy/latency trade-off (models which achieve the highest score within their end-to-end inference time). At minimum this will include the model with the highest score and the model with the lowest inference time. This is useful when deciding which model to use during inference if inference time is a consideration. Models filtered out by this process would never be optimal choices for a user that only cares about model inference time and score.
skip_score (bool, default = False) – [Advanced, primarily for developers] If True, will skip computing score_test if data is specified. score_test will be set to NaN for all models. pred_time_test and related columns will still be computed.
refit_full (bool, default = None) – If True, will return only models that have been refit (ex: have _FULL in the name). If False, will return only models that have not been refit. If None, will return all models.
set_refit_score_to_parent (bool, default = False) – If True, the score_val of refit models will be set to the score_val of their parent. While this does not represent the genuine validation score of the refit model, it is a reasonable proxy.
display (bool, default = False) – If True, the output DataFrame is printed to stdout.

Return type:

pd.DataFrame of model performance summary information.