TabularPredictor.predict_proba_oof

TabularPredictor.predict_proba_oof(model: str | None = None, *, transformed=False, as_multiclass=True, train_data=None, internal_oof=False, can_infer=None) DataFrame | Series[source]

Note: This is advanced functionality not intended for normal usage.

Returns the out-of-fold (OOF) predicted class probabilities for every row in the training data. OOF prediction probabilities may provide unbiased estimates of generalization accuracy (reflecting how predictions will behave on new data) Predictions for each row are only made using models that were fit to a subset of data where this row was held-out.

Warning: This method will raise an exception if called on a model that is not a bagged ensemble. Only bagged models (such a stacker models) can produce OOF predictions.

This also means that refit_full models and distilled models will raise an exception.

Warning: If intending to join the output of this method with the original training data, be aware that a rare edge-case issue exists:

Multiclass problems with rare classes combined with the use of the ‘log_loss’ eval_metric may have forced AutoGluon to duplicate rows in the training data to satisfy minimum class counts in the data. If this has occurred, then the indices and row counts of the returned pd.Series in this method may not align with the training data. In this case, consider fetching the processed training data using predictor.load_data_internal() instead of using the original training data. A more benign version of this issue occurs when ‘log_loss’ wasn’t specified as the eval_metric but rare classes were dropped by AutoGluon. In this case, not all of the original training data rows will have an OOF prediction. It is recommended to either drop these rows during the join or to get direct predictions on the missing rows via TabularPredictor.predict_proba().

Parameters:
  • model (str (optional)) – The name of the model to get out-of-fold predictions from. Defaults to None, which uses the highest scoring model on the validation set. Valid models are listed in this predictor by calling predictor.model_names()

  • transformed (bool, default = False) – Whether the output values should be of the original label representation (False) or the internal label representation (True). The internal representation for binary and multiclass classification are integers numbering the k possible classes from 0 to k-1, while the original representation is identical to the label classes provided during fit. Generally, most users will want the original representation and keep transformed=False.

  • as_multiclass (bool, default = True) –

    Whether to return binary classification probabilities as if they were for multiclass classification.

    Output will contain two columns, and if transformed=False, the column names will correspond to the binary class labels. The columns will be the same order as predictor.class_labels.

    If False, output will contain only 1 column for the positive class (get positive_class name via predictor.positive_class). Only impacts output for binary classification problems.

  • train_data (pd.DataFrame, default = None) – Specify the original train_data to ensure that any training rows that were originally dropped internally are properly handled. If None, then output will not contain all rows if training rows were dropped internally during fit. If train_data is specified and model is unable to predict and rows were dropped internally, an exception will be raised.

  • internal_oof (bool, default = False) – [Advanced Option] Return the internal OOF preds rather than the externally facing OOF preds. Internal OOF preds may have more/fewer rows than was provided in train_data, and are incompatible with external data. If you don’t know what this does, keep it as False.

  • can_infer (bool, default = None) – Only used if model is not specified. This is used to determine if the best model must be one that is able to predict on new data (True). If None, the best model does not need to be able to infer on new data.

Return type:

pd.Series or pd.DataFrame object of the out-of-fold training prediction probabilities of the model.