Add `BinaryClassifierPrecisionEfficacy` metric #711

frances-h · 2025-01-22T16:27:16Z

Problem Description

There are many ways and reasons to perform data augmentation with synthetic data for the purposes for building ML models. While we have some ML Efficacy metrics in beta, we'd like to create a suite of metrics that more effectively cover the use case. The BinaryClassifierPrecisionEfficacy metric will specifically measure if synthetic data improves the precision of a binary classifier.

Expected behavior

This metric should be defined in the data_augmentation sub-module inside single_table.

from sdmetrics.single_table.data_augmentation import BinaryClassifierPrecisionEfficacy

BinaryClassifierPrecisionEfficacy.compute_breakdown(
  real_training_data=real_df,
  synthetic_data=synthetic_df,
  real_validation_data=real_holdout_df,
  metadata=single_table_metadata_dict,
  prediction_column_name='covid_status',
  minority_class_label=1,
  classifier='XGBoost',
  fixed_recall_value=0.9
)

`compute_breakdown`

API

Args
- real_training_data (pd.DataFrame) - A dataframe containing the real data that was used for training the synthesizer. The metric will use this data for training a Binary Classification model.
- synthetic_data (pd.DataFrame) - A dataframe containing the synthetic data sampled from the synthesizer. The metric will use this data for training a Binary Classification model.
- real_validation_data (pd.DataFrame) - A dataframe containing a holdout set of real data. This data should not have been used to train the synthesizer. This data will be used to evaluate a Binary Classification model.
- metadata (dict) - The metadata dictionary describing the table of data.
- prediction_column_name (str) - The name of the column to be predicted. The column should be a categorical or boolean column.
- minority_class_label [str/int/float] - The value in the prediction column that should be considered a positive result, from the perspective of Binary Classfication. All other values in the column will be considered negative results.
- classifier [str, optional] - The ML algorithm to use when building a Binary Classfication. Supported options are 'XGBoost'. Defaults to 'XGBoost'.
  - Note: as an MVP, we will only support XGBoost. Future feature requests may add support for additional algorithms.
- fixed_recall_value [float, optional] - A float in the range (0, 1.0) describing the value to fix for the recall when building the Binary Classification model. Defaults to 0.9.
Returns
- A dictionary of the breakdown of the score, with the following information:
  - The score for the metric. This is the improvement precision score (from baseline -> augmented data) in percentage points, score = MIN(0, augmented_precision_score - baseline_precision_score)).
  - The parameters used to run the metric
  - For each of the augmented data and the real data baseline:
    - The recall score achieved during training. This should be at least the requested score input as a parameter, but may not be exactly equal.
    - The actual recall score achieved on the validation (holdout) set.
    - The precision score achieved on the validation set.
    - The prediction counts achieved on the validation set (true positive, false positive, true negative, and false negative).
- Expected dictionary output:
```
{
  'score': 0.86,
  'augmented_data': {
    'recall_score_training': 0.950,
    'recall_score_validation': 0.912
    'precision_score_validation': 0.84,
    'prediction_counts_validation': {
      'true_positive': 21,
      'false_positive': 4,
      'true_negative': 73,
      'false_negative': 3
    },
  },
  'real_data_baseline': {
    # keys are the same as the 'augmented_data' dictionary
   },
  'parameters': {
    'prediction_column_name': 'covid_status',
    'minority_class_label': 1,
    'classifier': 'XGBoost',
    'fixed_recall_value': 0.9
  }
}
```

Algorithm

Concatanate the real_training_data and synthetic_data together
Train a binary classification model on the data, using the classifier algorithm selected (default: XGBoost)
a) Need to pre-process the data to turn discrete columns into continuous columns (note that we cannot use RDT, and should use scikit learn methods instead)
b) Data pre-processing to convert the prediction_columm into a boolean column with the correct positive/negative values
- If multi-class, consider only the minority_class_label as positive values. All other values will be considered negative.
Based on the parameters, fix the recall for the minority class
a) This will require finding the right threshold to achieve as close of the fixed recall as specified. The classifier will return a continuous value for each data point in training data and we would have to find the threshold that will achieve the value closest to the fixed rate. Note that we should always choose a threshold that is as close as possible to the requested recall value but never less than it. That is to say, ensure that the training set recall is >= the requested recall value.
b) Save this threshold to use on the validation data. This threshold is now a learnt parameter alongside the classifier
Take the classifier and apply it on the real_validation_data. Compute the statistics that we want to return.
Calculate the baseline. Repeat steps 1-4 but this time, only use the real_training_data (do not concatenate synthetic_data).

`compute`

The compute method should take the same arguments as the compute_breakdown method.

The compute method should return just the overall score parameter calculated by compute_breakdown.

Additional context

See this doc

There will be significant overlap of required pre-processing/helper functions between data augmentation metrics. When possible, general functionality should be abstracted into utility functions that can be reused across many metrics.

The text was updated successfully, but these errors were encountered:

frances-h added the feature request Request for a new feature label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `BinaryClassifierPrecisionEfficacy` metric #711

Add `BinaryClassifierPrecisionEfficacy` metric #711

frances-h commented Jan 22, 2025 •

edited

Loading

Add BinaryClassifierPrecisionEfficacy metric #711

Add BinaryClassifierPrecisionEfficacy metric #711

Comments

frances-h commented Jan 22, 2025 • edited Loading

Problem Description

Expected behavior

compute_breakdown

compute

Additional context

Add `BinaryClassifierPrecisionEfficacy` metric #711

Add `BinaryClassifierPrecisionEfficacy` metric #711

frances-h commented Jan 22, 2025 •

edited

Loading

`compute_breakdown`

`compute`