Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BinaryClassifierPrecisionEfficacy metric #711

Open
frances-h opened this issue Jan 22, 2025 · 0 comments
Open

Add BinaryClassifierPrecisionEfficacy metric #711

frances-h opened this issue Jan 22, 2025 · 0 comments
Labels
feature request Request for a new feature

Comments

@frances-h
Copy link
Contributor

frances-h commented Jan 22, 2025

Problem Description

There are many ways and reasons to perform data augmentation with synthetic data for the purposes for building ML models. While we have some ML Efficacy metrics in beta, we'd like to create a suite of metrics that more effectively cover the use case. The BinaryClassifierPrecisionEfficacy metric will specifically measure if synthetic data improves the precision of a binary classifier.

Expected behavior

This metric should be defined in the data_augmentation sub-module inside single_table.

from sdmetrics.single_table.data_augmentation import BinaryClassifierPrecisionEfficacy

BinaryClassifierPrecisionEfficacy.compute_breakdown(
  real_training_data=real_df,
  synthetic_data=synthetic_df,
  real_validation_data=real_holdout_df,
  metadata=single_table_metadata_dict,
  prediction_column_name='covid_status',
  minority_class_label=1,
  classifier='XGBoost',
  fixed_recall_value=0.9
)

compute_breakdown

API

  • Args
    • real_training_data (pd.DataFrame) - A dataframe containing the real data that was used for training the synthesizer. The metric will use this data for training a Binary Classification model.
    • synthetic_data (pd.DataFrame) - A dataframe containing the synthetic data sampled from the synthesizer. The metric will use this data for training a Binary Classification model.
    • real_validation_data (pd.DataFrame) - A dataframe containing a holdout set of real data. This data should not have been used to train the synthesizer. This data will be used to evaluate a Binary Classification model.
    • metadata (dict) - The metadata dictionary describing the table of data.
    • prediction_column_name (str) - The name of the column to be predicted. The column should be a categorical or boolean column.
    • minority_class_label [str/int/float] - The value in the prediction column that should be considered a positive result, from the perspective of Binary Classfication. All other values in the column will be considered negative results.
    • classifier [str, optional] - The ML algorithm to use when building a Binary Classfication. Supported options are 'XGBoost'. Defaults to 'XGBoost'.
      • Note: as an MVP, we will only support XGBoost. Future feature requests may add support for additional algorithms.
    • fixed_recall_value [float, optional] - A float in the range (0, 1.0) describing the value to fix for the recall when building the Binary Classification model. Defaults to 0.9.
  • Returns
    • A dictionary of the breakdown of the score, with the following information:
      • The score for the metric. This is the improvement precision score (from baseline -> augmented data) in percentage points, score = MIN(0, augmented_precision_score - baseline_precision_score)).
      • The parameters used to run the metric
      • For each of the augmented data and the real data baseline:
        • The recall score achieved during training. This should be at least the requested score input as a parameter, but may not be exactly equal.
        • The actual recall score achieved on the validation (holdout) set.
        • The precision score achieved on the validation set.
        • The prediction counts achieved on the validation set (true positive, false positive, true negative, and false negative).
    • Expected dictionary output:
    {
      'score': 0.86,
      'augmented_data': {
        'recall_score_training': 0.950,
        'recall_score_validation': 0.912
        'precision_score_validation': 0.84,
        'prediction_counts_validation': {
          'true_positive': 21,
          'false_positive': 4,
          'true_negative': 73,
          'false_negative': 3
        },
      },
      'real_data_baseline': {
        # keys are the same as the 'augmented_data' dictionary
       },
      'parameters': {
        'prediction_column_name': 'covid_status',
        'minority_class_label': 1,
        'classifier': 'XGBoost',
        'fixed_recall_value': 0.9
      }
    }

Algorithm

  1. Concatanate the real_training_data and synthetic_data together
  2. Train a binary classification model on the data, using the classifier algorithm selected (default: XGBoost)
    a) Need to pre-process the data to turn discrete columns into continuous columns (note that we cannot use RDT, and should use scikit learn methods instead)
    b) Data pre-processing to convert the prediction_columm into a boolean column with the correct positive/negative values
    • If multi-class, consider only the minority_class_label as positive values. All other values will be considered negative.
  3. Based on the parameters, fix the recall for the minority class
    a) This will require finding the right threshold to achieve as close of the fixed recall as specified. The classifier will return a continuous value for each data point in training data and we would have to find the threshold that will achieve the value closest to the fixed rate. Note that we should always choose a threshold that is as close as possible to the requested recall value but never less than it. That is to say, ensure that the training set recall is >= the requested recall value.
    b) Save this threshold to use on the validation data. This threshold is now a learnt parameter alongside the classifier
  4. Take the classifier and apply it on the real_validation_data. Compute the statistics that we want to return.
  5. Calculate the baseline. Repeat steps 1-4 but this time, only use the real_training_data (do not concatenate synthetic_data).

compute

The compute method should take the same arguments as the compute_breakdown method.

The compute method should return just the overall score parameter calculated by compute_breakdown.

Additional context

See this doc

There will be significant overlap of required pre-processing/helper functions between data augmentation metrics. When possible, general functionality should be abstracted into utility functions that can be reused across many metrics.

@frances-h frances-h added the feature request Request for a new feature label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

1 participant