You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are many ways and reasons to perform data augmentation with synthetic data for the purposes for building ML models. While we have some ML Efficacy metrics in beta, we'd like to create a suite of metrics that more effectively cover the use case. The BinaryClassifierPrecisionEfficacy metric will specifically measure if synthetic data improves the precision of a binary classifier.
Expected behavior
This metric should be defined in the data_augmentation sub-module inside single_table.
real_training_data (pd.DataFrame) - A dataframe containing the real data that was used for training the synthesizer. The metric will use this data for training a Binary Classification model.
synthetic_data (pd.DataFrame) - A dataframe containing the synthetic data sampled from the synthesizer. The metric will use this data for training a Binary Classification model.
real_validation_data (pd.DataFrame) - A dataframe containing a holdout set of real data. This data should not have been used to train the synthesizer. This data will be used to evaluate a Binary Classification model.
metadata (dict) - The metadata dictionary describing the table of data.
prediction_column_name (str) - The name of the column to be predicted. The column should be a categorical or boolean column.
minority_class_label [str/int/float] - The value in the prediction column that should be considered a positive result, from the perspective of Binary Classfication. All other values in the column will be considered negative results.
classifier [str, optional] - The ML algorithm to use when building a Binary Classfication. Supported options are 'XGBoost'. Defaults to 'XGBoost'.
Note: as an MVP, we will only support XGBoost. Future feature requests may add support for additional algorithms.
fixed_recall_value [float, optional] - A float in the range (0, 1.0) describing the value to fix for the recall when building the Binary Classification model. Defaults to 0.9.
Returns
A dictionary of the breakdown of the score, with the following information:
The score for the metric. This is the improvement precision score (from baseline -> augmented data) in percentage points, score = MIN(0, augmented_precision_score - baseline_precision_score)).
The parameters used to run the metric
For each of the augmented data and the real data baseline:
The recall score achieved during training. This should be at least the requested score input as a parameter, but may not be exactly equal.
The actual recall score achieved on the validation (holdout) set.
The precision score achieved on the validation set.
The prediction counts achieved on the validation set (true positive, false positive, true negative, and false negative).
Expected dictionary output:
{
'score': 0.86,
'augmented_data': {
'recall_score_training': 0.950,
'recall_score_validation': 0.912'precision_score_validation': 0.84,
'prediction_counts_validation': {
'true_positive': 21,
'false_positive': 4,
'true_negative': 73,
'false_negative': 3
},
},
'real_data_baseline': {
# keys are the same as the 'augmented_data' dictionary
},
'parameters': {
'prediction_column_name': 'covid_status',
'minority_class_label': 1,
'classifier': 'XGBoost',
'fixed_recall_value': 0.9
}
}
Algorithm
Concatanate the real_training_data and synthetic_data together
Train a binary classification model on the data, using the classifier algorithm selected (default: XGBoost)
a) Need to pre-process the data to turn discrete columns into continuous columns (note that we cannot use RDT, and should use scikit learn methods instead)
b) Data pre-processing to convert the prediction_columm into a boolean column with the correct positive/negative values
If multi-class, consider only the minority_class_label as positive values. All other values will be considered negative.
Based on the parameters, fix the recall for the minority class
a) This will require finding the right threshold to achieve as close of the fixed recall as specified. The classifier will return a continuous value for each data point in training data and we would have to find the threshold that will achieve the value closest to the fixed rate. Note that we should always choose a threshold that is as close as possible to the requested recall value but never less than it. That is to say, ensure that the training set recall is >= the requested recall value.
b) Save this threshold to use on the validation data. This threshold is now a learnt parameter alongside the classifier
Take the classifier and apply it on the real_validation_data. Compute the statistics that we want to return.
Calculate the baseline. Repeat steps 1-4 but this time, only use the real_training_data (do not concatenate synthetic_data).
compute
The compute method should take the same arguments as the compute_breakdown method.
The compute method should return just the overall score parameter calculated by compute_breakdown.
There will be significant overlap of required pre-processing/helper functions between data augmentation metrics. When possible, general functionality should be abstracted into utility functions that can be reused across many metrics.
The text was updated successfully, but these errors were encountered:
Problem Description
There are many ways and reasons to perform data augmentation with synthetic data for the purposes for building ML models. While we have some ML Efficacy metrics in beta, we'd like to create a suite of metrics that more effectively cover the use case. The
BinaryClassifierPrecisionEfficacy
metric will specifically measure if synthetic data improves the precision of a binary classifier.Expected behavior
This metric should be defined in the
data_augmentation
sub-module insidesingle_table
.compute_breakdown
API
real_training_data (pd.DataFrame)
- A dataframe containing the real data that was used for training the synthesizer. The metric will use this data for training a Binary Classification model.synthetic_data (pd.DataFrame)
- A dataframe containing the synthetic data sampled from the synthesizer. The metric will use this data for training a Binary Classification model.real_validation_data (pd.DataFrame)
- A dataframe containing a holdout set of real data. This data should not have been used to train the synthesizer. This data will be used to evaluate a Binary Classification model.metadata (dict)
- The metadata dictionary describing the table of data.prediction_column_name (str)
- The name of the column to be predicted. The column should be a categorical or boolean column.minority_class_label [str/int/float]
- The value in the prediction column that should be considered a positive result, from the perspective of Binary Classfication. All other values in the column will be considered negative results.classifier [str, optional]
- The ML algorithm to use when building a Binary Classfication. Supported options are 'XGBoost'. Defaults to 'XGBoost'.fixed_recall_value [float, optional]
- A float in the range (0, 1.0) describing the value to fix for the recall when building the Binary Classification model. Defaults to 0.9.score = MIN(0, augmented_precision_score - baseline_precision_score))
.Algorithm
real_training_data
andsynthetic_data together
a) Need to pre-process the data to turn discrete columns into continuous columns (note that we cannot use RDT, and should use scikit learn methods instead)
b) Data pre-processing to convert the
prediction_columm
into a boolean column with the correct positive/negative valuesminority_class_label
as positive values. All other values will be considered negative.a) This will require finding the right threshold to achieve as close of the fixed recall as specified. The classifier will return a continuous value for each data point in training data and we would have to find the threshold that will achieve the value closest to the fixed rate. Note that we should always choose a threshold that is as close as possible to the requested recall value but never less than it. That is to say, ensure that the training set recall is >= the requested recall value.
b) Save this threshold to use on the validation data. This threshold is now a learnt parameter alongside the classifier
real_validation_data
. Compute the statistics that we want to return.real_training_data
(do not concatenatesynthetic_data
).compute
The
compute
method should take the same arguments as thecompute_breakdown
method.The
compute
method should return just the overallscore
parameter calculated bycompute_breakdown
.Additional context
See this doc
There will be significant overlap of required pre-processing/helper functions between data augmentation metrics. When possible, general functionality should be abstracted into utility functions that can be reused across many metrics.
The text was updated successfully, but these errors were encountered: