-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(steps): explain how to create a custom Step
#116
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,267 @@ | ||
--- | ||
title: "How to create your own transformer" | ||
description: | | ||
This tutorial provides step-by-step guidance for creating your own ibisML transformer in Python. | ||
--- | ||
|
||
|
||
|
||
ibisML comes with a variety of built-in transformation steps like `OneHotEncode`, `ImputeMean`, `DiscretizeKBins`, and many [others](https://ibis-project.github.io/ibis-ml/reference/steps-outlier.html). However, there are times when you might need to create your own custom preprocessing transformations. This guide will walk you through how to define a custom transformation in ibisML. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ibisML -> IbisML There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does "others" link to |
||
|
||
## Install and Import Necessary Modules | ||
|
||
```{python} | ||
# install ibis and ibisML | ||
# pip install 'ibis-framework[duckdb]' ibis-ml | ||
|
||
import ibis | ||
import ibis.expr.types as ir | ||
import ibis_ml as ml | ||
from ibis_ml.core import Metadata, Step | ||
from ibis_ml.select import SelectionType, selector | ||
from typing import Iterable, Any | ||
``` | ||
|
||
## Implementation Outlines | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use sentence case for headings |
||
|
||
Creating a custom transformer in ibisML involves defining a class that inherits from the `Step` class. This class needs to implement specific methods like `fit_table` and `transform_table` to handle data processing. If you're seeking good examples of existing steps, we recommend examining the code for [impute missing value](https://github.com/ibis-project/ibis-ml/blob/main/ibis_ml/steps/_impute.py) or [ExpandDateTime](https://github.com/ibis-project/ibis-ml/blob/main/ibis_ml/steps/_temporal.py#L14) as starting points. When you need information about Ibis, you can find it [here](https://ibis-project.org/). | ||
|
||
Here’s a step-by-step guide to creating such a transformer: | ||
|
||
#### Step 1: Define the Constructor | ||
In the constructor (`__init__ `method), you initialize any parameters or configurations needed for the transformer. | ||
|
||
#### Step 2: Implement `fit_table` | ||
The `fit_table` method is used to fit the transformer to the data. This could involve calculating statistics or other parameters from the input data that will be used during transformation. | ||
|
||
#### Step 3: Implement `transform_table` | ||
The `transform_table` method is used to apply the transformation to the data based on the parameters or configurations set during `fit_table`. | ||
|
||
#### Step 4: Test the Transformer | ||
Testing ensures that your custom transformer works as expected. You can create sample data to fit and transform, checking the output to verify correctness. | ||
|
||
## Example Implementation - `CustomRobustScale` | ||
Here’s a step-by-step guide to create a custom transformation step for scaling features using [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) from scikit-learn. | ||
|
||
The RobustScaler in scikit-learn scales features using statistics that are robust to outliers. Instead of using the mean and variance, it uses the median and the interquartile range (IQR). The formula for scaling a feature value $x$ is: | ||
|
||
|
||
$$ | ||
\text{scaled\_x} = \frac{x - \text{median}(X)}{\text{IQR}(X)} | ||
$$ | ||
|
||
where: | ||
|
||
- $\text{scaled\_x}$ is the scaled feature value. | ||
- $x$ is the individual feature value. | ||
- $\text{median}(X)$ is the median of the feature values. | ||
- $\text{IQR}(X)$ is the interquartile range of the feature values, defined as the difference between the 75th percentile (Q3) and the 25th percentile (Q1). | ||
|
||
|
||
The following code snippet provided outlines the structure or blueprint of the CustomRobustScale class, including its constructor and methods. We could start from here. | ||
|
||
```{python} | ||
class CustomRobustScale(Step): | ||
def __init__(self, inputs: SelectionType): | ||
pass | ||
|
||
def fit_table(self, table: ir.Table, metadata: Metadata) -> None: | ||
pass # Implement fitting logic here | ||
|
||
def transform_table(self, table: ir.Table) -> ir.Table: | ||
pass # Implement transformation logic here | ||
``` | ||
|
||
### Step 1: Define the Constructor | ||
|
||
To construct our `CustomRobustScale` transformation, we need to specify which columns will be scaled. IbisML provides a rich set of [Selectors](https://ibis-project.github.io/ibis-ml/reference/selectors.html), allowing you to select columns by data type, names, and other patterns. | ||
|
||
Here's how to begin defining the `__init__` method with these considerations: | ||
|
||
```{python} | ||
def __init__(self, inputs: SelectionType): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do all these blocks need to get executed? You can use ... Also, when I quickly read through this, I read all of the individual steps, and then I read it again in the final implementation. Is there a better way to present this? |
||
# Select the columns that will be involved in the transformation | ||
self.inputs = selector(inputs) | ||
``` | ||
|
||
### Step 2: Implement `fit_table` | ||
|
||
The next step is to implement the `fit_table()` method, which will be used to learn from the input data. This method typically fits the transformation to the data, storing any necessary statistics or parameters for later use in the transformation process. It has two parameters: | ||
|
||
- `table`: An Ibis table expression containing the data to be used for fitting the transformation. | ||
- `metadata`: Contains additional information about the data, such as labels, necessary for the transformation process. | ||
|
||
In this specific example, the `fit_table` method calculates the median and interquartile range (IQR) for selected columns. These statistics are necessary for scaling the data using the RobustScaler approach. We will save the statistics for each column in a dictionary. | ||
|
||
Here is the outlines for the fit_table method: | ||
|
||
- Get the column names using the `Selector`'s built-in method `select_columns`. | ||
- For each column, calculate the `median` and `IQR` (`p75` - `p25`) by building an Ibis expression, which can be lazily evaluated on your chosen Ibis-supported backend. | ||
- Save the statistics in a dictionary, which will be used during the transformation process. | ||
|
||
|
||
```{python} | ||
def fit_table(self, table: ir.Table, metadata: Metadata) -> None: | ||
# Step 1: Get the column names that match the selector | ||
columns = self.inputs.select_columns(table, metadata) | ||
|
||
# Step 2: Initialize a dictionary to store statistics | ||
stats = {} | ||
# Step 3: If there are columns selected, calculate statistics for each column | ||
if columns: | ||
# Create a list to hold Ibis aggregation expressions | ||
aggs = [] | ||
# Step 4: Iterate over each selected column | ||
for name in columns: | ||
# Get the column from the table | ||
c = table[name] | ||
# Build Ibis expressions for median, 25th percentile, and 75th percentile | ||
aggs.append(c.median().name(f"{name}_median")) | ||
aggs.append(c.quantile(0.25).name(f"{name}_25")) | ||
aggs.append(c.quantile(0.75).name(f"{name}_75")) | ||
|
||
# Step 5: Evaluate the Ibis expressions in one run | ||
results = table.aggregate(aggs).execute().to_dict("records")[0] | ||
# Step 6: Save the statistics in the dictionary | ||
for name in columns: | ||
stats[name] = ( | ||
results[f"{name}_median"], | ||
results[f"{name}_25"], | ||
results[f"{name}_75"], | ||
|
||
) | ||
# Step 7: Store the statistics in an instance variable | ||
self.stats_ = stats | ||
``` | ||
|
||
### Step 3: Implement `transform_table` | ||
|
||
The `transform_table` method is used to apply the learned transformation to the input data. This method takes the input table and transforms it based on the previously calculated statistics. Here's how to implement transform_table: | ||
|
||
```{python} | ||
def transform_table(self, table): | ||
# Apply the transformation to each column | ||
return table.mutate( | ||
[ | ||
# Apply the transformation formula: (x - median) / (p75 - p25) | ||
((table[c] - median) / (p75 - p25)).name(c) # type: ignore | ||
for c, (median, p25, p75) in self.stats_.items() | ||
] | ||
) | ||
``` | ||
|
||
### Step 4: Test | ||
Now let's put the code together and perform some simple tests to verify the results. | ||
|
||
```{python} | ||
class CustomRobustScale(Step): | ||
|
||
def __init__(self, inputs: SelectionType): | ||
# Select the columns that will be involved in the transformation | ||
self.inputs = selector(inputs) | ||
|
||
def fit_table(self, table: ir.Table, metadata: Metadata) -> None: | ||
# Step 1: Get the column names that match the selector | ||
columns = self.inputs.select_columns(table, metadata) | ||
|
||
# Step 2: Initialize a dictionary to store statistics | ||
stats = {} | ||
# Step 3: If there are columns selected, calculate statistics for each column | ||
if columns: | ||
# Create a list to hold Ibis aggregation expressions | ||
aggs = [] | ||
# Step 4: Iterate over each selected column | ||
for name in columns: | ||
# Get the column from the table | ||
c = table[name] | ||
# Build Ibis expressions for median, 25th percentile, and 75th percentile | ||
aggs.append(c.median().name(f"{name}_median")) | ||
aggs.append(c.quantile(0.25).name(f"{name}_25")) | ||
aggs.append(c.quantile(0.75).name(f"{name}_75")) | ||
|
||
# Step 5: Evaluate the Ibis expressions in one run | ||
results = table.aggregate(aggs).execute().to_dict("records")[0] | ||
# Step 6: Save the statistics in the dictionary | ||
for name in columns: | ||
stats[name] = ( | ||
results[f"{name}_median"], | ||
results[f"{name}_25"], | ||
results[f"{name}_75"], | ||
|
||
) | ||
# Step 7: Store the statistics in an instance variable | ||
self.stats_ = stats | ||
|
||
def transform_table(self, table): | ||
# Apply the transformation to each column | ||
return table.mutate( | ||
[ | ||
# Apply the transformation formula: (x - median) / (p75 - p25) | ||
((table[c] - median) / (p75 - p25)).name(c) # type: ignore | ||
for c, (median, p25, p75) in self.stats_.items() | ||
] | ||
) | ||
``` | ||
|
||
This code creates sample data for four columns: "string_col", "int_col", "floating_col", and "target_col", each containing 10 rows of data. The train_table variable holds the created Ibis memory table. | ||
|
||
```{python} | ||
import numpy as np | ||
|
||
# Enable interactive mode for Ibis | ||
ibis.options.interactive = True | ||
|
||
train_size = 10 | ||
data = { | ||
"string_col": np.array(["a"] * train_size, dtype="str"), | ||
"int_col": np.arange(train_size, dtype="int64"), | ||
"floating_col": np.arange(train_size, dtype="float64"), | ||
"target_col": np.arange(train_size, dtype="int8"), | ||
} | ||
train_table = ibis.memtable(data) | ||
train_table | ||
``` | ||
|
||
This code initializes a transformer instance of `CustomRobustScale` with the specified columns to scale. Then, it creates a `Metadata` object with target columns. The transformer is fitted to the training data and metadata using the `fit_table` method. Finally, the `transform_table` method is used to transform the training table with the fitted transformer. | ||
|
||
```{python} | ||
# Instantiate CustomRobustScale transformer with the specified columns to scale | ||
# # Select only one column: "int_col" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Double comments |
||
robust_scale = CustomRobustScale(["int_col"]) | ||
# # Select all numeric columns | ||
# robust_scale = CustomRobustScale(ml.numeric()) | ||
|
||
# Create Metadata object with target columns | ||
metadata = Metadata(targets=("target_col",)) | ||
|
||
# Fit the transformer to the training data and metadata | ||
robust_scale.fit_table(train_table, metadata) | ||
|
||
# Transform the training table using the fitted transformer | ||
transformed_train_table = robust_scale.transform_table(train_table) | ||
|
||
transformed_train_table | ||
``` | ||
|
||
Access the calculated statistics for each column | ||
|
||
```{python} | ||
robust_scale.stats_ | ||
``` | ||
|
||
### Additional Considerations | ||
|
||
Certainly! Here are additional checks and considerations to ensure the transformer handles unexpected data types or conditions gracefully: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This reads like ChatGPT |
||
|
||
- Check for numeric olumns: Ensure that selected columns are numeric before calculating statistics. This prevents errors when trying to calculate statistics on non-numeric data. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. olumns -> columns |
||
- Check for zero Interquartile Range (IQR): Verify that the IQR (the difference between the 75th and 25th percentiles) is not zero. A zero IQR indicates that all values in the column are the same, making standardization impossible. | ||
- Backend compatibility: Validate if [operators](https://ibis-project.org/backends/support/matrix) used by ibisML are supported by your chosen backend. This ensures seamless integration and execution of transformations across different environments. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this a consideration for people creating a step? |
||
|
||
## Conclusion | ||
|
||
Custom transformers offer a high degree of flexibility and control over data preprocessing tasks. They excel at encapsulating specific steps within the data processing pipeline, which greatly enhances code manageability. If you haven't already, I highly recommend exploring their capabilities and integrating them into your workflow. They can be a valuable asset in streamlining and optimizing your data preprocessing processes. | ||
Comment on lines
+260
to
+262
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Most FAQs don't have conclusions |
||
|
||
## <span style="color:red">🚀 Contribution Welcome!</span> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this randomly red with an emoji |
||
|
||
Feel free to contribute to our transformations by implementing your own custom transformers or suggesting ones that you find essential. You can do so by checking our transformation [priorities](https://github.com/ibis-project/ibis-ml/issues/32), discussing ideas through creating [issues](https://github.com/ibis-project/ibis-ml/issues), or submitting pull requests (PRs) with your implementations. We welcome collaboration and value input from all contributors. Your ideas and implementations can enrich our library of transformations, making it more comprehensive and useful for everyone involved in data preprocessing tasks. Let's collaborate to enhance the efficiency and effectiveness of our data processing workflows together. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, docs should usually have a separate page (or pages) on contribution guidelines, not as part of a different FAQ. It should also include all the information like how to set up your development environment, etc. at some point. This may not be an immediate priority (can check). |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use hyphens instead; also, don't capitalize "How"