-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How about adding a feature to pass the key when performing map on DatasetDict? #7356
Comments
@lhoestq |
Hi ! Can you give an example of what it would look like to use this new feature ? Note that currently you can already do ds["train"] = ds["train"].map(process_train)
ds["test"] = ds["test"].map(process_test) |
@lhoestq Currently, we need to write separate processing functions or call .map() separately: # Current approach
def process_train(example):
# Training-specific processing
return example
def process_valid(example):
# Validation-specific processing
return example
ds["train"] = ds["train"].map(process_train)
ds["valid"] = ds["valid"].map(process_valid) What I'm proposing is to have a single processing function that knows which split it's processing: # Proposed feature
def process(example, split_key):
if split_key == "train":
# Training-specific processing
elif split_key == "valid":
# Validation-specific processing
return example
# Using with_key=True to pass the split information
ds = ds.map(process, with_key=True) This becomes particularly useful when:
So I wanted to request this feature to achieve this kind of functionality. |
I see ! I think it makes sense, and it's more readable than doing something like this: from functools import partial
ds = DatasetDict({key: ds[key].map(partial(process, split_key=key)) for key in ds}) PS: you named the argument |
@lhoestq I agree. |
Sure ! |
Feature request
Add a feature to pass the key of the DatasetDict when performing map
Motivation
I often preprocess using map on DatasetDict.
Sometimes, I need to preprocess train and valid data differently depending on the task.
So, I thought it would be nice to pass the key (like train, valid) when performing map on DatasetDict.
What do you think?
Your contribution
I can submit a pull request to add the feature to pass the key of the DatasetDict when performing map.
The text was updated successfully, but these errors were encountered: