How about adding a feature to pass the key when performing map on DatasetDict? #7356

jp1924 · 2025-01-06T08:13:52Z

Feature request

Add a feature to pass the key of the DatasetDict when performing map

Motivation

I often preprocess using map on DatasetDict.
Sometimes, I need to preprocess train and valid data differently depending on the task.
So, I thought it would be nice to pass the key (like train, valid) when performing map on DatasetDict.

What do you think?

Your contribution

I can submit a pull request to add the feature to pass the key of the DatasetDict when performing map.

jp1924 · 2025-01-13T07:51:20Z

@lhoestq
If it's okay with you, can I work on this?

lhoestq · 2025-01-13T11:30:01Z

Hi ! Can you give an example of what it would look like to use this new feature ?

Note that currently you can already do

ds["train"] = ds["train"].map(process_train)
ds["test"] = ds["test"].map(process_test)

jp1924 · 2025-01-13T12:23:57Z

@lhoestq
Thanks for the response!
Let me clarify what I'm looking for with an example:

Currently, we need to write separate processing functions or call .map() separately:

# Current approach
def process_train(example):
    # Training-specific processing
    return example

def process_valid(example):
    # Validation-specific processing
    return example

ds["train"] = ds["train"].map(process_train)
ds["valid"] = ds["valid"].map(process_valid)

What I'm proposing is to have a single processing function that knows which split it's processing:

# Proposed feature
def process(example, split_key):
    if split_key == "train":
        # Training-specific processing
    elif split_key == "valid":
        # Validation-specific processing
    return example

# Using with_key=True to pass the split information
ds = ds.map(process, with_key=True)

This becomes particularly useful when:

The processing logic is heavily shared between splits but needs minor adjustments
You want to maintain the processing logic in one place for better maintainability
The processing function is complex and you want to avoid duplicating code

So I wanted to request this feature to achieve this kind of functionality.
I've created a draft PR implementing this: https://github.com/huggingface/datasets/pull/7240/files

lhoestq · 2025-01-13T13:26:50Z

I see ! I think it makes sense, and it's more readable than doing something like this:

from functools import partial
ds = DatasetDict({key: ds[key].map(partial(process, split_key=key)) for key in ds})

PS: you named the argument with_key in your example, but it might be even clearer with it's named with_split maybe no ?

jp1924 · 2025-01-13T14:16:51Z

@lhoestq I agree.
It seems better to use with_split.
So can I open a PR with this change?

lhoestq · 2025-01-13T14:30:46Z

Sure !

jp1924 added the enhancement New feature or request label Jan 6, 2025

jp1924 mentioned this issue Jan 13, 2025

Add with_split to DatasetDict.map #7368

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How about adding a feature to pass the key when performing map on DatasetDict? #7356

How about adding a feature to pass the key when performing map on DatasetDict? #7356

jp1924 commented Jan 6, 2025

jp1924 commented Jan 13, 2025

lhoestq commented Jan 13, 2025

jp1924 commented Jan 13, 2025 •

edited

Loading

lhoestq commented Jan 13, 2025

jp1924 commented Jan 13, 2025 •

edited

Loading

lhoestq commented Jan 13, 2025

How about adding a feature to pass the key when performing map on DatasetDict? #7356

How about adding a feature to pass the key when performing map on DatasetDict? #7356

Comments

jp1924 commented Jan 6, 2025

Feature request

Motivation

Your contribution

jp1924 commented Jan 13, 2025

lhoestq commented Jan 13, 2025

jp1924 commented Jan 13, 2025 • edited Loading

lhoestq commented Jan 13, 2025

jp1924 commented Jan 13, 2025 • edited Loading

lhoestq commented Jan 13, 2025

jp1924 commented Jan 13, 2025 •

edited

Loading

jp1924 commented Jan 13, 2025 •

edited

Loading