Refactor ContrastiveDataset and ContrastiveDistillationDataset #579

DemirTonchev · 2024-12-25T11:13:24Z

Refactor of ContrastiveDataset and ContrastiveDistillationDataset to generate pairs lazily. Also the trainer code is updated to create dataset from iterator. This targets how pairs are generated more motivation - #578

dataset = Dataset.from_generator(data_sampler.__iter__) this change allows to work with arbitrary big dataset (although the trade off is the cache on the disk managed by arrow dataset)

This also targets the ContrastiveDistillationDataset bug in #578

fixes: #578

…ers 4.45.2

…with bigger dataset

HuggingFaceDocBuilderDev · 2025-01-10T12:54:55Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

It also doesn't exist in 3.9 yet. https://docs.python.org/3/library/typing.html#typing.TypeAlias

…o refactor-contrds

tomaarsen · 2025-01-13T10:24:16Z

Hello!

I did a bit more tests, and I'm running into an issue: If I have a dataset with no positive pairs at all, then the generator will loop indefinitely, as it keeps going until it finds e.g. 3 positive pairs.

This was also causing infinite loops via the test_raise_when_metric_value_is_invalid tests.

Tom Aarsen

…rations

DemirTonchev · 2025-01-13T14:57:26Z

Fixed.
This occurred only when using num_iterations.

…gle pos pair

DemirTonchev · 2025-01-13T19:44:53Z

Fixed the same problem when using sampling strategies.

DemirTonchev added 10 commits December 20, 2024 15:03

fixed to work with processing_class instead tokenizer after transform…

cb4e803

…ers 4.45.2

refactor attempt for ContrastiveDataset so that it does not blow RAM …

6a69303

…with bigger dataset

added Samplit strategy enum

09869e1

improved logic and fixed iterator pattern

82474dc

fix for negative samples formula

d01dc36

added multilalbel support as in the original implementation

53eaace

ContrastiveDataset iterator refactor

3e3fa5f

trainer fixed to work with ContrastiveDataset iter method

2d5e29b

ContrastiveDistillationDataset iter refactor

1c905b1

typing fix

9cafa02

tomaarsen and others added 8 commits January 10, 2025 15:57

Merge branch 'main' into pr-579

421b1e5

TypeAlias will be deprecated in 3.12 again, so let's avoid it

ea088ff

It also doesn't exist in 3.9 yet. https://docs.python.org/3/library/typing.html#typing.TypeAlias

Remove args.sampling_strategy from ContrastiveDistillationDataset init

467c507

Run formatting

9784485

Add 'save_strategy="no"' in tests to counteract transformers v4.48.0 bug

3f67190

fix some autocomplete mistake

509ee68

Merge branch 'refactor-contrds' of github.com:DemirTonchev/setfit int…

8fdbbb3

…o refactor-contrds

cleanup leftovers attributes from old logic

7d9119b

tomaarsen mentioned this pull request Jan 13, 2025

[tests] Add 'save_strategy="no"' in tests to counteract transformers v4.48.0 bug #582

Merged

DemirTonchev added 2 commits January 13, 2025 16:53

docs

fe30eaa

fix to correctly specify positives and negatives when passing num_ite…

e7f9cf0

…rations

DemirTonchev added 3 commits January 13, 2025 18:34

fix for negative or positive examples only(not really possible)

d989f22

removed unnecessary np uses and casting

d229add

safeguard for oversampling strategy, when DS is negatives only or sin…

ea88e9b

…gle pos pair

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ContrastiveDataset and ContrastiveDistillationDataset #579

Refactor ContrastiveDataset and ContrastiveDistillationDataset #579

DemirTonchev commented Dec 25, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 10, 2025

tomaarsen commented Jan 13, 2025

DemirTonchev commented Jan 13, 2025

DemirTonchev commented Jan 13, 2025

Refactor ContrastiveDataset and ContrastiveDistillationDataset #579

Are you sure you want to change the base?

Refactor ContrastiveDataset and ContrastiveDistillationDataset #579

Conversation

DemirTonchev commented Dec 25, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Jan 10, 2025

tomaarsen commented Jan 13, 2025

DemirTonchev commented Jan 13, 2025

DemirTonchev commented Jan 13, 2025

DemirTonchev commented Dec 25, 2024 •

edited

Loading