Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NLLB (M2M100) support #769

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add NLLB (M2M100) support #769

wants to merge 1 commit into from

Conversation

vrmer
Copy link

@vrmer vrmer commented Dec 17, 2024

I implemented AdapterHub support for the Facebook NLLB model and its underlying M2M100 architecture. I've carried out and ran all the relevant tests, auto formatting and quality checks.

The code passes 124 tests, skipping 7, and failing 11. The 11 failed tests are all connected to Parallel composition blocks (that I did not implement) and flex heads, which I also did not implement. As the model is a machine translation model, it does not need to have classification heads on top of it, but I didn't find how to disable the irrelevant head_types in the ADAPTER_MODEL_MAPPING dictionary to be able to skip these tests.

Any advice on this is greatly appreciated!

Key addition:

  • A new M2M100AdapterModel class with the relevant WithAdapters and AdaptersMixin classes implemented.

Copy link
Member

@calpt calpt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this! Already looks pretty good overall, I left some comments regarding the open issues that are hopefully helpful.

Once that's done, please also add this new model type to the docs as described in the contributing guide, thanks!

Comment on lines +25 to +30
head_types = [
"classification",
"multilabel_classification",
"question_answering",
"seq2seq_lm",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this defines the range of supported heads. Since I believe we'd only want to support sequence generation, you can remove everything except for "seq2seq_lm" here.

Comment on lines +55 to +56
ParallelAdapterInferenceTestMixin,
ParallelTrainingMixin,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we don't want to support Parallel composition (which is totally fine), please remove these two mixins to disable the tests.

Otherwise, by adding the model type here:

SUPPORTED_MODELS = {
, you can declare it as supported (since I believe the implementation is already mostly there from looking at your code)

from .test_adapter_heads import PredictionHeadModelTestMixin


class M2M100AdapterTestBase(AdapterTestBase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this model doesn't support text classification tasks (and we test text classification training by default in the test runs), we'd have to override the add_heead() and dataset() methods here. That would roughly look like here:

def add_head(self, model, name, **kwargs):
model.add_masked_lm_head(name)
return self.default_input_samples_shape[-1]
def dataset(self, tokenizer=None):
# setup tokenizer
if tokenizer is None:
tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name, use_fast=False)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
def preprocess_function(examples):
inputs = examples["document"]
targets = examples["summary"]
inputs = ["Summarize: " + inp for inp in inputs]
model_inputs = tokenizer(inputs, padding="max_length", truncation=True, max_length=128)
# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, padding="max_length", truncation=True, max_length=128)
# If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
# padding in the loss.
labels["input_ids"] = [
[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
]
model_inputs["labels"] = labels["input_ids"]
return model_inputs
data_args = {
"task_name": "xsum",
"path": "./tests/fixtures/samples/xsum/sample.json",
}
dataset = load_dataset("json", data_files=data_args["path"])
train_dataset = dataset["train"]
train_dataset = train_dataset.map(
preprocess_function,
batched=True,
desc="Running tokenizer on train dataset",
)
return train_dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants