Add NLLB (M2M100) support #769

vrmer · 2024-12-17T16:34:16Z

I implemented AdapterHub support for the Facebook NLLB model and its underlying M2M100 architecture. I've carried out and ran all the relevant tests, auto formatting and quality checks.

The code passes 124 tests, skipping 7, and failing 11. The 11 failed tests are all connected to Parallel composition blocks (that I did not implement) and flex heads, which I also did not implement. As the model is a machine translation model, it does not need to have classification heads on top of it, but I didn't find how to disable the irrelevant head_types in the ADAPTER_MODEL_MAPPING dictionary to be able to skip these tests.

Any advice on this is greatly appreciated!

Key addition:

A new M2M100AdapterModel class with the relevant WithAdapters and AdaptersMixin classes implemented.

calpt

Thanks a lot for working on this! Already looks pretty good overall, I left some comments regarding the open issues that are hopefully helpful.

Once that's done, please also add this new model type to the docs as described in the contributing guide, thanks!

calpt · 2025-01-04T15:50:48Z

src/adapters/models/m2m_100/adapter_model.py

+    head_types = [
+        "classification",
+        "multilabel_classification",
+        "question_answering",
+        "seq2seq_lm",
+    ]


this defines the range of supported heads. Since I believe we'd only want to support sequence generation, you can remove everything except for "seq2seq_lm" here.

calpt · 2025-01-04T15:51:42Z

tests/test_m2m_100.py

+    ParallelAdapterInferenceTestMixin,
+    ParallelTrainingMixin,


In case we don't want to support Parallel composition (which is totally fine), please remove these two mixins to disable the tests.

Otherwise, by adding the model type here:

adapters/src/adapters/composition.py

Line 121 in f0ca962

SUPPORTED_MODELS = {

, you can declare it as supported (since I believe the implementation is already mostly there from looking at your code)

calpt · 2025-01-04T16:01:18Z

tests/test_m2m_100.py

+from .test_adapter_heads import PredictionHeadModelTestMixin
+
+
+class M2M100AdapterTestBase(AdapterTestBase):


Since this model doesn't support text classification tasks (and we test text classification training by default in the test runs), we'd have to override the add_heead() and dataset() methods here. That would roughly look like here:

adapters/tests/test_bert_generation.py

Lines 29 to 70 in f0ca962

def add_head(self, model, name, **kwargs):

model.add_masked_lm_head(name)

return self.default_input_samples_shape[-1]

def dataset(self, tokenizer=None):

# setup tokenizer

if tokenizer is None:

tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name, use_fast=False)

if tokenizer.pad_token is None:

tokenizer.pad_token = tokenizer.eos_token

def preprocess_function(examples):

inputs = examples["document"]

targets = examples["summary"]

inputs = ["Summarize: " + inp for inp in inputs]

model_inputs = tokenizer(inputs, padding="max_length", truncation=True, max_length=128)

# Setup the tokenizer for targets

with tokenizer.as_target_tokenizer():

labels = tokenizer(targets, padding="max_length", truncation=True, max_length=128)

# If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore

# padding in the loss.

labels["input_ids"] = [

[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]

]

model_inputs["labels"] = labels["input_ids"]

return model_inputs

data_args = {

"task_name": "xsum",

"path": "./tests/fixtures/samples/xsum/sample.json",

}

dataset = load_dataset("json", data_files=data_args["path"])

train_dataset = dataset["train"]

train_dataset = train_dataset.map(

preprocess_function,

batched=True,

desc="Running tokenizer on train dataset",

)

return train_dataset

add nllb model integration

bd0a5f7

calpt reviewed Jan 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NLLB (M2M100) support #769

Add NLLB (M2M100) support #769

vrmer commented Dec 17, 2024

calpt left a comment

calpt Jan 4, 2025

calpt Jan 4, 2025

calpt Jan 4, 2025

		from .test_adapter_heads import PredictionHeadModelTestMixin


		class M2M100AdapterTestBase(AdapterTestBase):

	def add_head(self, model, name, **kwargs):
	model.add_masked_lm_head(name)
	return self.default_input_samples_shape[-1]

	def dataset(self, tokenizer=None):
	# setup tokenizer
	if tokenizer is None:
	tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name, use_fast=False)
	if tokenizer.pad_token is None:
	tokenizer.pad_token = tokenizer.eos_token

	def preprocess_function(examples):
	inputs = examples["document"]
	targets = examples["summary"]
	inputs = ["Summarize: " + inp for inp in inputs]
	model_inputs = tokenizer(inputs, padding="max_length", truncation=True, max_length=128)

	# Setup the tokenizer for targets
	with tokenizer.as_target_tokenizer():
	labels = tokenizer(targets, padding="max_length", truncation=True, max_length=128)

	# If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
	# padding in the loss.
	labels["input_ids"] = [
	[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
	]

	model_inputs["labels"] = labels["input_ids"]
	return model_inputs

	data_args = {
	"task_name": "xsum",
	"path": "./tests/fixtures/samples/xsum/sample.json",
	}
	dataset = load_dataset("json", data_files=data_args["path"])
	train_dataset = dataset["train"]
	train_dataset = train_dataset.map(
	preprocess_function,
	batched=True,
	desc="Running tokenizer on train dataset",
	)
	return train_dataset

Add NLLB (M2M100) support #769

Are you sure you want to change the base?

Add NLLB (M2M100) support #769

Conversation

vrmer commented Dec 17, 2024

calpt left a comment

Choose a reason for hiding this comment

calpt Jan 4, 2025

Choose a reason for hiding this comment

calpt Jan 4, 2025

Choose a reason for hiding this comment

calpt Jan 4, 2025

Choose a reason for hiding this comment