From Masked-Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding
This repository contains all code to reproduce the results in the paper. If you are just here for the data, see the data/
folder (AND DON'T FORGET TO CITE THE PAPERS LISTED BELOW)
NOTE: Thanks to Evgeniia Razumovskaia, we have found some errors in the original data. We have kept the original data in the repo for reproducability reasons and the fixed version can be found in data/xSID-0.2. The main differences are in Arabic, where we forgot to map the intent labels.
NOTE 02-11-2021: Thanks to Noëmi Aepli, we have fixed the sentences in the English dev/test files to match the other languages. And thanks to Milan Grita we have corrected mismatches in words in the comments and the annotation (mainly in the English files).
NOTE 29-03-2022: There was an error in the nluEval.py code for calculating the loose overlap (thanks to Mike Zhang for finding and fixing it). The results in the paper for loose f1 were unfortunately overly optimistic.
UPDATE 05-2023: version 0.4 is released, including Neapolitan and Swiss German
UPDATE 04-2024: version 0.5 is released, including Bavarian and Lithuanian, please also see https://github.com/mainlp/NaLiBaSID/tree/main for alternatives for these languages (including native queries!).
To reproduce all results in the paper, you have to run ./scripts/runAll.sh
. However, in practice this would take a very long time (especially when rerunning nmt-transfer, ./scripts/1.\*
), which is why we would suggest to inspect ./scripts/runAll.sh
, decide which parts are relevant for you, and manually run the required commands parallel. It should be noted that some of the scripts may fail, because we used the MultiAtis dataset, which is not publicly available. To run the experiments on those, obtain the data, convert it with scripts/tsv2conll.py
in data/multiAtis
. Alternatively, one could remove the MultiAtis
key from the datasets
dictionary in scripts/myutils.py
. All tables and graphs of the paper can be reproduced by running scripts/genAll.sh
.
The experiments are divided in the scripts folder:
- 0.* Data preparation and setup
- 1.* Translated the training data for the
nmt-transfer
model - 2.* Trains MaChAmp for all languages and predicts on all files
- 3.* Generated the main table in the paper
- 4.* Run on the test data
- 5.* All additional experiments for the analysis section in the paper
- 6.* Generate all tables for the appendix
For many parts we also included the outputs so that it is not always necessary to re-run. The automatically translated training data as well as the automatically converted English training data can be found in /data/xSID/
. The output predictions and scores of all models are also included in predictions/
.
This code is largely based on MaChAmp, we include a copy of the exact version of MaChAmp that was used.
@inproceedings{van-der-goot-etal-2021-masked,
title = "From Masked Language Modeling to Translation: Non-{E}nglish Auxiliary Tasks Improve Zero-shot Spoken Language Understanding",
author = {van der Goot, Rob and
Sharaf, Ibrahim and
Imankulova, Aizhan and
{\"U}st{\"u}n, Ahmet and
Stepanovi{\'c}, Marija and
Ramponi, Alan and
Khairunnisa, Siti Oryza and
Komachi, Mamoru and
Plank, Barbara},
editor = "Toutanova, Kristina and
Rumshisky, Anna and
Zettlemoyer, Luke and
Hakkani-Tur, Dilek and
Beltagy, Iz and
Bethard, Steven and
Cotterell, Ryan and
Chakraborty, Tanmoy and
Zhou, Yichao",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.197",
doi = "10.18653/v1/2021.naacl-main.197",
pages = "2479--2497",
}
If you use version >= 0.4 (which includes Neapolitan and Swiss German), please also cite:
@inproceedings{aepli-etal-2023-findings,
title = "Findings of the {V}ar{D}ial Evaluation Campaign 2023",
author = {Aepli, No{\"e}mi and
{\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i} and
Van Der Goot, Rob and
Jauhiainen, Tommi and
Kazzaz, Mourhaf and
Ljube{\v{s}}i{\'c}, Nikola and
North, Kai and
Plank, Barbara and
Scherrer, Yves and
Zampieri, Marcos},
editor = {Scherrer, Yves and
Jauhiainen, Tommi and
Ljube{\v{s}}i{\'c}, Nikola and
Nakov, Preslav and
Tiedemann, J{\"o}rg and
Zampieri, Marcos},
booktitle = "Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.vardial-1.25",
doi = "10.18653/v1/2023.vardial-1.25",
pages = "251--261",
}
If you use version >= 0.5 (which includes Bavarian and Lithuanian), please also cite:
@inproceedings{winkler-etal-2024-slot,
title = "Slot and Intent Detection Resources for {B}avarian and {L}ithuanian: Assessing Translations vs Natural Queries to Digital Assistants",
author = "Winkler, Miriam and
Juozapaityte, Virginija and
van der Goot, Rob and
Plank, Barbara",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1297",
pages = "14898--14915",
@inproceedings{Winkler2024,
title = "Slot and Intent Detection Resources for {B}avarian and {L}ithuanian: Assessing Translations vs Natural Queries to Digital Assistants",
author = "Winkler, Miriam and Juozapaityte, Virginija and van der Goot, Rob and Plank, Barbara",
booktitle = "Proceedings of The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation",
year = "2024",
publisher = "Association for Computational Linguistics",
}