GitHub - Kartikaggarwal98/Robust_Codemix_MT: Noise robust machine translation for codemix text in Hindi and Bengali

Robust Machine Translation for Codemix Languages

This is a framework for translating text from codemix languages such as Hinglish, Bengalish to English. The repository includes datasets and pretrained-models for training and prediction. The models are capable of translating all-inclusive text i.e. an input with a mix of Devanagari / Romanized Hindi. The robustness capabilities of the model enables it to effectively handle spelling mistakes.

Link to datasets on huggingface:

The following models are available:

rcmt1
rcmt2
zcmt

The following languages are available:

Hindi (hi)
Bengali (bn)
English (en)

Installation

Download the repository and unzip.
pip install -r requirements.txt
cd jamt
pip install .

Training our own model from scratch

For training a model from scratch, the raw data needs to be preprocessed, tokenized, binarized and then used for training a multilingual model. We propose two robust and one zeroshot codemix translation model: RCMT1, RCMT2, ZCMT. Change the dataset name according to the requirements: calign/ctrans.

Preprocessing:

This involves cleaning of raw data and training a sentencepiece unigram model using train versions (noisy,romanized,codemix) of languages (hindi, bengali). The sentencepiece model is then used to tokenize all train, valid, test datasets.

cd examples/translation/
bash prepare_calign.sh

Binarization:

As the training data is very large (~4.5 million pairs), the raw tokenized needs to be binarized for faster loading. From the jamt/examples/translation directory run:

cd ../../
bash calign_binarize.sh

Training:

The binarized data can now be used to train a model. The default model is RCMT1. To change the model name, uncomment that specific part from the calign_train.sh file and run:

bash calign_train.sh

Prediction:

For prediction from the provided test set run:

bash calign_test.sh

Examples

After preprocessing and training, the file structure would look like this:

├──    jamt
|   ├──    examples
|   |   ├──    translation
|   |   |    ├──    calign
|   ├──    checkpoints
|   |   ├──    hinmix_calign_rcmt1
|   |   |    ├──    checkpoint.pt
|   ├──    data-bin
|   |   ├──    hinmix_calign_rcmt1
|   |   |   ├──    dict.lang.txt
|   |   |   ├──    fairseq.vocab

Requirements

Python<=3.6 (for torch 1.4)
torch==1.4.0
tqdm
numpy
sentencepiece

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
jamt		jamt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robust Machine Translation for Codemix Languages

Installation

Training our own model from scratch

Examples

Requirements

About

Releases

Packages

Languages

License

Kartikaggarwal98/Robust_Codemix_MT

Folders and files

Latest commit

History

Repository files navigation

Robust Machine Translation for Codemix Languages

Installation

Training our own model from scratch

Examples

Requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages