Skip to content

Noise robust machine translation for codemix text in Hindi and Bengali

License

Notifications You must be signed in to change notification settings

Kartikaggarwal98/Robust_Codemix_MT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Robust Machine Translation for Codemix Languages

This is a framework for translating text from codemix languages such as Hinglish, Bengalish to English. The repository includes datasets and pretrained-models for training and prediction. The models are capable of translating all-inclusive text i.e. an input with a mix of Devanagari / Romanized Hindi. The robustness capabilities of the model enables it to effectively handle spelling mistakes.

Link to datasets on huggingface:

The following models are available:

  1. rcmt1
  2. rcmt2
  3. zcmt

The following languages are available:

  1. Hindi (hi)
  2. Bengali (bn)
  3. English (en)

Installation

  1. Download the repository and unzip.
  2. pip install -r requirements.txt
  3. cd jamt
  4. pip install .

Training our own model from scratch

For training a model from scratch, the raw data needs to be preprocessed, tokenized, binarized and then used for training a multilingual model. We propose two robust and one zeroshot codemix translation model: RCMT1, RCMT2, ZCMT. Change the dataset name according to the requirements: calign/ctrans.

Preprocessing:

This involves cleaning of raw data and training a sentencepiece unigram model using train versions (noisy,romanized,codemix) of languages (hindi, bengali). The sentencepiece model is then used to tokenize all train, valid, test datasets.

  1. cd examples/translation/
  2. bash prepare_calign.sh

Binarization:

As the training data is very large (~4.5 million pairs), the raw tokenized needs to be binarized for faster loading. From the jamt/examples/translation directory run:

  1. cd ../../
  2. bash calign_binarize.sh

Training:

The binarized data can now be used to train a model. The default model is RCMT1. To change the model name, uncomment that specific part from the calign_train.sh file and run:

  1. bash calign_train.sh

Prediction:

For prediction from the provided test set run:

  1. bash calign_test.sh

Examples

After preprocessing and training, the file structure would look like this:

├──    jamt
|   ├──    examples
|   |   ├──    translation
|   |   |    ├──    calign
|   ├──    checkpoints
|   |   ├──    hinmix_calign_rcmt1
|   |   |    ├──    checkpoint.pt
|   ├──    data-bin
|   |   ├──    hinmix_calign_rcmt1
|   |   |   ├──    dict.lang.txt
|   |   |   ├──    fairseq.vocab

Requirements

  1. Python<=3.6 (for torch 1.4)
  2. torch==1.4.0
  3. tqdm
  4. numpy
  5. sentencepiece

About

Noise robust machine translation for codemix text in Hindi and Bengali

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published