Skip to content

Generic transliteration system using encoder-decoder transformer network

License

Notifications You must be signed in to change notification settings

PrabhakarTayenjam/transliteration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A generic transliteration trainer using Encoder-Decoder transformer model.

Directory organisation

  1. src: Contains source codes for pre-processing and training.
  2. dataset: Contains the datasets use for training.
  3. records: Contains the details of all the trainings.
  4. records/active: Contains the details of current training.

Dataset preparation and Pre-processing

All the tools required for data preprocessing are available. Consider transliteration from lang1 to lang2.

Steps for data preparation:

  1. Prepare a csv file in the following format:
    lang1_word1, lang2_word1
    lang1_word2, lang2_word2

  2. Clean the dataset:
    py src/clean_dataset.py -f path_to_csv_file -l lang-code

    This will create a directory named lang-code in dataset directory. And a file lang-code.csv in the lang-code directory which contents the cleaned and filtered dataset.

  3. Make token file:
    py src/mk_token_file.py -f path_to_clean_csv_file -l lang-code

    This will create a lang-code.tokens file in lang-code directory. Be sure use the cleaned csv file or else unwanted token might be in the token file. This token file will be use to initialise a Tokeniser which will be use to tokenise the datasets.

  4. Shuffle and split:
    /src/split_data.sh -f path_to_clean_csv_file -l lang-code -r val:test

    e.g. For spliting file.csv into training, validation, and testing datasets into 10% validation and 10% testing (remaining will be training set)

    ./src/split_data.sh -f ./file.csv -l lang-code -r 10:10

    This will create three files in the lang-code directory

    1. lang-code-train.csv: trainig dataset
    2. lang-code-val.csv: validation dataset
    3. lang-code-test.csv testing dataset

Training

Now we the trainig dataset ready. We can start training. Training is done by the train.py file. It excepts the following arguments:
  • e: Number of epochs to train.
  • l: Language code
  • i: epoch interval to save checkpoints. Default is 5
  • R: Restart the training. Previous progress will be lost. It will ask for confirmation before proceeding.
  • V: If provided, no validation will be perform during training. Therefore the training will be faster but there will be no validation graph.
  • r: If provided, the training will be done in reverse order (from lang2 to lang1).
  • S: Short training. It will train only for 10 samples. Use for quick testing of the model.

Examples:
To train for lang-code for 100 epochs:
py src/train.py -e 100 -l lang-code

To train for lang-code for 100 epochs in reverse (lang2 to lang1):
py src/train.py -e 100 -l lang-code -r

To restart the training:
py src/train.py -e 100 -l lang-code -R

Evaluation

Use evaluate.py for testing. It will use the dataset/lang-code/lang-code-test.csv file. Valid arguments:
  1. l: lang-code
  2. r: reversre evaluation. Use this if the model was trained reverse
  3. S: short evaluation. Use for quik testing of the model

py src/evaluate.py -l lang-code

When the training and testing is done, move the training records using:

py src/mv_active.py

This will move the records/active directory to records/record_n Then a new training session can be started.

Graphs

The training records all the metrics in records/active/metric file. The graph will be drawn using this file. The graph is drawn using draw_graph.py and save in records directory. It can take following arguments:
  1. d: record directory. Default id records/active
  2. p: if provided, graph will be shown on screen
  3. s: if provided, graph will not be save on disk

py src/draw_graph.py -d records/record_1

About

Generic transliteration system using encoder-decoder transformer network

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published