A generic transliteration trainer using Encoder-Decoder transformer model.

Directory organisation

src: Contains source codes for pre-processing and training.
dataset: Contains the datasets use for training.
records: Contains the details of all the trainings.
records/active: Contains the details of current training.

Dataset preparation and Pre-processing

All the tools required for data preprocessing are available. Consider transliteration from lang1 to lang2.

Steps for data preparation:

Prepare a csv file in the following format:
lang1_word1, lang2_word1
lang1_word2, lang2_word2
Clean the dataset:
py src/clean_dataset.py -f path_to_csv_file -l lang-code

This will create a directory named lang-code in dataset directory. And a file lang-code.csv in the lang-code directory which contents the cleaned and filtered dataset.
Make token file:
py src/mk_token_file.py -f path_to_clean_csv_file -l lang-code

This will create a lang-code.tokens file in lang-code directory. Be sure use the cleaned csv file or else unwanted token might be in the token file. This token file will be use to initialise a Tokeniser which will be use to tokenise the datasets.
Shuffle and split:
/src/split_data.sh -f path_to_clean_csv_file -l lang-code -r val:test

e.g. For spliting file.csv into training, validation, and testing datasets into 10% validation and 10% testing (remaining will be training set)

./src/split_data.sh -f ./file.csv -l lang-code -r 10:10

This will create three files in the lang-code directory
1. lang-code-train.csv: trainig dataset
2. lang-code-val.csv: validation dataset
3. lang-code-test.csv testing dataset

Training

Now we the trainig dataset ready. We can start training. Training is done by the train.py file. It excepts the following arguments:

e: Number of epochs to train.
l: Language code
i: epoch interval to save checkpoints. Default is 5
R: Restart the training. Previous progress will be lost. It will ask for confirmation before proceeding.
V: If provided, no validation will be perform during training. Therefore the training will be faster but there will be no validation graph.
r: If provided, the training will be done in reverse order (from lang2 to lang1).
S: Short training. It will train only for 10 samples. Use for quick testing of the model.

Examples:
To train for lang-code for 100 epochs:
py src/train.py -e 100 -l lang-code

To train for lang-code for 100 epochs in reverse (lang2 to lang1):
py src/train.py -e 100 -l lang-code -r

To restart the training:
py src/train.py -e 100 -l lang-code -R

Evaluation

Use evaluate.py for testing. It will use the dataset/lang-code/lang-code-test.csv file. Valid arguments:

l: lang-code
r: reversre evaluation. Use this if the model was trained reverse
S: short evaluation. Use for quik testing of the model

py src/evaluate.py -l lang-code

When the training and testing is done, move the training records using:

py src/mv_active.py

This will move the records/active directory to records/record_n Then a new training session can be started.

Graphs

The training records all the metrics in records/active/metric file. The graph will be drawn using this file. The graph is drawn using draw_graph.py and save in records directory. It can take following arguments:

d: record directory. Default id records/active
p: if provided, graph will be shown on screen
s: if provided, graph will not be save on disk

py src/draw_graph.py -d records/record_1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

A generic transliteration trainer using Encoder-Decoder transformer model.

Directory organisation

Dataset preparation and Pre-processing

Steps for data preparation:

Training

Evaluation

Graphs

Files

README.md

Latest commit

History

README.md

File metadata and controls

A generic transliteration trainer using Encoder-Decoder transformer model.

Directory organisation

Dataset preparation and Pre-processing

Steps for data preparation:

Training

Evaluation

Graphs