- src: Contains source codes for pre-processing and training.
- dataset: Contains the datasets use for training.
- records: Contains the details of all the trainings.
- records/active: Contains the details of current training.
All the tools required for data preprocessing are available. Consider transliteration from lang1 to lang2.
-
Prepare a csv file in the following format:
lang1_word1, lang2_word1
lang1_word2, lang2_word2
-
Clean the dataset:
py src/clean_dataset.py -f path_to_csv_file -l lang-code
This will create a directory named lang-code in dataset directory. And a file lang-code.csv in the lang-code directory which contents the cleaned and filtered dataset.
-
Make token file:
py src/mk_token_file.py -f path_to_clean_csv_file -l lang-code
This will create a lang-code.tokens file in lang-code directory. Be sure use the cleaned csv file or else unwanted token might be in the token file. This token file will be use to initialise a Tokeniser which will be use to tokenise the datasets.
-
Shuffle and split:
/src/split_data.sh -f path_to_clean_csv_file -l lang-code -r val:test
e.g. For spliting file.csv into training, validation, and testing datasets into 10% validation and 10% testing (remaining will be training set)
./src/split_data.sh -f ./file.csv -l lang-code -r 10:10
This will create three files in the lang-code directory
- lang-code-train.csv: trainig dataset
- lang-code-val.csv: validation dataset
- lang-code-test.csv testing dataset
- e: Number of epochs to train.
- l: Language code
- i: epoch interval to save checkpoints. Default is 5
- R: Restart the training. Previous progress will be lost. It will ask for confirmation before proceeding.
- V: If provided, no validation will be perform during training. Therefore the training will be faster but there will be no validation graph.
- r: If provided, the training will be done in reverse order (from lang2 to lang1).
- S: Short training. It will train only for 10 samples. Use for quick testing of the model.
Examples:
To train for lang-code for 100 epochs:
py src/train.py -e 100 -l lang-code
To train for lang-code for 100 epochs in reverse (lang2 to lang1):
py src/train.py -e 100 -l lang-code -r
To restart the training:
py src/train.py -e 100 -l lang-code -R
- l: lang-code
- r: reversre evaluation. Use this if the model was trained reverse
- S: short evaluation. Use for quik testing of the model
py src/evaluate.py -l lang-code
When the training and testing is done, move the training records using:
py src/mv_active.py
This will move the records/active directory to records/record_n Then a new training session can be started.
The training records all the metrics in records/active/metric file. The graph will be drawn using this file. The graph is drawn using draw_graph.py and save in records directory. It can take following arguments:- d: record directory. Default id records/active
- p: if provided, graph will be shown on screen
- s: if provided, graph will not be save on disk
py src/draw_graph.py -d records/record_1