Checkpointing: Continue model training at epoch x after saving intermediate model #36

lukashaenjes · 2021-03-16T12:15:08Z

Hi, first of all, many thanks for this outstanding package.

I have a question concerning model checkpointing: I have a fairly large corpus (~ 70M words) and run a model which calculates word embeddings (with embed_wordspace) with 10 epochs. I run this on a remote server and it can take up to 2 days for all 10 epochs to finish.

As a fault tolerance measure, I figured it might be a good idea to checkpoint the model after every epoch so in case something crashes, I can load the last saved epoch and continue training from there. For this, I set saveEveryEpoch = TRUE. Since I only want to save the last successful epoch, I keep saveTempModel = FALSE.

My question now is: How can I continue training from this checkpoint after something went wrong? I tried to pass initModel = "wordspace.bin" in the existing embed_wordspace call, which gives:

Start to load a trained starspace model.
STARSPACE-2017-2
Model loaded.

But, then it continues to run the model with the parameters specified in the overall call to embed_wordspace, starting at epoch 1 and seemingly ignoring the passed model. Also, when reading in the intermediate wordspace.bin.tsv, I'm left with the default parameters, not the one I passed in the function. For instance, x$args$param$epoch gives 5 (the default), while I originally passed epoch = 10:

x <- starspace_load_model("wordspace.bin.tsv", method = "tsv-data.table")
x$args$param$epoch
#> [1] 5

Could this be the cause of the problem?

Am I approaching this correctly? What would be an alternative way to achieve my desired goal? I'm thinking of something similar to the ModelCheckpoint functionality in TensorFlow.

Many thanks in advance!

The text was updated successfully, but these errors were encountered:

jwijffels · 2021-03-16T16:38:18Z

I never did this but I think you can just do saveEveryEpoch = TRUE
And next time you want to train again you need to load the model and get the embeddings

x <- starspace_load_model("wordspace.bin.tsv", method = "tsv-data.table")
embeddings <- as.matrix(x)

and next pass on the embeddings to embed_wordspace(..., embeddings = embeddings) or directly to starspace starspace(..., embeddings = embeddings)
Transfer learning is shown in section 5 of the package vignette: https://cran.r-project.org/web/packages/ruimtehol/vignettes/ground-control-to-ruimtehol.pdf

lukashaenjes · 2021-03-17T12:22:56Z

Thanks a lot for your fast response! I'll give this a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing: Continue model training at epoch x after saving intermediate model #36

Checkpointing: Continue model training at epoch x after saving intermediate model #36

lukashaenjes commented Mar 16, 2021 •

edited

Loading

jwijffels commented Mar 16, 2021

lukashaenjes commented Mar 17, 2021

Checkpointing: Continue model training at epoch x after saving intermediate model #36

Checkpointing: Continue model training at epoch x after saving intermediate model #36

Comments

lukashaenjes commented Mar 16, 2021 • edited Loading

jwijffels commented Mar 16, 2021

lukashaenjes commented Mar 17, 2021

lukashaenjes commented Mar 16, 2021 •

edited

Loading