Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing: Continue model training at epoch x after saving intermediate model #36

Open
lukashaenjes opened this issue Mar 16, 2021 · 2 comments

Comments

@lukashaenjes
Copy link

lukashaenjes commented Mar 16, 2021

Hi, first of all, many thanks for this outstanding package.

I have a question concerning model checkpointing: I have a fairly large corpus (~ 70M words) and run a model which calculates word embeddings (with embed_wordspace) with 10 epochs. I run this on a remote server and it can take up to 2 days for all 10 epochs to finish.

As a fault tolerance measure, I figured it might be a good idea to checkpoint the model after every epoch so in case something crashes, I can load the last saved epoch and continue training from there. For this, I set saveEveryEpoch = TRUE. Since I only want to save the last successful epoch, I keep saveTempModel = FALSE.

My question now is: How can I continue training from this checkpoint after something went wrong? I tried to pass initModel = "wordspace.bin" in the existing embed_wordspace call, which gives:

Start to load a trained starspace model.
STARSPACE-2017-2
Model loaded.

But, then it continues to run the model with the parameters specified in the overall call to embed_wordspace, starting at epoch 1 and seemingly ignoring the passed model. Also, when reading in the intermediate wordspace.bin.tsv, I'm left with the default parameters, not the one I passed in the function. For instance, x$args$param$epoch gives 5 (the default), while I originally passed epoch = 10:

x <- starspace_load_model("wordspace.bin.tsv", method = "tsv-data.table")
x$args$param$epoch
#> [1] 5

Could this be the cause of the problem?

Am I approaching this correctly? What would be an alternative way to achieve my desired goal? I'm thinking of something similar to the ModelCheckpoint functionality in TensorFlow.

Many thanks in advance!

@jwijffels
Copy link
Contributor

I never did this but I think you can just do saveEveryEpoch = TRUE
And next time you want to train again you need to load the model and get the embeddings

x <- starspace_load_model("wordspace.bin.tsv", method = "tsv-data.table")
embeddings <- as.matrix(x)

and next pass on the embeddings to embed_wordspace(..., embeddings = embeddings) or directly to starspace starspace(..., embeddings = embeddings)
Transfer learning is shown in section 5 of the package vignette: https://cran.r-project.org/web/packages/ruimtehol/vignettes/ground-control-to-ruimtehol.pdf

@lukashaenjes
Copy link
Author

Thanks a lot for your fast response! I'll give this a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants