You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, first of all, many thanks for this outstanding package.
I have a question concerning model checkpointing: I have a fairly large corpus (~ 70M words) and run a model which calculates word embeddings (with embed_wordspace) with 10 epochs. I run this on a remote server and it can take up to 2 days for all 10 epochs to finish.
As a fault tolerance measure, I figured it might be a good idea to checkpoint the model after every epoch so in case something crashes, I can load the last saved epoch and continue training from there. For this, I set saveEveryEpoch = TRUE. Since I only want to save the last successful epoch, I keep saveTempModel = FALSE.
My question now is: How can I continue training from this checkpoint after something went wrong? I tried to pass initModel = "wordspace.bin" in the existing embed_wordspace call, which gives:
Start to load a trained starspace model.
STARSPACE-2017-2
Model loaded.
But, then it continues to run the model with the parameters specified in the overall call to embed_wordspace, starting at epoch 1 and seemingly ignoring the passed model. Also, when reading in the intermediate wordspace.bin.tsv, I'm left with the default parameters, not the one I passed in the function. For instance, x$args$param$epoch gives 5 (the default), while I originally passed epoch = 10:
x <- starspace_load_model("wordspace.bin.tsv", method = "tsv-data.table")
x$args$param$epoch
#> [1] 5
Could this be the cause of the problem?
Am I approaching this correctly? What would be an alternative way to achieve my desired goal? I'm thinking of something similar to the ModelCheckpoint functionality in TensorFlow.
Many thanks in advance!
The text was updated successfully, but these errors were encountered:
I never did this but I think you can just do saveEveryEpoch = TRUE
And next time you want to train again you need to load the model and get the embeddings
x <- starspace_load_model("wordspace.bin.tsv", method = "tsv-data.table")
embeddings <- as.matrix(x)
Hi, first of all, many thanks for this outstanding package.
I have a question concerning model checkpointing: I have a fairly large corpus (~ 70M words) and run a model which calculates word embeddings (with
embed_wordspace
) with 10 epochs. I run this on a remote server and it can take up to 2 days for all 10 epochs to finish.As a fault tolerance measure, I figured it might be a good idea to checkpoint the model after every epoch so in case something crashes, I can load the last saved epoch and continue training from there. For this, I set
saveEveryEpoch = TRUE
. Since I only want to save the last successful epoch, I keepsaveTempModel = FALSE
.My question now is: How can I continue training from this checkpoint after something went wrong? I tried to pass
initModel = "wordspace.bin"
in the existingembed_wordspace
call, which gives:But, then it continues to run the model with the parameters specified in the overall call to
embed_wordspace
, starting at epoch 1 and seemingly ignoring the passed model. Also, when reading in the intermediatewordspace.bin.tsv
, I'm left with the default parameters, not the one I passed in the function. For instance,x$args$param$epoch
gives5
(the default), while I originally passedepoch = 10
:Could this be the cause of the problem?
Am I approaching this correctly? What would be an alternative way to achieve my desired goal? I'm thinking of something similar to the ModelCheckpoint functionality in TensorFlow.
Many thanks in advance!
The text was updated successfully, but these errors were encountered: