-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load pretrained model assert failed #106
Comments
+1 |
The following change temporarily fix the assert failure. @wanghaisheng
I don't know whether there is other side effect. |
i looked into the source code you refer ,it seems after load existing model we should first get codec vector for this model and codec for trainningset ,then combine these two vector into one and using Codec::set (https://github.com/tmbdev/clstm/blob/master/clstm.cc) |
Thanks for reporting. I don't think it's a good solution to read all the dataset in each loading. |
Update: It's not a good idea, see comments below. |
@amitdo With experience, the more unicode codec you load, the slower the training process will be. |
I don't suggest to actually do training on all those chars... |
OK, looking forward to your solution. |
In the meantime, don't use your temporary solution. I believe it will mess your model. |
I don't think this will be easy. The codec determines the size of the network's layers, i.e. there will be weights/connections in the network for each of the characters in the codec. To add new characters not in the original training data during re-training, you would have to modify the structure of the network before training, which is pretty complicated: you'd have to add extra dimensions to a lot of the weight/bias matrices. Is this what you're suggesting, @amitdo? |
Is there a problem with registering chars in the model's codec at build time (first time only), even if some of them won't be trained? For example, for Chinese - registering 6000-10,000 symbols. |
I missed that sentence. My answer: Certainly not! |
My suggested solution: What do you think about that? |
I also do not think that there is a sensible approach to extending a trained model for symbols the network was not originally aware of. It is possible to adapt the data structures (e.g. just adding new code points to the codec) but it will result in an inconsistent model unless you fully retrain - which is what you do not want, obviously.
This seems a straightforward approach depending on how much providing all possible chars in the codec degrades training performance.
Is it such a performance hit to have a large codec size even if the training data contains only a subset of those characters? Implementing some form of "pre-loading" of e.g. full Unicode code pages instead of building the codec from the training set (as @amitdo suggests) is doable but I'm at a loss on the consequences wrt performance and network consistency. If the number and frequency of new char is small (e.g. a few new variants of letters), it will take a long time to accurately predict them, but it seems plausible. If it's a completely independent training set (like extending a Japanese model with Chinese training data), wouldn't that effectively require un-learning the old model and creating a new one? Also, enabling such pre-loading would require retraining from scratch with the extended codec which can be very time-consuming, depending on the actual number of chars in the training set:
|
The issue is mostly with Chinese and Japanese. |
Training both Chinese and Japanese in the same model is not a good idea. |
Chinese has so many characters, we often train commonly used ones. |
I think your external codec file solution is good. We can prepare some codec for future use. @amitdo |
we often came into multi-lingual document such as english-chinese, japanese-chinese.these characters are both valuable to use case . |
Resizing the output layer of the network after training is generally not possible, although it would be possible to precreate unused nodes and making up codec entries for these afterwards. On the other hand this is a less than smart idea as the performance impact is rather high even for rather small scripts and their combination, e.g. Greek and Latin (codec size <300). IMHO just retrain your models and invest some time to streamline the process. It's something you should be doing anyway and is quite a bit more straightforward than trying to repurpose already existing models. Finally with Unihan it is actually quite a neat idea to train combined CJK models as it shouldn't increase output layer size for the vast majority of glyphs in either Hanzi scripts. On the other hand finding a network configuration that works for this multi-font model may take some hyperparameter exploration. |
@striversist, I decided not to implement what I suggested before. It seems not to be such a |
Load a pretrained model to retrain new samples will cause assert failed in
Codec::encode
, but start training from scratch, this problem probably not happens.see related issue #83
After digging the code a little, I found this clue:
from
clstmocrtrain.cc main1
If training from scratch, the
load_name
is empty, so goes totrainingset.getCodec(codec);
. In this function, the chaincodec.build(gtnames, charsep);
->Codec::set
is executed. So the training samples' all codec are inserted into the encoder map.If loading pretrained model to retrain new samples, the
load_name
is not empty, theclstm.load(load_name);
loads pretrained codec into encoder map. Next in theCodec::encode
, if a new sample string contains a new codec(not in the pretrained encoder map),assert(encoder->count(c) > 0);
fails.Hope contributors fix this problem ASAP.
The text was updated successfully, but these errors were encountered: