-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized hyperparameters on "small" model, achieves 87 perplexity in 1 hour. #5
base: master
Are you sure you want to change the base?
Conversation
…message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
that's nice! how long did it take to run the bayesian optimization and find the ideal hyperparameters? |
It took overnight. I'm still letting it run, it's still exploring, so this On Thu, Feb 26, 2015 at 11:43 AM, Soumith Chintala <[email protected]
|
that is really cool. Just on a single GPU or are you running parallel jobs? |
Running 10 g2.2xlarges in parallel. The job suggestions come in via a pull On Thu, Feb 26, 2015 at 11:49 AM, Soumith Chintala <[email protected]
|
is it possible to compare bayesian optimization vs a simple logistic regression or binary search? It is unclear how to quantify that. Also, can you tell us how much tuning of the bayesian optimizer's hyperparameters is required... |
There's a bit of an explanation here: https://www.whetlab.com/technology/ Also, here is an earlier paper detailing going into a bit more depth on the original approach: http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms The last link has comparisons to other hyperparameter optimization approaches, like the Tree Parzen method and random grid search, which we outperform by a large margin. I'm not sure how you would use logistic regression or binary search in this setting, since there a) aren't gradients readily available for the hyperparameters and b) some parameters are categorical, some integer, and some floating point. We handle all of these cases. |
Wow that's impressive! Maybe if you have a few processors free you could try parameter tuning the Adam optimizer - for a test problem? |
Really nice stuff, thanks @alexbw |
The modifications required to automatically tune the hyperparameters are minimal. Pasting here in case anyone wants to replicate the tuning process. You'll need a beta key, which I'd be glad to provide you. If you're interested, there's two small changes required:
We treat a return value of NaN (0/0) as a constraint, meaning we learn how to avoid similar jobs that would produce a failure. For deep nets, this usually means a memory error or a segfault (e.g. can occur for weird batch sizes). We treat training time too far above an hour as a constraint in this example, because we want to train the best "fast" net. You could similarly train the best "small" net, providing constraints for models that won't fit on a smartphone, or won't provide real-time classification. |
👍 |
Applied bayesian optimization (whetlab.com) to lower the perplexity of the "small" model (13 full epochs).