Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized hyperparameters on "small" model, achieves 87 perplexity in 1 hour. #5

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

alexbw
Copy link
Contributor

@alexbw alexbw commented Feb 26, 2015

Applied bayesian optimization (whetlab.com) to lower the perplexity of the "small" model (13 full epochs).

…message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
@soumith
Copy link
Collaborator

soumith commented Feb 26, 2015

that's nice! how long did it take to run the bayesian optimization and find the ideal hyperparameters?

@alexbw
Copy link
Contributor Author

alexbw commented Feb 26, 2015

It took overnight. I'm still letting it run, it's still exploring, so this
number could possibly get better.

On Thu, Feb 26, 2015 at 11:43 AM, Soumith Chintala <[email protected]

wrote:

that's nice! how long did it take to run the bayesian optimization and
find the ideal hyperparameters?


Reply to this email directly or view it on GitHub
#5 (comment).

@soumith
Copy link
Collaborator

soumith commented Feb 26, 2015

that is really cool. Just on a single GPU or are you running parallel jobs?

@alexbw
Copy link
Contributor Author

alexbw commented Feb 26, 2015

Running 10 g2.2xlarges in parallel. The job suggestions come in via a pull
model over a REST API, so it is trivial to parallelize (no extra code
required at all). 10 is on the small side of what we usually do when trying
to break records.

On Thu, Feb 26, 2015 at 11:49 AM, Soumith Chintala <[email protected]

wrote:

that is really cool. Just on a single GPU or are you running parallel jobs?


Reply to this email directly or view it on GitHub
#5 (comment).

@soumith
Copy link
Collaborator

soumith commented Feb 26, 2015

is it possible to compare bayesian optimization vs a simple logistic regression or binary search? It is unclear how to quantify that. Also, can you tell us how much tuning of the bayesian optimizer's hyperparameters is required...

@alexbw
Copy link
Contributor Author

alexbw commented Feb 26, 2015

There's a bit of an explanation here: https://www.whetlab.com/technology/
The core engine is based on research you can read about here: http://arxiv.org/abs/1502.05700

Also, here is an earlier paper detailing going into a bit more depth on the original approach: http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms

The last link has comparisons to other hyperparameter optimization approaches, like the Tree Parzen method and random grid search, which we outperform by a large margin.

I'm not sure how you would use logistic regression or binary search in this setting, since there a) aren't gradients readily available for the hyperparameters and b) some parameters are categorical, some integer, and some floating point. We handle all of these cases.

@ghost
Copy link

ghost commented Feb 26, 2015

Wow that's impressive!

Maybe if you have a few processors free you could try parameter tuning the Adam optimizer - for a test problem?

[Feature request] A suitable test problem for Adam

@ajtulloch
Copy link

Really nice stuff, thanks @alexbw

@alexbw
Copy link
Contributor Author

alexbw commented Feb 26, 2015

The modifications required to automatically tune the hyperparameters are minimal. Pasting here in case anyone wants to replicate the tuning process. You'll need a beta key, which I'd be glad to provide you.

If you're interested, there's two small changes required:

We treat a return value of NaN (0/0) as a constraint, meaning we learn how to avoid similar jobs that would produce a failure. For deep nets, this usually means a memory error or a segfault (e.g. can occur for weird batch sizes). We treat training time too far above an hour as a constraint in this example, because we want to train the best "fast" net. You could similarly train the best "small" net, providing constraints for models that won't fit on a smartphone, or won't provide real-time classification.

@ajtulloch
Copy link

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants