Optimized hyperparameters on "small" model, achieves 87 perplexity in 1 hour. #5

alexbw · 2015-02-26T16:33:33Z

Applied bayesian optimization (whetlab.com) to lower the perplexity of the "small" model (13 full epochs).

…message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

soumith · 2015-02-26T16:43:25Z

that's nice! how long did it take to run the bayesian optimization and find the ideal hyperparameters?

alexbw · 2015-02-26T16:45:34Z

It took overnight. I'm still letting it run, it's still exploring, so this
number could possibly get better.

On Thu, Feb 26, 2015 at 11:43 AM, Soumith Chintala <[email protected]

wrote:

that's nice! how long did it take to run the bayesian optimization and
find the ideal hyperparameters?

—
Reply to this email directly or view it on GitHub
#5 (comment).

soumith · 2015-02-26T16:49:04Z

that is really cool. Just on a single GPU or are you running parallel jobs?

alexbw · 2015-02-26T16:56:05Z

Running 10 g2.2xlarges in parallel. The job suggestions come in via a pull
model over a REST API, so it is trivial to parallelize (no extra code
required at all). 10 is on the small side of what we usually do when trying
to break records.

On Thu, Feb 26, 2015 at 11:49 AM, Soumith Chintala <[email protected]

wrote:

that is really cool. Just on a single GPU or are you running parallel jobs?

—
Reply to this email directly or view it on GitHub
#5 (comment).

soumith · 2015-02-26T17:03:12Z

is it possible to compare bayesian optimization vs a simple logistic regression or binary search? It is unclear how to quantify that. Also, can you tell us how much tuning of the bayesian optimizer's hyperparameters is required...

alexbw · 2015-02-26T17:15:14Z

There's a bit of an explanation here: https://www.whetlab.com/technology/
The core engine is based on research you can read about here: http://arxiv.org/abs/1502.05700

Also, here is an earlier paper detailing going into a bit more depth on the original approach: http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms

The last link has comparisons to other hyperparameter optimization approaches, like the Tree Parzen method and random grid search, which we outperform by a large margin.

I'm not sure how you would use logistic regression or binary search in this setting, since there a) aren't gradients readily available for the hyperparameters and b) some parameters are categorical, some integer, and some floating point. We handle all of these cases.

ghost · 2015-02-26T17:41:58Z

Wow that's impressive!

Maybe if you have a few processors free you could try parameter tuning the Adam optimizer - for a test problem?

[Feature request] A suitable test problem for Adam

ajtulloch · 2015-02-26T18:34:53Z

Really nice stuff, thanks @alexbw

alexbw · 2015-02-26T19:33:44Z

The modifications required to automatically tune the hyperparameters are minimal. Pasting here in case anyone wants to replicate the tuning process. You'll need a beta key, which I'd be glad to provide you.

If you're interested, there's two small changes required:

We treat a return value of NaN (0/0) as a constraint, meaning we learn how to avoid similar jobs that would produce a failure. For deep nets, this usually means a memory error or a segfault (e.g. can occur for weird batch sizes). We treat training time too far above an hour as a constraint in this example, because we want to train the best "fast" net. You could similarly train the best "small" net, providing constraints for models that won't fit on a smartphone, or won't provide real-time classification.

ajtulloch · 2015-02-26T19:35:14Z

👍

alexbw added 3 commits February 26, 2015 16:25

Merging in new hypers

47a8fe4

Merging in new hypers

1b8370d

Merge remote-tracking branch 'origin/master' # Please enter a commit …

b5926b2

…message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized hyperparameters on "small" model, achieves 87 perplexity in 1 hour. #5

Optimized hyperparameters on "small" model, achieves 87 perplexity in 1 hour. #5

alexbw commented Feb 26, 2015

soumith commented Feb 26, 2015

alexbw commented Feb 26, 2015

soumith commented Feb 26, 2015

alexbw commented Feb 26, 2015

soumith commented Feb 26, 2015

alexbw commented Feb 26, 2015

ghost commented Feb 26, 2015

ajtulloch commented Feb 26, 2015

alexbw commented Feb 26, 2015

ajtulloch commented Feb 26, 2015

Optimized hyperparameters on "small" model, achieves 87 perplexity in 1 hour. #5

Are you sure you want to change the base?

Optimized hyperparameters on "small" model, achieves 87 perplexity in 1 hour. #5

Conversation

alexbw commented Feb 26, 2015

soumith commented Feb 26, 2015

alexbw commented Feb 26, 2015

soumith commented Feb 26, 2015

alexbw commented Feb 26, 2015

soumith commented Feb 26, 2015

alexbw commented Feb 26, 2015

ghost commented Feb 26, 2015

ajtulloch commented Feb 26, 2015

alexbw commented Feb 26, 2015

ajtulloch commented Feb 26, 2015