-
-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial skorch hyperparam opt implementation #149
base: main
Are you sure you want to change the base?
Conversation
Check out this pull request on Review Jupyter notebook visual diffs & provide feedback on notebooks. Powered by ReviewNB |
cc @stsievert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @ToddMorrill! I can see where I need to improve some of the documentation, and have fixed some other issues:
fairly cryptic error message.
Resolved in dask/dask-ml#670. Looks like you have a typo; I left a comment below on the appropriate line. For completeness, here's the full traceback:
File "/Users/scott/anaconda3/envs/dask-ml-docs/lib/python3.6/site-packages/distributed/utils.py", line 665, in log_errors
yield
File "/Users/scott/Developer/stsievert/dask-ml/dask_ml/model_selection/_incremental.py", line 115, in _create_model
model = clone(model).set_params(**params)
File "/Users/scott/anaconda3/envs/dask-ml-docs/lib/python3.6/site-packages/skorch/net.py", line 1424, in set_params
self.initialize_module()
File "/Users/scott/anaconda3/envs/dask-ml-docs/lib/python3.6/site-packages/skorch/net.py", line 467, in initialize_module
module = module(**kwargs)
TypeError: __init__() got an unexpected keyword argument 'filter_size'
"source": [ | ||
"# takes some time to numericalize the whole dataset\n", | ||
"\n", | ||
"# also notice that skorch and dask expect numpy arrays, which isn't ideal since it ties you to the cpu.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know skorch accepts some other formats: https://skorch.readthedocs.io/en/stable/user/FAQ.html#faq-how-do-i-use-a-pytorch-dataset-with-skorch. Why doesn't this work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenge is solving for variable length features. With images or tabular datasets, your feature set size is fixed. With text, your feature length varies by batch due to the varying sequence lengths in your batch.
For Skorch, I need to see if I can use a collate_fn somewhere. I want to pad to the longest sequence length in the batch instead of padding to the longest sequence length in the dataset, which would save significant compute time.
Here's another potential solution that I need to take a closer look at.
As for Dask, it's not clear to me how to get around the fixed shape of Dask arrays. Maybe instead of using the numericalized representation (from torchtext), you could work with raw text in the Dask array and then try to preprocess it somewhere else.
In any event, these approaches involve some tinkering, whereas torchtext has solved these problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable length features sounds like an issue that won't work out of the box. It sounds like you've got a handle on it with collate_fn
and the Skorch issue. I'm not sure how the validation/chunk splitting works with Hyperband/etc though. It'd be great to have some practical use!
In the past, I've run into variable-length features with a bag-of-words count using Scikit-learn's CountVectorizer. To resolve this, I used the HashingVectorizer, which is an approximate version of CountVectorizer. I'm not sure if that's relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like you've got a handle on it with collate_fn and the Skorch issue
This is working now in Skorch.
I'm not sure how the validation/chunk splitting works with Hyperband/etc though
That's actually a question I have in my most recent commit. I'm not sure if my training collate_fn
is being handled differently from my validation collate_fn
.
I used the HashingVectorizer, which is an approximate version of CountVectorizer. I'm not sure if that's relevant.
It's totally relevant because it solves the same issue! I'm just trying to find a solution that follows the typical workflow of a deep learning practitioner (i.e. padding at the batch level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashingVectorizer ... same issue
Does PyTorch have an equivalent to HashingVectorizer, or can it work with Scikit-learn's HashingVectorizer? I agree, it's useful to highlight the use of DataLoader but it'd also be nice to see an alternative approach that's better for distributed computation.
It sounds like you've got a handle on it with collate_fn and the Skorch issue
This is working now in Skorch.
👍
"\n", | ||
"# it's not immediately obvious to beginners how all these parameters interact with each other\n", | ||
"max_iter = n_params\n", | ||
"chunk_size = n_examples // n_params" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help to add the following to the "Notes" section of the HyperbandSearchCV docstring?
One feature of Hyperband and the underlying mathematics is that the iteration count
max_iter
determines the number of parameters that need to be sampled.
- add to docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That would be helpful. My frame of reference is sklearn's RandomizedSearchCV, which uses both n_iter
and cv
parameters. Separately, with Skorch + RandomizedSearchCV, I specify how many epochs
to train for. With n_iter
, cv
, and epochs
, it's clear to me how much computation will take place.
When I started looking at Hyperband, I was struggling to map those parameters above to Hyperband. My intuition was that n_params
and n_iter
were equivalent and that if they were both the same value, you would get an apples-to-apples comparison between RandomizedSearchCV and Hyperband on time-to-compute and accuracy of the model found.
Just so I'm clear, when we set n_params
(which then flows into max_iter
), it's only loosely related to n_iter
in RandomizedSearchCV, is that right?
I also may need to go reread your paper to develop some more intuition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_params
andn_iter
were equivalent and that if they were both the same value, you would get an apples-to-apples comparison between RandomizedSearchCV and Hyperband on time-to-compute and accuracy of the model found.
n_params
and n_iter
have the same meaning: the both mean "sample (approximately) this many parameters/initialize this many models."
If n_params == n_iter
, HyperbandSearchCV will find the same score as RandomizedSearchCV with high probability. However, HyperbandSearchCV will do a lot less work.
If RandomizedSearchCV and HyperbandSearchCV do the same amount of work, HyperbandSearchCV will find scores that are a lot higher.
Numbers/graphs behind these statements are in the Dask-ML docs at "Hyper Parameter Search > Hyperband performance." These are the same figures shown in the paper.
- I think it'd help to rename
n_models
ton_params_actual
insearch.metadata
. Is that accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's all very helpful, thanks @stsievert. Whenever I see something like n_params = 299 # sample about 300 parameters I'm not sure what to make of it. Is there are reason you chose 299?
I wonder if n_params_searched
(though I don't like that it's past tense) would be better, to convey that the you're searching for that many unique hyperparameter configurations. Also, I'm not sure if there is an attribute that could show users which n
hyperparameter configurations are actually planned for the search (or are these hyperparameters adaptively selected?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to convey that the you're searching for that many unique hyperparameter configurations
Maybe the best option is to add a note in the rule of thumb saying n_params
is approximate, and the true value can be found in search.metadata["n_params"]
.
Is there are reason you chose 299?
I chose 299 to make the Dask array chunk evenly. I think with 300 there was one chunk with few examples. With 299 all chunks were the same size.
"EPOCHS = 5\n", | ||
"NUM_TRAINING_EXAMPLES = len(train)*.8\n", | ||
"n_examples = EPOCHS * NUM_TRAINING_EXAMPLES\n", | ||
"n_params = 8\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure Hyperband is relevant when n_params = 8
. Hyperband is an early stopping scheme, and there's not much early stopping to be done when max_iter = n_params = 8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I take that back. Hyperband is still (somewhat) relevant. It'll get more relevant if n_params
is higher.
search.metadata["partial_fit_calls"]
is the total number of calls to partial_fit
, not the number of calls to any one model. No model will see more than max_iter = n_params = 8
calls to partial_fit
.
With Hyperband, n_models = 5
parameters are sampled to see if they're the best parameters (n_models
is in search.metadata
). If RandomizedSearchCV were used instead with the same amount of work, only 2 parameters can be sampled to be considered the best.
>>> # Setup as in the notebook
>>> assert len(X_train) == 25000
>>> n_examples = 5 * len(X_train)
>>> n_params = 8
>>> chunk_size = n_examples // n_params
>>>
>>> # How many data will be fed to the model for Hyperband?
>>> hyperband_eg = 26 * chunk_size
>>> # How many models could we fit if we used randomized search? Randomized search gives all models an equal number of examples.
>>> hyperband_eg / n_examples
2.6
- TODO: add this example to the docs, or in
search.metadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a really helpful comparison for people - to see how you're able to sample many more parameters with Hyperband.
I can increase n_params
but my frame of reference was that n_params
is functionally equivalent to n_iter
in RandomizedSearchCV, where higher numbers just mean more compute. If n_params
can be increased without increasing the compute cost, then fantastic! If that's the case, then this would be useful to include in the docs.
I need to understand how more params impact GPU memory utilization as well. If models are only initialized one at a time (assuming 1 single GPU) then this is likely fine. The key thing is that those models need to be unpersisted somehow, otherwise you will fill up GPU memory.
"outputs": [], | ||
"source": [ | ||
"# define parameter grid\n", | ||
"params = {'module__filter_size': [(1,2,3), (2, 3, 4), (3, 4, 5)], \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo:
- {'module__filter_size':
+ {'module__filter_sizes':
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!!
I fixed the typo you pointed out (thanks!) and noticed a couple other things along the way. It turns out that I was running out of GPU memory because of the way that I was passing in pretrained embeddings (was using all of Glove instead of a 25k subset or vectors). So that's fixed now and memory utilization is staying much lower. Another observation is that GPU memory utilization is monotonically increasing and I haven't been able to reduce it by deleting PyTorch or Skorch objects, which should garbage collect those objects and free memory. I'm wondering if this has something to do with working in a distributed environment, where deleting the object in the Jupyter Notebook doesn't delete references to GPU memory on the workers. When I keyboard interrupted the process, the workers got restarted and memory utilization dropped down to zero. I'm also getting an error because my The script now runs but it takes a long time on a single small-ish GPU (~60 minutes). I'm hoping to try this out on a GPU cluster soon. I suspect for big hyperparameter optimization jobs you'd want a fairly large cluster of GPUs (e.g. 4+) to get through these jobs in a reasonable amount of time, which does put up a bit of a barrier to entry for an example demo and any practitioners that can't afford that. I could probably reduce the dataset size and the model would still converge. I'm just trying to create as "real" an example as possible. |
Thanks for this use case. I've got some of the related fixes in dask/dask-ml#671 (which will remain a draft until this PR is merged). Please comment in that PR with your questions and/or suggestions.
I'd be careful with excessive computation. These examples run on Binder, which has pretty serious limits on computation. GPUs are definitely out of scope, and I hesitate to do any computation that takes more than ~10 seconds. I've included cells like this before: # Make sure the computation isn't too excessive for this simple example.
max_iter = 9
# max_iter = 243 # uncomment this line for more realistic usage
I typically don't look for performance comparisons in examples like this. I tend to leave that for papers/documentation. Instead, I tend to run these examples to figure out how to use the tool. To me, the most salient question this example answers are the following:
The last question might warrant another PR. The GPU usage is interesting and good to see. I'd definitely add a note saying this works on GPUs (and probably some code to put it on the GPU if available). Importantly, I don't think a GPU should be required. |
Thanks for all the feedback so far. The example is coming together. I implemented the The custom The crux of the issue is that Hyperband doesn't appear to be handling the validation data correctly (though frankly, I can't tell if it's handling my training data correctly either - not sure if this would be better in a .py where I might see more log output). It looks like Hyperband is passing my validation data through my Do you have any thoughts on why the handling of the validation data in Hyperband would differ from the training data? Separately, is there any reason to think that a EDIT: do you think this has anything to do with skorch-dev/skorch#641? |
Sorry for the delayed response. I'll have more time to respond a week from now.
Could the issue be with out-of-vocabulary words? There might be a word in the validation set that's not in the training set. That's the first idea that comes to mind, especially because you're passing in list of strings. If that's the issue, use of HashingVectorizer would resolve it. |
From what I can tell, you're handling the test data correctly: it appears you're only running the test data through the model once at the very end. How are you confused?
Yes, especially with text data. Do the train, validation and test sets all have the same vocabulary? If not, you could probably get it around with something like: def pad_batch(batch, TEXT, LABEL):
text, label = list(zip(*batch))
text = [word for word in text if word in TRAIN_VOCAB]
# ... rest of function untouched for some appropriately defined I suspect this is coming into play with these lines: train_dataloader = DataLoader(..., collate_fn=pad_batch_partial) # In[20]
test_dataloader = DataLoader(..., collate_fn=pad_batch_partial) # In[32] I wouldn't use a grid search with Hyperband. I prefer random searches. |
@stsievert, my apologies for the delay in picking this back up. The good news, we’re up and running! After working with But hindsight is 20/20. Most problems appear to stem from the need to use dask arrays. In Here are two options (as I see it) for using
Here’s how I’m thinking about the decision to use or not use This analysis doesn’t rigorously address the time it takes for a model to converge under standard batching semantics (i.e. pad at the batch level) vs. padding to the longest example in the dataset. The model may take longer to converge and/or may not be able to achieve peak performance (e.g. accuracy, f1, etc.) by padding to the longest example in the dataset. Sadly, I don’t know enough about I’m happy to discuss further and answer any questions that you have. I’m also happy to run any edits on this example to polish it up but in terms of design patterns, I think we’ve explored most of the obvious options. |
I have one more basic question:
I'd expect the model to converge at the same rate regardless of batching semantics. I'd expect the model have the identical output for identical inputs, regardless if the input is padding to the longest example in the batch or dataset. Why isn't that the case, or am I mis-understanding?
I like this solution best because arrays of string are passed between workers. Why is this implementation bad practice? class PreprocessInputs(skorch.NeuralNetClassifier):
def __init__(self, preprocessing, **kwargs):
self.preprocessing = preprocessing
super().__init__(**kwargs)
def partial_fit(self, X: np.ndarray, y=None):
X_processed = self.preprocess(torch.from_numpy(X))
return super().partial_fit(X_processed, y=y)
def preprocess(self, X):
with torch.no_grad():
return self.preprocess(X) This implementation is more usable because the model is one atomic unit. That is, no outside knowledge is needed on specific methods to preprocess or normalize the input. We could fold this implementation into Dask-ML, but it'd basically be doing the same thing as this implementation. |
It's a bit hand wavy but I've observed very different loss metrics when training on data padded at the batch level (lower loss scores) vs. data padded at the dataset level (higher loss scores). Yes, when I say convergence, I mean that the model is actually training effectively and improving with each iteration. I'd have to spend some more time running experiments to determine if a model trained on data padded to the longest example in the dataset would be able to achieve the same level of performance (e.g. accuracy, f1, etc.) as one trained with batch level padding.
Suppose you've trained a classification deep learning model. Further, let's suppose you prepare one single example that you want a prediction for. If you pad that example, you will get one predicted probability distribution. If you don't pad that example, you will get a different probability distribution. Without more experimentation, the question still remains, how big is that difference? I agree with you, padding should not play a huge role, in general, especially when you might only see a dozen or so pad tokens in a typical batch. However, in our case, we're talking about 1000s of unnecessary pad tokens. It probably will have some sort of impact on the predicted probability distributions.
There's nothing inherently wrong about it. It's just not the typical design pattern that you'll see in the wild, where people tend to experiment with model architectures much more frequently than they experiment with their preprocessing setup. For example, I implemented parts of that approach in a previous commit. We can pursue it further - I just worry about adoption of this pattern. |
It's been a while since there was any activity here. @ToddMorrill is there any still interest in getting this into a mergeable state? |
Hey guys, I want to get the conversation started on this. I have a v1 implementation of an example using PyTorch + Skorch for a text classification problem. I'm then using Dask's Hyperband grid search algo to find the best hyperparameters. It ran successfully once and then I made some more changes and it's now failing with a fairly cryptic error message.
If you have some pointers, I can run edits. Meanwhile, I'll keep looking at it for potential bugs.
Separately, I'd love to get your thoughts on how to make better use of torchtext in the current pipeline. The way I'm preparing training data is causing a lot of extra compute and totally breaks the batching semantics of torchtext and deep learning models in general.