Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove assumption that all epochs have same number of batches #411

Closed
wants to merge 3 commits into from

Conversation

mrcslws
Copy link
Contributor

@mrcslws mrcslws commented Nov 13, 2020

Mixins and experiments that vary the batch size (e.g. VaryBatchSize mixin) or the samplers (e.g. ContinualLearningExperiment) will now be compatible with code that relies on step counts:(e.g. OneCycleLR).

Mixins and experiments that vary the batch size (e.g. VaryBatchSize
mixin) or the samplers (e.g. ContinualLearningExperiment) will now be
compatible with code that relies on step counts:(e.g. OneCycleLR).
@mrcslws
Copy link
Contributor Author

mrcslws commented Nov 13, 2020

FYI I'm currently trying to make this more functional. The PR as-is mutates a central dataloader. It would be easier to understand and more robust if the dataloader was created in a functional way, rather than being reused, though its internals like the dataset and task indices should be reused. (That was my original plan, and now I'm trying it again.)

@mrcslws mrcslws force-pushed the num-batches-assumption branch from fc57ad3 to 7a321fc Compare November 16, 2020 16:10
@mrcslws
Copy link
Contributor Author

mrcslws commented Nov 16, 2020

This PR depends on #414 (to avoid breaking the meta_cl_experiment). The regression tests pass, and there's no measurable slowdown in CPU profiles of the imagenet experiment. (pre_epoch is negligible)

Pasting second commit message:

Functional programming approach to dataloaders
Rather than mutating a central dataloader over time, create new dataloaders on-demand (while storing internal immutable state like datasets). This reduces potential for bugs, and it also makes the create_train_dataloader method fully capable even when called as a classmethod.

@mrcslws mrcslws force-pushed the num-batches-assumption branch 2 times, most recently from 74d8c76 to d7f6f36 Compare November 16, 2020 16:26
Rather than mutating a central dataloader over time, create new
dataloaders on-demand (while storing internal immutable state like
datasets). This reduces potential for bugs, and it also makes the
create_train_dataloader method fully capable even when called as a
classmethod.
@mvacaporale
Copy link
Contributor

What are the instances where we vary the dataloader per epoch? Of course, VaryBatchSize is one of them, but are there others?

I'm not sure how I feel about such a fundamental change for only one class of experiments. I'm sure you thought extensively about what features you're trying to support, so I'm curious what you have in mind with these changes.

@mvacaporale
Copy link
Contributor

I also was a fan of calling self.loader.sampler.set_active_tasks(). It really makes it explicit what's happening in the code. As far as I understand, we could still create the samplers only once as we do with the datasets, correct? Do you see any issues with that? Even though the dataloaders may change every epoch, we don't necessarily have to change the samplers at every epoch.

@mrcslws
Copy link
Contributor Author

mrcslws commented Nov 19, 2020

Any ContinualLearningExperiment also may vary the number of batches per epoch, even if unintentionally. In general, this opens us up to running different tasks or groups of tasks in different epochs, without breaking built-in assumptions (like self.total_batches). That second use case was enough to convince me that we want this flexibility.

(Plus, VaryBatchSize isn't that obscure of a case. We happen to implement it as a mixin, but we could have just as easily built it in.)

@mrcslws
Copy link
Contributor Author

mrcslws commented Nov 19, 2020

Is it any less explicit here? We're still calling set_active_tasks. When we create the sampler, we create it for a specific epoch number. It's different, but still explicit. And now we don't have to store a bunch of state in our heads to understand and debug the code.

@mvacaporale
Copy link
Contributor

mvacaporale commented Nov 19, 2020

I’m curious if there are other ways of dealing with VaryBatchSize and OneCycleLR.

I'm wary of creating functionality in anticipation of a need that hasn't fully realized just yet. As of now, we haven't had a desire to train with OneCylceLR and VaryBatchSize with samplers that change every epoch. In the past, I’ve often fallen victim to overgeneralizing. The code can become harder to understand and harder to maintain at the expense of a feature that may not be used as much as originally thought.

I just want to be sure there’s a strong need given the changes. Here are the downsides that I see:

  • The ContinualLearningExperiment and SupervisedExperiment now differ more in their API from the MetaContinualLearningExperiment, making our codebase potentially harder to understand
  • The create*loaders functionality is now more complex
  • The task setting for ContinualLearningExperiment is not as obvious imo. It takes some understanding of its class inheritance to route the call of create_train_sampler to the pre_epoch function of SupervisedExperiment.
  • The ContinualLearningExperiment does not iterate over epochs, but instead tasks. Here we would require it to compute compute_steps_in_epoch, which would not be apt of a name in this case.

To avoid these issues, I’d like to minimize the changes needed to accommodate the varying steps per epoch. Here’s a general route we may take: We could keep the samplers as they were, have one create_train_dataloader (removing _create… as proposed) that takes an epoch and, optionally, a pre-instantiated sampler and dataset, and add a total_steps property that uses this create... method to calculate the total number of steps as appropriate.

Here's a rough draft of what that may look like



class SupervisedExperiment:

    def setup_experiment(self, config):

        ...

        self.train_dataset = self.load_dataset(config, train=True)
        self.val_dataset = self.load_dataset(config, train=False)

        # Create samplers with no specification for epoch.
        self.train_sampler = self.create_train_sampler(config, self.train_dataset)
        self.val_sampler = self.create_validation_sampler(config, self.train_dataset)

        # It could be good to set a default loader, other than None. Up to you.
        self.train_loader = self.create_train_dataloader(
            config,
            dataset=self.train_dataset,
            sampler=self.train_sampler,
            epoch=0
        )
        self.val_loader = self.create_validation_dataloader(
            config,
            dataset=self.train_dataset,
            sampler=self.train_sampler,
            epoch=0
        )

        ...

    @classmethod
    def create_train_dataloader(cls, config, dataset=None, sampler=None, epoch=0):
        """
        We still enable dataset and samplers to be None so the loaders may be created
        solely via the config.
        """

        if dataset is None:
            dataset = cls.load_dataset(config, train=True)
        if sampler is None:
            sampler = cls.create_train_sampler(config, dataset, epoch)

        return DataLoader(
            dataset=dataset,
            batch_size=config.get("batch_size", 1),
            shuffle=sampler is None,
            num_workers=config.get("workers", 0),
            sampler=sampler,
            pin_memory=torch.cuda.is_available(),
            drop_last=config.get("train_loader_drop_last", True),
        )

    @classmethod
    def create_validation_dataloader(cls, config, dataset=None, sampler=None, epoch=0):
        """
        Something similar to `create_train_dataloader`
        """
        ...

    def pre_epoch(self):
        # In the distributed case, one would also call self.train_sampler.set_epoch
        self.train_loader = self.create_train_dataloader(
            self.config,
            dataset=self.train_dataset,
            sampler=self.train_sampler,
            epoch=self.current_epoch
        )

    # This gets removed
    # def compute_steps_in_epoch(self, epoch):

    @property
    def total_steps(self):
        total_steps = 0
        for epoch in range(self.epochs):
            train_loader = self.create_train_dataloader(
                self.config,
                dataset=self.train_dataset,
                sampler=self.train_sampler,
                epoch=epoch
            )
            total_steps += len(train_loader)
        return total_steps


class ContinualLearningExperiment:

    def setup_experiment(self, config):

        ...
        # Adjusted to handle class indices, but otherwise the same
        ...

    # The create_*_loader methods can be reused from the parent class.
    # Nothing to change, I believe. If anything, the argument `epoch`
    # can be renamed to `task`

    def run_task(self):
        # Set the task as we did before.
        self.train_sampler.set_active_task(self.current_task)
        ...

    @property
    def total_steps(self):
        total_steps = 0
        for task in range(self.num_tasks):
            train_loader = self.create_train_dataloader(
                self.config,
                dataset=self.train_dataset,
                sampler=self.train_sampler,
                epoch=task,  # <- this is a little weird here
            )
            self.train_sampler.set_active_tasks(task)
            total_steps += len(train_loader)
        return total_steps

It may also be possible to eliminate the need for the create_*_loader’s to use epoch as a parameter. We could potentially make VaryBatchSize knowledgeable enough to compute the total_steps on its own (depending on whether it’s being used for SupervisedExperiment or ContinuousLearningExperiment).

@mrcslws
Copy link
Contributor Author

mrcslws commented Nov 19, 2020

I'll spend some more time thinking about that, but here are a couple initial thoughts:

  • I wanted to avoid needing a different VaryBatchSize mixin for different experiments. (Or, equivalently, avoid type-checking in the mixin)
  • The plan is to change the MetaCLExperiment to also use this approach. The PR was already big, I didn't want to make that change in this PR.
  • This is compatible with the ContinualLearningExperiment. You can do sum(self.compute_steps_in_epoch(epoch) for epoch in range(self.num_tasks * self.epochs))
  • A total_steps method would not cover all of our use cases. In some places in our code, we want to know how many steps have elapsed up to some point (e.g. at the end of the 5th epoch)

@mvacaporale
Copy link
Contributor

Gotcha. I'm glad we'll adjust the MetaCL experiment for consistency.

And I can appreciate not wanting to have experiment-specific code in VaryBatchSize. When I proposed the opposite, I thought your option 5 from here would be well suited for that, maybe for later.

For this PR right now, I would still like to keep things as simple as possible. I believe something similar to what I wrote above would work. But instead of having total_steps you'd have compute_steps_in_epoch for the reasons you mention.

@mrcslws
Copy link
Contributor Author

mrcslws commented Mar 29, 2021

We'll keep this idea in our back pocket.

@mrcslws mrcslws closed this Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants