Remove assumption that all epochs have same number of batches #411

mrcslws · 2020-11-13T20:28:49Z

Mixins and experiments that vary the batch size (e.g. VaryBatchSize mixin) or the samplers (e.g. ContinualLearningExperiment) will now be compatible with code that relies on step counts:(e.g. OneCycleLR).

mrcslws · 2020-11-13T23:37:51Z

FYI I'm currently trying to make this more functional. The PR as-is mutates a central dataloader. It would be easier to understand and more robust if the dataloader was created in a functional way, rather than being reused, though its internals like the dataset and task indices should be reused. (That was my original plan, and now I'm trying it again.)

mrcslws · 2020-11-16T16:22:07Z

This PR depends on #414 (to avoid breaking the meta_cl_experiment). The regression tests pass, and there's no measurable slowdown in CPU profiles of the imagenet experiment. (pre_epoch is negligible)

Pasting second commit message:

Functional programming approach to dataloaders
Rather than mutating a central dataloader over time, create new dataloaders on-demand (while storing internal immutable state like datasets). This reduces potential for bugs, and it also makes the create_train_dataloader method fully capable even when called as a classmethod.

Rather than mutating a central dataloader over time, create new dataloaders on-demand (while storing internal immutable state like datasets). This reduces potential for bugs, and it also makes the create_train_dataloader method fully capable even when called as a classmethod.

mvacaporale · 2020-11-19T16:01:32Z

What are the instances where we vary the dataloader per epoch? Of course, VaryBatchSize is one of them, but are there others?

I'm not sure how I feel about such a fundamental change for only one class of experiments. I'm sure you thought extensively about what features you're trying to support, so I'm curious what you have in mind with these changes.

mvacaporale · 2020-11-19T16:06:49Z

I also was a fan of calling self.loader.sampler.set_active_tasks(). It really makes it explicit what's happening in the code. As far as I understand, we could still create the samplers only once as we do with the datasets, correct? Do you see any issues with that? Even though the dataloaders may change every epoch, we don't necessarily have to change the samplers at every epoch.

mrcslws · 2020-11-19T16:18:42Z

Any ContinualLearningExperiment also may vary the number of batches per epoch, even if unintentionally. In general, this opens us up to running different tasks or groups of tasks in different epochs, without breaking built-in assumptions (like self.total_batches). That second use case was enough to convince me that we want this flexibility.

(Plus, VaryBatchSize isn't that obscure of a case. We happen to implement it as a mixin, but we could have just as easily built it in.)

mrcslws · 2020-11-19T16:23:56Z

Is it any less explicit here? We're still calling set_active_tasks. When we create the sampler, we create it for a specific epoch number. It's different, but still explicit. And now we don't have to store a bunch of state in our heads to understand and debug the code.

mvacaporale · 2020-11-19T18:26:18Z

I’m curious if there are other ways of dealing with VaryBatchSize and OneCycleLR.

I'm wary of creating functionality in anticipation of a need that hasn't fully realized just yet. As of now, we haven't had a desire to train with OneCylceLR and VaryBatchSize with samplers that change every epoch. In the past, I’ve often fallen victim to overgeneralizing. The code can become harder to understand and harder to maintain at the expense of a feature that may not be used as much as originally thought.

I just want to be sure there’s a strong need given the changes. Here are the downsides that I see:

The ContinualLearningExperiment and SupervisedExperiment now differ more in their API from the MetaContinualLearningExperiment, making our codebase potentially harder to understand
The create*loaders functionality is now more complex
The task setting for ContinualLearningExperiment is not as obvious imo. It takes some understanding of its class inheritance to route the call of create_train_sampler to the pre_epoch function of SupervisedExperiment.
The ContinualLearningExperiment does not iterate over epochs, but instead tasks. Here we would require it to compute compute_steps_in_epoch, which would not be apt of a name in this case.

To avoid these issues, I’d like to minimize the changes needed to accommodate the varying steps per epoch. Here’s a general route we may take: We could keep the samplers as they were, have one create_train_dataloader (removing _create… as proposed) that takes an epoch and, optionally, a pre-instantiated sampler and dataset, and add a total_steps property that uses this create... method to calculate the total number of steps as appropriate.

Here's a rough draft of what that may look like



class SupervisedExperiment:

    def setup_experiment(self, config):

        ...

        self.train_dataset = self.load_dataset(config, train=True)
        self.val_dataset = self.load_dataset(config, train=False)

        # Create samplers with no specification for epoch.
        self.train_sampler = self.create_train_sampler(config, self.train_dataset)
        self.val_sampler = self.create_validation_sampler(config, self.train_dataset)

        # It could be good to set a default loader, other than None. Up to you.
        self.train_loader = self.create_train_dataloader(
            config,
            dataset=self.train_dataset,
            sampler=self.train_sampler,
            epoch=0
        )
        self.val_loader = self.create_validation_dataloader(
            config,
            dataset=self.train_dataset,
            sampler=self.train_sampler,
            epoch=0
        )

        ...

    @classmethod
    def create_train_dataloader(cls, config, dataset=None, sampler=None, epoch=0):
        """
        We still enable dataset and samplers to be None so the loaders may be created
        solely via the config.
        """

        if dataset is None:
            dataset = cls.load_dataset(config, train=True)
        if sampler is None:
            sampler = cls.create_train_sampler(config, dataset, epoch)

        return DataLoader(
            dataset=dataset,
            batch_size=config.get("batch_size", 1),
            shuffle=sampler is None,
            num_workers=config.get("workers", 0),
            sampler=sampler,
            pin_memory=torch.cuda.is_available(),
            drop_last=config.get("train_loader_drop_last", True),
        )

    @classmethod
    def create_validation_dataloader(cls, config, dataset=None, sampler=None, epoch=0):
        """
        Something similar to `create_train_dataloader`
        """
        ...

    def pre_epoch(self):
        # In the distributed case, one would also call self.train_sampler.set_epoch
        self.train_loader = self.create_train_dataloader(
            self.config,
            dataset=self.train_dataset,
            sampler=self.train_sampler,
            epoch=self.current_epoch
        )

    # This gets removed
    # def compute_steps_in_epoch(self, epoch):

    @property
    def total_steps(self):
        total_steps = 0
        for epoch in range(self.epochs):
            train_loader = self.create_train_dataloader(
                self.config,
                dataset=self.train_dataset,
                sampler=self.train_sampler,
                epoch=epoch
            )
            total_steps += len(train_loader)
        return total_steps


class ContinualLearningExperiment:

    def setup_experiment(self, config):

        ...
        # Adjusted to handle class indices, but otherwise the same
        ...

    # The create_*_loader methods can be reused from the parent class.
    # Nothing to change, I believe. If anything, the argument `epoch`
    # can be renamed to `task`

    def run_task(self):
        # Set the task as we did before.
        self.train_sampler.set_active_task(self.current_task)
        ...

    @property
    def total_steps(self):
        total_steps = 0
        for task in range(self.num_tasks):
            train_loader = self.create_train_dataloader(
                self.config,
                dataset=self.train_dataset,
                sampler=self.train_sampler,
                epoch=task,  # <- this is a little weird here
            )
            self.train_sampler.set_active_tasks(task)
            total_steps += len(train_loader)
        return total_steps

It may also be possible to eliminate the need for the create_*_loader’s to use epoch as a parameter. We could potentially make VaryBatchSize knowledgeable enough to compute the total_steps on its own (depending on whether it’s being used for SupervisedExperiment or ContinuousLearningExperiment).

mrcslws · 2020-11-19T18:56:32Z

I'll spend some more time thinking about that, but here are a couple initial thoughts:

I wanted to avoid needing a different VaryBatchSize mixin for different experiments. (Or, equivalently, avoid type-checking in the mixin)
The plan is to change the MetaCLExperiment to also use this approach. The PR was already big, I didn't want to make that change in this PR.
This is compatible with the ContinualLearningExperiment. You can do sum(self.compute_steps_in_epoch(epoch) for epoch in range(self.num_tasks * self.epochs))
A total_steps method would not cover all of our use cases. In some places in our code, we want to know how many steps have elapsed up to some point (e.g. at the end of the 5th epoch)

mvacaporale · 2020-11-20T15:57:27Z

Gotcha. I'm glad we'll adjust the MetaCL experiment for consistency.

And I can appreciate not wanting to have experiment-specific code in VaryBatchSize. When I proposed the opposite, I thought your option 5 from here would be well suited for that, maybe for later.

For this PR right now, I would still like to keep things as simple as possible. I believe something similar to what I wrote above would work. But instead of having total_steps you'd have compute_steps_in_epoch for the reasons you mention.

mrcslws · 2021-03-29T14:49:36Z

We'll keep this idea in our back pocket.

Remove assumption that all epochs have same number of batches

205cf6c

Mixins and experiments that vary the batch size (e.g. VaryBatchSize mixin) or the samplers (e.g. ContinualLearningExperiment) will now be compatible with code that relies on step counts:(e.g. OneCycleLR).

Merge branch 'master' into num-batches-assumption

9c3df2b

mrcslws force-pushed the num-batches-assumption branch from fc57ad3 to 7a321fc Compare November 16, 2020 16:10

mrcslws force-pushed the num-batches-assumption branch 2 times, most recently from 74d8c76 to d7f6f36 Compare November 16, 2020 16:26

mrcslws force-pushed the num-batches-assumption branch from d7f6f36 to b17d13a Compare November 16, 2020 16:30

mrcslws requested review from mvacaporale and lscheinkman November 16, 2020 16:58

mrcslws closed this Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove assumption that all epochs have same number of batches #411

Remove assumption that all epochs have same number of batches #411

mrcslws commented Nov 13, 2020

mrcslws commented Nov 13, 2020

mrcslws commented Nov 16, 2020

mvacaporale commented Nov 19, 2020

mvacaporale commented Nov 19, 2020

mrcslws commented Nov 19, 2020 •

edited

Loading

mrcslws commented Nov 19, 2020

mvacaporale commented Nov 19, 2020 •

edited

Loading

mrcslws commented Nov 19, 2020 •

edited

Loading

mvacaporale commented Nov 20, 2020

mrcslws commented Mar 29, 2021

Remove assumption that all epochs have same number of batches #411

Remove assumption that all epochs have same number of batches #411

Conversation

mrcslws commented Nov 13, 2020

mrcslws commented Nov 13, 2020

mrcslws commented Nov 16, 2020

mvacaporale commented Nov 19, 2020

mvacaporale commented Nov 19, 2020

mrcslws commented Nov 19, 2020 • edited Loading

mrcslws commented Nov 19, 2020

mvacaporale commented Nov 19, 2020 • edited Loading

mrcslws commented Nov 19, 2020 • edited Loading

mvacaporale commented Nov 20, 2020

mrcslws commented Mar 29, 2021

mrcslws commented Nov 19, 2020 •

edited

Loading

mvacaporale commented Nov 19, 2020 •

edited

Loading

mrcslws commented Nov 19, 2020 •

edited

Loading