Original paper by Cuong V. Nguyen, Yingzhen Li, Thang D. Bui and Richard E. Turner
- Continual Learning
- Data continuously arrive in a possibly non i.i.d. way
- Tasks may change over time (e.g. new classes may be discovered)
- Entirely new tasks can emerge (Schlimmer & Fisher 1986; Sutton & Whitehead, 1993; Ring, 1997)
- Challenge for Continual Learning
- Balance between adapting to new data vs. retaining existing knowledge
- Too much plasticity → catastrophic forgetting (McCloskey & Cohen, 1989; Ratcliff, 1990; Goodfellow et al., 2014a)
- Too much stability → inability to adapt
- Approach 1: train individual models on each task → train to combine them (Lee et al., 2017)
- Approach 2: maintain a single model and use a single type of regularized training that prevents drastic changes in the influential parameters, but allow other parameters to change more freely (Li & Hoiem, 2016; Kirkpatrick et al., 2017; Zenke et al., 2017)
- Variational Continual Learning
- Merge online VI (Ghahramani & Attias, 2000; Sato, 2001; Broderick et al., 2013)
- with Monte Carlo VI for NN (Blundell et al., 2015)
- and include a small episodic memory (Bachem et al., 2015; Huggins et al., 2016)
- Online updating, derived from Bayes' rule
- Posterior after Tth dataset is proportional to the Posterior after the (T-1)th dataset multiplied by the likelihood of the Tth dataset
- Projection Operation: approximation for intractable posterior (recursive)
- This paper will use Online VI as it outperforms other methods for complex models in the static setting (Bui et al., 2016)
- Projection Operation: KL Divergence Minimization
- Potential Problems
- Errors from repeated approximation → forget old tasks
- Minimization at each step is also approximate → information loss
- Solution: Coreset
- Coreset: small representative set of data from previously observed tasks
- Analogous to episodic memory (Lopez-Paz & Ranzato, 2017)
- Coreset VCL: equivalent to a message-passing implementation of VI in which the coreset data point updates are scheduled after updating the other data
- : updated using and selected data points from (e.g. random selection, K-center algorithm, ...)
- K-center algorithm: return K data points that are spread throughout the input space (Gonzalez, 1985)
- Variational Recursion
- Algorithm
- Step 5: Perform prediction
- Multi-head Networks
- Standard architecture used for multi-task learning (Bakker & Heskes, 2003)
- Share parameters close to the inputs / Separate heads for each output
- More advanced model structures:
- for continual learning (Rusu et al., 2016)
- for multi-task learning in general (Swietojanski & Renals, 2014; Rebuffi et al., 2017)
- automatic continual model building: adding new structure as new tasks are encountered
- This paper assumes that the model structure is known a priori
- Formulation
- Network Training
- : tractable / set as multivariate Gaussian (Graves, 2011; Blundell et al., 2015)
- : intractable → approximate by employing simple Monte Carlo and using the local reparameterization trick to compute the gradients (Salimans & Knowles, 2013; Kingma & Welling, 2014; Kingma et al., 2015)
- Deep Generative Models
- Can generate realistic images, sounds, and video sequences (Chung et al., 2015; Kingma et al., 2016; Vondrick et al., 2016)
- Standard batch learning assumes observations to be i.i.d. and are all available at the same time
- This paper applies VCL framework to variational auto encoders (Kingma & Welling, 2014; Rezende et al., 2014)
- Formulation - VAE approach (batch learning)
- No parameter uncertainty estimates (used to weight the information learned from old data)
- Formulation - VCL approach (continual learning)
- Model Architecture
- Architecture 1: shared bottom network - suitable when data are composed of a common set of structural primitives (e.g. strokes)
- Architecture 2: shared head network - information tend to be entirely encoded in bottom network
- Continual Learning for Deep Discriminative Models (regularized MLE)
- Laplace Propagation (LP) (Smola et al., 2004) - recursion for using Laplace's approximation
- Diagonal LP: retain only the diagonal terms of to avoid computing full Hessian
- Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) - modified diagonal LP
- Approximate the average Hessian of the likelihoods using Fisher information
- Regularization term: introduce hyperparameter, remove prior, regularize intermediate estimates
- Synaptic Intelligence (SI) (Zenke et al., 2017) - compute using importance of each parameter to each task
- Approximate Bayesian Training of NN (focused on )
|Approach|References| |-|-| |extended Kalman filtering|Singhal & Wu, 1989| |Laplace's approximation|MacKay, 1992| |variational inference|Hinton & Van Camp, 1993; Barber & Bishop, 1998; Graves, 2011; Blundell et al., 2015; Gal & Ghahramani, 2016| |sequential Monte Carlo|de Freitas et al., 2000| |expectation propagation|Hernández-Lobato & Adams, 2015| |approximate power EP|Hernández-Lobato et al., 2016|
- Continual Learning for Deep Generative Models
- Naïve approach: apply VAE to with parameters initialized at → catastrophic forgetting
- Alternative: add EWC regularization term to VAE objective & approximate marginal likelihood by variational lower bound
- Similar approximation can be used for Hessian matrices for LP and → for SI (Importance sampling: Burda et al., 2016)