Language-Guided Continual Learning

11 min readApr 23, 2021

What is Continual Learning?

Machine learning classifiers are regularly prepared to recognize a set of predefined classes; nevertheless, it will be useful to have the flexibility of learning additional concepts, with limited data and without re-training on the full training set.

Therefore, Continual Learning is a concept to learn a model for a large number of tasks sequentially without forgetting knowledge obtained from the preceding tasks, where the data in the old tasks are not available anymore during training new ones.

Then, our objective is to let the system learn new novel categories from only a few training data without forgetting the base categories on which it was trained.

We will examine the role of semantics in the form of attributes, labels, or natural language descriptions in boosting continual learning.

There are several ways to achieve our goal:

Dynamic Few-Shot Visual Learning without Forgetting
Incremental Few-Shot Learning with Attention Attractor Networks
XtarNet: Learning to Extract Task-Adaptive Representation for Incremental Few-Shot Learning
Continual Learning by Asymmetric Loss Approximation with Single-Side Overestimation.

For each method we will try to discuss:

* Main ideas and the important definitions

* Overview of the framework.

* Some observations and results.

Dynamic Few-Shot Visual Learning without Forgetting:

Research on this subject is usually termed few-shot learning,

What is few-shot learning?

As the name implies, few-shot learning refers to the practice of feeding a learning model with a very small amount of training data, contrary to the normal practice of using a large amount of data. For example, if we have a problem categorizing bird species from photos, some rare species of birds may lack enough pictures to be used in training images.

Consequently, if we have a classifier for bird images, with an insufficient amount of the dataset, we will treat it as a few-shot or low-shot machine learning problem.

Tracking the problem of few-shot learning was motivated by two requirements that are neglected by most earlier methods, these requirements are:

a -the learning of the novel categories needs to be fast,

b -to not sacrifice any recognition accuracy on the initial categories that the ConvNet was trained on.

This project aims to develop an object recognition learning system that, not as it were can recognize these base categories but moreover learns to powerfully recognize novel categories from as it were some preparing cases whereas moreover not overlooking the base ones or requiring to be re-trained on them.

To accomplish this work that comes about on few-shot learning the authors propose :

Extend an object acknowledgment framework with a consideration-based few-shot classification weight generator.
Redesign the classifier of a ConvNet model (The big idea behind CNNs is that a local understanding of an image is good enough. The practical benefit is that having fewer parameters greatly improves the time it takes to learn as well as reduces the amount of data required to train the model. Instead of a fully connected network of weights from each pixel, a CNN has just enough weights to look at a small patch of the image) as the cosine similarity function between feature representations and classification weight vectors, to unify the recognition of both novel and base categories, it moreover leads to include representations that generalize way better on “unseen” categories.
Using Mini-ImageNet to evaluate the approach where they are able to improve the prior state-of-the-art on few-shot recognition.
Finally, they apply their approach to the recently introduced few-shot benchmark of Bharath and Girshick where they also achieve state-of-the-art results.

Overview of the framework:

It consists of two main components, a ConvNet-based recognition model that is able to recognize both base and novel categories and a few-shot classification weight generator that dynamically generates classification weight vectors for the novel categories at test time. Both are trained on a set of base categories for which we have available a large set of training data. During test time, the weight generator gets as input a few training data of a novel category and the classification weight vectors of base categories (green rectangle inside the classifier box) and generates a classification weight vector for this novel category (blue rectangle inside the classifier box). This allows the ConvNet to recognize both base and novel categories.

Conclusions:

There is two types of Few-shot classification weight generator : Feature averaging based weight inference and attention-based weight inference. The one the author chose to use is attention-based weight inference because the base classification weight vectors learn to be representative feature vectors of their categories. Then, the base classification weight vectors also encode visual similarity. Hence, the classification weight vector of a novel category can be composed as a linear combination of those base classification weight vectors that are most similar to the few training examples of that category. This allows the few-shot weight generator to explicitly exploit the acquired knowledge about the visual word in order to improve the few-shot recognition performance
They use a cosine similarity function instead of the dot product, because by comparing the cosine-similarity based ConvNet models with the dot-product based models;

we observe that the former drastically improve the few-shot object recognition performance, which means that the feature extractor that is learned with the cosine-similarity classifier generalizes significantly better on “unseen” categories than the feature extractor learned with the dot-product classifier.

For more details check Dynamic Few-Shot Visual Learning without Forgetting article.

Incremental Few-Shot Learning with Attention Attractor Networks :

This paper addresses this problem, incremental few-shot learning, where a regular classification network has already been trained to recognize a set of base classes, and several extra novel classes are being considered, each with only a few labeled examples. After learning the novel classes, the model is then evaluated on the overall classification performance on both base and novel classes. To this end, the authors propose a meta-learning model, the Attention Attractor Network, which regularizes the learning of novel classes. In each episode, they train a set of new weights to recognize novel classes until they converge, and they show that the technique of recurrent back-propagation can back-propagate through the optimization process and facilitate the learning of these parameters. They demonstrate that the learned attractor network can help recognize novel classes while remembering old classes without the need to review the original training set, outperforming various baselines.

Overview of the framework:

The proposed attention attractor network for incremental few-shot learning. During pretraining we learn the base class weights and the feature extractor CNN backbone. In the meta-learning stage, a few-shot episode is presented. The support set only contains novel classes, whereas the query set contains both base and novel classes. We learn an episodic classifier network through an iterative solver, to minimize cross entropy plus an additional regularization term predicted by the attention attractor network by attending to the base classes. The attention attractor network is meta-learned to minimize the expected query loss. During testing an episodic classifier is learned in the same way.

To accomplish the work of the setup of incremental few-shot learning the authors propose :

Pretraining Stage: We learn a base model for the regular supervised classification task on dataset, the purpose of this stage is to learn both a good base classifier and a good representation. The parameters of the base classifier are learned in this stage and will be fixed after pretraining
Incremental Few-Shot Episodes: A few-shot dataset can be the same data source as the pre-training dataset but sampled episodically from which we can sample few-shot learning episodes. In each episode, we learn a classifier on the support set whose learnable parameters are called the fast weights, as they are only used during this episode.
Meta-Learning Stage: In meta-training, we iteratively sample few-shot episodes and try to learn the meta-parameters in order to minimize the joint prediction loss. In their model, meta-parameters are encapsulated in the attention attractor network, which produces regularizers for the fast weights in the few-shot learning objective.
Joint Prediction on Base and Novel Classes: First, we construct an episodic classifier, which takes the learned image features as inputs and classifies them according to the few-shot classes. During training on the support set, we learn the fast weights via minimizing the regularized cross-entropy objective, which we call the episodic objective.

Attention Attractor Networks :

Directly learning the few-shot episode can cause catastrophic forgetting on the base classes, because the parameter which is trained to maximize the correct novel class probability can dominate the base classes in the joint prediction. Attention Attractor Network can help to solve this problem.

The key feature of our attractor network is the following regularization term:

This sum of squared Mahalanobis distances from the attractors adds a bias to the learning signal arriving solely from novel classes; for a classifier such as an MLP, one can extend this regularization term in a layer-wise manner. Specifically, one can have separate attractors per layer, and the number of attractors equals the number of output dimension of that layer.

To ensure that the model performs well on base classes, the attractors must contain some information about examples from base classes. We use the slow weights to encode such information.

Refers to Incremental Few-Shot Learning with Attention Attractor Networks article.

XtarNet: Learning to Extract Task-Adaptive Representation for Incremental Few Shot learning

This paper addresses this problem where the author propose XtarNet, which learns to extract task-adaptive representation (TAR) for facilitating incremental few-shot learning. The method utilizes a backbone network pretrained on a set of base categories while also employing additional modules that are meta-trained across episodes. Given a new task, the novel feature extracted from the meta-trained modules is mixed with the base feature obtained from the pretrained model. The process of combining two different features provides TAR and is also controlled by meta-trained modules. The TAR contains effective information for classifying both novel and base categories. The base and novel classifiers quickly adapt to a given task by utilizing the TAR. Experiments on standard image datasets indicate that XtarNet achieves state-of-the-art incremental few-shot learning performance. The concept of TAR can also be used in conjunction with existing incremental few-shot learning methods; extensive simulation results in fact show that applying TAR enhances the known methods significantly.

Overview of the framework:

XtarNet processing of a given episode: our modules can be plugged in with existing incremental few-shot learners, e.g., Imprint and LwoF. Although TapNet was not proposed as an incremental few-shot learner, it can be used as an effective novel classifier for our purposes, in which case projection M is computed such that per-class average of the combined features and novel classifier weights coincide in M.

For more details about the methodology and experimental results check this article XtarNet: Learning to Extract Task-Adaptive Representation for Incremental Few-Shot Learning

Continual Learning by Asymmetric Loss Approximation with Single-Side Overestimation:

This paper addresses this problem where the author proposes a novel approach to continual learning by approximating a true loss function using an asymmetric quadratic function with one of its sides overestimated. Our algorithm is motivated by the empirical observation that the network parameter updates affect the target loss functions asymmetrically. In the proposed continual learning framework, we estimate an asymmetric loss function for the tasks considered in the past through a proper overestimation of its unobserved sides in training new tasks, while deriving the accurate model parameter for the observable sides. In contrast to existing approaches, our method is free from the side effects and achieves the state-of-the-art accuracy that is even close to the upper-bound performance on several challenging benchmark datasets.

Overview of the framework:

Conceptual diagram to illustrate why our loss approximation is required. The symmetric loss approximations such as the quadratic approximation used in SI (yellow) is prone to underestimate the unobserved sides of potentially asymmetric real surrogate loss functions (black). We claim that properly introduced asymmetry in the loss approximation (blue) prevents this problem. Note that the other side are observed during the optimization of the network in training the previous tasks and we can derive a more accurate parameter for the quadratic optimization of the surrogate function.

This paper presents a novel continual learning framework based on asymmetric loss approximation with singleside overestimation (ALASSO), which effectively adapts to a large number of tasks. ALASSO approximates the true loss functions corresponding to the previously considered tasks asymmetrically by overestimating their unobserved sides in the parameter space while deriving the accurate quadratic approximation on the observed sides. The figure below illustrates the main concept of our approach; it computes the optimal parameter through a quadratic approximation in the observed side (left) while using a steep surrogate quadratic function in the unobservable side (right). The proposed algorithm also decouples the hyperparameters for the current surrogate loss approximation and the surrogate loss change of the previous tasks. This approach is motivated by the observation that updating the model parameters of deep neural networks affects target losses asymmetrically and that using the overestimated loss functions is relatively safe for the optimization without the training data of the previous tasks.

Proposed Algorithm :

Overview : This algorithm overestimates the unobserved sides of the approximate loss function and allows the models to learn under a harsher condition. It also derives the accurate parameter estimation of the approximate quadratic loss functions on their observed sides. To accelerate the optimization procedure and handle the conflicts between the loss computation of the current task and the loss approximation of the previous tasks, we introduce a hyperparameter decoupling technique although the values of the decoupled hyperparameters should be identical conceptually.
Asymmetric Loss Approximation: for a better structural regularizer in continual learning is asymmetric loss approximation of the previous tasks. Because the symmetric regularizer as in SI overly simplifies the true loss functions and have a critical limitation in maintaining the knowledge obtained from the previous tasks. The approximate quadratic loss functions may be sufficiently accurate on the sides, where the true losses are observable along the model parameter updates during training. However, they may incur substantial error on their unobserved sides, so it is dangerous to assume that the true loss functions are symmetric.
Accurate Quadratic Approximation: lead to remarkable performance improvement. check the link at the end to know more about the equations.

Analysis of results from ALASSO in comparison to SI, which is one of the state-of-the-art methods. This figure presents per-task test accuracy of the selected tasks on the permuted MNIST dataset over time. In this graph, x-axis denotes the index of most recently trained task, and y-axis is accuracy. All the visualized tasks achieve very high accuracy by the two algorithms when they are trained initially, but their accuracies are degraded as newer tasks come in. This forgetting problem is more catastrophic in SI than ALASSO by a large margin; SI is not effective to maintain accuracy of tasks learned in the past as the newer tasks are considered for training.

To know more details about these algorithms and Experiments check this article Continual Learning by Asymmetric Loss Approximation with Single-Side Overestimation.

Additional links related to the same topic:

Language Models as Fact Checkers?

Uncertainty-guided Continual Learning with Bayesian Neural Networks

Compositional Language Continual Learning

Thanks for reading this blog, and I hope that it will be useful.