Multi-Task Learning for Dense Prediction Tasks: A Perspective

7 min readNov 28, 2021

Machine Learning and Deep Learning tasks have traditionally been designed to tackle one task at a time. But biological systems have been observed to tackle multiple tasks concurrently. Take, for example, the human brain, which can coordinate multiple tasks concurrently. This provided scientists with the impetus to research generalized learning where tasks can be learned concurrently by the same system.

In general, deep learning has had a significant impact in learning abstractions and patterns from one task and applying them to a different but related problem. This is the fundamental idea of transfer learning. At a higher level of complexity, multi-task learning (MTL) aims to solve a generalized set of problems rather than a single task concurrently. This form of learning has been gaining popularity given that the very nature of problems today is both multi-modal i.e., involving different data indicating a certain system, and the system being a series of tasks. For example, multi-task learning can help an intelligent advertising system to determine the presence of a person, their demographics and track eye movement to show a relevant advertisement — all different tasks learned by the same system.

With the witty, yet aptly titled paper, “One Model To Learn Them All”, Kaiser et.al presented a single model that used building blocks from multiple domains to learn a series of tasks from image captioning to English parsing. The paper showed that multi-task learning showed better results for situations where fewer data were available without significant performance degradation for large tasks. There have been many different proposals after that which address the problem of MTL in different scenarios such as NLP, Speed Recognition and Bioinformatics.

The focus of this article is the applications of MTL in computer vision, particularly the task of dense prediction tasks. Dense prediction, in general, refers to producing dense outputs for an input unit. In the case of image data, it refers to labeling data at a pixel level, unlike classification tasks which produce results at an image level. The article follows an interesting paper by Vandenhende et al. called Multi-task Learning for Dense Prediction Tasks: A Survey, that I had the opportunity to review.

What is dense prediction?

Dense prediction in computer vision is the task of predicting output values at a pixel level. Some use-cases require information at this level. Consider images and videos from video surveillance etc which suffer from the haze, distorted images, extreme lighting, etc which are detrimental to proper predictions by vision algorithms. In such scenarios, estimating depth, creating super-resolution images, etc. require predictions at a pixel level. This is a typical case of dense prediction tasks.

The rest of the paper is concerned with MTL Architectures and some of the Optimization techniques. The researchers conducted multiple experiments to compare the different architectures discussed, while I talk about the techniques, I only discuss the results of the experiment and not the entire setup.

Deep Multi-level Architectures

Traditional Approach

Traditional Models for MTL tried to capture the common information among the different tasks, which was believed to improve the generalization performance. The models made assumptions about the task parameters that confined them to very close space. For example, task parameters belong to a common probabilistic prior or they reside in a low-dimensional subspace. But this causes an alternate problem, which is, the performance degradation if the information sharing happens between unrelated tasks. This is called negative transfer. To mitigate these clustering techniques were used to ensure that the groups with a common prior are associated with each other.

Deep Learning Approach

In the context of MTL deep learning, the shared tasks from multi-task supervisory signals are learned. Historically, the learning techniques were divided into soft and hard sharing.

Soft and Hard Parameter Sharing

Soft Parameter sharing refers to designs where each task is assigned to its own set of parameters and a feature sharing mechanism handles the cross-talk between different tasks. Cross-stitch networks are an example of soft parameter learning. The model uses a linear combination of the activations in every layer of the task-specific networks as a means for soft feature fusion. The problem with soft parameter learning is performance issues when it is scaled. This is because of the linear growth of the size of the multi-task network with the number of tasks.

Hard Parameter sharing refers to the parameter set being divided between shared and task-specific parameters. The Ubernet model featured a multi-head design across different network layers and scales. But the most characteristic hard parameter sharing design consists of a shared encoder that branches out into task-specific decoding heads. In hard parameter learning, the branching points in the network are determined in an ad-hoc manner which may lead to suboptimal groupings. Recommendations have been made to solve the problem, for example, stochastic filter groups were used to repurpose each of the convolutional kernels to support shared or low-task behavior.

It has to be noted that the latest approaches to distilling task predictions have changed. In the past, the task outputs were all directly predicted from the same set of inputs in a one-processing cycle, while recent works utilized a multi-task network to make initial task predictions and used the features to further improve each task output recursively.

Encoder-focused and Decoder-focused Models

Given that the latest research has utilized both the historical paradigms to generate better results, a new taxonomy is used for classifying MTL models. Rather than using task sharing as the primary classification paradigm, information sharing is used. Therefore networks where the information sharing is limited to the encoder, while the decoding happens in decoder-specific heads are called encoder-focused while the networks where decoders exchange information during encoding and specifically decoding are called decoder-specific models.

Encoder-focused networks share the task features in the encoding stage while the processing is taken care of in independent decoding heads. The encoder learns the generic representation of the data while the task-specific heads make the predictions for every task. Some examples of encoder-focused networks are:

Cross-stitch networks shared the activations amongst all single-task networks in the encoder. One possible way in which this can be achieved is by using a learnable linear combination of the activation maps and utilizing the transformed result to feed the next layer in the single-task networks. The goal eventually is to learn the weights which determine the degree to which the features can be shared between different tasks. As can be seen, the size of the network increases with the number of tasks. Sluice networks avoid this by selective sharing of skip connections and subspaces.

Neural Discriminative Dimensionality Reduction CNNs (NDDR-CNNs) are similar to cross-stitch networks but apply a dimensionality reduction mechanism to fuse the activations from all the single-task networks. The convolutional layer is then used to fuse the activations across all the channels.

Multi-Task Attention Networks (MTAN) uses a shared backbone network in conjunction with task-specific attention modules in the encoder.
Fully-Adaptive Feature Sharing (FAFS)

The problem with encoder-focused networks is that they directly predict all the task outputs from the same input in one processing cycle. This means they fail to capture the commonalities and differences among tasks, that are likely to improve cross-task learning, which has resulted in moderate performance among encoder-focused models.

Decoder-focused networks alleviate the problems in encoder-focused networks by utilizing a recursive improvement approach to making predictions obtained from a multi-task network better.

PAD-Net was one of the first decoder-focused architectures. As can be seen, the input image is first processed by an off-the-shelf backbone network. The backbone features are further processed by a set of task-specific heads that produce an initial prediction for every task.

Joint Task-Recursive Learning (JTRL) recursively predicts two tasks at increasingly higher scales to gradually refine the results based on past statistics.

Optimization — Task Balancing

The paper talks about the optimization required in MTL to ensure that the learning does not get strongly influenced by one of the tasks. Different methodologies and their impact are discussed. Below is a summary of what each of them means and how they can influence learning.