A simple and surprisingly effective family of conditioning mechanisms.
Many real-world problems require integrating multiple sources of information. Sometimes these problems involve multiple, distinct modalities of information — vision, language, audio, etc. — as is required to understand a scene in a movie or answer a question about an image. Other times, these problems involve multiple sources of the same kind of input, i.e. when summarizing several documents or drawing one image in the style of another.
When approaching such problems, it often makes sense to process one source of information in the context of another; for instance, in the right example above, one can extract meaning from the image in the context of the question. In machine learning, we often refer to this context-based processing as conditioning: the computation carried out by a model is conditioned or modulated by information extracted from an auxiliary input.
Finding an effective way to condition on or fuse sources of information is an open research problem, and in this article, we concentrate on a specific family of approaches we call feature-wise transformations. We will examine the use of feature-wise transformations in many neural network architectures to solve a surprisingly large and diverse set of problems; their success, we will argue, is due to being flexible enough to learn an effective representation of the conditioning input in varied settings. In the language of multi-task learning, where the conditioning signal is taken to be a task description, feature-wise transformations learn a task representation which allows them to capture and leverage the relationship between multiple sources of information, even in remarkably different problem settings.
To motivate feature-wise transformations, we start with a basic example, where the two inputs are images and category labels, respectively. For the purpose of this example, we are interested in building a generative model of images of various classes (puppy, boat, airplane, etc.). The model takes as input a class and a source of random noise (e.g., a vector sampled from a normal distribution) and outputs an image sample for the requested class.
Our first instinct might be to build a separate model for each class. For a small number of classes this approach is not too bad a solution, but for thousands of classes, we quickly run into scaling issues, as the number of parameters to store and train grows with the number of classes. We are also missing out on the opportunity to leverage commonalities between classes; for instance, different types of dogs (puppy, terrier, dalmatian, etc.) share visual traits and are likely to share computation when mapping from the abstract noise vector to the output image.
Now let’s imagine that, in addition to the various classes, we also need to model attributes like size or color. In this case, we can’t reasonably expect to train a separate network for each possible conditioning combination! Let’s examine a few simple options.
A quick fix would be to concatenate a representation of the conditioning information to the noise vector and treat the result as the model’s input. This solution is quite parameter-efficient, as we only need to increase the size of the first layer’s weight matrix. However, this approach makes the implicit assumption that the input is where the model needs to use the conditioning information. Maybe this assumption is correct, or maybe it’s not; perhaps the model does not need to incorporate the conditioning information until late into the generation process (e.g., right before generating the final pixel output when conditioning on texture). In this case, we would be forcing the model to carry this information around unaltered for many layers.
Because this operation is cheap, we might as well avoid making any such assumptions and concatenate the conditioning representation to the input of all layers in the network. Let’s call this approach concatenation-based conditioning.
Another efficient way to integrate conditioning information into the network is via conditional biasing, namely, by adding a bias to the hidden layers based on the conditioning representation.
Interestingly, conditional biasing can be thought of as another way to
implement concatenation-based conditioning. Consider a fully-connected
linear layer applied to the concatenation of an input
and a conditioning representation
:
Yet another efficient way to integrate class information into the network is via conditional scaling, i.e., scaling hidden layers based on the conditioning representation.
A special instance of conditional scaling is feature-wise sigmoidal gating: we scale each feature by a value between and (enforced by applying the logistic function), as a function of the conditioning representation. Intuitively, this gating allows the conditioning information to select which features are passed forward and which are zeroed out.
Given that both additive and multiplicative interactions seem natural and
intuitive, which approach should we pick? One argument in favor of
multiplicative interactions is that they are useful in learning
relationships between inputs, as these interactions naturally identify
“matches”: multiplying elements that agree in sign yields larger values than
multiplying elements that disagree. This property is why dot products are
often used to determine how similar two vectors are.
In the spirit of making as few assumptions about the problem as possible,
we may as well combine both into a
conditional affine transformation.
All methods outlined above share the common trait that they act at the feature level; in other words, they leverage feature-wise interactions between the conditioning representation and the conditioned network. It is certainly possible to use more complex interactions, but feature-wise interactions often strike a happy compromise between effectiveness and efficiency: the number of scaling and/or shifting coefficients to predict scales linearly with the number of features in the network. Also, in practice, feature-wise transformations (often compounded across multiple layers) frequently have enough capacity to model complex phenomenon in various settings.
Lastly, these transformations only enforce a limited inductive bias and remain domain-agnostic. This quality can be a downside, as some problems may be easier to solve with a stronger inductive bias. However, it is this characteristic which also enables these transformations to be so widely effective across problem domains, as we will later review.
To continue the discussion on feature-wise transformations we need to
abstract away the distinction between multiplicative and additive
interactions. Without losing generality, let’s focus on feature-wise affine
transformations, and let’s adopt the nomenclature of Perez et al.
We say that a neural network is modulated using FiLM, or FiLM-ed, after inserting FiLM layers into its architecture. These layers are parametrized by some form of conditioning information, and the mapping from conditioning information to FiLM parameters (i.e., the shifting and scaling coefficients) is called the FiLM generator. In other words, the FiLM generator predicts the parameters of the FiLM layers based on some auxiliary input. Note that the FiLM parameters are parameters in one network but predictions from another network, so they aren’t learnable parameters with fixed weights as in the fully traditional sense. For simplicity, you can assume that the FiLM generator outputs the concatenation of all FiLM parameters for the network architecture.
As the name implies, a FiLM layer applies a feature-wise affine
transformation to its input. By feature-wise, we mean that scaling
and shifting are applied element-wise, or in the case of convolutional
networks, feature map -wise.
In addition to being a good abstraction of conditional feature-wise transformations, the FiLM nomenclature lends itself well to the notion of a task representation. From the perspective of multi-task learning, we can view the conditioning signal as the task description. More specifically, we can view the concatenation of all FiLM scaling and shifting coefficients as both an instruction on how to modulate the conditioned network and a representation of the task at hand. We will explore and illustrate this idea later on.
Feature-wise transformations find their way into methods applied to many problem settings, but because of their simplicity, their effectiveness is seldom highlighted in lieu of other novel research contributions. Below are a few notable examples of feature-wise transformations in the literature, grouped by application domain. The diversity of these applications underscores the flexible, general-purpose ability of feature-wise interactions to learn effective task representations.
Perez et al.
The model’s linguistic pipeline is a FiLM generator which
extracts a question representation that is linearly mapped to
FiLM parameter values. Using these values, FiLM layers inserted within each
residual block condition the visual pipeline. The model is trained
end-to-end on image-question-answer triples. Strub et al.
de Vries et al.
The visual pipeline consists of a pre-trained residual network that is fixed throughout training. The linguistic pipeline manipulates the visual pipeline by perturbing the residual network’s batch normalization parameters, which re-scale and re-shift feature maps after activations have been normalized to have zero mean and unit variance. As hinted earlier, conditional batch normalization can be viewed as an instance of FiLM where the post-normalization feature-wise affine transformation is replaced with a FiLM layer.
Dumoulin et al.
Dumoulin et al.
Yang et al.
So far, the models we covered have two sub-networks: a primary
network in which feature-wise transformations are applied and a secondary
network which outputs parameters for these transformations. However, this
distinction between FiLM-ed network and FiLM generator
is not strictly necessary. As an example, Huang and Belongie
Adaptive instance normalization can be interpreted as inserting a FiLM layer midway through the model. However, rather than relying on a secondary network to predict the FiLM parameters from the style image, the main network itself is used to extract the style features used to compute FiLM parameters. Therefore, the model can be seen as both the FiLM-ed network and the FiLM generator.
As discussed in previous subsections, there is nothing preventing us from considering a neural network’s activations themselves as conditioning information. This idea gives rise to self-conditioned models.
Highway Networks
The ImageNet 2017 winning model
For statistical language modeling (i.e., predicting the next word
in a sentence), the LSTM
Also in the domain of language modeling, Dauphin et al.
The Gated-Attention Reader
The Gated-Attention architecture
Bahdanau et al.
Outside instruction-following, Kirkpatrick et al.
The conditional variant of DCGAN
For convolutional layers, concatenation-based conditioning requires the network to learn redundant convolutional parameters to interpret each constant, conditioning feature map; as a result, directly applying a conditional bias is more parameter efficient, but the two approaches are still mathematically equivalent.
PixelCNN
WaveNet describes two ways in which conditional biasing allows external information to modulate the audio or speech generation process based on conditioning input:
As in PixelCNN, conditioning in WaveNet can be viewed as inserting FiLM layers after each convolutional layer. The main difference lies in how the FiLM-generating network is defined: global conditioning expresses the FiLM-generating network as an embedding lookup which is broadcasted to the whole time series, whereas local conditioning expresses it as a mapping from an input sequence of conditioning information to an output sequence of FiLM parameters.
Kim et al.
The key difference here is that the conditioning signal does not come from an external source but rather from utterance summarization feature vectors extracted in each layer to adapt the model.
For domain adaptation, Li et al.
For few-shot learning, Oreshkin et al.
Aside from methods which make direct use of feature-wise transformations, the FiLM framework connects more broadly with the following methods and concepts.
The idea of learning a task representation shares a strong connection with
zero-shot learning approaches. In zero-shot learning, semantic task
embeddings may be learned from external information and then leveraged to
make predictions about classes without training examples. For instance, to
generalize to unseen object categories for image classification, one may
construct semantic task embeddings from text-only descriptions and exploit
objects’ text-based relationships to make predictions for unseen image
categories. Frome et al.
The notion of a secondary network predicting the parameters of a primary
network is also well exemplified by HyperNetworks
Some parallels can be drawn between attention and FiLM, but the two operate in different ways which are important to disambiguate.
This difference stems from distinct intuitions underlying attention and FiLM: the former assumes that specific spatial locations or time steps contain the most useful information, whereas the latter assumes that specific features or feature maps contain the most useful information.
With a little bit of stretching, FiLM can be seen as a special case of a
bilinear transformation
If we view as the concatenation of the scaling
and shifting vectors and and
if we augment the input with a 1-valued feature,
For some applications of bilinear transformations, see the Bibliographic Notes.
As hinted earlier, in adopting the FiLM perspective we implicitly introduce a notion of task representation: each task — be it a question about an image or a painting style to imitate — elicits a different set of FiLM parameters via the FiLM generator which can be understood as its representation in terms of how to modulate the FiLM-ed network. To help better understand the properties of this representation, let’s focus on two FiLM-ed models used in fairly different problem settings:
As a starting point, can we discern any pattern in the FiLM parameters as a function of the task description? One way to visualize the FiLM parameter space is to plot against , with each point corresponding to a specific task description and a specific feature map. If we color-code each point according to the feature map it belongs to we observe the following:
The plots above allow us to make several interesting observations. First, FiLM parameters cluster by feature map in parameter space, and the cluster locations are not uniform across feature maps. The orientation of these clusters is also not uniform across feature maps: the main axis of variation can be -aligned, -aligned, or diagonal at varying angles. These findings suggest that the affine transformation in FiLM layers is not modulated in a single, consistent way, i.e., using only, only, or and together in some specific way. Maybe this is due to the affine transformation being overspecified, or maybe this shows that FiLM layers can be used to perform modulation operations in several distinct ways.
Nevertheless, the fact that these parameter clusters are often somewhat
“dense” may help explain why the style transfer model of Ghiasi et al.
To some extent, the notion of interpolating between tasks using FiLM
parameters can be applied even in the visual question-answering setting.
Using the model trained in Perez et al.
The network seems to be softly switching where in the image it is looking, based on the task description. It is quite interesting that these semantically meaningful interpolation behaviors emerge, as the network has not been trained to act this way.
Despite these similarities across problem settings, we also observe qualitative differences in the way in which FiLM parameters cluster as a function of the task description. Unlike the style transfer model, the visual reasoning model sometimes exhibits several FiLM parameter sub-clusters for a given feature map.
At the very least, this may indicate that FiLM learns to operate in ways that are problem-specific, and that we should not expect to find a unified and problem-independent explanation for FiLM’s success in modulating FiLM-ed networks. Perhaps the compositional or discrete nature of visual reasoning requires the model to implement several well-defined modes of operation which are less necessary for style transfer.
Focusing on individual feature maps which exhibit sub-clusters, we can try to infer how questions regroup by color-coding the scatter plots by question type.
Sometimes a clear pattern emerges, as in the right plot, where color-related questions concentrate in the top-right cluster — we observe that questions either are of type Query color or Equal color, or contain concepts related to color. Sometimes it is harder to draw a conclusion, as in the left plot, where question types are scattered across the three clusters.
In cases where question types alone cannot explain the clustering of the FiLM parameters, we can turn to the conditioning content itself to gain an understanding of the mechanism at play. Let’s take a look at two more plots: one for feature map 26 as in the previous figure, and another for a different feature map, also exhibiting several subclusters. This time we regroup points by the words which appear in their associated question.
In the left plot, the left subcluster corresponds to questions involving objects positioned in front of other objects, while the right subcluster corresponds to questions involving objects positioned behind other objects. In the right plot we see some evidence of separation based on object material: the left subcluster corresponds to questions involving matte and rubber objects, while the right subcluster contains questions about shiny or metallic objects.
The presence of sub-clusters in the visual reasoning model also suggests
that question interpolations may not always work reliably, but these
sub-clusters don’t preclude one from performing arithmetic on the question
representations, as Perez et al.
Perez et al.
This points to a separation between the various computational primitives learned by the FiLM-ed network and the “numerical recipes” learned by the FiLM generator: the model’s ability to generalize depends both on its ability to parse new forms of task descriptions and on it having learned the required computational primitives to solve those tasks. We note that this multi-faceted notion of generalization is inherited directly from the multi-task point of view adopted by the FiLM framework.
Let’s now turn our attention back to the overal structural properties of FiLM
parameters observed thus far. The existence of this structure has already
been explored, albeit more indirectly, by Ghiasi et al.
The projection on the left is inspired by a similar projection done by Perez
et al.
To summarize, the way neural networks learn to use FiLM layers seems to vary from problem to problem, input to input, and even from feature to feature; there does not seem to be a single mechanism by which the network uses FiLM to condition computation. This flexibility may explain why FiLM-related methods have been successful across such a wide variety of domains.
Looking forward, there are still many unanswered questions.
Do these experimental observations on FiLM-based architectures generalize to
other related conditioning mechanisms, such as conditional biasing, sigmoidal
gating, HyperNetworks, and bilinear transformations? When do feature-wise
transformations outperform methods with stronger inductive biases and vice
versa? Recent work combines feature-wise transformations with stronger
inductive bias methods
Finally, the fact that changes on the feature level alone are able to compound into large and meaningful modulations of the FiLM-ed network is still very surprising to us, and hopefully future work will uncover deeper explanations. For now, though, it is a question that evokes the even grander mystery of how neural networks in general compound simple operations like matrix multiplications and element-wise non-linearities into semantically meaningful transformations.
Multiplicative interactions have succeeded on various tasks, ever since
they were introduced in vision as “mapping units”
Many models lie on the spectrum between FiLM and Hypernetworks:
Tenenbaum and Freeman
This article would be nowhere near where it is today without the honest and constructive feedback we received from various people across several organizations. We would like to thank Chris Olah and Shan Carter from the Distill editorial team as well as Ludwig Schubert from the Google Brain team for being so generous with their time and advice. We would also like to thank Archy de Berker, Xavier Snelgrove, Pedro Oliveira Pinheiro, Alexei Nordell-Markovits, Masha Krol, and Minh Dao from Element AI; Roland Memisevic from TwentyBN; Dzmitry Bahdanau from MILA; Ameesh Shah and Will Levine from Rice University; Dhanush Radhakrishnan from Roivant Sciences; Raymond Cano from Plaid; Eleni Triantafillou from Toronto University; Olivier Pietquin and Jon Shlens from Google Brain; and Jérémie Mary from Criteo.
Review 1 - Anonymous
Review 2 - Anonymous
Review 3 - Chris Olah
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
For attribution in academic contexts, please cite this work as
Dumoulin, et al., "Feature-wise transformations", Distill, 2018.
BibTeX citation
@article{dumoulin2018feature-wise, author = {Dumoulin, Vincent and Perez, Ethan and Schucher, Nathan and Strub, Florian and Vries, Harm de and Courville, Aaron and Bengio, Yoshua}, title = {Feature-wise transformations}, journal = {Distill}, year = {2018}, note = {https://distill.pub/2018/feature-wise-transformations}, doi = {10.23915/distill.00011} }