Compositional Generalization in Semantic Parsing: A Selection of Methods

Denis Lukovnikov
13 min readFeb 16, 2021

In this post, we’ll discuss some methods that have been recently proposed to improve the compositional generalization in semantic parsing.

Factorizing Syntax and Semantics

(Russin et al., 2020)

Paper: https://cogsci.mindmodeling.org/2020/papers/0027/0027.pdf

Cognitive science suggests that human ability to generalize beyond training data may come from the separation of syntax and semantics. Inspired by this, the authors try to implement this intuition in a modified attention mechanism.

(1) semantic factor
(2) syntactic factor

The proposed method, Syntactic Attention, assumes that the target action probability can be decomposed into (1) the semantic factor and (2) the syntactic factor. These are shown in the images to the left. The semantic factor models the probability of producing some action given some word in the input. It is important to note that the word representation used here is independent of its context. The syntactic factor, on the other hand, models the alignment between the positions in the input and output sequences.

The encoder consists of two “streams”. The semantic one simply takes word embeddings and projects them:

The other one is a regular bidirectional LSTM encoder:

The decoder decodes tokens based on a summary of the “semantic” encoding:

Whereas the attention weights (the alphas) are computed using the “syntactic” encoding:

The s_j are the states of the decoder LSTM and the h are the encoding vectors from the encoder LSTMs.

Results

This extremely simple approach improves the generalization in the Add Jump scenario, as well as in the left turn scenario. This is expected since the proposed attention factorization directly targets this scenario.

However, it does not improve generalization to longer sequences.

Compositional Generalization for Primitive Substitutions

(Li et al., EMNLP 2019)

Paper: https://arxiv.org/pdf/1910.02612.pdf

This paper follows essentially the same approach as the concurrent work of Russin et al. that we just discussed. However, their approach differs in several small aspects, which appear to make a significant impact on the results.

The first is what the authors call entropy regularization. An L2 penalty is added to the word embeddings and subsequently, noise is added. Such noisy word embeddings are then used in the “syntactic” seq2seq to compute attentions. Another set of noisy word embeddings is used to generate the output probabilities given the alignment.

Another change from the normal seq2seq is not feeding the previous output token. This prevents unwanted correlations between semantics and syntax in the decoder.

Results

Test accuracy on primitive generalization tasks in SCAN.

The results indicate that the method essentially solves the primitive generalization challenge.

However, generalization to longer sequences remains bad.

Compositional generalization through meta sequence-to-sequence learning

(Lake, NeurIPS 2019)

Paper: https://papers.nips.cc/paper/2019/file/f4d0e2e7fc057a58f7ca4a391f01940a-Paper.pdf

This work presents a memory-augmented meta-learning approach. I found the explanation pretty scarse on first sight, so I might have missed some important details.

The model is an RNN-based seq2seq model, but with some modifications and a different training method: it uses an external memory that encodes the support set.

The external memory is a key-value mapping that contains encodings of the support inputs as keys and encodings of the support outputs as values. The support input encodings are computed using the same encoder as for the actual query encoder. Only the final RNN state is retained as the representation. In addition, the model introduces an output encoder for the support outputs. The support set is a set of examples, which here are mappings from language to commands (e.g. “run twice” → RUN RUN). Every such support set example is encoded by the input and output encoders for its input and output parts, respectively. The key and value vectors that are used in the external memory are the final states of the encoders.

The memory can be illustrated by this part of the illustration in the paper:

The decoding process starts when the model receives an input and proceeds as follows:

  1. The model first encodes the input using the input encoder (the same one used for the keys). This forms the query matrix for the memory attention step that follows.
  2. In the memory attention step, the representations produced by the input encoder are compared to the keys stored in the memory. This is implemented as an attention mechanism. The attention scores are used to create a weighted summary of the value vectors stored in memory.
  3. The produced value summaries for every token in the input are concatenated to produce the context C (which is a sequence of vectors, one for each input token).
  4. Finally, the actual decoder proceeds as usual, but uses C rather than the input encoding. In fact, C is simply a memory-”enriched” input encoding.

Note that this model can be trained using backpropagation as usual and the question is what ends up in the support sets that get encoded into the memory.

Results

The results show good performance on primitive generalization task.

But it also fails on the length generalization task:

… All this is good and well, but from the very beginning of this post, I heard you thinking:

“…BUT WHAT IF WE JUST THROW A TRANSFORMER AT IT ?”

(this image indicates that this post is an informal publication)

That’s exactly what this paper does:

Compositional Generalization in Semantic Parsing: Pretraining vs. Specialized Architectures.

(Furrer et al. 2020)

Paper: https://arxiv.org/pdf/2007.08970.pdf

Furrer and colleagues perform an analysis of how the use of pre-trained language model transformers affects compositional generalization.

The paper also contains a nice overview of existing work and a clear comparison table, which is nice.

(white= previously reported numbers; grey=results computed in this paper; *=these models use additional knowledge and thus might not be directly applicable to new datasets)

As we can see here, simply fine-tuning a large enough pre-trained transformer appears to work well for primitive generalization (add jump and add turn left). It still fails on length generalization though, as well as on SCAN MCD split and isn’t performing that well on the CFQ MCD splits (a reminder: a normal random split on CFQ gives near-100% performance for standard baseline models).

It is important to note is T5 is not the average BERT, and is instead a pre-trained text-to-text model that uses relative positions.

From this table, we can also see that LANE achieves a spectacular 100% on all evaluated SCAN settings, and that without using any additional knowledge! So let’s look into this one now.

Compositional Generalization by Learning Analytical Expressions (LANE)

(Liu et al., NeurIPS 2020)

Paper: https://arxiv.org/pdf/2006.10627.pdf

This is an interesting paper because it achieves 100% on both primitive and length generalization challenges of SCAN. None of the methods we discussed so far were able to solve length generalization, and according to the table from the previous paper, only one other was able to by using additional knowledge.

The proposed method relies on a memory-augmented neural network that consists of a Composer and a Solver. However, the approach requires reinforcement learning since there are discrete latent decisions. RL can complicate implementation and use. The authors also report that a single training run takes “20~25 hours” on a Tesla-P100. To put this in context, SCAN contains about 21k examples and a normal model can be trained in around 10–20 minutes on a mid-range gaming PC with 2019 hardware.

Decoding paradigm

First of all, it’s important to realize that the way LANE decodes is different from a left-to-right decoder.

Instead, you might call it a memory-augmented hierarchical decoder that generates a sequence by iteratively consuming spans in the input and recombining previously generated output sub-sequences. Let’s see this with an example:

The input sentence is

  • run opposite left after walk twice (this is SCAN-speak for “Turn around to the left and run, after you walked twice”)

In the first step, the decoder might look at the first word, “run” and realize that it’s an action and can be replaced by a variable $x. In fact, it can also be translated to its action, RUN. The variable and its value are stored in a memory. So the sequence and the memory become:

  • $x opposite left after walk twice, [$x=RUN]

Then, the decoder might look at the word “left” and replace it with a variable too:

  • $x opposite $y after walk twice, [$x=RUN, $y=LTURN]

The remaining steps proceed similarly. In the following, bold-faced text is what is replaced with bold-face variables.

In the next step, the decoder will look at “$x opposite $y” and recognize that this means that it should turn around by using $y and then do $x. This means taking $x and $y from the memory and instead inserting $z with the new action sequence.

  • $x opposite $y after walk twice → $z after walk twice, [$z=LTURN LTURN RUN]
  • $z after walk twice → $z after $x twice, [$z=LTURN LTURN RUN, $x=WALK]
  • $z after $x twice → $z after $y, [$z=LTURN LTURN RUN, $y=WALK WALK]
  • $z after $y → $x, [$x=WALK WALK LTURN LTURN RUN]

By this point, there is only one memory entry, which contains the entire command sequence and the input sequence has been reduced to one variable ($x), which is our output.

The Model

There are two parts to the model: the Composer and the Solver. The Composer identifies which spans in the input can be translated and triggers the Solver to do so.

The Composer is essentially an iterative tagger-”merger”:

At every iteration, it scores the current items in the sequence and checks if it should be “solved”. If not, it proceeds to merge neighbouring spans and repeat. In this example, “$y twice” was scored as solvable so the merging stops here and the Solver is triggered.

The Solver is a seq2seq model with attention that takes the solvable span and decodes an expression for it. This expression may contain variable. To actually obtain the full expression, the variables from the memory have to be inserted in the decoded expression. This full expression is then stored in memory after all the used variables have been popped from the memory.

Training

Since there are two intertwined components and no supervision for each, they have to be trained together and since their latent decisions are discrete, we have to use reinforcement learning.

Essentially, the algorithm used is REINFORCE (with baseline) and the reward is proportional to the length of the longest common substring between the final decoded expression and the intended one. This alleviates the sparse reward that would be the case for only a 1/0 reward scheme where a 1 is only achieved if the expression is decoded 100% correctly.

Results

As mentioned above, the authors achieve a 100% on all SCAN tasks (🧠🚀), even on the length generalization task of SCAN, which was essentially ignored in other work.

It also works a 100% on the MCD splits on SCAN:

Discussion

While the results are definitely impressive, it is important to keep in mind that SCAN has an extremely small vocabulary (<50). Despite this, LANE takes “20~25 hours” per training run. It is not clear how this approach would work, or how long it will take with a more realistic vocabulary in both the input and output (tens of thousands of unique input tokens and at least thousands at the output). With a larger vocabulary, the exploration space for any RL algorithm will be much higher. While the action space of the Composer remains the same, there is a higher number of possible inputs and their combinations. The input space and output spaces of the Solver will both get larger. Nevertheless, LANE or some modification of it might work on CFQ.

Learning Compositional Rules via Neural Program Synthesis

(Nye et al., NeurIPS 2020)

Paper: https://arxiv.org/pdf/2003.05562.pdf

This is another NeurIPS 2020 paper on the SCAN dataset and, like the previous paper (LANE) we discussed, takes a radically different approach to decoding the sequence of actions.

Before reading further, consider this:

If I tell you that “dax fep” means “Red red red” (Red repeated three times in a row) and “lug fep” means “Blue Blue Blue”, what do you think “dax”, “lug” and “fep” mean?

Answer: (probably) “dax”=Red, “lug”=Blue, “fep” = repeat the meaning of the previous word three times.

A normal seq2seq decoder simply encodes the input, and decodes the output, internally doing some black-box operations. In this standard approach, there is nothing forcing the learning of a particular kind of rules and it’s difficult or impossible to check the rules learned and correct them.

Instead, the approach followed in this paper, learns a set of translation rules, a grammar G from a small set of examples called the support set. So for example, if we have two simple support examples “dax fep” → RED RED RED and “lug fep” → BLUE BLUE BLUE, the model has to deduce that “dax” → RED and “lug” → BLUE.

This is illustrated in the following:

Given a support set of examples, a model decodes the rules that solve those examples. And then these rules are used to solve actual, more complicated examples:

The model is quite simple, it’s a LSTM-based encoder-decoder with attention that operates over sets of input sequences and decodes a grammar token-by-token. See figure above and paper for more information.

Search

During test time, candidate programs are sampled until a program is found that satisfies all support set examples. Otherwise, the best model is returned. This best found set of rules is then applied on the actual input to produce a prediction.

Training

What’s even more interesting is training. How do we train a model to produce these rules when the original data does not provide this information explicitly?

The paper is very concise in its description of training and I’m not sure I completely understood it but here we go.

“During each training episode, we randomly sample an interpretation grammar G from a distribution over interpretation grammars, or ‘meta-grammar’ M. We then sample a set of input sequences consistent with the sampled interpretation grammar, and apply the interpretation grammar to each input sequence to produce the corresponding output sequence, giving us a support set of input-output examples X_G.”

Further inspection of specific training settings and examples in Appendix significantly improved my understanding here.

It appears that the grammars sampled during training may include false rules. For example, in Fig. A.2 in Appendix, one of the sampled grammars contains “wif” → PINK while another contains “wif” → BLACK. However, the presence of examples that support one of these two interpretations determines which rule is correct. I think a reasonable question for such an approach would be how many tries you need if your vocabularies grow to realistic sizes?

Results

Like LANE, this approach achieves 100% over all settings (🧠🚀), including the length generalization one.

Good-Enough Compositional Data Augmentation (GECA)

(Andreas, ACL 2020)

Paper: https://www.aclweb.org/anthology/2020.acl-main.676.pdf

Let’s take a look at an entirely different kind of method that proposes data augmentation to solve the problem instead of developing specialized models and training methods.

The paper addresses the primitive generalization case and argues that this case “amounts to an inference about syntactic categories”. The proposed solution amounts to having a non-negligible probably assigned to different originally unobserved combinations.

The approach used is extremely simple and is illustrated as follows:

(figure from paper)

Suppose the three sentences, the two at the left and the one at the top-right are present in the training data. From the pair of sentences on the left, we can see that the patterns used with “She … the wug … in …” are interchangeable, meaning that whenever “… picks … up …” can be used, “… puts … down …” can also be used. Such pattern equivalences can then be exploited to generate new patterns, for example, if we have “Pat picks cats up”, and since we know that “… picks … up … “ can be replaced with “… puts … down…”, we can add the sentence “Pat puts cats down.” to our training set.

The paper also provides a linguistic discussion that is worth checking out.

Results

This almost embarrassingly simple method performs pretty well on primitive generalization SCAN tasks.

However, no results on length generalization in SCAN have been reported so we assume there’s no improvement there.

GECA is also tested on some standard semantic parsing datasets, GeoQuery and Scholar.

The results on GeoQuery show some improvement upon the baseline and a different data augmentation method. Note that the “Question” column indicates a standard split while the “Query” column indicates a “compositional” split.

The paper also doesn’t shy away from negative results.

On the Scholar dataset for semantic parsing (see left), the method doesn’t provide any improvement in neither standard (“Question”) or “compositional” (“Query”) split.

This was only a selection of recently proposed methods for compositional generalization in semantic parsing. While there are more, this selection covers most of the prevalent types of approaches.

--

--