Compositional Generalization in Semantic Parsing: More Approaches

Denis Lukovnikov
10 min readApr 29, 2021

--

Hierarchical Poset Decoding for Compositional Generalization in Language

(Guo et al., NeurIPS 2020)

Paper: https://arxiv.org/pdf/2010.07792.pdf

In this paper, the authors start from the observation that in semantic parsing, the relative order of certain parts of the logical forms does not matter; this means there is a certain degree of partial permutational invariance, which is also formalized and explored in other work (Gordon et al. 2020).

This is observed in different query language. In SPARQL queries, the order of the triples in the query pattern in the WHERE close doesn’t matter. For example the query

  • “SELECT count(*) WHERE {?x ns:director ns:Nolan . ?x ns:writer ns:Nolan}

is the same as:

  • “SELECT count(*) WHERE {?x ns:writer ns:Nolan . ?x ns:director ns:Nolan}

In other languages, one can also find that the conditions are interchangeable.

Nevertheless, standard seq2seq models used in NMT and applied to semantic parsing assume a certain ordering during training. This ordering can be systematic but will always be artificial.

The paper argues that “imposing additional ordering constraints increases learning complexity which makes the learning likely to overfit to bias ordering information in training distribution”.

Rather than decoding in a particular order, the authors propose to model the decoding such that the permutational invariances are properly exploited during decoding. This intuition can be illustrated as follows: in (c) in the figure, the two conditions are decoded in parallel rather than sequentially like in (a).

(Figure 1 from paper)

Section 3 in the paper defines posets, or partially ordered sets, and provides a discussion. Partially ordered sets are sets where an ordering relation exists for some pairs of elements from the set, but where a comparison might not exist for other pairs. Thus, posets generalize totally ordered sets. Posets naturally lend themselves to our usecase because the query contains branches that can be decoded in any order, and are thus not comparable. However, within some of those branches, for example, in a single triple pattern, an order must exist to convey which entity or variable is the subject or the object and what is the relation. Totally ordered sets are then equivalent to the standard decoding of a fixed linearization of a query.

The authors then define semantic posets, which are used for semantic parsing in the proposed approach. The posets can be represented in as a DAG in a diagram. An example of a semantic poset is found in the following figure from the paper. The corresponding question is “Who edited and executive produced Night of Dark Shadows’ sequel and edited Corpse Bride?

In Section 4, the authors describe a basic poset decoder and in Section 5, they propose their hierarchical poset decoder.

Poset decoding is different in that it decodes different traversal paths of the poset in parallel, which are then merged at variables and entities to produce the poset that represents the WHERE clause of the query. If I understood this correctly, for the example above, first, the poset decoder model produces the sequences [x_0, PERSON], [x_0, EDIT, Corpse_Bride], [x_0, EDIT, x_1, SEQUEL, Night_of_Dark_Shadow] and [x_0, EXEC_PRODUCE, x_1, SEQUEL, Night_of_Dark_Shadow]. Note that these are decoded in parallel and thus are likely to break some spurious patterns. In the next step, all variables (x_0, x_1) and entities (Corpse_Bride, Night_of_Dark_Shadow) are made unique and thus the paths are merged. (For more details on model and training, see the paper.)

The hierarchical poset decoder consists of three parts: (1) sketch prediction, (2) primitive prediction and (3) traversal path prediction.

During sketch prediction, following other works on hierarchical decoding, the general structure of the query is decoded first. However, different from previous work, here, the sketch prediction model relies on poset decoding rather than sequence or tree decoding.

A sketch is formed by replacing entities with “E” and predicates with “P”. An example of a sketch:

(figure from paper)

The purpose of primitive prediction is to predict which elements are likely to be used in the final query. Essentially, primitives are the entities and predicates that can be filled in the “P” and “E” slots in the sketch. Note that more primitives can be predicted than are eventually used. The authors “implement the primitive prediction module by adopting phrase table induction from statistical machine translation (Koehn, 2009).”

For our example, the output of primitive prediction is:

The third phase is traversal path prediction. In this final step, all possible assignments of the predicted candidate primitives to the predicted sketch are used to obtain a set of all possible traversal paths, which are subsequently scored using an ESIM network (Chen et al. 2016) (any sequence pair classification model would do). The output poset DAG can then be constructed from the set of predicted traversal paths. This is illustrated in the paper as follows:

Results

The proposed method is only evaluated on CFQ. This is probably because in SCAN, there is no partial order to the generated sequences and thus the poset idea would not bring any benefit.

The results on CFQ look really good:

(Accuracies on CFQ)

From the ablation results, it appears that normal models, such as seq2seq and seq2tree can also be significantly improved by first performing sketch decoding. Still, while sketch-based decoding brings most of the improvement, the results are further improved by using the proposed poset decoder.

Span-based Semantic Parsing for Compositional Generalization

(Herzig and Berant, 2020/09)

Paper: https://arxiv.org/pdf/2009.06040.pdf

In this paper, the authors propose an approach that parses disjoint spans of the input utterance independently and composes them into the final semantic parse. This is accomplished using a span-tagging model that predicts categories for spans, which determines how parse is constructed.

An example of how the parse of an utterance is generated is given in the following figure:

Here, for example the span “capital” is annotated with the predicate capital(), and “borders” with next_to_1(). The join label indicates that the meaning of this span is composed from the meanings of its subspans and the ϕ category indicates that this span has no meaning.

Model

The model simply has to encode the input utterance and somehow make predictions over the different spans (note that spans can span multiple words). The authors simply use a BERT-base encoder and represent every span as a concatenation of the start and input position vectors from the encoder. Then the probabilities for every span are computed using a softmax as usual, for every possible span, IIUIC.

Inference

The model is trained (more on this next) to label the spans correctly. However, the model assigns a score to each possible span. To actually find the most probable span tree, the CKY algorithm (Cocke–Younger–Kasami) is used.

Training

The model needs labels that specify what class to assign to every input token. However, this supervision is not available, and thus the authors use a hard expectation maximization approach where the best tree under the current model is obtained using a constrained version of the inference CKY algorithm. The algorithm is constrained such that it can only generate trees that lead to the correct output. While this restricts the search space considerably, there could still exist many different trees that generate the same logical form. The hard EM approach simply takes the highest-scoring (under the current model) candidate and uses that to extract spans and tags to train the span tagging model.

Results

The approach was evaluated on (1) a semantic parsing variant of SCAN that consists of logical form trees rather than simply a sequence of commands, (2) CLEVR (a visual question answering benchmark) and (3) GeoQuery (standard NL → query dataset).

The results indicate that under challenging splits, the proposed method outperforms several baselines.

Revisiting Iterative Back-Translation from the Perspective of Compositional Generalization

(Guo et al., AAAI 2021)

Paper: https://arxiv.org/pdf/2012.04276.pdf

In this paper, iterative back-translation is investigated in the context of compositional generalization. The authors find that iterative back-translation improves the results a lot, which is especially interesting since it is a relatively simple approach compared to some other works on compositional generalization and does not require using a novel model, training with discrete latent variables, or specific prior knowledge. The approach does assume, however, that some monolingual data is available in addition to the paired training data.

Iterative back-translation improves compositional generalization

First, the authors experiment with simple iterative back-translation (IBT). The training process with IBT uses two translation models, in this case a natural language to formal language (N2F) model which we are actually interested in, and a formal language to natural language model for translating queries/action sequences back to natural language (F2N). Moreover, the IBT approach assumes some of the data are given as paired sequences (x, y), and some of the data are “monolingual”. In our case, that means that some data is given as natural language inputs only (x’) and some data as formal language outputs only (y’).

Training with IBT is a two-phase process.

First, the N2F and F2N models are trained on the given parallel data (x, y). This ensure that the models have learned mappings in both directions.

Then, the training proceeds with the parallel data as well as the artificial automatically translated monolingual data:

  1. some monolingual NL data from x’ are taken and translated to y’* using the current N2F model.
  2. some monolingual FL data from y’ are taken and translated to x’* using the current F2N model.
  3. the N2F model is trained on some of the original (x, y) data, as well as the “fake” (x’*, y’) data that we just generated using the F2N model.
  4. the F2N model is trained on some of the original (x, y) data, as well as the “fake” (x’, y’*) data that we just generated using the N2F model.

Note that the “fake” data are generated again at every epoch.

This simple approach results in really remarkable improvements on both SCAN and CFQ:

(The result table for SCAN, copied from the paper)
(The result table for CFQ, copied from the paper)

The “mono30”, “mono100” and “transductive” settings are the settings in which the authors used the IBT approach.

For the mono30 and mono100 ones, 30% resp. 100% of the development set were taken as monolingual data, whereas the transductive setting uses the entire test set as monolingual data. The mono100 and transductive settings are thus less realistic. In the case of some SCAN splits that don’t have validation sets of their own, the authors split the test set into validation and test set, 50/50.

In Section 4, additional experiments are conducted to identify why iterative back-translation helps.

Curriculum IBT

Then, an IBT approach is proposed that uses a curriculum. This means that the IBT training approach described above now proceeds in multiple stages, where the monolingual data in every stage are of increasing difficulty. Only CFQ has been experimented with here and the difficulty of examples is measured by the number of entities in the examples.

The results show some improvement from the vanilla IBT procedure:

Compositional Generalization via Semantic Tagging

(Zheng and Lapata, 2020)

Paper: https://arxiv.org/pdf/2010.11818.pdf

In this paper, Zheng and Lapata propose a two-step approach that first (1) tags the sequence with logical form elements and (2) uses a seq2seq model to construct the logical form.

More formally, the conditional probability of output sequence y given input sequence x is decomposed as follows:

where z are tags assigned to every element of the input sequence x.

The entire process is also illustrated in the following diagram:

(figure from paper)

The bottom part marked p(z|x) is the tagger that assigns some logical form tag (or null if none apply) to every token in the input. The tagger is implemented as a bidirectional LSTM model.

Then the p(y|x, z) is modeled as a sequence-to-sequence model. Any seq2seq model can be used an the authors experiment with LSTM-based and transformer-based ones. To use the generated tags, they simply embed them and concatenate with the word embeddings.

Where things get more interesting is during training. Typical semantic parsing supervision only provides the output logical forms (and in weakly supervised settings just execution results 😲). However, here we need to train a tagging model, which models some discrete latent variables. Because we don’t have the alignment information given, we can’t say with 100% certainty which word is actually semantically meaningful w.r.t. query and what its meaning should be. The authors used a procedure consisting of rules and a more agnostic procedure that uses expectation maximization to learn the discrete latent variables modeled by the tagger.

Results

The approach was evaluated on the semantic parsing datasets GeoQuery and ATIS, both on lambda expressions and SQL versions. The datasets were split according to Finegan-Dollak et al. (2018)’s method, which ensures that the query templates between training and test are different.

The results show significant improvement:

The authors also evaluated on WikiSQL, showing some improvements:

--

--

No responses yet