Document-level Machine Translation

Typical machine translation is done on the sentence level. This means that sentences are translated in isolation without considering the flow in the surrounding context. From a human perspective, this is less than satisfactory. A natural text needs to be well connected, and the story it tells should be coherent and fluent. Any human translator will use plenty of connectives and referential structures to improve the readability of a translation in the same way as creating original text.

He saw children playing with binoculars.

There can be subtle ambiguities in isolated utterances, which must be disambiguated to generate a correct translation. For example, the above sentence may either indicate that a man saw some children who were playing with binoculars, or that he was playing with binoculars and saw some children through it. It’s often clear which interpretation makes more sense in the larger context of discourse, establishing the incentive for document-level machine translation.

Document-level (or “discourse-aware”) machine translation has already been studied for some years, albeit without real progress. Researchers like Christian Hardmeier and Sharid Loaiciga looked at various discourse phenomena and how they influence translation.

Current Incentives for Research

A dedicated series of workshops on the topic (DiscoMT) has been started, and continues to run with a new iteration this year. In early years, the focus was set on referential pronouns as one of the most explicit phenomena that can be observed on the discourse level. The tricky part is that such pronouns need to agree with their antecedents—the items they refer to. Why is that a problem? Well, for example, if there is gender marking in one of the languages that is different from that of the other language, then there would be a problem when translating a pronoun in isolation without looking back to the item it refers to. Let’s look at the following examples:

Input	The funeral of the Queen Mother will take place on Friday. It will be broadcast live.
(a)	Les funérailles de la reine-mère auront lieu vendredi. Elles seront retransmises en direct.
(b)	L’enterrement de la reine-mère aura lieu vendredi. Il sera retransmis en direct.

Input	Der Strafgerichtshof in Truro erfuhr, dass er seine Stieftochter Stephanie Randle regelmässig fesselte, als sie zwischen fünf und sieben Jahre als [recte: alt] war.
Reference translation	Truro Crown Court heard he regularly tied up his step-daughter Stephanie Randle, when she was aged between five and seven.
MT output	The Criminal Court in Truro was told it was his Stieftochter Stephanie Randle tied as they regularly between five and seven years.

Examples of referential pronouns in translations to French (example 1), and the comparison between a reference German translation and the output of an actual MT system (example 2), taken from Christian Hardmeier’s PhD thesis (2014). Problematic pronouns are highlighted in bold.

The interest in document-level models is just growing and various models have recently been proposed to extend the context considered in neural machine translation. A simple extension is to run a sliding window over the data and to connect at least a small number of subsequent sentences in translation. This can be done in the source, in the target, or in both. In Tiedemann & Scherrer (2017), we proposed such a strategy showing that connections can be learned between sentences and that modern attention-based NMT is capable of learning from extended context.

A visualisation of cross-lingual attention between German input text with extended context (on the Y axis, extended context words in brackets) and its translation into English (on the X axis). The model compresses various types of information from the context into the translation. Darker-coloured cells indicate greater influence from a source word while generating a target word, and lighter-coloured cells indicate less influence.

Simple concatenation has its limits and other approaches propose some kind of knowledge aggregation that can be passed to the single-sentence translation engine. Hierarchical and selective attention mechanisms are used to aggregate relevant information for the translation of coherent text.

A hierarchical attention network building a context-aware state that is combined with the current input. The illustration is taken from Miculicich et al. 2018. Each point indicates a vector representation of data. Blue points indicate source words, orange points indicate target words, and arrows leading to green points indicate aggregating information to create abstract representations.

This year, WMT featured a task on document-level machine translation. Teams were encouraged to work on document-level models, and promised a dedicated evaluation to decide whether such models have led to improved coherence and translation quality. We have participated in that task as well for short news documents translated from English to German. We tested concatenation models and hierarchical attention models, and evaluated the results with a shuffled and a natural test set. The shuffled version randomly shuffles sentences, which should confuse the document-level model. Below, we show the results of our concatenation models (in BLEU scores) on both versions of the same news data test set:

System	Shuffled	Coherent
Sentence-level MT	38.96	38.96
Concatenate 2 source sentences	36.62	37.17
Concatenate 3 source sentences	34.14	34.39
2 source + 2 target sentences	38.53	39.08

The first line represents the baseline without any document-level information (translation of single sentences). The second and third models extend the context on the input side (source). The last model translates blocks of 2 input sentences (source) into 2 output sentences (target). This is done in a sliding-window fashion, and we evaluate only the second sentence in each block (with a special treatment of the first sentence of a document).

What can we learn from this?

The positive side of the experiment is that all document-level models seem to pick up valuable information, which shows in the improved scores on coherent test sets. However, none of the document-level models is very successful compared to the single-sentence model. The only (but insignificant) improvement was achieved with the 2+2 model. This is certainly disappointing, and a trend that we have seen in most of the work on document-level machine translation. Here, we also omit the results from hierarchical attention models, as they are even worse in comparison.

First of all, document-level machine translation is difficult, and requires further attention. Secondly, standard evaluation is not appropriate, and this has already been pointed out in much of the background literature. It will be interesting to see the outcome of the manual evaluation at WMT in August as well as the new developments that will be presented at DiscoMT and related events in the future.

References

Christian Hardmeier: Discourse in Statistical Machine Translation. PhD Thesis, Uppsala University.

Sameen Maruf, Andre F. T. Martins, and Gholamreza Haffari: Selective Attention for Context-aware Neural Machine Translation. NAACL-HLT 2019.

Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, James Henderson: Document-Level Neural Machine Translation with Hierarchical Attention Networks.

Jörg Tiedemann and Yves Scherrer: Neural Machine Translation with Extended Context. DiscoMT 2017.