Skip to content
Related Articles

Related Articles

Statistical Machine Translation of Languages in Artificial Intelligence

Improve Article
Save Article
  • Last Updated : 28 Feb, 2022
Improve Article
Save Article

Given how difficult translation may be, it should come as no surprise that the most effective machine translation systems are created by training a probabilistic model with statistics acquired from a vast corpus of text. This method does not need a complicated ontology of interlingua ideas, handcrafted source, and destination language grammars, or a hand-labeled treebank. It just requires data in the form of example translations from which a translation model may be learned. We determine the string of words f^{*}     that maximizes 

f^{*}=\underset{f}{\operatorname{argmax}} P(f \mid e)=\operatorname{argmax} P(e \mid f) P(f)

To convert a phrase from English (e) to French (f)

The target language model for French is P(f)    , which indicates how likely a particular sentence is in French. The translation model, P(e \mid f)    , indicates how likely an English sentence is to be translated into a particular French sentence. Similarly, P(f \mid e)     is an English-to-French translation model.
Should we work on P(f \mid e)     directly, or should we use Bayes’ rule and work on P(e|f)P(f)? It is easier to model the domain in the causal direction in diagnostic applications like medicine: P(\text { symptoms } \mid \text { disease })     rather than P(\text { disease } \mid \text { symptoms })    . However, both approaches are equally simple to translate. The researchers used Bayes’ rule in the early work in statistical machine translation, in part because they had a decent language model, P(f)    , and wanted to utilize it, and in part, because they came from a background in voice recognition, which is a diagnostic problem. In this chapter, we follow their path, although we should point out that recent work in statistical machine translation frequently optimizes P(f \mid e)     directly, employing a more complex model that incorporates many of the language model’s properties.

The language model, P(f)    , might address any level(s) on the right-hand side of the figure above, but the simplest and most frequent technique, as we’ve seen before, is to develop an n-gram model from a French corpus. This just catches a partial, local sense of French phrases, but it’s typically enough for a rudimentary translation.

A bilingual corpus

A collection of parallel texts, each with an English/French pair—is used to train the translation model. If we had an endlessly huge corpus, translating a sentence would just be a lookup task: we’d already seen the English sentence in the corpus, so we’d just return the corresponding French sentence. However, our resources are limited, and the majority of the sentences we will be required to translate will be unfamiliar. They will, however, be made up of terms that we have seen previously (even if some phrases are as short as one word). For example, “in this exercise we shall,” “size of the state space,” “as a function of,” and “notes at the conclusion of the chapter” are all prevalent terms in this book. We should be able to break the novel sentence “In this exercise, we will compute the size of the state space as a function of the number of actions.” into phrases, find the corresponding phrases and French phrases in the English corpus and from the French translation respectively, and then reassemble the French phrases into an order that makes sense in French. To put it another way, given an English sentence e, finding a French translation f is a three-step process:

  1. Divide the English sentence into e_{1}, \ldots, e_{n}
  2. 2 Choose a French phrase f_{i}     for each phrase e_{i}    . For the phrasal probability that fi is a translation of e_{i}    , we use the notation P\left(f_{i} \mid e_{i}\right)    .
  3. Pick a combination of the terms f_{1}, \ldots, f_{n}    . This permutation will be specified in a style that appears difficult but is supposed to have a simple probability distribution: We pick a distortion d_{i}     for each f_{i}    , which is the number of words that phrase fi has moved with regard to f_{i-1}    ; positive for travelling to the right, negative for moving to the left, and zero if f_{i}     follows f_{i-1}     directly.

The figure above shows an example of the process explained in the previous article. The line “There is a stinky wumpus sleeping in 2 2″ is broken down into five phrases at the top, e_{1}, \ldots, e_{5}    . Each one is translated into a f_{i}     phrase, which is then permuted into the following order: f_{1}, f_{3}, f_{4}, f_{2}, f_{5}    . The permutation is defined as 

d_{i}=\operatorname{START}\left(f_{i}\right)-\operatorname{END}\left(f_{i-1}\right)-1

where \operatorname{START}\left(f_{i}\right)     is the ordinal number of the first word of phrase f_{i}     in the French sentence and \operatorname{END}\left(f_{i-1}\right)     is the ordinal number of the last word of phrase f_{i-1}    . The figure above shows thatf_{5} \text {, "à } 2 \text { 2," }     comes right after f_{4}    , “qui dort,” hence d_{5}=0    d2 = 1     because phrase f_{2}     has shifted one word to the right of f1. We have d1 = 0     as a special case since f1 begins at position 1 and \operatorname{END}\left(f_{0}\right)     is specified to be 0. (even though f_{0}     does not exist).

We can define the probability distribution for distortion, P    , now that we’ve established the distortion, d_{i}    . Because we have \left|d_{i}\right| \leq n     for phrases of length n    , the whole probability distribution \mathbf{P}\left(d_{i}\right)     only contains 2n + 1     element, significantly fewer numbers to remember than the number of permutations, n!    . That is why the permutation was specified in such a convoluted manner. Of course, this is a fairly rudimentary distortion model. When translating from English to French, it doesn’t state that adjectives are frequently altered to appear after the noun—that fact is reflected in the French language model, P (f)    . The chance of distortion is unaffected by the words in the sentences, relying simply on the integer value di. The probability distribution summarizes the permutation’s volatility; for example, how often is a distortion of P(d = 2)     compared to P(d = 0)    .

Now it’s time to put it all together: The probability that the series of words f with distortions d is a translation of the sequence of phrases e may be defined as P(f, d \mid e)    . We assume that each phrase translation and distortion is independent of the others, and so the expression may be factored as

P(f, d \mid e)=\prod P\left(f_{i} \mid e_{i}\right) P\left(d_{i}\right)

This allows us to calculate the probability P(f, d \mid e)     for a given candidate translation f     and distortion d    . But we can’t merely enumerate sentences to discover the optimal f and d; with about 100 French phrases for every English phrase in the corpus, there are 1005 distinct 5-translations and 5!     reorderings for each of them. We’ll need to look for a decent solution. Finding a nearly-most-probable translation using a local beam search with a heuristic that assesses likelihood has proved to be beneficial.

The only thing left is to figure out the phrasal and distortion probability. We sketch the procedure for further information, read the notes at the conclusion of the chapter

  1. Look for texts that are similar: To begin, compile a bilingual corpus in parallel. A Hansard9, for example, is a record of legislative discourse. Bilingual Hansards are produced in Canada, Hong Kong, and other nations, the European Union publishes official papers in 11 languages, and the United Nations produces multilingual publications. Online, bilingual text is also available; some Web sites publish parallel material with parallel URLs, such as /en/ for the English page and /fr/ for the French page, for example. Hundreds of millions of words of parallel text and billions of words of monolingual material are used to train statistical translation algorithms.
  2. Break down into sentences: Because a sentence is the unit of translation, we’ll need to divide the corpus into sentences. Periods are strong markers of the conclusion of a sentence, yet in the line “Dr. J. R. Smith of Rodeo Dr. paid $29.99 on September 9, 2009,” just the final period finishes the sentence. One method for determining if a period terminates a sentence is to train a model that includes the surrounding words and their parts of speech as features. This method has a 98 percent accuracy rate.
  3. Align sentences: For each sentence in the English version, figure out which sentence(s) in the French version it relates to. In most cases, the following English sentence corresponds to the next French sentence in a 1:1 match, but there are exceptions: one sentence in one language may be broken into a 2:1 match, or the order of two sentences may be changed, resulting in a 2:2 match. Using a version of the Viterbi method, it is feasible to align them (1:1, 1:2, or 2:2, etc.) with accuracy in the 90 percent to 99 percent range just by looking at the sentence lengths (i.e. short sentences should align with short sentences). Even greater alignment can be obtained by employing common landmarks in both languages, such as numbers, dates, proper names, or words with a clear translation from a multilingual dictionary. If the third English and fourth French sentences both include the string “1989,” but the adjacent sentences do not, the sentences should be placed together.
  4. Align phrases within a sentence: A procedure similar to that used for sentence alignment may be utilized to align phrases within a sentence, although iterative improvement is required. We have no way of knowing that “qui dort” aligns with “sleeping” when we begin, but we can arrive at that connection through a process of evidence accumulation. We can observe that “qui dort” and “sleeping” co-occur often in all of the example sentences and that no other phrase other than “qui dort” co-occurs as frequently in other sentences with “sleeping” in the pair of aligned sentences. The phrasal probabilities are determined by a complete phrase alignment across our corpus (after appropriate smoothing).
  5. Define distortion probability: Once we have a phrase alignment, we may define distortion probabilities. Simply count how many times the corpus is distorted for each distance d=0, \pm 1, \pm 2, \ldots    , then smooth it out.
  6. Use EM to improve estimates: To enhance the estimations of P(f \mid e) \text { and } P(d)     values, use expectation–maximization. In the E     phase, we compute the best alignments using current parameter values, then update the estimates in the M     step and iterate the procedure until convergence.

My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!