Why Google’s English Translation is Better
October 30, 2008
(This is an English version of my original Hebrew post that appeared here).
Google Translate was launched for 11 additional languages in September, including Hebrew. Playing with it a little, you would soon notice that translating from Hebrew to English yields much (much, much) better results than attempting the reverse direction. If you read some Hebrew, there’s a pretty typical example here, but you can produce any number of examples simply by trying to translate virtually anything.
What’s going on then? Why is the translation into English legible, even usable, while the translation into Hebrew sums up to complete and total rubbish? (Especially as intuitively, as the Hebrew writing system is problematic and ambiguous, I think it should be harder to translate from Hebrew). So here’s what I think the answer is.
Google’s translator “learns” to translate using two kinds of sources. The first is a pool of translated texts, that is texts that were written in one language and translated into the other. The other type is a pool of texts in the target language.
The translations pool is used like a bilingual dictionary, only better. As you would look up an entry in the dictionary in order to translate it, you can search the word in the pool of translations and see how it was translated before. Plus you have the advantage of being able to use the context to choose the best translation for the current case.
For example, in Hebrew one word is usually used for “search”, “look for” and “seek” (לחפש – lehapes). If you have to translate “lehapes” from Hebrew to English using a dictionary, you would find all these options. But if you search “lehapes be-Google” in the translations pool, you would find “search” to be used in this context.
The target language pool has two functions: one is to help in selecting the correct translation, and the other is to assist in constructing a reasonable target language sentence: changing word order, matching the gender and number of the subject and the verb, etc. Basically, the idea is to translate the source text in any conceivable way, and test which translation best matches what we see in the target language pool. For example, if we come up with “generality of elections” and “general elections”, we can assume that in most cases the second would be better.
Translations are hard to find, and translation pools are hard to build. The number of texts that were translated between any two languages is much much smaller than the number of texts written in any of these languages. We prefer avoiding archaic language, and we prefer texts similar to the ones the computer will later have to translate. So in software designed to translate web sites, we don’t really want to use Bible translations, or even Harry Potter. In short, this is a challenge, even for an information giant like Google.
It turns out, though, that if you have a really good pool of target language texts, it can offset for a small translations pool. I read somewhere about the following experiment: people were asked to evaluate some (human-made, in that case) translations. There were two groups of evaluators: bilinguals, who evaluated the translations having also read the source; and monolinguals, that were asked to evaluate the translation quality based on reading the translation only. The monolingual’s evaluations, it turned out, were very similar to those of the bilinguals. (My scientific conscious troubles me about not giving a citation for this, but not enough for me to start rummaging through papers and files. Do ask though, if you need it).
What we learn here is that a lot of the translation’s quality has to do with how the product makes sense in the target language. A good target language pool and a good way of using it to testing our translation may improve translation quality dramatically.
And that’s, I guess, where the difference between translation into Hebrew and into English lies. Google’s Hebrew texts pool would be way smaller than the English one. Moreover, I think it’s safe to assume that the way the use it to verify translation correctness isn’t quite as sophisticated. In other words, in the limited sense that software “knows” a language, Google’s “knows” way less Hebrew than it does English. Which is not really surprising, after all.
Entry Filed under: google, machine translation. Tags: google, machine translation.
Trackback this post | Subscribe to the comments via RSS Feed