Thursday, July 01, 2010

Ugaritic as a test case for computer language-decipherment

TECHNOLOGY WATCH: Progress in computer decipherment of ancient languages, with Ugaritic as a test case.
Computer program deciphers a dead language that mystified linguists

(io9)

The lost language of Ugaritic was last spoken 3,500 years ago. It survives on just a few tablets, and linguists could only translate it with years of hard work and plenty of luck. A computer deciphered it in hours.

[...]
That last sentence, of course, is irresponsible hogwash. What the program did manage to do was (1) to identify almost perfectly (29 of 30) the letter correspondences between Ugaritic and Hebrew and (2) identify about 60% of the Hebrew words cognate to Ugaritic words. This is a remarkable and important achievement and it's a pity that the media coverage it has received so far is so sensationalist.

The limits of the system are (1) it needs to have identified a cognate language (Hebrew in this case) and (2) both languages need to be in an alphbetic script.

Anyone who has worked with Ugaritic will know how far this is from a decipherment. Decipherment of Ugaritic is still at a fairly primitive level and is likely to remain that way unless we recover a lot more texts. (Rumor was in my postgraduate days that a famous Semitist who shall remain nameless liked to say that he thought we were reading the tablets upside-down!) We understand Ugaritic as well as we do primarily because of all the cognates it shares with Hebrew, but a cognate word or root does no more than narrow the range of guesswork for understanding that word or root. Usage and context are also extremely important, and neither are addressed by this computer program. Indeed, the program doesn't do any translating at all. What it does do, remarkably efficiently, is to lay the groundwork for a human being to decipher an ancient language, potentially making the task much easier.

You can download a pdf file of the paper "A Statistical Model for Lost Language Decipherment," by Benjamin Snyder, Regina Barzilay, and Kevin Knight, by clicking on the link. This is the abstract:
In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known re-lated language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and high- level morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew.