Automatic Transliteration of Hieroglyphs
Hey folks! We secretly uploaded a service last week and haven’t reported about it yet. We’re going to make up for it now.
Do you know the famous article “Automated Transliteration of Late Egyptian Using Neural Networks: An Experiment in “Deep Learning”” written by Serge Rosmorduc and published in the journal Lingua Aegyptia? A neural network automatically generates a transliteration from a hieroglyphic text. The software is available here. Great stuff! Play around with it! This inspired us, of course! We wondered if we could create a web service for automatic transliteration without relying on a neural network. This blog describes our work.
Transliteration
What is transliteration? Transliteration converts the phonetic values of Egyptian hieroglyphs into Latin letters. Recently, there have been attempts to create a unified transcription system. Read our review! It is important to note that information is lost in the process of transliteration. Determinatives are not rendered in transliteration because they have no phonetic value. In other words, you cannot reconstruct the original from the transliteration. A ꜥnḫ could stand for 𓋹 or for 𓋹𓏤𓆰𓏥 or for 𓋹𓀁 or for 𓋹𓈖𓐍𓏛 or for 𓋹𓈖𓐍. Let’s stay with the last example, 𓋹𓈖𓐍. This string consists of three hieroglyphs, each with its own phonetic value. 𓋹 has the phonetic value ꜥnḫ, 𓈖 has the phonetic value n, and 𓐍 has the phonetic value ḫ. The transliteration is now not simply phonetic value of the first hieroglyph + phonetic value of the second hieroglyph + phonetic value of the third hieroglyph. The last two hieroglyphs act as complements. They are reading aids for the first hieroglyph. As a consequence, it is not possible to simply work through a large list of hieroglyphs and their phonetic values. Such reading aids are especially important when a hieroglyph has multiple phonetic values. The string 𓂧𓍯𓇼 is transliterated as dwꜣ. The complements 𓂧 and 𓍯 indicate that 𓇼 here has the phonetic value dwꜣ and not, say, sbꜣ. Instead, you need to capture the string as a whole. This also has the advantage of being able to capture the so-called honorific transposition. A 𓊹𓍛 has the transliteration ḥm-nṯr, although the first sign has the phonetic value nṯr and the second sign has the phonetic value ḥm. Thus, the order of the hieroglyphs does not correspond to the order of the transliteration. This often happens with deities, which is why it is called honorific transposition.
Our approach
From these preliminary considerations, we came to the following conclusions: a rule-based approach seems too complex to us. Instead, we want to take a lexicon-based approach. The basis is not the individual signs, but the word forms. This avoids the problems of complements and honorific transpositions. Thus, we split the automatic transliteration, which is a unit in Rosmorduc’s approach, into two modules. First, an Egyptian text must be tokenized. Then, the individual tokens can be automatically transliterated. In other words, our automatic transliteration needs a tokenized text as a basis. Instead of “𓐍𓅱𓆑𓅱𓊹𓍛𓅓𓂋𓉐𓏤𓇓𓏏𓈖𓃂𓇓𓏏𓂋𓐍𓊪𓏏𓎛𓅢𓄤𓆑𓂋” we need “𓐍𓅱𓆑𓅱𓊹𓍛 𓅓𓂋𓉐𓏤 𓇓𓏏𓈖𓃂 𓇓𓏏𓂋𓐍 𓊪𓏏𓎛𓅢𓄤𓆑𓂋”. The words are separated by a space. Thus, we divide the problem of automatic transliteration into two sub-problems, that of tokenization and that of automatic transliteration, of which only the second is currently being considered. The hieroglyphs themselves, which are automatically transliterated, are encoded in Unicode and not - as in Rosmorduc - in MdC. We refer to our blog, which explains the reasons for this.
Lexicon-based approach
For a lexicon-based approach, one needs a mapping that assigns a hieroglyphic spelling to its transliteration. The mapping is based on our corpus data, which provides for a token its transliteration and, to a large extent, its hieroglyphic spelling. Our corpus is not small, so this mapping contains more than 40000 different entries. This is not small, but it does not cover every hieroglyphic spelling. The complements alone allow for a very large number of possible combinations. Our mapping captures 𓋹𓈖𓐍, but not the possible spelling 𓂝𓈖𓐍𓋹. This is the weakness of any lexicon-based approach. What is not included in the mapping cannot be processed. Our solution here is to match a spelling not in the mapping with all existing spellings. The most similar spelling is used for automatic transliteration.
Levenshtein and other distance measures
So how do you determine the most similar spelling? One popular method is Levenshtein. This is a method that measures the steps it takes to get from one string to another. But Levenshtein does not give good results for hieroglyphic spellings. This is exactly what Coralie Collignon writes in her readable paper on string distances:
For us, humans, the “beauties”/”beautiful” pair is much more similar than the “foo”/”bar” pair. But the Levenshtein distance is the same. (https://medium.com/@appaloosastore/string-similarity-algorithms-compared-3f7b4d12f0ff)
Instead, we use another method, namely the Javascript library String Score, which we are quite satisfied with.
Evaluation of the distance
Okay, so we have two choices: Either a hieroglyphic spelling is in our mapping or it is not. We color the second option, i.e. exactly when the transliteration of a hieroglyphic spelling is determined by a similarity measure. We evaluate the generated transliteration by considering the phonetic values of the hieroglyphs. Suppose the hieroglyphic spelling contains a 𓆑. Then the transliteration should also contain an “f”. Understandable evaluation, isn’t it? If the evaluation is successful, the transliteration gets a green color. If not, the transliteration of the word will be red.
Service
Okay, enough said! Have a look at our service: https://oraec.github.io/corpus/hiero_to_transliteration.html You can enter your hieroglyphs in the upper box. Remember to separate the words with a space. Suffixes are also considered independent words and must be separated. Then click the button and the site will provide the transliteration. The quality of the service is quite good, we think. But of course it can be improved.
New pairs
Our automatic transliteration stands and falls with the mapping. The more hieroglyphic spellings with their transliteration in the list, the better the quality of our service. This is where you come in. You can tell us about missing hieroglyphic spelling and transliteration pairs. There is a link on our website to https://github.com/oraec/corpus/issues/new?assignees=oraec&labels=enhancement&projects=&template=add-a-new-hieroglyphs-transliteration-pair.md&title=%5BNEW+PAIR%5D. This is an issue tracker on Github where you can submit new pairs. We will then review the pairs and integrate them into our mapping. Cool, right? Anyway, we are curious what you think of our new service.
This work is marked with CC0 1.0 Universal