Missing Unicode characters and further mapping problems

Well, are you already waiting impatiently for our next blog? We wanted to convert more texts encoded in Manuel de Codage to Unicode, so we took Serge Rosmorduc’s great repo and looked at the freely licensed texts. There are 112 of them.

O.k., what took so long? There is already a mapping. Well, that was not enough. It had 1807 pairs so far. By working with these texts, 157 have been added. That’s an increase of almost 9%!

But we did it and converted the texts. In this blog we report on the difficulties of the process.

First, we found twelve characters for which there are no corresponding Unicode characters. That is, first, no character had the required feature set, and second, the character could not be understood as a glyph of an existing character.

Missing Unicode characters

A158

phonetic value: nḥp
source: Leitz, Christian: Quellentexte zur Ägyptischen Religion I. Die Tempelinschriften der griechisch-römischen Zeit. 2004, p. 154.
source: https://thotsignlist.org/mysign?id=223

A457

semantic value: Iunmutef priest
source: Urk. IV, 157.11
source: Rummel, Ute: Pfeiler seiner Mutter - Beistand seines Vaters : Untersuchungen zum Gott Iunmutef vom Alten Reich bis zum Ende des Neuen Reiches. Volume I. 2003. https://nbn-resolving.org/urn%3Anbn%3Ade%3Agbv%3A18-34441, p. 4 with n. 34.

B24

semantic value: woman-shaped statue
source: Polis, Stéphane & Rosmorduc, Serge. The Hieroglyphic Sign Functions. Suggestions for a Revised Taxonomy. In: Fuzzy Boundaries: Festschrift für Antonio Loprieno. 2015, p. 161.
source: http://tla-temp.hieroglyphic-texts.net/Belege_fuer_Hieroglyphe_B24_satzweise_Ansicht_.html

C165

semantic value: Mehit
source: Leitz, Christian: Quellentexte zur Ägyptischen Religion I. Die Tempelinschriften der griechisch-römischen Zeit. 2004, p. 157.

D153

phonetic value: r
source: Leitz, Christian: Quellentexte zur Ägyptischen Religion I. Die Tempelinschriften der griechisch-römischen Zeit. 2004, p. 157.
source: https://thotsignlist.org/mysign?id=1943

F132

phonetic value: mꜣṯ
source: Leitz, Christian: Quellentexte zur Ägyptischen Religion I. Die Tempelinschriften der griechisch-römischen Zeit. 2004, p. 162.

M163

phonetic value: sm
source: De Meulenaere, Herman. Un titre memphite méconnu. In: Mélanges Mariette. BdE 32. 1961, p. 285-290.

R88

phonetic value: mks
source: Leitz, Christian: Quellentexte zur Ägyptischen Religion I. Die Tempelinschriften der griechisch-römischen Zeit. 2004, p. 170.

S116

measure of linen, ligature of S114 4 times -> S114 is the representation of the missing Unicode character
source: Scheele, Katrin: Die Stofflisten des Alten Reiches. Lexikographie, Entwicklung und Gebrauch. MENES 2. 2005, p. 56-58.
meaning not listed in https://thotsignlist.org/mysign?id=5651

T92

semantic value: net/trap
source: https://thotsignlist.org/mysign?id=6023 (T90)

U105

semantic value: saw
source: http://tla-temp.hieroglyphic-texts.net/Belege_fuer_Hieroglyphe_U105_satzweise_Ansicht_.html

<S

opening of the serekh, cf. opening of ḥwt enclosure 𓉘; opening of fortified wall cartouche 𓊆; opening of square fortified wall cartouche 𓊈; opening of cartouche 𓍹

A ⯑ was used as a placeholder for the mapping. Hopefully, these characters will be added in the future.

Glyphs instead of characters and other problems

However, most of the new 157 cases are variants of existing characters: An N104 is just a variant of 𓈞, or a D3A is just a variant of 𓁸.

Remember the saying of Mark-Jan Nederhof we already had in a previous blog? No? We quote:

In Egyptology however, there seem to be tendencies to remain true to the original manuscript while encoding a text, often to the extent of encoding glyphs rather than characters. (Nederhof, Mark-Jan. The Manuel de Codage encoding of hieroglyphs impedes development of corpora. In: Texts, Languages & Information Technology in Egyptology. Edited by Jean Winand & Stéphane Polis. 2013. 104.)

This statement is confirmed by the many addressed variants we found in the texts. Often the code of Manuel de Codage does not want to be a code at all, but merely represents the glyphs as similar as possible. Here is an impressive example: R13282**D263126**D23449**W100-W10:mw-!!! R15442**D263126**D23449**W100-W10:mw-!!

This code snippet is from the text for the Stele Louvre C 26. This represents the icons to the left and right of the udjat eyes. But the code suggests that D263 and D234 are used as characters there. But this is not true at all! D263 and D234 are there only because the character shapes help to reproduce the icon as a picture.

Another example: Occasionally there is a 1*1*1 in the code, but it is usually not the number 1, but the plural determinative 𓏥. Therefore, be careful when reusing our converted data. We cannot guarantee the quality of the data.

If the Egyptologists encode mainly glyphs, don’t be surprised if some characters are chosen only for their similarity. As a result, one character suddenly has the most different functions. Z30 shows impressively a big confusion in the coding practice. With such characters we had to map with �.

Undocumented code like US85Aa1001XT was also mapped with �.

Finally, there is also code that does not stand for a character at all. Space fillers have their own code in JSesh, but are not to be converted to Unicode.

Converting Manuel de Codage to Unicode

Four challenges when converting AES