Recommendations for encoding Egyptian hieroglyphs in Unicode

To coincide with the release of our recommendations for encoding Egyptian hieroglyphs in Unicode (the repo: https://github.com/oraec/recommendations-encoding-hieroglyphs, the release: https://github.com/oraec/recommendations-encoding-hieroglyphs/releases/tag/v1.0), we are publishing a related blog post here:

Digitization means representing non-digital in a digital environment. The non-digital is transformed (sorry for this buzzword) into the digital. You can either digitize a handwritten letter as an image or encode the writing as script. The second option requires standards that transport the individual letters, words and sentences into the digital world. The standard for encoding all scripts is called Unicode.

You don’t know Unicode? No problem; here’s an introduction: https://blog.daftcode.pl/a-brief-introduction-to-unicode-for-everybody-bc26f5761548

Unicode aims to make not only Latin script, but also cuneiform script and Egyptian hieroglyphs encodable. Since a few years there are 1071 Egyptian hieroglyphs in Unicode. With this set, Egyptology can encode Egyptian texts. All great, isn’t it? Unfortunately not. Let’s take a look at what Mark-Jan Nederhof, a specialist in the digital representation of Egyptian, writes:

Following the terminology of Unicode, a character is the smallest component of written language and a glyph is a shape that a character can have when it is rendered or displayed. In Egyptology however, there seem to be tendencies to remain true to the original manuscript while encoding a text, often to the extent of encoding glyphs rather than characters. (Nederhof, Mark-Jan. The Manuel de Codage encoding of hieroglyphs impedes development of corpora. In: Texts, Languages & Information Technology in Egyptology. Edited by Jean Winand & Stéphane Polis. 2013. 104.)

What’s the problem? Well, let’s look at an example: Unicode has two characters for a phonogram with the phonetic value ms: 𓄠 and 𓄟. If one encodes 𓄠 and the other 𓄟, you can’t find the words again. Obviously, if you search for 𓄠, you won’t find 𓄟. This is also annoying for the font developer. He doesn’t have to create one character, but two.

But if the Egyptologist says, “Hey, I need the two characters because the original also have two different shapes,” then that’s exactly the problem Mark-Jan Nederhof describes.

The Egyptologist needs two glyphs because he wants to represent the different character forms. However, both glyphs can belong to one character. Glyph vs. character is a very important distinction that Unicode makes. We have to keep glyph and character cleanly apart.

A character, a grapheme so to speak, is what Unicode offers. A font, on the other hand, controls exactly what a character looks like. Different fonts can represent different glpyhs.

So we can answer the Egyptologist: “Use different fonts if you want to represent different glyphs. 𓄠 and 𓄟 are still only one character. Stick to the rule of coding only characters, but no glyphs!” So, if you want to search or reuse digital hieroglyphs, you need font-independent characters, not glyphs.

Back to Unicode and its 1071 hieroglyphs. Many of these hieroglyphs are not characters. Besides 𓄠 and 𓄟, there are many other variants, e.g. 𓏴 and 𓏵, or 𓎼 and 𓎽, or 𓁰 and 𓁱. Variants thus occur in both phonograms and logograms.

Next are the ligatures. Ligatures combine glyphs to form a unit in the typeface. Unicode therefore considers ligatures to be a matter of the font and not the characters. Accordingly, a 𓅲 is only a glyph and not a character.

Finally, there are the substitutes. 𓏲 is the hieroglyphic representation of the abbreviated hieratic form of 𓅱. 𓅱 and 𓏲 - according to Nederhof (Nederhof, Mark-Jan. The Manuel de Codage encoding of hieroglyphs impedes development of corpora. In: Texts, Languages & Information Technology in Egyptology. Edited by Jean Winand & Stéphane Polis. 2013. 104.) - “could be argued to be different shapes representing the same character.” The digital paleography of hieratic and cursive hieroglyphs also groups the two shapes under one grapheme. Thus 𓅱 and 𓏲 should be only one character.

Of these 1071 members of the set of hieroglyphs are 692 characters, 252 variants, 121 ligatures, and 6 substitutes. Our recommendations for encoding Egyptian hieroglyphs in Unicode breaks this down exactly. We recommend to use only the real characters: 𓀀, 𓀁, 𓀂, 𓀄, 𓀉, 𓀊, 𓀋, 𓀌, 𓀍, 𓀎, 𓀏, 𓀐, 𓀒, 𓀓, 𓀔, 𓀖, 𓀗, 𓀘, 𓀙, 𓀚, 𓀛, 𓀝, 𓀞, 𓀟, 𓀠, 𓀡, 𓀢, 𓀣, 𓀤, 𓀦, 𓀧, 𓀨, 𓀩, 𓀫, 𓀭, 𓀯, 𓀲, 𓀵, 𓀸, 𓀹, 𓀺, 𓀻, 𓀾, 𓀿, 𓁀, 𓁂, 𓁄, 𓁅, 𓁆, 𓁊, 𓁋, 𓁌, 𓁐, 𓁑, 𓁒, 𓁔, 𓁖, 𓁗, 𓁘, 𓁙, 𓁚, 𓁟, 𓁠, 𓁢, 𓁣, 𓁤, 𓁥, 𓁦, 𓁨, 𓁩, 𓁫, 𓁭, 𓁮, 𓁯, 𓁰, 𓁲, 𓁳, 𓁴, 𓁵, 𓁶, 𓁷, 𓁸, 𓁹, 𓁺, 𓁼, 𓁽, 𓁿, 𓂀, 𓂁, 𓂂, 𓂃, 𓂄, 𓂅, 𓂆, 𓂇, 𓂈, 𓂉, 𓂋, 𓂌, 𓂍, 𓂎, 𓂏, 𓂐, 𓂑, 𓂓, 𓂕, 𓂘, 𓂙, 𓂚, 𓂜, 𓂝, 𓂞, 𓂠, 𓂡, 𓂢, 𓂣, 𓂤, 𓂥, 𓂦, 𓂧, 𓂨, 𓂩, 𓂪, 𓂫, 𓂬, 𓂭, 𓂷, 𓂸, 𓂺, 𓂻, 𓂽, 𓂾, 𓂿, 𓃀, 𓃂, 𓃃, 𓃇, 𓃈, 𓃒, 𓃓, 𓃔, 𓃕, 𓃖, 𓃗, 𓃘, 𓃙, 𓃛, 𓃜, 𓃝, 𓃟, 𓃠, 𓃡, 𓃢, 𓃥, 𓃦, 𓃧, 𓃩, 𓃫, 𓃬, 𓃭, 𓃮, 𓃯, 𓃰, 𓃱, 𓃲, 𓃴, 𓃵, 𓃶, 𓃷, 𓃸, 𓃹, 𓃻, 𓃼, 𓃿, 𓄀, 𓄁, 𓄂, 𓄃, 𓄅, 𓄇, 𓄈, 𓄊, 𓄋, 𓄏, 𓄑, 𓄒, 𓄓, 𓄔, 𓄖, 𓄗, 𓄙, 𓄚, 𓄛, 𓄜, 𓄝, 𓄞, 𓄟, 𓄡, 𓄢, 𓄣, 𓄤, 𓄥, 𓄦, 𓄪, 𓄫, 𓄬, 𓄭, 𓄮, 𓄯, 𓄰, 𓄲, 𓄹, 𓄼, 𓄽, 𓄾, 𓄿, 𓅂, 𓅃, 𓅄, 𓅆, 𓅇, 𓅉, 𓅊, 𓅋, 𓅌, 𓅏, 𓅐, 𓅑, 𓅒, 𓅓, 𓅕, 𓅘, 𓅙, 𓅚, 𓅜, 𓅝, 𓅟, 𓅠, 𓅡, 𓅢, 𓅣, 𓅤, 𓅥, 𓅦, 𓅧, 𓅨, 𓅪, 𓅬, 𓅭, 𓅮, 𓅯, 𓅰, 𓅱, 𓅷, 𓅹, 𓅺, 𓅻, 𓅼, 𓅽, 𓅾, 𓅿, 𓆀, 𓆁, 𓆂, 𓆃, 𓆄, 𓆆, 𓆇, 𓆈, 𓆉, 𓆊, 𓆋, 𓆌, 𓆎, 𓆏, 𓆐, 𓆑, 𓆒, 𓆓, 𓆔, 𓆗, 𓆘, 𓆙, 𓆛, 𓆜, 𓆝, 𓆞, 𓆟, 𓆠, 𓆡, 𓆢, 𓆣, 𓆤, 𓆦, 𓆧, 𓆨, 𓆩, 𓆫, 𓆬, 𓆭, 𓆯, 𓆰, 𓆱, 𓆳, 𓆴, 𓆷, 𓆸, 𓆹, 𓆻, 𓆼, 𓇅, 𓇇, 𓇉, 𓇋, 𓇌, 𓇍, 𓇎, 𓇏, 𓇐, 𓇑, 𓇒, 𓇓, 𓇔, 𓇕, 𓇗, 𓇚, 𓇛, 𓇜, 𓇝, 𓇠, 𓇣, 𓇤, 𓇥, 𓇧, 𓇨, 𓇩, 𓇫, 𓇬, 𓇭, 𓇮, 𓇯, 𓇰, 𓇲, 𓇳, 𓇶, 𓇷, 𓇹, 𓇻, 𓇼, 𓇽, 𓇾, 𓈀, 𓈁, 𓈂, 𓈄, 𓈅, 𓈇, 𓈈, 𓈉, 𓈋, 𓈌, 𓈍, 𓈎, 𓈏, 𓈐, 𓈑, 𓈒, 𓈔, 𓈖, 𓈗, 𓈘, 𓈙, 𓈝, 𓈞, 𓈠, 𓈡, 𓈢, 𓈣, 𓈤, 𓈦, 𓈧, 𓈨, 𓈩, 𓈪, 𓈫, 𓈬, 𓈭, 𓈮, 𓈯, 𓈰, 𓈱, 𓈳, 𓈴, 𓈵, 𓈶, 𓈷, 𓈸, 𓈹, 𓈺, 𓈻, 𓈼, 𓈽, 𓈾, 𓈿, 𓉁, 𓉃, 𓉄, 𓉅, 𓉆, 𓉇, 𓉈, 𓉉, 𓉋, 𓉌, 𓉍, 𓉎, 𓉐, 𓉔, 𓉕, 𓉗, 𓉘, 𓉜, 𓉠, 𓉡, 𓉥, 𓉧, 𓉩, 𓉪, 𓉬, 𓉭, 𓉯, 𓉱, 𓉲, 𓉳, 𓉴, 𓉵, 𓉶, 𓉸, 𓉹, 𓉺, 𓉻, 𓉽, 𓉿, 𓊀, 𓊁, 𓊂, 𓊃, 𓊄, 𓊅, 𓊆, 𓊇, 𓊈, 𓊉, 𓊊, 𓊋, 𓊌, 𓊍, 𓊎, 𓊏, 𓊑, 𓊒, 𓊔, 𓊖, 𓊗, 𓊚, 𓊛, 𓊜, 𓊝, 𓊞, 𓊠, 𓊡, 𓊢, 𓊤, 𓊦, 𓊧, 𓊨, 𓊪, 𓊫, 𓊬, 𓊭, 𓊮, 𓊯, 𓊲, 𓊵, 𓊶, 𓊸, 𓊹, 𓊺, 𓊻, 𓊽, 𓊾, 𓊿, 𓋀, 𓋁, 𓋂, 𓋃, 𓋄, 𓋆, 𓋇, 𓋉, 𓋋, 𓋍, 𓋏, 𓋐, 𓋑, 𓋓, 𓋔, 𓋖, 𓋘, 𓋙, 𓋚, 𓋛, 𓋜, 𓋝, 𓋞, 𓋣, 𓋦, 𓋧, 𓋨, 𓋪, 𓋫, 𓋬, 𓋭, 𓋮, 𓋯, 𓋰, 𓋲, 𓋳, 𓋴, 𓋷, 𓋸, 𓋹, 𓋺, 𓋽, 𓋾, 𓋿, 𓌀, 𓌁, 𓌂, 𓌃, 𓌄, 𓌅, 𓌆, 𓌇, 𓌈, 𓌉, 𓌎, 𓌏, 𓌐, 𓌑, 𓌒, 𓌔, 𓌕, 𓌗, 𓌘, 𓌙, 𓌛, 𓌝, 𓌞, 𓌟, 𓌡, 𓌢, 𓌤, 𓌥, 𓌦, 𓌨, 𓌩, 𓌪, 𓌫, 𓌰, 𓌲, 𓌳, 𓌸, 𓌼, 𓌽, 𓌾, 𓍁, 𓍂, 𓍃, 𓍄, 𓍅, 𓍇, 𓍉, 𓍊, 𓍋, 𓍍, 𓍏, 𓍑, 𓍔, 𓍕, 𓍖, 𓍘, 𓍙, 𓍛, 𓍜, 𓍝, 𓍞, 𓍠, 𓍡, 𓍢, 𓍬, 𓍯, 𓍰, 𓍱, 𓍲, 𓍶, 𓍸, 𓍹, 𓍺, 𓍼, 𓍿, 𓎁, 𓎂, 𓎃, 𓎅, 𓎆, 𓎔, 𓎗, 𓎙, 𓎛, 𓎝, 𓎟, 𓎡, 𓎣, 𓎤, 𓎥, 𓎨, 𓎩, 𓎫, 𓎬, 𓎯, 𓎰, 𓎱, 𓎳, 𓎵, 𓎶, 𓎷, 𓎸, 𓎺, 𓎻, 𓎼, 𓎿, 𓏁, 𓏃, 𓏇, 𓏈, 𓏉, 𓏊, 𓏌, 𓏎, 𓏏, 𓏐, 𓏒, 𓏖, 𓏘, 𓏙, 𓏛, 𓏞, 𓏠, 𓏡, 𓏢, 𓏣, 𓏤, 𓏥, 𓏭, 𓏳, 𓏴, 𓏶, 𓏸, 𓐍, 𓐎, 𓐏, 𓐑, 𓐒, 𓐓, 𓐖, 𓐗, 𓐘, 𓐙, 𓐛, 𓐞, 𓐟, 𓐡, 𓐢, 𓐣, 𓐥, 𓐧, 𓐨, 𓐩, 𓐪, 𓐬, 𓐮

By the way, what about the nine control characters Unicode added in 2019? These characters join different glyph to groups. The above mentioned 𓅲 can be encoded with 𓅱, 𓏏 and a control character inserting the 𓏏. As you can see, the control character builds a ligature. Therefore, we do not recommend these control characters because ligatures belong to the font level. There are some fonts that can build ligatures of Egyptian hieroglyphs, e.g. Aegyptus or EgyptianHiero. So, if you want to encode ligatures, please use these fonts instead of the control characters.

In the future, we will present here texts encoded according to our recommendations. Stay tuned!

Happy birthday, Egyptology!

Converting Manuel de Codage to Unicode