We have converted freely licensed annotations of hieroglyphic texts into Unicode. This is now all in our repository https://github.com/oraec/formerly-mdc-now_unicode. Anyone can look at the encodings and read the texts. But that is not the real added value. Anyone can reuse the data because it is freely licensed. Anyone can base their own research on it. In this blog we want to offer an example of what you can do with this data.

The Egyptian texts consist of hieroglyphics. O.k., what an insight! Some hieroglyphs are rare, some are used more often. This is the case in all scripts, that some characters appear more often than others. In English texts you have more n than q. How does it look in Egyptian? To find out, we can use the data from the repository mentioned above. Notice! This is an example of reuse. Our plan then was to simply count through the hieroglyphs. But there are other characters in the data that were used because of the conversion to Unicode. Read about it in earlier blogs! We are talking about ⯑ and �, which is a mark that a character is missing in Unicode for a particular encoding, or a mark that the original source does not allow a unique mapping to a Unicode character. We have disregarded these characters.

The result of counting is here. The file lists the hieroglyphs with the number how often they are used in the repository. The file is sorted by frequency. The most frequently used hieroglyph is at the beginning. The 50 most common hieroglyphs are: 𓏏, 𓈖, 𓏤, 𓂋, 𓇋, 𓅱, 𓏥, 𓅓, 𓆑, 𓋴, 𓂝, 𓄿, 𓏛, 𓎡, 𓐍, 𓀀, 𓊪, 𓎛, 𓅆, 𓂧, 𓏭, 𓇳, 𓁷, 𓊃, 𓂻, 𓃀, 𓉐, 𓏌, 𓎟, 𓊹, 𓁹, 𓂡, 𓀁, 𓆓, 𓇓, 𓇾, 𓈙, 𓈒, 𓈎, 𓄹, 𓀗, 𓅯, 𓐛, 𓋹, 𓎆, 𓄤, 𓅪, 𓊖, 𓏠, 𓂜. Mostly these are phonograms, especially uniliteral phongrams. That seems logical, doesn’t it? But the fact that the 𓏤 is in the third position is a bit surprising.

The frequencies of the individual characters show how unevenly the usage is distributed. The most frequent hieroglyph of the total 617 characters used in the repository, 𓏏, has 73804 instances, i.e. 𓏏 occurs 73804 times in our repository. In contrast, 16 hieroglyphs (𓉜, 𓂍, 𓃰, 𓈻, 𓃦, 𓌏, 𓇌, 𓁂, 𓃇, 𓌲, 𓋙, 𓄮, 𓈹, 𓄀, 𓁴, 𓄾) each occur only once. Or, in percentages, our repository consists of 910101 hieroglyphs, so 8.1% of them are a 𓏏. 𓋙, on the other hand, has a proportion of only 0.0001%. This inequality is made clear by the fact that the 11 most common characters are responsible for more than half of the 910101 instances. The following table shows which characters can be used to write 50%, 80%, 90%, and 95% of all 910101 instances. In other words: If you know these characters, you can read more than 95% of the hieroglyphic texts!

Proportion of written hieroglyphs characters
50 % tokens with: 𓏏,𓈖,𓏤,𓂋,𓇋,𓅱,𓏥,𓅓,𓆑,𓋴,𓂝
80 % tokens with: 𓄿,𓏛,𓎡,𓐍,𓀀,𓊪,𓎛,𓅆,𓂧,𓏭,𓇳,𓁷,𓊃,𓂻,𓃀,𓉐,𓏌,𓎟,𓊹,𓁹,𓂡,𓀁,𓆓,𓇓,𓇾,𓈙,𓈒,𓈎,𓄹,𓀗,𓅯,𓐛,𓋹,𓎆,𓄤,𓅪,𓊖,𓏠,𓂜,𓌳,𓉻,𓈇
90 % tokens with: 𓊨,𓉔,𓁐,𓃹,𓍿,𓂞,𓈗,𓅭,𓊵,𓅨,𓌸,𓄣,𓆱,𓏇,𓏴,𓆼,𓏊,𓍑,𓐎,𓎼,𓁶,𓍛,𓂓,𓀏,𓐙,𓀭,𓆇,𓆣,𓆰,𓈉,𓊤,𓍘,𓄟,𓏙,𓊮,𓈐,𓇯,𓌨,𓊢,𓈞,𓆄,𓇼,𓌡,𓌙,𓄡,𓏶,𓌉,𓄂,𓅡,𓊗,𓍯,𓅮,𓆤,𓇑,𓏞,𓈘,𓅃
95 % tokens with: 𓂾,𓎺,𓇍,𓈋,𓌃,𓏐,𓌢,𓄔,·,𓐟,𓈌,𓍱,𓂉,𓏎,𓎔,𓄑,𓇥,𓂂,𓂸,𓆭,𓍋,𓀻,𓃒,𓏃,𓄪,𓀔,𓋀,𓈅,𓉗,𓂭,𓆳,𓌪,𓌫,𓄓,𓅂,𓍢,𓊌,𓂺,𓅜,𓇉,𓐩,𓆷,𓎗,𓏒,𓊛,𓂢,𓁺,𓃂,𓄋,𓅷,𓌂,𓍼,𓍲,𓇛,𓀯,𓎿,𓍔,𓈍,𓄖,𓇅,𓎱,𓋞,𓅠,𓍃

Is it all just gimmicky? The counts are helpful in two main areas:

1) Everyone who learns hieroglyphic writing faces a huge amount of new characters at the beginning. One can hardly recognize which characters are particularly important and which are not. The sheer frequency, however, can determine the importance. The more frequent a character is, the more important it is, and the faster you should learn it!

2) In any kind of computer aided character recognition, the information how likely an occurrence of the characters is is important. If you were to guess which flat, elongated character follows in a hieroglyphic text, you should take 𓈖 or 𓂋, because those are the two flat, elongated characters most commonly used. The sheer number gives an initial measure of the probability of occurrence of a character. This is what any character recognition software is based on.

The presentation so far is based on an evaluation that globally considers all texts equally. In addition, however, we have counted each individual text independently and published the results in a separate file. Only if one counts the texts separately, one can recognize chronological developments or regional differences. The file offers the following values per text: text length, number of different characters, information about the inequality, like the table above, list of the used characters with their absolute and their percentage frequency and finally the cumulated percentage frequencies with the rank position. Folks, get started with your script statistical analyses! The possibilities are immense.

We just did the following little thing, more of a gimmick: as seen above, 𓏏 is ranked #1 in the frequency table of all hieroglyphs, 𓈖 is ranked #2, and 𓏤 is ranked #3. But is the order the same for all the different texts? A file investigates this for the ten most frequent characters. For each character, it lists in which texts the character is the most frequent character, in which it is the second most frequent, and so on. However, not all texts were included in the investigation, but only those that are more than 1000 tokens long. According to this, 𓏏, the most frequent character, is ranked first in frequency in most texts, 𓈖, the second most frequent character, is ranked second in most texts. But 𓏤, the third most frequent character, is ranked first in frequency in 28 texts, but ranked third in frequency in only 23 texts. What might that mean? We don’t have a clue. Guys, think about it or better: do your own (meaningful) research!

This work is marked with CC0 1.0 Universal


<
Previous Post
ORAEC and Wikidata
>
Next Post
Text statistics