Hooray, our text corpus is online.

In the last blog we introduced the stable IDs for Egyptian texts. The web pages with these stable IDs include metadata like date or findspot. Now we have also put the texts themselves online, so that you can read through them and do research in them.

Actually, we intended to publish the texts in hieroglyphic script. However, only 21% of the texts to which we have assigned a stable ID have hieroglyphic annotations. Therefore, we used the Egyptological transcription instead. All 13026 texts can now be read from front to back in transcription. The transcription itself is linked. For example, oraec11793 consists only of this sentence: ḥm-nṯr-Ḫwi̯≡f-wj (j)m(,j)-rʾ-pr-ḥw,t-ꜥꜣ(,t) ḥm-nṯr-Sꜣḥ,w-Rꜥw ḥm-nṯr-Nfr-jr-kꜣ-Rꜥw jmꜣḫ,w-ḫr-nṯr-ꜥꜣ wꜥb-nswt (j)r(,j-j)ḫ(,t)-nswt Jj-mry. Clicking ḥm-nṯr-Ḫwi̯≡f-wj gives all the occurrences for ḥm-nṯr-Ḫwi̯=f-wj in our corpus. With (j)m(,j)-rʾ-pr-ḥw,t-ꜥꜣ(,t), you get all the occurrences for jm.j-rʾ-pr-ḥw.t-ꜥꜣ.t. Cool, right?

You ask how we did it? Simple! We took the data from AES, which consists of individual Egyptian sentences. We connected the sentences that belong together to form a text and extracted the transcription.

O.k., you probably ask now: “And what about the hieroglyphic spellings? You started with the hieroglyphs in Unicode. Have you forgotten them now?” Good point! It’s not so easy to map two levels at the same time, namely transcription and hieroglyphics. The reader must be able to recognize which hieroglyphs and which transcription belong together. We have not been able to come up with a slim and user-friendly format that does that. Sorry!

Instead, we present an additional view that includes all the information that AES provides. This additional view does not include the entire text, but only one sentence. To be able to reference these sentences, we did not reuse the IDs of AES. You remember that AES has such terribly long IDs that no one can remember. We simply counted through the sentences of a text. oraec1-5 stands for the fifth sentence of the text oraec1.

Back to our additional view: there is a rather large table per sentence. The different information like hieroglyphic writing or German translation are in the rows. Every single word has its own column. Let’s go through the rows one by one: At token are the IDs of the words. Here we have also introduced new IDs. We count through the words of a sentence: oraec1-5-2 is the second word of the fifth sentence of the text oraec1. In written_form are the transcriptions of the words. The hieroglyphic spellings are in hiero. We use the font EgyptianHiero mentioned in a blog post a month ago beacuse this font use ligatures. (Some of them are a little bit buggy, we know.) line count provides the information in which line the word in question is located. translation has a German translation of the word. This was all very obvious until now. But what is lemma? This is the term for a lexical entry in a dictionary, i.e. it tells you where you can find the word in question in a dictionary. AED ID gives the ID of this lemma in the Ancient Egyptian Dictionary. part of speech stands for, well, what do you think? Sure: for part of speech… The following lines can specify grammatical features about the word. There are name, number, voice, genus, pronoun, numerus, epitheton, morphology, inflection, adjective, particle, adverb, verbal class and status.

The entries in this table are also linked. If you click on a hieroglyphic writing, you get all occurrences for this writing. If you click on a part of speech, you get all occurrences for this part of speech. If you click on a verbal class, you get all the occurrences for … o.k. you know what for.

Additionally there are overview pages: one for all grammatical features like nisba and one for all hieroglyphic writings in ORAEC. Just search a desired writing via CTRL + F and get all occurrences for this writing in ORAEC!

We have as a simple view that provides the entire transcription of a text, and a table view that provides a heck of a lot of information per sentence. O.k., because of the table you are now getting puzzled. When every word gets its own column and a sentence consists of, let’s say, ten words, that’s - to put it nicely - a challenge for a smartphone. We admit it, it’s really not ideal, but we’ve tried to make the best of it. ZURB has a great idea to implement large tables in a responsive design. The first column is pinned and the other columns are scrollable. We find the result quite appealing. What do you think? Or do you have a cool idea to handle such huge tables on cell phones?

What’s next? We will publish the texts in JSON so that everyone can reuse them. That will be easier than scraping the HTML files. And we want to handle with the collocations of words.

Stay tuned and have fun with the corpus!

Stable IDs for Egyptian texts

Next steps