Let’s continue with the statistics. This time it’s TF-IDF. This is a classic measure from text retrieval. Suppose you have a large collection of texts that you want to search for a search term. If the search term appears in multiple texts, the question is how to sort the hit list. Which texts are more relevant than others in a search? This is a typical problem for search engines. Google has a sophisticated algorithm in which various parameters play a role. These include tf and idf, which are the central elements in the classic measure tf-idf. tf is the term frequency, i.e. it measures how often a term is used in a document. idf is the inverse document frequency. It counts in how many documents the search term is used. idf is then the reciprocal value of this. With the measure tf-idf, tf and idf are combined. The reason lies in the high frequency words. These have a high tf value, but a low idf value. Thus, they are not preferred over other words.
We computed this classic value for the ORAEC data and published it in our repository alongside the other statistical data: https://github.com/oraec/corpus_raw_data/blob/main/statistics/type_token_etc/tf_idf_lemma.json Each lemma represented by its ID has its own table with many pairs. Each text in which the lemma occurs has its tf-idf value assigned. This table is sorted by the td-idf value. This means that the most relevant text for a lemma is at the beginning.
This is the classic procedure for the tf-idf value: you use it to pick out the matching texts for a given word. We also turned this around so that you can find the typical words for a given text. To do this, we simply asked ourselves, for which texts is the tf-idf value of a word the largest. We published a JSON for this as well: https://github.com/oraec/corpus_raw_data/blob/main/statistics/type_token_etc/tf_idf_text.json
Let’s take a look at which words are declared typical based on the tf-idf measure for the first ORAEC texts:
At first glance, this is not very illuminating, since the top 5 mainly contain high frequency words. But the other words are indeed to be regarded as relevant for the respective text. Because the old TLA offers exactly these words also as keywords for these texts. ḥm, “Majestät”, the word with the largest tf-idf value in oraec1, is the fourth most relevant keyword according to chi square (cf. https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=20673&l1=0&wt=0&sr=0&mf=5&md=2&ss=0&mw=25) and the most relevant keyword according to log-likelihood (cf. https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=20673&l1=0&wt=0&sr=0&mf=5&md=2&ss=1&mw=25). Ꜥꜣpp, “Apophis (Schlangengott, Götterfeind)”, which is among the top5 of oraec3, is the most relevant keyword according to chi square and according to log-likelihood (cf. https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=19935&l1=0&wt=0&sr=0&mf=5&md=2&ss=0&mw=25 and https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=19935&l1=0&wt=0&sr=0&mf=5&md=2&ss=1&mw=25). Likewise, nn, “[Negationspartikel]” and tw, “du; dich [Enkl. Pron. sg.2.m.]; du; dich [Enkl. Pron. sg.2.f.]” are in the top 5 of both the tf-idf values of oraec3 and the keywords according to log-likelihood (cf. https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=19935&l1=0&wt=0&sr=0&mf=5&md=2&ss=1&mw=25). The two words from the top 5 of oraec5 that are not high frequency words, namely sbḫ.t, “Portikus; Pforte; Palast; Krypte” and 2…10000.nw, “[Ordinalzahl in Ziffernschreibung mit Bildungselement -nw]”, are the two most relevant keywords according to log-likelihood (cf. https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=19922&l1=0&wt=0&sr=0&mf=5&md=2&ss=1&mw=25).
O.k., is it still possible to clean the data so that the high frequency words are no longer preferred? Wikipedia mentions different ways to determine the term frequency, more precisely one “to prevent a bias towards longer documents”, which could be a reason for the preference of the high frequency words. double normalization 0.5
is defined as follows:
\[tf(t,d) = 0.5 + 0.5 \* {f_{t,d} \over max f_{t',d} : t' \in d }\]
With this modified tf, other tf-idf values result of course. These can be found - sorted by word and text - here: https://github.com/oraec/corpus_raw_data/blob/main/statistics/type_token_etc/double_normalization_idf_lemma.json and https://github.com/oraec/corpus_raw_data/blob/main/statistics/type_token_etc/double_normalization_idf_text.json. This results in the following top 5 for the first five ORAEC texts:
double normalization 0.5: oraec1
double normalization 0.5: oraec2
double normalization 0.5: oraec3
double normalization 0.5: oraec4
double normalization 0.5: oraec5
Due to normalization, the high frequency words have disappeared. Instead, only words that have few occurrences appear. In addition, these words are only attested in the text in which they appear in the top 5. Partially, the TLA also records the words as keywords: wr-n-Mꜥ, “Großer der Ma”, Tꜣy=f-nḫt.t, “Tefnacht”, Wḫ, “Uch (Gott von Kusae)”, tꜣ-n-nḥḥ, “Land der Ewigkeit (Bez. des Totenreiches)” and pr-ꜥꜣ, “der Pharao (verschiedene Götter)”. Wḫ, “Uch (Gott von Kusae)” is even the most relevant keyword of oraec4 in the TLA according to chi square (cf. https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=19961&l1=0&wt=0&sr=0&mf=5&md=2&ss=0&mw=25). While the normal tf-idf corresponded rather with the keywords according to log-likelihood, with double normalization 0.5 one has rather the correspondence with chi square.
The normalization seems to be a bit overfitting. This is a nice task for you guys! Can you find the right measure for normalization? We’ve only shown the two extremes here: Normal tf-idf favors the high frequency words, double normalization 0.5 the rare ones. This sounds like a nice topic for a thesis, finding a suitable modification of tf-idf that adequately represents the ORAEC data. Guys, go ahead and have fun!