TF-IDF

Let’s continue with the statistics. This time it’s TF-IDF. This is a classic measure from text retrieval. Suppose you have a large collection of texts that you want to search for a search term. If the search term appears in multiple texts, the question is how to sort the hit list. Which texts are more relevant than others in a search? This is a typical problem for search engines. Google has a sophisticated algorithm in which various parameters play a role. These include tf and idf, which are the central elements in the classic measure tf-idf. tf is the term frequency, i.e. it measures how often a term is used in a document. idf is the inverse document frequency. It counts in how many documents the search term is used. idf is then the reciprocal value of this. With the measure tf-idf, tf and idf are combined. The reason lies in the high frequency words. These have a high tf value, but a low idf value. Thus, they are not preferred over other words.

We computed this classic value for the ORAEC data and published it in our repository alongside the other statistical data: https://github.com/oraec/corpus_raw_data/blob/main/statistics/type_token_etc/tf_idf_lemma.json Each lemma represented by its ID has its own table with many pairs. Each text in which the lemma occurs has its tf-idf value assigned. This table is sorted by the td-idf value. This means that the most relevant text for a lemma is at the beginning.

This is the classic procedure for the tf-idf value: you use it to pick out the matching texts for a given word. We also turned this around so that you can find the typical words for a given text. To do this, we simply asked ourselves, for which texts is the tf-idf value of a word the largest. We published a JSON for this as well: https://github.com/oraec/corpus_raw_data/blob/main/statistics/type_token_etc/tf_idf_text.json

Let’s take a look at which words are declared typical based on the tf-idf measure for the first ORAEC texts:

tf-idf: oraec1

lemma	tf-idf
ḥm, “Majestät”	0.033183939188786184
=f, “[Suffix Pron. sg.3.m.]”	0.03245734437921875
m, “[Präposition]”	0.028916463472258857
r, “[Präposition]”	0.024731250000095188
ꜥḥꜥ.n, “[aux.]”	0.02181480312140608

tf-idf: oraec2

lemma	tf-idf
=f, “[Suffix Pron. sg.3.m.]”	0.04257268349459486
m, “[Präposition]”	0.03973108319341872
=tw, “[Suffix Pron. sg.3.c.]”	0.03635593404074869
=s, “[Suffix Pron.sg.3.f.]”	0.023975401718916267
ḥr, “[Präposition]”	0.019970772784288694

tf-idf: oraec3

lemma	tf-idf
=k, “[Suffix Pron. sg.2.m.]”	0.051304204880567465
nn, “[Negationspartikel]”	0.03141442056321167
m, “[Präposition]”	0.030224029067485562
Ꜥꜣpp, “Apophis (Schlangengott, Götterfeind)”	0.02686618928143773
tw, “du; dich [Enkl. Pron. sg.2.m.]; du; dich [Enkl. Pron. sg.2.f.]”	0.026098745399493382

tf-idf: oraec4

lemma	tf-idf
=tw, “[Suffix Pron. sg.3.c.]”	0.03579470952258956
=f, “[Suffix Pron. sg.3.m.]”	0.03548797626168779
m, “[Präposition]”	0.03403746446875721
=s, “[Suffix Pron.sg.3.f.]”	0.026673508151424045
r, “[Präposition]”	0.021404534836296497

tf-idf: oraec5

lemma	tf-idf
=j, “[Suffix Pron. sg.1.c.]”	0.041546063669260454
m, “[Präposition]”	0.0247850561928741
sbḫ.t, “Portikus; Pforte; Palast; Krypte”	0.024597796369753448
2…10000.nw, “[Ordinalzahl in Ziffernschreibung mit Bildungselement -nw]”	0.022419368363327457
=ṯn, “[Suffix Pron. pl.2.c.]”	0.021806282148920483

At first glance, this is not very illuminating, since the top 5 mainly contain high frequency words. But the other words are indeed to be regarded as relevant for the respective text. Because the old TLA offers exactly these words also as keywords for these texts. ḥm, “Majestät”, the word with the largest tf-idf value in oraec1, is the fourth most relevant keyword according to chi square (cf. https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=20673&l1=0&wt=0&sr=0&mf=5&md=2&ss=0&mw=25) and the most relevant keyword according to log-likelihood (cf. https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=20673&l1=0&wt=0&sr=0&mf=5&md=2&ss=1&mw=25). Ꜥꜣpp, “Apophis (Schlangengott, Götterfeind)”, which is among the top5 of oraec3, is the most relevant keyword according to chi square and according to log-likelihood (cf. https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=19935&l1=0&wt=0&sr=0&mf=5&md=2&ss=0&mw=25 and https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=19935&l1=0&wt=0&sr=0&mf=5&md=2&ss=1&mw=25). Likewise, nn, “[Negationspartikel]” and tw, “du; dich [Enkl. Pron. sg.2.m.]; du; dich [Enkl. Pron. sg.2.f.]” are in the top 5 of both the tf-idf values of oraec3 and the keywords according to log-likelihood (cf. https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=19935&l1=0&wt=0&sr=0&mf=5&md=2&ss=1&mw=25). The two words from the top 5 of oraec5 that are not high frequency words, namely sbḫ.t, “Portikus; Pforte; Palast; Krypte” and 2…10000.nw, “[Ordinalzahl in Ziffernschreibung mit Bildungselement -nw]”, are the two most relevant keywords according to log-likelihood (cf. https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=19922&l1=0&wt=0&sr=0&mf=5&md=2&ss=1&mw=25).

O.k., is it still possible to clean the data so that the high frequency words are no longer preferred? Wikipedia mentions different ways to determine the term frequency, more precisely one “to prevent a bias towards longer documents”, which could be a reason for the preference of the high frequency words. double normalization 0.5 is defined as follows:

\[tf(t,d) = 0.5 + 0.5 \* {f_{t,d} \over max f_{t',d} : t' \in d }\]

With this modified tf, other tf-idf values result of course. These can be found - sorted by word and text - here: https://github.com/oraec/corpus_raw_data/blob/main/statistics/type_token_etc/double_normalization_idf_lemma.json and https://github.com/oraec/corpus_raw_data/blob/main/statistics/type_token_etc/double_normalization_idf_text.json. This results in the following top 5 for the first five ORAEC texts:

double normalization 0.5: oraec1

lemma	double normalization 0.5
wr-n-Mꜥ, “Großer der Ma”	2.133605741990107
Tꜣy=f-nḫt.t, “Tefnacht”	2.0955056394545695
rd.wj, “Obliegenheit; Aufgabenbereich; Weisung; Stellung; Art; Weise”	2.082805605276057
Pr-Wsjr-nb-Ḏd.w, “Busiris (9. u.äg. Gau)”	2.0701055710975442
ẖr.j-ḏbꜥ.w, “Winzling”	2.0701055710975442

double normalization 0.5: oraec2

lemma	double normalization 0.5
sꜥꜣ, “Fruchtblase”	2.078453163895185
ḏsr.w, “Konsekration”	2.078453163895185
Ḥr.w-Mdnj.t, “Horus von Medenit”	2.073191257151147
Ḥꜣb-ḏr.t, “Fest der Hand”	2.073191257151147
Ḫrp-ḥw.w-ꜥ, “[ein Fest]”	2.0679293504071086

double normalization 0.5: oraec3

lemma	double normalization 0.5
Mꜣꜣ-Ḥr.w, “Horusblick”	2.0700276567774303
ṯꜣz.t, “Zusammengefügtes (?); verknotetes Bündel”	2.066872126812831
Wnm-wꜣḏ.w, “Frischfleischfresser”	2.066872126812831
hꜣj, “(sich selbst) begatten”	2.066872126812831
Zꜣ.ww-n.w-sbḫ.wt-štꜣ.wt, “Wächter der geheimen Portale”	2.066872126812831

double normalization 0.5: oraec4

lemma	double normalization 0.5
Wḫ, “Uch (Gott von Kusae)”	2.1120269228549353
wrm.t, “Aufgerichtete; Figur”	2.093819794209634
zꜣ-sms.w, “ältester Sohn”	2.084716229886983
ḥn.wt-n-ẖnw-pr, “die Herrin des Tempelinneren (Nehmet-away)”	2.084716229886983
Qꜣ.yt-qꜣ.t, “Hoher Hügel”	2.084716229886983

double normalization 0.5: oraec5

lemma	double normalization 0.5
tꜣ-n-nḥḥ, “Land der Ewigkeit (Bez. des Totenreiches)”	2.1248614561622787
pr-ꜥꜣ, “der Pharao (verschiedene Götter)”	2.099565486446061
nfr-jmꜣ, “Vollkommen an Beliebtheit (Osiris)”	2.091133496540655
QQ-n-nbw, “Der in Gold gehüllte”	2.0827015066352494
wsr, “der Mächtige (viele Götter)”	2.0827015066352494

Due to normalization, the high frequency words have disappeared. Instead, only words that have few occurrences appear. In addition, these words are only attested in the text in which they appear in the top 5. Partially, the TLA also records the words as keywords: wr-n-Mꜥ, “Großer der Ma”, Tꜣy=f-nḫt.t, “Tefnacht”, Wḫ, “Uch (Gott von Kusae)”, tꜣ-n-nḥḥ, “Land der Ewigkeit (Bez. des Totenreiches)” and pr-ꜥꜣ, “der Pharao (verschiedene Götter)”. Wḫ, “Uch (Gott von Kusae)” is even the most relevant keyword of oraec4 in the TLA according to chi square (cf. https://aaew.bbaw.de/tla/servlet/s0?f=0&l=0&ff=10&ex=1&db=0&oc=19961&l1=0&wt=0&sr=0&mf=5&md=2&ss=0&mw=25). While the normal tf-idf corresponded rather with the keywords according to log-likelihood, with double normalization 0.5 one has rather the correspondence with chi square.

The normalization seems to be a bit overfitting. This is a nice task for you guys! Can you find the right measure for normalization? We’ve only shown the two extremes here: Normal tf-idf favors the high frequency words, double normalization 0.5 the rare ones. This sounds like a nice topic for a thesis, finding a suitable modification of tf-idf that adequately represents the ORAEC data. Guys, go ahead and have fun!

Type token ratio

Lifting a treasure - web scraping of the new TLA using the demotic corpus as an example