eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.
A Cross-Genre Survey

Krzysztof Nowak
Institute of Polish Language (Polish Academy of Sciences)
krzysztof.nowak@ijp.pan.pl
\ANDJędrzej Ziębura
AGH University / Enelpol
ziebura.jedrzej@gmail.com
\ANDKrzysztof Wróbel
Jagiellonian University / Enelpol
krzysztof@wrobel.pro
\ANDAleksander Smywiński-Pohl
AGH University / Enelpol
apohllo@o2.pl
Abstract

This study introduces the eFontes models for automatic linguistic annotation of Medieval Latin texts, focusing on lemmatization, part-of-speech tagging, and morphological feature determination. Using the Transformers library, these models were trained on Universal Dependencies (UD) corpora and the newly developed eFontes corpus of Polish Medieval Latin. The research evaluates the models’ performance, addressing challenges such as orthographic variations and the integration of Latinized vernacular terms. The models achieved high accuracy rates: lemmatization at 92.60%, part-of-speech tagging at 83.29%, and morphological feature determination at 88.57%. The findings underscore the importance of high-quality annotated corpora and propose future enhancements, including extending the models to Named Entity Recognition.

eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.
A Cross-Genre Survey


Krzysztof Nowak Institute of Polish Language (Polish Academy of Sciences) krzysztof.nowak@ijp.pan.pl


Jędrzej Ziębura AGH University / Enelpol ziebura.jedrzej@gmail.com


Krzysztof Wróbel Jagiellonian University / Enelpol krzysztof@wrobel.pro


Aleksander Smywiński-Pohl AGH University / Enelpol apohllo@o2.pl


1 Introduction

After the decline of the Roman Empire, Latin continued to be widely utilized throughout Europe for over ten centuries. Latin served as the primary mode of communication across diverse social and cultural contexts, ranging from church and secular administrative records to scholarly writings in burgeoning universities, historical narratives, literary works, religious poetry, and liturgical texts. Given the critical role of Latin writings in understanding European culture and history during the Middle Ages, their sheer volume necessitates the application of distant reading techniques for their analysis and better comprehension.

Developing a pipeline for the automatic processing of Latin texts is far from straightforward due to the language’s diverse applications and extensive history. In recent years, numerous studies have emerged focusing on the development of methods for processing historical languages, Latin. However, the majority of these studies have concentrated on Classical and Late Latin. (see section 2 below).

The specificity of the Medieval Latin (Stotz, 1996-2004) may pose substantial challenges for processing tools. Beyond its vast range of uses, this is mainly due to its pan-European nature as a lingua franca throughout almost all of medieval Europe. The standardization of Latin across a region spanning from Sweden to Italy and from Poland to Portugal occurred as the language was disseminated through education and utilized predominantly in formal contexts. Despite the overarching influence of the church or secular authorities in unifying the language, significant regional variations also emerged. Additionally, Latin interacted closely with vernacular languages, impacting many aspects of their pronunciation, grammar, vocabulary and syntax. Furthermore, Medieval Latin frequently was intertwined with vernacular languages in manuscripts. Consequently, the documents and court records produced during the Middle Ages often contain a significant number of words in medieval Polish, German, or English. (Goyens and Verbeke, 2003)

This paper introduces a family of eFontes models for automatic annotation of the Medieval Latin texts. Based on the Transformers library Wolf et al. (2020), they include models for context-independent lemmatization, part of speech and morphological features tagging. The models were trained on a set of publicly available Universal Dependencies corpora (henceforth: UD corpora) and a new language resource, namely the corpus eFontes of Polish Medieval Latin. In section 2, we briefly summarize the research on automatic annotation of Latin language texts. In sections 3 and 4, we present our system and the datasets used in training and evaluation. In section 5 we evaluate the performance of the models and provide a thorough error analysis for the Lemmatization and PoS tagging tasks. In section 6, we provide a concise summary of the results, propose areas for improving the models, and outline plans for extending the models for the Named Entity Recognition.

2 Previous Work

In their recent survey, Sommerschield et al. (2023) observe that the research on the automatic processing of ancient languages has significantly accelerated in recent years. Out-of-the-box models for user-friendly processing of Latin texts are integrated into popular frameworks like Classical Language Toolkit (Johnson et al., 2021) and SpaCy. The latter now offers several packages with pre-trained models for Latin, such as spacy-udpipe111https://spacy.io/universe/project/spacy-udpipe and LatinCy222https://spacy.io/universe/project/latincy

As far as custom solutions are concerned, an important stimulus and a platform for presenting progress in the field turned out to be the biannual workshops LT4HALA. Language Technology for Historical and Ancient Languages (Sprugnoli et al., 2020, 2022). At the EvaLatin competition hosted during the 2022 edition of the workshop, Wróbel and Nowak (2022) presented a Cracovia system that outperformed other contributions in every tagging task for both cross-genre and cross-time corpus. Although the present solution is based on this work, there is no shortage of new systems.

Recently, Riemenschneider and Frank (2023) conducted a thorough evaluation of existing systems across various natural language processing tasks, such as part-of-speech tagging and lemmatization. Their custom models for Latin outperformed the Cracovia tagger, except for the cross-genre and cross-time subtasks, where the open modality system of Wróbel and Nowak (2022) exhibited better performance.

The challenge of applying tools built for Classical Latin in tagging medieval texts was recognized early on. In the Omnia project, for example, TreeTagger was trained on a manually annotated corpus of medieval Latin on top of previously available Classical Latin parameters (Bon, 2011).

Eger et al. (2016) compared the performance of the solutions existing at the time, in lemmatization and part-of-speech tagging of medieval capitularies, and showed how the performance of the system may be enhanced by incorporating word embeddings and lexicon rules into the picture. Their study revealed a notable decrease in accuracy (below 90%) when a tagger trained on Classical or Late Latin corpus was applied to medieval texts. Building upon this research, Kestemont and De Gussem (2016) introduced a joint learning solution for the integrated approach to lemmatization and part-of-speech tagging. Their study showed, among other insights, the impact of normalizing orthographic variations in medieval texts on the accuracy of the system.

Importantly, significant advancements have also been made in processing medieval variations of Romance languages, such as Old French Camps et al. (2021).

3 Latin Datasets Used

As demonstrated in section 4 below, the models presented in this paper were trained and evaluated using a set of publicly available UD corpora and the eFontes corpus of Polish Medieval Latin. Since we aimed to assess the impact of using data from different periods and areas of Latin language use, we provide a more detailed presentation of the corpora in the following section, focusing on the features relevant to this study.

3.1 UD Corpora

The UD-conformant corpora utilized in training and evalution of the models include:

  • The Perseus corpus which contains works composed between the 1st century BCE and the 2nd century CE with the exception of the Jerome’s translation of the Vulgate from the 4th–5th century BC. The dataset includes fragments of a historical work by Tacitus, Cicero’s speeches by Cicero, Phaedrus’ Fables and poetical works of Propertius.

  • The PROIEL corpus (Haug and Jøhndal, 2008) consists of selected books of the Jerome’s Vulgate, and selected fragments of the historical work of Caesar, Cicero’s letters, as well as scholarly treatises by Palladius (on agriculture from the 4th century CE) and Cicero (de officiis written in the 1st century BCE).

  • The LLCT (Late Latin Charter Treebank) (Cecchini et al., 2020b) is a large collection of medieval charters written between the 8th and the 9th century in Tuscany.

  • The ITTB (Index Thomisticus Treebank) consists of the annotated works of Thomas Aquinas written in the 13th century (Cecchini et al., 2018).

  • Finally, the UDante Treebank (Cecchini et al., 2020a) includes both prosaic (treatises on linguistics, politics, and a selection of letters) and poetical (Eclogues) works of Dante Alighieri.

Corpus Coverage Number of
time (centuries) place text type tokens sentences avg
PROIEL 1 BCE - 5 CE Roman Empire various 177 558 16 196 10.96
Perseus 1 - 5 CE Roman Empire various 18 425 1 334 13.81
LLCT 8 - 9 CE Italy charters 390 819 7 289 26.64
ITTB 13 CE Italy theological treatises 390 819 22 775 17.16
UDante 14 CE Italy various 30 566 926 33.01
Table 1: Number of tokens, sentences and average number of tokens in a sentence in Universal Dependencies corpora used in the scenarios involving UD data.

Regarding their linguistic features (see Table 1), the Perseus and the PROIEL corpora represent mainly Classical, Post-Classical and Late Latin, whereas the other three corpora consist of texts written during the Middle Ages. The Perseus and PROIEL corpora exhibit relatively heterogeneous content, encompassing both prose and poetry across various genres (e.g., speeches, treatises, Bible translations) and covering a wide array of topics ranging from history to philosophy to agriculture. The LLCT treebank exclusively focuses on one genre, namely charters, while both UDante and ITTB contain texts attributed to a single author. The pre-medieval texts were composed within the boundaries of the Roman Empire, whereas medieval charters and the works by Thomas Aquinas and Dante were penned in what is now Italy.

Therefore, at first glance, the language represented in the UD treebanks appears to be relatively different from that of Polish Medieval Latin texts from the representative eFontes corpus, even if we acknowledge that medieval Latin tended to be more conservative and less prone to change due to the stabilizing impact of writing.

3.2 The eFontes Corpus

The eFontes corpus has been compiled since 2013 and is expected to contain over 15 million tokens by the mid-2024. It comprises texts composed between 1000 and 1550 on the territory of the Kingdom of Poland. The corpus’s representativeness is carefully monitored with regard to time, place, and text types.

The corpus is planned to be further expanded in the coming years, facilitated by new critical editions and the broader adoption of Handwritten Text Recognition technology. This, coupled with its potential for linguistic and historical research, underscores the importance of automatic annotation.

Genre Tokens Sentences Avg
Annals 895 33 27.12
Biography 8994 298 30.18
Normative 3142 115 27.32
Proceedings 7189 389 16.48
Science 1990 106 18.74
Table 2: Number of tokens, sentences and average number of tokens in sentence in data used in cross-validation.

For training and evaluation purposes, a small manually annotated gold corpus was prepared based on texts from the eFontes corpus. The composition of the dataset reflects most prominent text types and its domain and register variation (see Table 2). The gold corpus comprises following genres:

  • Annals The genre is represented by the Annals of the Cistercian Order of Henrykow from the 13/14 century.

  • Biography The subcorpus includes samples of popular hagiographic works:

    • the Life of Anne, Duchess of Silesia (Lat. Vita Annae ducissae Silesiae) from the second half of the 13th century,

    • the Miracles of Saint Adalbert (Lat. Miracula Sancti Adalberti) from the end of the 13th century, and

    • the Life of Saint Kinga (Lat. Vita Sanctae Kyngae) from the first half of the 14th century.

  • Normative The group consists of statutory texts concerning ecclesiastical law, in particular in includes so called synodial statutes of Gniezno and Kraków from the beginning of the 15th century.

  • Science Scientific writing is represented by Vitello’s technical treatise on optics completed by the end of the 13th century (Lat. Perspectivae liber primus);

  • Proceedings The sub-corpus comprises a selection of records of courts of law and city councils and includes:

    • the book of the city of Lviv from the end of the 14th century;

    • the books of the court of Kraków from the end of the 14th century;

    • the book of a small village in the Southern Poland from the second half of the 15th century.

The text samples included in the gold corpus were manually annotated by two highly-qualified philologists with expertise in medieval Latin linguistics and history. The part of speech, lemma and morphosyntactic features were annotated based on guidelines which followed the UD model.333The guidelines are to be published at https://scriptores.pl/efontes. As mentioned earlier, the dataset was designed to reflect the diatopic, diachronic, and diastratic variation of Latin written production in Poland. It includes texts with features that pose systematic challenges for automatic processing tools, such as a large number of Latinized or non-Latinized Polish personal names, scientific and legal terminology, numerous medieval place names, and vernacular insertions ranging from single words to multi-word phrases. Moreover, the orthography of the texts is anything but consistent, with variations stemming from medieval scribal practices and modern editorial policies.

4 System Description

4.1 Training scenarios

In order to assess the influence of different training data on the results and specifically the necessity to use a custom Medieval Latin corpus we have designed several training scenarios.

In the first scenario (baseline), the foundation models are fine-tuned only on the data from the eFontes corpus. The training procedure follows a cross-validation scheme, where each eFontes subcorpus is treated as a testing set and the remaining sub-corpora are used for training and validation. Since the tasks are different and there are 5 subcorpora, the procedure yields 15 separate models, which are evaluated on the specific test sets.

In the second scenario (UD all), the foundation models are fine-tuned on all the data from the UD Latin corpora (no normalization of spelling is applied) and tested for all tasks on all the sub-corpora. This scenario yields 3 models: one for each task. The scenario is designed to answer whether it is necessary to fine-tune the model on the eFontes corpus, or if training solely on previously available data would be sufficient.

The third scenario involves using all the UD Latin corpora in the initial fine-tuning step. Subsequently, the model undergoes further fine-tuning on a specific UD subcorpus only (referred to as UD + specific UD corpus name). With 5 available UD corpora and 3 tasks, this scenario yields 15 models, which are then evaluated on the individual eFontes sub-corpora. This results in 75 evaluation outcomes. The scenario was designed to determine whether and which of the UD corpora exhibit the greatest similarities with the eFontes domain-based sub-corpora, thereby making them more valuable for a specific set of documents.

The last scenario (UD + eFontes) involves using all the data from the UD Latin corpora in the initial fine-tuning step. Then, similar to the baseline scenario, the model undergoes further fine-tuning. The key difference between the baseline scenario and this one is that the previous one uses the original foundation model, while this scenario employs a model that has already been fine-tuned on the UD Latin dataset. This scenario is designed to determine the impact of additional training data on the linguistic analysis tasks. Similar to the baseline scenario, it yields 15 individual models.

4.2 Model architecture

The architecture of all trained models is based on the transformer, as this family of models yield state-of-the-art results in part-of-speech tagging, morphological feature determination and lemmatization Van Nguyen et al. (2021). The work builds on a morphosyntactic tagger KFTT Wróbel (2020) which won the PolEval 2020 task 1 contest (Morphosyntactic tagging of Middle, New and Modern Polish) as well as on the results presented during the EvaLatin 2022 competition by Wróbel and Nowak (2022).

4.2.1 POS and Morphological Features Tagging

Part-of-speech and morphological features tagging tasks are addressed with a transformer encoder-only model with a token classification head on top.

batch size 12
epochs 10
learning rate 2e-5
sequence length 256
Table 3: The training parameters for Part of Speech Tagging and Morphological Features Determination tasks.

First, the transformer returns contextual embedding of each token, then a linear layer with a softmax activation returns normalized scores for each tag present in the training corpus. During training both the pre-trained model and the classification head are updated, so the training uses a full back propagation procedure. During pre-training XLM-R uses only masked language modelling (MLM) as a training task, so the model initially has no knowledge regarding the possible parts of speech or morphological tags (beyond knowledge that is extracted from the MLM task itself). For the baseline model which utilized only the eFontes corpus, the classification head is initialized randomly. For all the remaining scenarios the model is already fine-tuned for the same task, but using a different corpus as described in the previous section. The important parameters of the training are given in Table 3. The remaining parameters were the defaults from the Huggingface library.

batch size 128
epochs 5
input sequence length 48
output sequence length 24
learning rate 0.001
Table 4: The training parameters for Lemmatization tasks.

4.2.2 Lemmatization

Lemmatization is a different task, with respect to the model architecture – in general it requires from the model to produce an arbitrary sequence of letters, depending on the input given. Taking that into account we have used ByT5 small model (Xue et al., 2022) whose input are separate bytes of a text. It is a text-to-text (or bytes-to-bytes) model, so it is well-suited for the task. We have performed some additional experiments with sub-word models such as mT5 Xue et al. (2021), but they clearly showed inferior performance.

The model receives as input the individual word to be lemmatized together with the predicted Part of Speech. The input is framed as the inflected word and the PoS tag separated by a colon, e.g. adducam:VERB. The model does not receive the morphological features nor any information about the context of the lemmatized word. The context is only indirectly reflected in the PoS tag provided.

The important parameters of the model training are given in Table 4. The remaining parameters were the defaults from the Huggingface library.444The best models obtained in our experiment will be published at https://huggingface.co/efontes

5 Results

5.1 System Performance

Biography Normative
UPOS UFeats Lemma UPOS UFeats Lemma
baseline 95.43 82.06 80.75 95.81 81.53 85.05
UD all 90.19 52.85 86.14 92.04 50.32 85.81
UD + ITTB 89.58 32.02 87.48 91.66 67.96 85.85
UD + LLCT 89.87 69.84 86.65 91.66 67.96 86.26
UD + Perseus 90.34 72.68 84.42 92.01 69.87 83.74
UD + PROIEL 77.34 75.78 85.10 79.46 71.95 84.73
UD + UDante 90.20 33.56 86.30 91.69 34.44 85.08
UD + eFontes 96.10 84.86 88.37
Proceedings Science Annals
UPOS UFeats Lemma UPOS UFeats Lemma UPOS UFeats Lemma
baseline 91.63 86.67 81.69 79.21 77.48 80.53 95.98 80.78 87.04
UD all 93.18 67.12 79.82 75.35 51.88 96.45 86.15 59.11 82.79
UD + ITTB 91.06 70.62 79.03 74.65 48.78 95.44 87.15 31.40 85.14
UD + LLCT 91.06 70.62 78.95 75.15 68.51 95.94 86.93 70.50 83.46
UD + Perseus 93.35 73.26 78.99 75.30 61.26 94.73 86.48 71.06 81.68
UD + PROIEL 80.04 74.23 79.87 61.56 87.37 94.12 75.08 76.09 81.79
UD + UDante 93.42 37.67 80.28 74.65 51.83 95.23 88.60 32.18 83.91
UD + eFontes 94.97 87.67 83.17 79.61 78.60 96.35 96.20 81.45 86.82
Table 5: The results of the various training scenarios described in Section 4. Green color indicates the best result, while red indicates the worse result for a given subcorpus-task combination. The results involving eFontes data (baseline and UD + eFontes) are reported for the cross-validation scheme, i.e. the evaluated model was trained on the data excluding the testing subcorpus. For the other scenarios for each task only one model was trained and the results show its performance for the different subcorpora.

The performance of the models for part of speech, morphological features, and lemma tagging tasks was assessed according to the scenarios described in Section 4. The performance for each eFontes sub-corpus was evaluated separately, as shown in Table 5.

Overall, it is evident that the best results were obtained in the last scenario, where the model fine-tuned on the UD Latin data was further fine-tuned on specific eFontes sub-corpora. In that scenario, lemmatization results ranged from 83.17 (Proceedings) to 96.35 (Science). Conversely, morphological features tagging showed varying outcomes, with accuracy rates of 78.60 for the Science sub-corpus and 87.67 for the Proceedings genre. Regarding POS tagging, the system achieved the highest accuracy in the Annals genre (96.20) and lower accuracy in the Science sub-corpus (75.35). The character of the errors and possible reasons for the poor performance of the models in some of the sub-corpora are discussed in Section 5.2.

The comparison with the baseline models clearly demonstrates that it is possible to achieve moderate improvement (1-3 percentage points) through additional training on the UD data. The most signigicant difference is observed for the lemmatization task for the Science sub-corpus, with an improvement of almost 16 percentage points.

When comparing the results between the scenarios referred to as UD all and UD + UD corpus, it is interesting to note that in most cases, using all the available data is not the optimal choice. Instead, for many task–sub-corpus combinations, a model further trained on a specific UD corpus yields better results. This suggests significant variability between the data in the collected corpora and indicates that more specific sub-corpus is better suited for achieving the best results. It is also possible that combining specific UD corpus with eFontes data could yield better results for some of the eFontes sub-corpora. However, such a scenario would involve too many experimental combinations, so it was excluded from the setup.

In the UFeats task, the model fine-tuned on the PROIEL corpus achieved a performance score of 87.37, surpassing the model for the last scenario by more than 8 percentage points. For lemmatization, the model fine-tuned on all UD corpora showed only a minimal advantage of 0.10 percentage points.

In conclusion, the results highlight the significance of the availability of high-quality annotated corpora for improving the accuracy of models. While the advantages of training with eFontes data may not be immediately obvious for some genres and tasks (e.g., lemmatization for Science and Annals genre), the difference is significant for the majority of them. In the next section, we discuss main results of qualitative error analysis for lemmatization and PoS tagging tasks.

5.2 Qualitative Analysis

5.2.1 Lemmatization

Following the example set by Wróbel and Nowak (2022), we conducted an in-depth qualitative analysis to investigate the nature of tagging errors, identify their sources, and determine ways to improve the results.

Overall, the analysis revealed that a significant number of lemmatization and POS tagging errors could have been easily reduced by simplifying the task and harmonizing the training datasets. Trivial errors included, for example, frequent mislabeling of the SYM tokens: over 10% of the total number of lemmatization errors were found to be due to the way the model handled mathematical notation in the Science sub-corpus, which explains its low accuracy (see Figure 1).

In the gold corpus, tokens that represent elements of mathematical notation, such as points, lines, angles, and geometric shapes (e.g.,A or CD), as well as fractions (e.g., 1amXI), were uniformly labeled as SYM, with their lemma set to an underscore (_). This decision was made to differentiate occurrences of symbolic tokens from those of "meaningful" words, particularly in instances where homonymy could occur, such as with the adposition ab. Another practical reason for this approach was to offer a simplified lemma representation for tokens whose interpretation might not be immediately clear. However, our model would frequently missclassify such tokens as nouns and assigned them "meaningful" lemmas, such as resp., a, or cd.555In a somewhat similar manner, nearly 50 lemmatization errors resulted from the system replacing the editorial symbol ⁢⁢ with a plus sign (+). In future versions of the model, it appears reasonable to consider restricting the lemmatization of symbolic units. For clarity, similar errors have been excluded from the discussion in the subsequent analysis.

Refer to caption
Figure 1: Lemmatization task: genre distribution of errors.

Beyond these trivial instances, further examination revealed that lemmatization errors primarily arose from the handling of orthographic variation in medieval Latin texts and a significant presence of Latinized Polish terms and proper names in the annotated texts, for which the training corpora lacked sufficient data.

Position
initial middle final
pattern count pattern count pattern count
u:v 289 u:v 281 a:us 107
k:c 95 t:c 260 z:us 100
i:j 66 a: 55 s:m 76
h: 18 k:c 47 o:us 59
a: 17 :h 30 m:s 46
Table 6: 5 most frequent lemma confusion patterns.

The most common group of errors (see Table 6) stems from the misspelling of the Latin bilabial v. While in the gold corpus it was rendered as u for both consonant and vowel, the model would yield v for the consonant variant, thus preferred uideo, ciuitas, and uiuo would be replaced with video, civitas, and vivo.

The second prevalent group of errors is related to the spelling of the Latin CV group -ti- which would be often spelled as -ci- in the medieval texts. To minimize spelling variance, in the eFontes corpus the group is always represented as -ti-. However, the model often opted to output -ci-: instead of laurentius, gratia, or pretiosus, it produced laurencius, gracia, or preciosus.

Third, the models substituted the letter k with c, despite the former being standard in medieval documents. This error particularly affects the “Proceedings” and “Biography” sub-corpora where it occurs in the spelling of Polish proper names, such as kinga \rightarrow cinga, thokarz \rightarrow thocarz, and stassek \rightarrow stassec.

The error types discussed so far account for nearly 40% of the lemmatization errors produced by the system. Other notable categories include:

  • substitution of the diphthongs ae and oe with e (daemon \rightarrow demon, aequidisto \rightarrow equidisto, or aeuum \rightarrow evum; and dioecesanus \rightarrow diocesanus, uesperae \rightarrow vespera);

  • addition or omission of the h consonant despite the standardized spelling adopted in the gold corpus, for example, tomco \rightarrow thomcus, platea \rightarrow plathea, iohannes \rightarrow joannes, and hungaria \rightarrow ungaria.

The preference for specific spelling cannot be always traced back to the form of the token but should rather be attributed to the structure of the training datasets, which were not normalized or harmonized beforehand. For example, concerning the u:v alternation, the LLCT and UDante corpora exclusively use u for lemma forms, whereas the Perseus and PROIEL corpora employ both v and u to distinguish between consonantal and vowel variations. In the ITTB corpus, the letter v does not appear in either surface forms or lemmas.

Regarding the -ti-:-ci- alternation, it is observed exclusively in the LLCT corpus. Over 30 occurrences include proper names, such as Baruncio \rightarrow Barontius, Laurencii \rightarrow Laurentius), as well as common nouns, for example, presencia \rightarrow praesentia, palacio \rightarrow palatium. However, within the same corpus, these lemmas are more often spelled in a standardized manner, with forms like praesentia, Laurentii, and so forth.

The LLCT corpus, along with the eFontes corpus, is unique in providing data on the k:c alternation. Its attestations, however, are limited to occurrences of a single word only, namely the noun karitas : caritas. 666The corpus includes numerous examples of the k spelling for both surface form and lemma, largely due to the prevalence of terms like kalendae. Notably, it includes 5 examples of the spelling variation calendae : kalendae.

The model’s inability to correctly lemmatize words that, according to classical norm, should be spelled with a diphthong ae or oe is noteworthy, as the UD corpora seem to offer ample evidence for such normalization. Specifically, the LLCT corpus contains over 200 instances of the e:oe alternation, although the range of words concerned is limited primarily to forms of poena and oboedientia. Additionally, the corpus contains more than 1000 instances of the e:ae variance. Both types of alternation also appear, though less frequently, in the ITTB, UDante, and PROIEL corpora, but are virtually absent in the Perseus corpus.

The situation becomes even less clear regarding the spelling of the h consonant for the UD corpora provide evidence of its usage in post-consonantal and intervocalic positions at the beginning and in the middle of a word.

5.2.2 Part of Speech tagging

Setting aside the errors arising from the misinterpretation of SYM tokens, the parts of speech most frequently mislabeled were adjectives, nouns, verbs, pronouns, adverbs, proper nouns, particles, and determiners (Figure 2). Many of these errors tend to recur in PoS tagging tasks, as demonstrated by Wróbel and Nowak (2022). They can often be traced back to the derivational relationships between words and seem to arise as a result of insufficient context, which does not allow for choosing between multiple interpretations.

Refer to caption
Figure 2: The UPOS task: break-down of error types.

This phenomenon includes the frequent mislabeling of adjectives as participial forms of verbs. In the dataset analyzed, a common error involved the high-frequency phrase iudicium bannitum, a technical term denoting a category of court trials in medieval Poland. Although the adjective bannitus originates from the verb bannio, its adjectival meaning has become fully lexicalized, leading to its classification as an adjective in Medieval Latin dictionaries. A similar situation occurs with the adjective aequidistans ’equidistant’. The verb aequidisto, to which it is linked, is even less frequently attested in medieval data than the verb bannio discussed above.

Nouns were most frequently misidentified as adjectives, accounting for 47% of such errors in the analyzed data. This error impacted both deadjectival nouns attested already in Classical Latin, such as sanctus ’a saint’ and bonum ’good behaviour, deed etc.’, and medieval terms like grossus ’type of currency’, which are not found in the UD corpora but as adjectives.

In the reverse scenario of mislabeling adjectives and nouns, participial forms of verbs, such as debitus or contractus, are incorrectly annotated as adjectives (accounting for 40% of errors) or as nouns (28% of errors).

6 Conclusions

The models presented in this study were designed for the automatic annotation of medieval texts across a wide range of domains, varying levels of formality, genres, and communicative contexts. Although they have achieved satisfactory results in most tasks, further research is certainly needed to conduct a rigorous comparison with existing solutions.

An examination of the tagging errors revealed that a considerable portion of them were minor in nature, suggesting they could be easily remedied in future versions of the system by simplifying the annotation of symbol units, such as fractions and editorial marks, for example. Other errors, in turn, stemmed from inconsistent spelling of Latin words in the training data or from insufficient evidence for preferred normalized spelling in the eFontes corpus.

The study further indicated that training on manually annotated corpora, like eFontes, considerably improves the accuracy of tagging where significant domain- or genre-specific variation of data may be observed.

Future research will focus on the challenges of benchmarking against existing systems, including GPT models, which are presently viewed as competitors to the custom models discussed in this paper. Additionally, plans include expanding the datasets to cover historically significant, yet unexplored, medieval genres such as poetry or medieval Latin documents.

Finally, the authors are currently working on an automated solution for tagging named entities in medieval Latin texts from Polish sources. A high-performing NER tagger, while significant in its own right, should help in mitigating some of the issues associated with processing vernacular proper names discussed in this paper.

7 Acknowledgments

This work was supported by the project eFontes. The Electronic Corpus of Polish Medieval Latin (11H 17 0116 85) funded by the Polish Ministry of Science and by the grant of the PLGrid Infrastructure.

References

  • Bon (2011) Bruno Bon. 2011. Omnia : outils et méthodes numériques pour l’interrogation et l’analyse des textes médiolatins (3). BUCEMA. Bulletin du centre d’études médiévales d’Auxerre, (15):251–252.
  • Camps et al. (2021) Jean-Baptiste Camps, Thibault Clérice, Frédéric Duval, Lucence Ing, Naomi Kanaoka, and Ariane Pinche. 2021. Corpus and Models for Lemmatisation and POS-tagging of Old French. Preprint, arxiv:2109.11442.
  • Cecchini et al. (2020a) Flavio M. Cecchini, Rachele Sprugnoli, Giovanni Moretti, and Marco Passarotti. 2020a. UDante: First Steps Towards the Universal Dependencies Treebank of Dante’s Latin Works. In Felice Dell’Orletta, Johanna Monti, and Fabio Tamburini, editors, Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020 : Bologna, Italy, March 1-3, 2021, Collana Dell’Associazione Italiana Di Linguistica Computazionale, pages 99–105. Accademia University Press.
  • Cecchini et al. (2020b) Flavio Massimiliano Cecchini, Timo Korkiakangas, and Marco Passarotti. 2020b. A new Latin treebank for Universal Dependencies: Charters between Ancient Latin and Romance languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 933–942, Marseille, France. European Language Resources Association.
  • Cecchini et al. (2018) Flavio Massimiliano Cecchini, Marco Passarotti, Paola Marongiu, and Daniel Zeman. 2018. Challenges in converting the Index Thomisticus treebank into universal dependencies. Proceedings of the Universal Dependencies Workshop 2018 (UDW 2018).
  • Eger et al. (2016) Steffen Eger, Rüdiger Gleim, and Alexander Mehler. 2016. Lemmatization and Morphological Tagging in German and Latin: A Comparison and a Survey of the State-of-the-art. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1507–1513, Portorož, Slovenia. European Language Resources Association (ELRA).
  • Goyens and Verbeke (2003) Michèle Goyens and Werner Verbeke, editors. 2003. The dawn of the written vernacular in Western Europe. Leuven University Press.
  • Haug and Jøhndal (2008) Dag Trygve Truslew Haug and Marius L. Jøhndal. 2008. Creating a parallel treebank of the old indo-european bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), pages 27–34.
  • Johnson et al. (2021) Kyle P. Johnson, Patrick J. Burns, John Stewart, Todd Cook, Clément Besnier, and William J. B. Mattingly. 2021. The Classical Language Toolkit: An NLP framework for pre-modern languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 20–29, Online. Association for Computational Linguistics.
  • Kestemont and De Gussem (2016) Mike Kestemont and Jeroen De Gussem. 2016. Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning. arXiv:1603.01597 [cs, stat].
  • Riemenschneider and Frank (2023) Frederick Riemenschneider and Anette Frank. 2023. Exploring Large Language Models for Classical Philology. Preprint, arxiv:2305.13698.
  • Sommerschield et al. (2023) Thea Sommerschield, Yannis Assael, John Pavlopoulos, Vanessa Stefanak, Andrew Senior, Chris Dyer, John Bodel, Jonathan Prag, Ion Androutsopoulos, and Nando de Freitas. 2023. Machine Learning for Ancient Languages: A Survey. Computational Linguistics, 49(3):703–747.
  • Sprugnoli et al. (2022) Rachele Sprugnoli, Marco Passarotti, Flavio Massimiliano Cecchini, Margherita Fantoli, and Giovanni Moretti. 2022. Overview of the EvaLatin 2022 Evaluation Campaign. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 183–188. European Language Resources Association.
  • Sprugnoli et al. (2020) Rachele Sprugnoli, Marco Passarotti, Flavio Massimiliano Cecchini, and Matteo Pellegrini. 2020. Overview of the EvaLatin 2020 Evaluation Campaign. In Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages, pages 105–110. European Language Resources Association (ELRA).
  • Stotz (1996-2004) Peter Stotz. 1996-2004. Handbuch zur lateinischen Sprache des Mittelalters, volume 1–5. C. H. Beck.
  • Van Nguyen et al. (2021) Minh Van Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh, and Thien Huu Nguyen. 2021. Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. arXiv preprint arXiv:2101.03289.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  • Wróbel and Nowak (2022) Krzysztof Wróbel and Krzysztof Nowak. 2022. Transformer-based part-of-speech tagging and lemmatization for Latin. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 193–197, Marseille, France. European Language Resources Association.
  • Wróbel (2020) Krzysztof Wróbel. 2020. Kftt : Polish full neural morphosyntactic tagger. In Maciej Ogrodniczuk and Łukasz Kobyliński, editors, Proceedings of the PolEval 2020 Workshop, pages 47–53. Institute of Computer Sciences, Polish Academy of Sciences, Warszawa.
  • Xue et al. (2022) Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. 2022. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics, 10:291–306.
  • Xue et al. (2021) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.