The 2022-23 edition of the LEL seminar begins tomorrow (4th October) with a talk by Lauren Fonteyn (Leiden University). The talk will be at 4pm in Simon 1.34. The abstract and the title are below.
Featured photo is from Lauren's website
MacBERTh & GysBERT: using machine learning to automate grammatical and semantic data annotation in historical corpora
Lauren Fonteyn
Leiden University
In this talk, I will demonstrate how contextualized embeddings – which are a type of compressed token-based semantic vectors – can be used as annotation and research tools. More specifically, I will focus on the use of the Bidirectional Encoder Representations from Transformers model, also known as 'BERT' (Devlin et al. 2019).
Originally, BERT was set up for Present-day English, having been pre-trained on 3.2 billion words of Present-day English Wikipedia and Google books data. Yet, researchers who interpret and analyse historical textual material are well aware that the interpretation of textual/linguistic material from the past should not be approached from a present-day point-of-view. Hence, NLP models pre-trained on present-day language data are less than ideal candidates for the job. For the case study presented in this paper, we use two variants of BERT called MacBERTh (Manjavacas and Fonteyn, 2021, 2022), which has been pre-trained on approximately 3.9B (tokenized) words of historical English (time span: 1450-1950), and GysBERT, which has been pre-trained on 7.1B (tokenized) words of historical Dutch (time span: 1500-1950).
These models will be put into action in two different but thematically related case studies on individual-level language variation. The first case study, which focusses on variation and change in the use of English ing-forms by Early Modern English individuals, demonstrates how the models can be used to automate grammatical annotation. The second case study demonstrates how contextualized embeddings can be integrated into lexical diversity measures to allow us to not only consider the 'vocabulary richness' but also the 'semantic richness' of texts produced by different authors.
No comments:
Post a Comment