Noémi Vadász is a PhD student and junior research fellow who works at the Research Institute for Linguistics. As a computational linguist with a formal background in syntax and semantics, she collaborates with HUN-CLARIN in the e-magyar project. The interview was conducted via e-mail.
1. Please describe your academic background and your current research position.
I am a junior research fellow at the Research Institute for Linguistics, Hungarian Academy of Sciences at the Research Group of Language Technologies. After my BA on Hungarian Literature and Linguistics I have finished two MA programmes: Theoretical Linguistics and Computational Linguistics. I then moved on to the Doctoral School for Linguistics at Pázmány Péter Catholic University and am currently working on my PhD thesis.
2. What is the topic of your PhD and why did you decide to focus on this problem? How are you approaching it and what do you hope to achieve with it once it is completed, both in terms of scientific results and to your research community? What are you currently busy with?
The topic of my PhD is coreference resolution which is widely researched within the scope of computational linguistics. Yet, I assume that I could show something new because my approach slightly differs from the classical view of computational linguistics. The reason for that is that my way to computational linguistics has led through classic humanities and theoretical linguistics; therefore, I investigate this topic rather as a theoretician but I keep in mind the applicability as well.
Currently I am building a coreference corpus which – beyond the usual analysis layers such as tokenization, part-of-speech tagging, morphological analysis and dependency parsing – will contain anaphoric and coreference relationships. In the example ‘I called my mother. She was really tired.’ the personal pronoun ‘she’ refers back to its antecedent ‘my mother’ and this relationship is called anaphora. On the contrary, coreference occurs when two expressions have the same referent and there are numerous forms of this relationship (e.g. repetition, name variants, synonymy, part-whole relationship, etc.). In the example ‘I bought a bicycle. Tomorrow I will ride home my new bike.’ the base of the coreference relationship between ‘bicycle’ and ‘bike’ is synonymy.
Anaphora and coreference show similar behaviour across languages. However, in contrast with English, Hungarian is a so called pro-drop language, which means that some pronouns (namely the personal and possessive pronouns) can be dropped from the sentence following fairly subtle rules. In these cases, the person and number of the subject and the object can be calculated from the inflection of the finite verb, and the person and number of the possessor are calculable from the inflection of the possessed, therefore the use of zero pronouns can be handled in a simple rule-based manner. As a zero pronoun can also refer back to its antecedent, it needs to be indicated in the coreference corpus. I have created an application that inserts the dropped pronouns into the texts, therefore these pronouns can also play a role in anaphora resolution.
The corpus could serve as a resource for further research on this topic, be it answering theoretical questions or technical application for a certain purpose.
Building a corpus of gold standard quality is definitely complicated and time-consuming. But still, the process of corpus building allows one to study the object of anaphora and coreference very meticulously. The feedback of my annotators also give lessons to be learned. Therefore, together with the corpus, I increase my own knowledge about the phenomena. At the end of the pilot phase I am going to be in possession of the know-how that allows further enlargement of the resource.
3. How did you get involved with HUN-CLARIN and what is your experience with it?
My department has multiple connections with HUN-CLARIN. Firstly, the Old Hungarian Corpus was produced in my institute. Initially, I was involved in this project as an annotator, I manually corrected the output of the optical character recognition on Old Hungarian texts. Later, to speed up the work, I developed a small script for helping manual normalization (standardization of old or non-standard texts). It turned out that manual work could be considerably cut down with the help of this pre-normalization tool.
Secondly, I am involved in the project of e-magyar, a text processing pipeline for Hungarian, which is also connected to HUN-CLARIN. Last year I developed two small but useful modules for e-magyar, both of them are responsible for conversion between certain formats. One of them converts from the e-magyar tagset to an international standard part-of-speech tagset of Universal Dependencies (UD). The converter is needed for intermodular communication inside the pipeline, but could also serve as a useful output formalism due to the prevailing nature of UD. The other converter is applied between the internal format of e-magyar and CoNLL-U format, a widely used international standard. The conversion between these two formats allows further work, annotation or visualization of the output with other tools related to the ConLL-U format. Both of the two converters were needed for my own purpose in my corpus building project, but soon it turned out that the covered formalisms could be useful for other users as well and therefore the converters have now been integrated in the e-magyar framework.
4. In addition to contributing to the development of e-magyar, you also have extensive experience in using it in practice. Could you briefly describe this dual role of yours and the advantages it brings?
I have a two-faced relationship with e-magyar as I am an everyday user of it and a member of the developer team as well. This duality brings benefits: on the one hand, my needs are fulfilled thanks to the work of my colleagues, and on the other hand, my everyday experience with e-magyar serves as a useful feedback, which is important for maintenance and further development of e-magyar.
I use e-magyar principally in my corpus building project. Initially, the selected texts are analyzed with the tokenizer, morphological analyzer and part-of-speech tagger modules of e-magyar. Then, the output of e-magyar must be corrected manually, because the quality of the other annotation layers can be strongly influenced by the initial step. Next, the texts with the corrected annotation layers are further analyzed by a sentence parser module of e-magyar, which produces the dependency trees of the sentences. This layer needs manual correction as well. At this point of the workflow, the texts with the corrected annotations are accessible for further, higher-level analysis such as anaphora resolution.
I am working on three applications in connection with my PhD. The first one is responsible for inserting zero pronouns, the second one resolves anaphora and the third one resolves coreference. Indeed, the output of these applications also needs manual correction, but finally, besides a high quality gold standard corpus, I obtain valuable observations of the quality of my applications. I hope that these three applications can also be added to the e-magyar chain as modules in the future.
5. You have ve recently also been involved in the development of Normo, a tool for the normalisation of historical Hungarian. How is historical Hungarian different from contemporary Hungarian and why is such a tool needed? How does it work and who it is intended for? Has it been used on any text collection that is important for Hungarian humanities researchers?
Normalizing old texts is an important step in the workflow, because of the heterogeneity of the old orthographic system applied in historical texts. Normalization makes the text readable for humans and also for computers. There are multiple approaches for normalization – our project aimed to preserve the structures of the old language variety making them investigable for historical linguists, therefore the task of normalization here mainly means the standardization of the spelling (thus covering the differences between the Middle and Modern Hungarian alphabet).
Since manual normalization is time-consuming and requires highly skilled and delicate work, application of automatic methods can help a lot. According to our measurements and the feedback from our annotators, Normo, our pre-normalizaton tool eases and shortens the manual normalization work.
Normo consists of two main modules. The first one is a memory-based module with a relatively small dictionary of the most frequent words of the New Testament and their modern equivalents. Based on this dictionary, the most frequent words can simply be replaced with their modern forms. The second one is a rule-based module which works with manually defined rewrite rules. These rules come from two sources: some of them were defined on the basis of known changes in the history of Hungarian, others were defined through corpus-based observations. While the character-level rules are applied inside a word (e.g. replacement rules for handling characters that are not used in Modern Hungarian), the so-called token-level rules operate across word boundaries for splitting or joining words according to the rules of the modern orthography.
Normo has been used in the project of building the Old Hungarian Corpus and has been applied to our five Middle Hungarian Bible translations.
6. What are your plans and dreams for the future?
My biggest future plan is to work further on my coreference corpus and to make it available for others. With this it will be all set for seeking answers to other exciting questions. I also have to write up my dissertation. Apart from the work on my PhD I have recently been working on some other topics a lot, for instance, I became interested in morphological tagsets. I assume that I could exploit my theoretical–computational hybrid attitude in this field as well. Lastly, I have some favourite topics which I have already been working on (e.g. authorship attribution), I would like to work further on these topics later.
7. How can research infrastructures such as HUN-CLARIN best serve early-stage researchers and how can a new generation of researchers best contribute to the research infrastructure?
Recently, for example, I have attended a CLARIN workshop on NLP tools for historical data, which was a great opportunity for me. On the one hand, the event gave me a chance to get to know other researchers of a specific field. On the other, as I think it is inevitable for beginners to gain self-confidence among their colleagues which comes gradually through presenting your research often, not to mention the fluent use of English. Additionally, CLAIRN conferences and workshops serve as a good platform to share new ideas to colleagues with more experience and get useful feedback. The world of conferences, workshops and networking is of course only one aspect of the CLARIN infrastructure's benefits. However, according to my recent experiences, it is one really worth mentioning.