Written by Diana G. Maynard, edited by Darja Fišer and Jakob Lenardič
GATE is a widely used, established open-source infrastructure that provides a framework and numerous essential components (plugins) for all kinds of NLP and text processing tasks. Developed at the University of Sheffield, which is a partner in CLARIN-UK, it is now 20 years old and has a research team of 16 people, as well as a vibrant community of users, ensuring its continuous development and usage in a wide variety of scenarios and domains. The components, some of which are available in a variety of languages, include:
- pre-processing tools (e.g. tokenisation, lemmatisation, normalisation);
- language processing tools (e.g. part of speech tagging, parsing, chunking, morphological analysis);
- domain- and task-specific NLP tools (e.g. named entity recognition and linking, gene tagging, recognition of legal terminology, biomedical processing, social media analysis)
- NLP development tools (machine learning algorithms; a linguistic pattern-matching, rule engine; performance evaluation tools).
Beyond making these NLP tools openly available, GATE also provides:
- GATE Developer – a graphical interface for developing and testing new NLP tools and applications;
- GATE Cloud – a cloud-based NLP platform-as-a-service, for seamless service-based deployment of GATE NLP tools and applications;
- GATE Mimir – a highly scalable semantic indexing and search platform;
- GATE Teamware – a collaborative, web-based document annotation tool.
With these NLP tools and services, even researchers without coding experience can easily use, adapt, or build an NLP system to analyse text. Thanks to its open source nature, GATE users also benefit from tools and applications that are provided by third-party GATE users (typically other academic researchers adapting GATE tools to other languages and tasks) and shared via public repositories such as GitHub. In particular, the GATE Cloud platform offers 69 different services covering many languages and domains, providing an easy way for people to try out a number of applications on sample text and to run them as a web service over more substantial datasets without having to deal with any integration issues, by providing a common across disparate services. Many of these services have already been integrated in D4Science and are being currently integrated in the European Language Grid and RISIS platforms, as well as the CLARIN Language Resources Switchboard.
Figure 1: Ontology viewer in GATE developer, highlighting named entity relations within an annotated English text.
The GATE development team dedicates significant resources to supporting and growing the GATE user community through regular and bespoke training courses, open access training materials/documentation, and an open user mailing list, as well as offering consulting services to help the development of new NLP applications in a wide variety of sectors. At the CLARIN-PLUS Workshop “Creation and Use of Social Media Resources”, Diana Maynard, who is one of the developers of GATE at the University of Sheffield, held a demo session in which she showed how open-source GATE tools, such as the TwitIE named entity recognizer and a Twitter-based sentiment analyzer, can be applied to analyse social media texts in various languages.
Large-scale text analysis can be carried out with GATE tools to gain valuable quantitative insights from large volumes of social media content, helping to answer important open questions. For example, one important strand of work has looked at how social media can be harnessed more effectively during crises and natural disasters, while another has looked at political debates. The past few years have heralded the age of ubiquitous disinformation – aka fake news – which poses serious questions over the role of social media and the internet in modern democratic societies. Topics and examples abound, ranging from the UK “Brexit” referendum and the 2016 US presidential election to medical misinformation (e.g. miraculous cancer cures). For example, how do political parties, candidates, and voters engage online in the run up to elections and referenda? How polarised are these discussions and how prevalent are abusive comments? What is the role of disinformation and bots during elections? Can they influence the outcome?
In addition, the explosion of free text in healthcare (such as electronic health records, and research papers) creates important opportunities that can benefit from NLP and text mining in the biomedical domain. Examples include extracting patients’ background (e.g. occupation, HIV stats, prescription) from their records, or labelling protein, DNA/RNA and cell types from biomedical literature. GATE tools have been successfully applied to the following tasks in biomedicine (amongst others):
- the extraction of medical terms in the text and linking them with UMLS concepts;
- automatic extraction of drug names and dosages from prescriptions;
- the expansion, annotation and co-reference of biomedical abbreviations and acronyms;
- the recognition of organisms in biomedical literature.
Cunningham, Hamish, Diana Maynard, Kalina Bontcheva, et al. 2011. Developing Processing Components with GATE Version 6 (a User Guide).