Frequently Asked Questions

On the following pages CLARIN wants to answer some of the most frequently asked questions which we are confronted with. The QA should give interested people a quick entry to a number of essential issues that have been discussed within CLARIN.

Experts can add other FAQs by clicking here.


About CLARIN

CLARIN is an acronym for Common Language Resources and Technology Infrastructure.

CLARIN aims at uniting existing digital archives in Europe that contain language-based material into a federation that will allow the social sciences and humanities research communities to have unified access to the content. It wants to make the wealth of language and speech processing tools that have been developed over the recent years available to interested researchers with a view to opening up new research avenues. Another goal of CLARIN is to provide web based services that will allow non-expert users (especially humanities and social sciences researchers without technological background) to perform complex tasks on the materials contained in the archives, such as ‘Summarize Le Monde of March 17 2008 – in Polish’.

How do we unite existing archives from all over Europe into a single federation? How can we manage the wide variability of  conventions for describing resources, in many different languages? How might we chain together existing tools and applications, with their specific expectations in terms of input and output formats? Can we help to provide a minimum level of technological coverage for all languages, irrespective of the number of speakers? How do we protect the intellectual property rights of those who have provided data or tools? How do we ensure that whatever infrastructure facilities we manage to build will be sustainable?
These questions will have an answer at the end of the CLARIN preparatory phase at the end of 2010.

Not yet. CLARIN is presently approaching the end of the preparatory phase (2008-2010), in which the technological and organizational specifications are being finalized. The infrastructure will be set up during the construction phase (2011 - 2015) and will be in exploitation from 2016 on. However, in the meantime, we do aim at creating some prototype services.

At the end of the preparatory phase, CLARIN intends to make the transition towards a more permanent structure. CLARIN is now investigating the possibility that this be an ERIC - European Research Infrastructure Consortium - a legal entity based on EU law (Article 171 of the EC Treaty). It was designed to facilitate the joint establishment and operation of research facilities of European interest (see the European Commission site for more details).

CLARIN aims at bringing together producers and consumers of language resources and technologies. A producer is a contributor of linguistic data, tools or services. Mainly these are research organisations, such as a university or an industrial company that is collecting and annotating textual or speech corpora, or is inventing and implementing natural language processing technologies or applications. A consumer is a researcher or a group in need of linguistic data or processing technologies. In CLARIN, this person is seen mainly as a scholar in the humanities or social sciences. Although the consumer may have no computational linguistics background, CLARIN will aim to help to address research problems involving processing of linguistic data.

In the preparatory phase (2008-2010) CLARIN is funded by the EU through the 7th Framework ESFRI programme. One of the objectives of the preparatory phase is to come with cost estimations for the construction and exploitation phase. The main funders will then be the national governments, with a possible minor contribution from the EU for some generic costs of the infrastructure. The cost estimate will include European wide aspects such as comprehensive training and education programs.

  • access to upload/use data;
  • access to upload/use processing tools;
  • access to offer/use services;
  • guides for using standards and best practices, as well as access to these standards, including formal specifications;
  • know-how - building help-desk services, displaying examples, envisioning solutions, directing users towards experts, offering training (starting with the construction phase);
  • dissemination - information on events, the CLARIN Newsletter, the CLARIN NewsFlash, catalogs, information on sites, groups, etc.

CLARIN publishes a newsletter every 3 months and a newsflash every month. Have a look at the newsletter and at the newsflash (you can also subscribe to them, if you would like to receive these updates in your email).

Joining CLARIN

In the preparatory phase (until the end of 2010) CLARIN is still a project of the EU, therefore the number and names of consortium partners are strictly limited by the contract. But you can become a CLARIN Member, by filling in the form here. You also need to have recomendations from two existing CLARIN partners or members. If you are the first institution in your country which is joining CLARIN, you also need to appoint a national contact person for your country. For further details, please consult the document related to CLARIN membership criteria and procedures. You can also contact the CLARIN Office (clarin@clarin.eu).
Starting with the construction phase (from 2011), CLARIN will become an self-supporting organization with a legal entity (the CLARIN-ERIC) and the members will be the governments of  the countries involved.

The preparatory phase of CLARIN is funded by the European Union, and the number and names of consortium partners are defined in the contract signed with the EU. Still, you can apply for CLARIN membership. Becoming a CLARIN member is free of charge and means you are supportive with regards to its goals. Membership requests are subject to acceptance by the national representative and the executive board of CLARIN. For further information on becoming a member in the preparatory phase, please have a look at the previous question, "How can I join CLARIN?". You can also contact the CLARIN Office (clarin@clarin.eu) for additional information.
In the construction phase, the country membership will be bound to an yearly fee. The level of this fee is yet to be decided before the construction phase begins, and will be laid down in the CLARIN-ERIC (the legal entity).  The majority of CLARIN's activities will be funded through these contributions of the national governments.

You should ask your government for a support letter to join the CLARIN Consortium, and to appoint an institute or a person as the national contact. You can find an overview of what is expected in terms of financial and other types of support here. For more suggestions and help you can also contact the CLARIN Office (clarin@clarin.eu).

Only countries and institutions can be CLARIN members. Private persons are not allowed.

If you are an institution in Latvia, or in any other CLARIN member country, you should contact your national representative - the CLARIN national contact person. If you don't know who is the national contact point for your country, please have a look at the list here.

Yes, this is possible. Institutions from non-member countries, including countries outside Europe, can become associated members in the preparatory phase. Technical details on how to apply for the CLARIN membership in this phase can be found under the "How can I join CLARIN?" section. The procedure will change after 2010, in the construction phase, when the rights and duties of associated members will be laid down in the statuses of the new legal entity (CLARIN-ERIC).

The actual way to access the infrastructure will be established during the construction phase.

CLARIN resources

CLARIN is not the direct owner of any resource. It only provides access to a series of repositories where information about existing resources can be found. There are three main categories of resources: data, services and tools.

Several ways to access these resources are available:

  • browse them using the Virtual Language World or the CLARIN Language Resouce/Language Tool inventory (you can find them both inside the Virtual Language Observatory);

Another source of information regarding existing resources is LT-World. LT World is also a valuable source of information about linguistic terms and types of resources from a theoretical point of view.
Take note that this is not the entry point into the CLARIN world of information. When CLARIN centres will be in place, an extensive search service will provide access to all the types of resources.

All kinds of relevant linguistic data: textual and speech corpora, either raw or accompanied by metadata (annotated), lexica, grammars, video (sign-language recordings) and multimedia (text and speech, video, speech and subtitles), etc.

Yes, although not in CLARIN, but through CLARIN... You can use the facetted search tool for data to find a specific type of linguistic data.

If you are a member, you can add data by filling in the form here.

The procedure to update existing (meta)data is decribed here.

Tools are programs doing specific transformations over linguistic data.

Yes, you can use the facetted search interface for tools to search for various NLP tools. You can filter these tools by type, language, platform, license and author organization.

If you are a member, you can add tools by filling in the form here.

Services are the on-line equivalent of tools. The difference between a tool and a service is that a tool needs to be run locally (where the data is), while a service runs remotely. When using a web service the input data and the program that does the processing can reside on different machines. The data is transferred via protocols to the remote server, and the output results are transferred back when the processing is complete.

If you are a member, you can add tools by filling in the form here. You need to specify the type "Web service".

The Virtual Language World is a modality to browse the resources and tools in the CLARIN repository using the Google Earth Interface. You can find it here (note that you need to have Google Earth installed in order to use it). The VLW contains data from multiple sources: from the CLARIN LRT inventory, from OLAC providers, from IMDI and from the DFKI software registry.

Technical infrastructure

Standards

CLARIN does not create linguistic resources; its purpose is to offer rapid access to the existing resources and to facilitate their reuse in new contexts. When resources and tools are produced for individual usage interoperability and therefore the need to adhere to standards or best practices is of little relevance. The problem of interoperability only emerges when linguists are ready to offer their resources and tools to other researchers. One of the requirements of interoperability is to connect different resources to the same tool. This can be made using standards, but this would imply having all the resources standardized (this is an ideal situation, but cannot always be achieved in reality). When needed, a standard can also play the role of a pivot format (resources are converted to the standard before they are used).

standards

An open list follows:

  • character encoding: ISO -10646 UNICODE, UTF-8
  • country codes: ISO 3166
  • language codes: ISO 639-1/2/3
  • codes for the representation of names of scripts: ISO 15924
  • text format: XML
  • feature structure representation: ISO 24610-1:2006
  • representation of primary sources: TEI (Text Encoding Initiative)
  • knowledge engeneering: RDF, RDF-S, SKOS, OWL
  • audio/speech: PCM (Pulse Code Modulation) for digitizing sound waves, the Alphabet of the International Phonetic Association for phonetic transcriptions;
  • video/multimodality: MJPEG2000 lossles as backend format, MPEG2 or H.264 for handling and processing
  • data categories: ISO DCR and ISOcat
  • annotation of temporal entities: TimeML (part of TC 37/SC 4)
  • morpho-syntactic annotation: MAF (Morpho-syntactic Annotation Framework), ISO/DIS 24611
  • syntactic annotation: SynAF (Syntactic Annotation Framework), ISO/CD 24615
  • lexical annotation: LMF (Lexical Markup Framework), ISO 24613:2008
  • linguistic annotation: LAF (Linguistic Annotation Framework), ISO/DIS 24612

For more information on each of these standards, please take a look at the CLARIN Standardization Action Plan.

CLARIN actively tracks a number of ongoing standardisation activities at two major levels: linguistic structures/formats and linguistic encoding. CLARIN as an infrastructure project has the duty to evaluate, test and comment these proposals in close relation with the relevant standardisation bodies. When necessary, CLARIN may take the lead in initiating new standardisation activities when a clear gap in coverage is identified. For more information see the CLARIN Standardization Action Plan.

No linguist should be required to read long documents about standards; it is primarily the task of the tool, service and converter developers to provide frameworks that help the researcher and that hide complex formalisms as much as possible.

Very good question. A researcher will always face this issue. The research moves a field on, and in no-man's-land there are no standards (yet). The standards remain behind the research. Industry stays always on ferm ground, therefore on well establishd conventions. Although it appears that reseach has no means to make use of standards, it should base itself on well-established foundations, which should be expressed in standardized form whenever possible. Only for the head of the arrow, the really fresh things, just invented, the researcher should look for his own ad-hoc conventions. Applied to the linguistic data, this means that in an annotated corpus, for example, one will find a mixture of standard and invented markings. CLARIN can be used for that part of processing that involves using existing tools and resources, that have been converted to a standard format.

ISO Data Category Registry

The ISO Data Category Registry is an attempt to make a step in the direction of interoperability at the level of linguistic encoding (tag sets). The basic idea is to register all widely used concepts/terminology so that everyone can refer to them or relate his own categories to them. All is based on the ISO 12620 standard which is a generic model not restricted to linguistics. If you want to read more, please, go to the ISOcat web-site (www.isocat.org).

ISOcat

Use the ISOcat forum - see http://www.isocat.org/forum/viewtopic.php?f=3&t=4&p=6#p6 for details.
In origin you can mention the inspiration source for the creation of the data category. If you do not know what to enter, please enter CLARIN

For the source of the language section, please enter CLARIN
No, unless it is required by certain language rules (e.g. nouns in German), the name of a data category should not contain capitals.
  • Register at isocat.org
  • Send a mail with your name to dieter.vanuytvanck@mpi.nl
  • You will be added to the CLARIN group
  • My Workspace > button “create new data category”
  • My Workspace > Private > CLARIN > MD > button “edit this data category selection”
  • My Workspace > Private > + (add this data category to selection)
  • Click on the icon for “save the selected data categories”
  • After inspection the new datacats can be moved to the Metadata thematic view
  • (Finally, and optionally the datcats in the Metadata thematic view can be submitted to the Thematic Domain Group for official approval)

ISOcat is the software and database that implements the ISO 12620 standard and data model which is currently being filled with many categories from for example the EAGLES project, various metadata initiatives and hopefully other sub-disciplines and initiatives. There are bodies made up by linguists that take care that the content of the ISOcat registry is not too fragmented and meets a number of criteria.

NO - the data model was set up with the explicit intention to not include relations, since these in most cases are dependent on theories and practical intentions. Therefore a framework will be offered that allows users to easily manipulate and share relations according to their needs. From CLARIN we intend to offer at least one set of relations with a large coverage which users may want to use or manipulate.

No - it is just a start to offer a reference, so that users creating new resources could use the registered categories and schemas describing legacy data can refer to them. But we will found that not all tag sets which are in use for various purposes can easily be mapped on another one. It also will largely depend on the intended usage. For searching an imperfect mapping may result in less precision, but for a researcher this may not be a problem.

There is much debate about this and other question and there is no good universal answer yet. However, we need to start using the ISOcat registry to find out how the definitions can be improved, which categories are missing and which granularity should be chosen for metadata, morphology and semantic annotation to just mention a few examples.

Metadata

It is data about data: information describing properties of linguistic resources. Think of the size of a corpus, the recording date of a speech file, the purpose for which annotations were created.

A fixed set of elements for the description of resources. Think of the traditional filing cards in the library, specifying the writer and title of each book.

Quite a few, examples are: Dublin Core, OLAC (which is an enriched version of Dublin Core), IMDI, the TEI header, ...

Good question. In fact there is not such a thing as a single CLARIN metadata scheme. Practice showed that using a particular scheme for a large community (e.g. the humanities) often results in a mismatch between the chosen elements and the needs of the user.

CLARIN proposes a component-based approach: you can combine several metadata components (sets of metadata elements) into a self-defined scheme that suits your particular needs. Of course you can share your profile with others (in fact we strongly advise that). If sharing the full profile is not an option, you still can use common components, e.g. a component to describe a sound recording. In case that still does not address your needs, it is even possible to create components yourself.

Check out the CLARIN specification document: http://www.clarin.eu/files/wg2-4-metadata-doc-v5.pdf

Not yet - bringing all metadata descriptions together ("harvesting"), making them searchable ("indexing") and citeable ("creating persistent identifiers") is an important part of the infrastructure that CLARIN is building. As with all infrastructure these things require a solid base to build on. That base is currently being constructed, but this also means that there is currently no simple-to-use method for accessing the new CLARIN metadata.

Until the infrastructure for the component-based metadata is fully in place you can use one of the following metadata schemes:

OLAC - see http://linguistlist.org/olac/olac.html

IMDI - see http://www.mpi.nl/imdi/

Data described in one of both formats will be made available and searchable via the CLARIN catalog and the Virtual Language Observatory. Apart from that we will ensure that these metadata descriptions will be converted to CLARIN component-based metadata.

Send a mail to dieter.vanuytvanck@mpi.nl and we will incorporate your metadata as soon as possible

No, we are not. It is a term used for gathering metadata descriptions from several locations and storing it in a central database. You can find the results of such a harvesting process at http://catalog.clarin.eu/ (click on OLAC data providers)

Yes - have a look at http://www.clarin.eu/node/3026

There are indeed issues with searching if people aren't using matching descriptions. Think of someone calling a collection of texts a "text archive", while someone else might be searching for a "text corpus". Or think of all the variants that people can use for one and the same country: the Netherlands, Netherlands, Holland, etc. The same goes for lingustic annotations: "noun" and "substantive" can both be used to describe the same part-of-speech tag. To counter these problems the metadata components contain links to a kind of database that contains atomic concepts (say "country" or "resource type"). We call them data categories. Smart software will later on be able to "see" that if a user searches for nouns, he might also be interested in substantives.

In a data category registry - a server that can be reached via the internet, both by human users and computer programs.

Yes it does - you can have a look at it via http://www.isocat.org/interface/

Yes, go to the site mentioned above and click through to Public > Thematic views > Metadata > Metadata

Centres

Centres will form the backbone of the persistent and stable CLARIN service infrastructure. For further information we refer to the Requirements Specification Document and to the Short Guides.

Researchers will only use certain resources and certain tools/services offered via the web when they are sure that they can access them also during a longer period. Currently, researchers mostly download resources first to their computer to create accessibility, but in the cyberinfrastructure scenario with the many big resources and collections this way is not suitable anymore. So availability and accessibility has to be guaranteed by institutions with a clear service oriented attitude that do pose as little restrictions as possible on the usage by the researchers. Only this new type of centres can give such guarantees.

Can everyone act as a CLARIN centre in the emerging network? No - since centres need to fulfill a number of criteria which have been mentioned in the documents. In particular these centres need to make a commitment statement that they will give their services for a defined period of time at a certain service level which is dependent on the type of service. Setting up centres that adhere to these requirements will cost some money, therefore it is obvious that centres need to have clear funding basis.

The answer  on this question is dependent on the country although some services are given at a European level. For detailed questions about the centres, see www.clarin.eu/centres

Persistent Identifiers (PIDs)

Persistent identifiers are increasingly often seen as core component for all the many references we are creating at various levels - this can range from references between metadata descriptions and their resources up to references between semantic assertions made by using the RDF (Resource Description Framework). For more information please read the requirements specification document or the short guide.

In the emerging cyberinfrastructure we are creating more and more references between resources, resource fragments and services. The creation of these references is very costly and often is essential for the interpretation of a resource. Therefore we need proper mechanisms to ensure that these references survive despite all the changes that happen in repositories for example. It is known that URLs are not appropriate - they are not persistent even when we believe that they are proper URIs. Therefore special PIDs come into place which identify an object and which are maintained by reliable institutions.

Handling PIDs is very simple. First you need to register a PID for a resource or service. You can do this very simply by providing the required information to the PID service site, in particular the path to access the resource such as a URL and you will receive back a PID which you can enter into the metadata description for example, so that everyone can use it for referencing. When a user finds such a PID in a resource, he/she can click on this reference and the service will resolve the PID and give access to (one of the copies of) the resource. Normally as user you don't see the intermediate transactions.

If the PIDs cannot be resolved at a certain moment one simply cannot access a resource. Think of a situation where hundreds of users are waiting on a resolution of a PID and nothing happens - a nightmare for any cyberinfrastructure scenario! Since this would not be acceptable, we need to make sure that the PID service is based (a) on a very robust and reliable software offering sufficient functionality, (b) on a proper service based on redundant centres with a high availability and persistency guarantee.

CLARIN has an arrangement with the EPIC consortium that CLARIN members will be able to register PIDs and of course resolve them. This consortium groups a number of reliable European service providers that want to participate in providing a redundant service for the research world, i.e. we are speaking about millions of PIDs and a service at very low costs. The service is based on the Handle System which according to our investigations is the only robust system meeting all requirements. No one is obliged to register Handles, but of course CLARIN centres will need to demonstrate that their PIDs can be resolved in a robust manner and offer the required functionality.

PIDs are as said unique and persistent identifiers of objects that are made available by proper repositories. For many resources there are additional characteristics such as multiple copies for preservation reasons, a string (such as MD5) that can be used to check authenticity, simple metadata for citation purposes, a reference to the access permission record etc. A proper PID system should offer such information immediately when resolving a PID. PURLs can't offer functionality, for URNs we do not know about well-proven and robust resolver, although the big libraries agreed on using URNs for their publications. DOIs are also based on the proven Handle System and it is certainly a proper service which is used in particular by the big publishing companies. However, DOI also comes with a business model that will not be acceptable for may research organizations.

(Answer taken from the ISO citer draft, p. 11)

This International Standard supports different levels of granularity. The following recommendations are designed to encourage efficiency and promote interoperability with other naming schemes:

1) If there is an existing identifier scheme for a type of resources, for instance, ISBN, this level of granularity should be retained, which is to say that no new PIDs should be issued without very good reasons, such as for chapters. Chapters would preferably be addressed using part identifiers in conjunction with the PID of the book.

2) If the resource is associated with the complete content of a digital file, an individual PID should probably be assigned for this resource.

3) If the resource is autonomous and exists outside a larger context, an individual PID should probably be assigned for this resource.

4) If a resource should be citable apart from any containing resource, an individual PID should probably be assigned for this resource.

These recommendations are, however, subject to the needs of resource creators with respect to the level of granularity they deem suitable to the specific resource environment.

Intellectual Property Rights and Business Models

CLARIN has already sketched out three types of license that could be used when an individual or an institution contributes resources to the CLARIN Infrastructure:

  • publicly available resource
  • resource available for academic use
  • resource with restricted use

Additionally, there can be other conditions imposed, like non-commercial use, usage report (the end user must report the published articles which use the resources) and grant back (the end user must license back the modified versions of the resources).
Additional details will continue to be worked out during the preparatory phase. 
The official licensing model will be made available during the construction phase.

To distribute tools and resources to the research community through CLARIN, a Content Provider and a CLARIN Service Provider have to sign a Deposition Licensing Agreement. An End User intrested in the tools and resources distributed through CLARIN has to sign a Terms of Service agreement. In addition, the user may need to sign one or more End-User License Agreements to obtain access to individual tools or resources and a Data User Agreement in case there are some additional ethical restrictions on the data.

You do not have to be authenticated to use publicly available resources, even if they are distributed through the CLARIN Infrastructure.

No. In the construction phase, CLARIN will only provide the access to the resources. Each resource will entrusted to CLARIN together with its specific EULA (End Use License Agreement) that you will need to agree to.

Yes, depending on the nature of the end use of the resource, you will have access to resources licensed for academic or commercial purposes.

CLARIN maintains a catalogue of resources. Providers of these resources may be either CLARIN Members or outsiders. However, only resources submitted by CLARIN Members will be integrated into the infrastructure, since a certain amount of effort must be put into the integration. Resources provided by non-members will only be listed in the catalogue, they will not be available to the processing infrastructure.

CLARIN for the Humanities and Social Sciences

Mainly, because CLARIN is an infrastructure of language technologies, and the primary users of this technology are the humanities and social sciences. Then, because one of the goals of CLARIN is to make the NLP technology easily available, without the necessity to download, install and get aquainted with all the tools that might be needed for their research. Also because CLARIN will also facillitate the wide distribution of the language resources created by HSS researchers.

Yes. At the beginning of 2009 CLARIN organised a call for collaboration with HSS projects. Three HSS projects were selected as a result of this call. CLARIN will collaborate with these projects during their development and offer language resources and tools, as well as advice on how to use the provided language technology to enhance the work of the projects.

It depends on what you want to do.
If you are searching for a specific corpus, you can find it using the VLW or the facetted search tool. The same happens if you are looking for a specific kind of tool.
If you know the processing steps that would solve your problem, you can use the CLARIN repository to assemble yourself a solution (either browsing through the VLW or searching using the facetted search tool).
CLARIN is currently looking for more advanced solutions that would help even the totally innocent HSS researcher (for instance, guiding him along the process of building a solution).

Basic languare resource kit - terminology

Annotation

Generally - a word, but also an abbreviation. In some aglutinative languages it could be just a morphem, part of a compound.
Used XML marking:

<tok tid="...">word</tok> or <w wid="...">word</w>.

Part of speech (POS) seem to occur in every natural language. The usual categories are: noun, verb, article, adjective, pronoun, preposition, adverb, conjunction, etc. Sometimes, by POS morphological and syntactic classes are also meant.

The canonical form of a word. It represent all the various forms of a morphological paradigm. It is marked on tokens with a specialized attribute, for instance: 

<tok id=“w10” pos=“det lemma=“the”>the</tok>

<tok id=“w11” pos=“n” lemma=“wind”>winds</tok>

<tok id=“w12” pos=“prep” lemma=“of”>of</tok>

<tok id=“w13” pos=“n” lemma=“change”>change</tok>

 

A group of words acting as a unit surrounding at least one noun. One noun of the group acts as representative - the head. It gives the morpho-syntactic properties of the group. 
Common XML notation: 

<np np_id="np1" head_id="t3"><tok tid="t1">the</tok><tok tid="t2">black</tok><tok tid="t3">cat</tok></np>.

A complex tag delimiting sentence boundaries.
For instance: 

<seg sid="...">This is a sentence.</seg>.

Syntactic descriptions constitute a huge chapter of computational linguistics. A very general classification, however, sees a syntactic description notating either constituents or functional dependencies. In an FDG type of notation, for instance, one possibility is to mark on each word its parent and the type of the syntactic relation to its parent: 

<tok id=“w10” pos=“detlem=“the” link=“w11” linktype=“det">the</tok>

<tok id=“w11” pos=“n” lem=“wind” link=“…” linktype=“">winds</tok>

<tok id=“w12” pos=“prep” lem=“of” link=“w11” linktype="mod">of</tok>

<tok id=“w13” pos=“n” lem=“change” link=“w12” linktype=“pcomp">change</tok>

 

As words are often polisemous, only the contexts can define their senses. Accordingly, notations used to disambiguate word senses (WS) are indications of senses in context, as given by specialized repositories (for instance WordNet).
Common XML notation of WS are realized by an attribute (for instance, 'wsd' or 'sense') complementing a token marking: 

<tok tid="..." pos="v" lemma="run" wsd="s1">running</tok> or
<tok tid="..." pos="v" lemma="run" sense="s1">running</tok>.

We say that an anaphor and an antecedent are coreferential is both are text spans (usually NPs) refering the same discourse entity.
Bellow is an example of notation, as an attribute (coref) belonging to a 'np' element:
 

<np np_id="np1"><tok tid=“t1”>John</tok></np>
<
tok tid=“t2”>
hit</tok><np np_id="np2">
<
tok tid=“t3”
coref=“np1”>himself</tok></np>
 

 

Processes

The process of segmenting word tokens in texts. In all modern languages that use Latin, Cyrillic or Greek writting systems word tokens are recognized by the delimiting blank or punctuation. Numbers, alphanumerics and special format expressions (dates, measures, abbreviations) are also recognized as tokens, traditionally by using regular expressions. Tokenization in non-segmented languages, such as many Oriental languages, require more soffisticated algorithms (lexical look-up of longest matching sequences, hidden Markov models, n-gram methods and other statistical techniques).