Tour de CLARIN: CLARIN-PL presents WebSty, an open web-based system for stylometric analysis

Submitted by karolina@clarin.eu on 15 January 2018

WebSty is a powerful web-based system for stylometric, semantic and comparative analysis of texts. In its current implementation, the system is suited for the quantitative analysis of German, Polish, English, Hungarian, Russian and Spanish texts and is presented as an easy-to-use web interface that enables researchers to simply drag and drop the documents they want to analyse or provide links to uploaded .zip files containing the documents (figure 1). WebSty is also integrated with the Polish D-Space based repository provided by CLARIN-PL. To ensure fast processing of the documents, WebSty is designed as a sevice-oriented software where each language tool runs as a separate process with pre-loaded data models. Currently, the English version of WebSty makes use of the following tools:

SpaCy, a suite that prepares texts for deep learning and features advanced annotation like Named Entity recognition,
Fextor, a tool for the extraction of features from text collections,
CLUTO, a tool for the clustering of datasets, and
D3.js and D3-tip, which are the visualisation components

After uploading the file to be analysed, researchers can use the Choice of features tab (figure 2) to specify which linguistic features WebSty takes into account when performing the analysis. Among others, these include the specification of various grammatical classes and a host of features related to named entities. The results of the clustering are primarily visualised in the form of a dynamic dendrogram (figure 3), which is generated on the basis of the D3.js library and involves an interactive binary tree where each subtree can be collapsed. In addition, WebSty allows researchers to download the result in .xslx format and also to visualize the results with other user-friendly methods, like a heat map, radar chart and multidimensional scaling.

Since WebSty does not require in-depth computational knowledge, it is a crucial tool for fields in the social sciences and digital humanities in that it allows researchers to conduct massive-scale analyses of numerous resources revealing characteristics that have been overlooked by traditional approaches. As an example of a successful application in literary studies, Dr Maciej Maryl, who is Deputy Director at the Institute of Literary Research of the Polish Academy of Sciences, has used WebSty to analyse a large collection of blogs with anonymous authorship and thereby detected subtle similarities between documents on the basis of the provided clustering options. As a successful application in sociology, Dr Marek Troszyński from Collegium Civitas has used the tool in a project for monitoring and documenting manifestations of discrimination against the Ukrainian minority in Poland. In relation to languages other than Polish, Websty has successfully been used by Dr Palkó Gábor from the The Petőfi Museum of Literature to analyse texts in Hungarian (figure 4). Through cooperation with partners from The Petőfi Museum of Literature, a new version of WebSty will be created with a dedicated interface in Hungarian.

Figure 1: Uploading datasets in WebSty, where in the case of the English version researchers can either upload their own local documents or provide links to online resources. The Polish version is also integrated with the D-Space repository provided by CLARIN-PL.

Figure 2: Choosing linguistic features for analysis

Figure 3: Visualising the results through a dendrogram, showing that Anne Brontë's Agnes Grey (1847) and The Tenant of Wildfell Hall (1848) have more features in common than the older novel Emma (1815) by Jane Austen

Figure 4: Using WebSty to analyse Hungarian text. The visualisation shows clusters of similar texts scaled to 2D space.

Click here to read more about Tour de CLARIN