CLARIN-PL Presents CompCorp - A Tool for Comparing the Linguistic Features of Corpora

CompCorp is an advanced, user-friendly interface for uploading any two corpora that have been previously zipped, and then for comparing them with regard to certain linguistic features. These features include the presence of specific multiword units, the presence of grammatical tags (according to NKJP tagset), the presence of vocabulary specific for given corpora, the presence of vocabulary that differs across the corpora, the presence of proper names, morphosyntactic features of verbs, and statistical features of the corpora. The CompCorp tool is also used to detect the linguistic characteristics that are common and different in any two sets of texts.


The area of Digital Humanities triggers a number of issues that are related to certain text resources. These issues may be reduced to the analysis of vocabulary. Particularly the comparison of the vocabulary occurring in different collections of texts constitutes primary interest. The method behind this analysis may require the comparison of two own text resources or a text resource with a reference corpus. CompCorp makes it possible to easily execute such comparative analyses. It has proven useful  and helped many scholars to answer their own scientific questions.