Skip to main content

#DHH21 theme: Exploration of Society Through the Lens of Labour Market Related Documentation

Submitted by on

At the end of May, the 2021 edition of the Helsinki Digital Humanities Hackathon (#DHH21) took place, a summer school supported by CLARINDARIAH and the H2020 project SSHOC. This blogpost will focus on the work carried out by the CBAQuest team of #DHH21focused on the worked carried out by the WageIndicator Foundation in the context of the SSHOC project. The CBAQuest team focused on the ex­plor­a­tion of society through the lens of la­bour mar­ket re­lated doc­u­ment­a­tion - comparing the coverage, style and subjects discussed in collective labour agreements from more than 50 countries.

The WageIndicator Foundation provides access to the WageIndicator Collective Agreements Database that includes 1600 Collective Bargaining Agreements (CBAs) in 28 languages from more than 50 countries. This database, which was enriched within the SSHOC project, gives researchers the ability to read and compare the original texts of agreements by topic, at national and international levels.

Using text mining to discover the secrets of CBAs

The CBAQuest team of the Helsinki Digital Hu­man­it­ies Hack­a­thon 2021, lead by Daniela Ceccon and Stefano Ceccon, explored the feasibility of assessing the ‘worker-friendliness’ of CBAs in order to find new ways of understanding agreements and to contribute to improving global labour market transparency. To this end, the team produced a prototype of a digital tool that offers visualized information about CBAs to anyone interested in the documents governing the lives of workers.

The worker friendliness of collective labour agreements was rated by considering the following measurements:

  • Equality is evaluated through the presence of clauses addressing 4 indicators that fall under gender equality trigger: gender equality, discrimination, sexual harassment and grievance procedure;
  • Overtime and annual leave is evaluated by checking whether there are regulations on overtime, whether there is travel allowance provided, and whether the number of days of annual leave after 1 year of working is above the international standard of 15 working days.
  • Text accessibility is evaluated through 3 indicators: concreteness, readability, lexical density which provide the information about how easy it is for the workers to understand the contract.


The figure below shows the gap and relationship between the total number of CBAs, the number of CBAs with equality related indicators, and the number of CBAs with both equality related triggers and procedures mentioned. The total number of CBAs that are contained is 1247. From those, the ones that were found to be related with the indicators, that means they have the equality related triggers, were 584. Among all the 584 gender equality related CBAs, 101 of them were found to have the procedure related terms mentioned.

The next figure is a stacked bar chart which illustrates the contribution of each equality-measuring indicator to the overall score. Among other things, it can be observed that, overall, sexual harassment is more widely addressed in CBAs then the other three indicators: grievance procedure, gender equality and discrimination. The figure also shows the difference in the overall scores between countries for the 20 countries with the highest total score which indicates that those CBAs include the most clausses addressing the issues related to one of the 4 equality-measuring indicator. Thus, it can be observed that Romanian CBAs are more equality oriented than Slovakia’s CBAs.

The map below shows the number of CBAs in the current dataset for each country. The map clearly indicates that many countries have only a small number of CBAs, which may cause the current score results to be to some extent unrepresentative in countries with less data. This problem, however, could be solved with the continuous expansion of the database in the future.

Overtime and Annual Leave

The formula to calculate the score for this measurment was applied to all available CBAs in the dataset. The bar chart below shows the average worker friendliness scores of CBAs by country from the perspective of annual leave and overtime working for the 20 countries with the highest score.

It is clear that the current results do not seem to represent a perfect situation. As shown in the graph, some countries such as the UK and Belgium get lower average scores while they are often perceived as countries with sophisticated systems protecting workers. There are many possible reasons for this observation.

Firstly, the absence of clauses relevant to overtime or annual leave does not always imply low worker friendliness. There may already be clear laws and regulations in specific countries regarding these aspects, hence the abscence of need for CBAs to have sections dedicated to them, which explains why the assumption that absence means worker unfriendliness is not always correct. But it is important to realize that the score indicates the worker friendliness of CBAs themselves without considering the contexts.

Secondly, the availability of CBAs in the dataset varies across countries. It can be difficult to collect CBAs from countries such as the UK due to reasons like privacy concerns, and the small and biased sample may also affect the accuracy of the final score.

Thirdly, there are also limitations regarding binary indicators. In the current scoring system, the indicators mostly concern the existence of certain clauses, but the existence of clauses does not have anything to do with the quality of these clauses and does not necessarily mean that those clauses are enough to protect labor’s rights.

Text Accessibility

This indicator is based on three different measures, namely concretness, readability and lexical density, which were used to observe how easy it is for the workers to understand CBAs. Concreteness refers to the amount of abstract versus concrete words used in a text. Texts with relatively more concrete words are more accessible than texts with relatively more abstract words. While this measures the use of referentiality to concrete objects, the second aspect which referred to readability measures the years of education one would need to be able to understand a piece of text. This is usually evaluated through the amount of long words and long sentences. Finally, lexical density looks at language from a semantic perspective and is a measure of the number of different words that are used. The hypothesis is that for a text to be considered readable, it should require fewer years of education and should have low lexical density. The figure below shows the text accessibility scores for various languages, and it can be seen that Slovak’s CBAs are considerably more reader-friendly than Indonesian CBAs.

Final Thoughts

In summary, the project identified a number of ways that the ‘worker-friendliness’ of agreements might be measured, making use of text mining methods to analyse and score agreements on various indicators. By using and visualising these scores, the team has been able to find new methods of evaluating agreements at a glance, in ways that might facilitate understanding of these agreements for labour market researchers and workers in general. However, a number of challenges and limitations have been identified that invite further research into the secrets of collective bargaining agreements.


This blog post is an adaptation of the post-event report originally made available on the SSHOC website at this link.