You are here

What the IMPACT K-Centre can do for you


Blogpost written by Mikel Iruskieta ( member of the CLARIN Knowledge Sharing Infrastructure Committee) describing a research sample case that was succesfully conducted thanks to the services offered by the IMPACT centre of competence - CLARIN K-centre in digitisation 


IMPACT Centre of Competence

Note: The work described in this blog post was undertaken thanks to the collaboration with the IMPACT Centre of Competence (www.digitisation.eu).

Digitization of archival and historical material can be problematic for researchers due to a number of issues. One of such issues is the presence of gaps and empty spaces around and in between text. It became apparent in my recent analysis of the most frequent words in Pulgarcito a Cuban illustrated and literary journal for kids wrote between 1919-1920. Pulgarcito was digitized and is available online: http://imagenes.sld.cu/download/pulgarcito/volumen-2.pdf.

The journal consists of rich material: drawings, photographs, fairy tales, comic strips, legends, poems, fables, anecdotes, paintings for children. It is a very interesting material as the text throughout the publication is typed, handwritten and drawn. This makes the digitized publication in PDF quite challenging. 

The aim of the task was to use NLP tools in the text analysis in an image-based book digitalization, with texts including also hand-written texts.

After trying a commercial OCR product, the results were very poor, so I decided to approach the IMPACT Centre of Competence (www.digitisation.eu) for support. I needed an OCR system that would deliver good quality text recognition results in a machine-readable format (e.g. XML, TXT). They promptly answered my enquiry and just over a week later I received the journal in plain text format and in XML (both with some OCR errors).                                                   

The example below illustrates the conversion of a hand-written and typed text from the PDF into a plain text format.  Have a look how many different “A”s can be found on this page:

CUANDO UN N1NO
«5Í POEAA?
M$&ECE, UMBETRAfO
conminas y cía

Most of the sentences were recognized with some errors, for example the word “NIÑO” (child) was identified as “N1NO”, or the word “UN RETRATO” (a picture) was not split and resulted in “UMBETRAfO”. Finally, the last line was not detected at all.

As expected, better results were given where the text was typed. The example below illustrates it well.


                                  
NoBin_pul-074.txt
                       

                                  
Bin_pul-074.txt
                       

                                  
vos pajaritos! Tienen, a su modo, las mismas atenciones, cariños y                  
cuidados que tiene el hombre con sus hijos.
                                  
Sienten a su modo lo mismo que vuestros padres sienten por us-
                                  
tedes; por eso es tan inhumano destruir esos nidos o encerrar a cual-
                                  
quier pájaro en una jaula que por ser muy dorada, no dejará de ser
                                  
una prisión para él, nacido para cantar libremente ccmo un poeta
                                  
del ensueño que volase entre el cielo y la tierra. Al contrario. Fa-
                                  
bricad vosotros mismos nidos, e instalad pequeñas fuentes en vues-
                                  
tro jardín. Tendréis así todos los pájaros y todos los cantos. Y cuan-
                                  
do llegue la época de las crías, regad motitas de algodón, como ha-
                                  
cen en los grandes parques los niños de otras ciudades. No olvidéis
                                  
que estos amigos alados tienen, como vosotros, su hogar, sus hijos,
                                  
la dulce encantadora libertad por la cual han venido luchando to-
                                  
dos los hombres desde que la tierra recibió; allá, en la noche de los
                                  
tiempos, el primer beso del sol.
                       

                                  
vos pajaritos! Tienen, a su modo, las mismas atenciones, cariños y
                                  
4
                                  
cuidados que tiene el hombre con sus hijos.
                                  
Sienten a su modo lo mismo que vuestros padres sienten por us-
                                  
tedes; por eso es tan inhumano destruir esos nidos o encerrar a cual-
                                  
quier pájaro en una jaula que por ser muy dorada, no dejará de ser
                                  
una prisión para él, nacido para cantar libremente ccmo un poeta
                                  
del ensueño que volase entre el cielo y la tierra. Al contrario. Fa-
                                  
bricad vosotros mismos nidos, e instalad pequeñas fuentes en. vues-
                                  
tro jardín. Tendréis así todos los pájaros y todos los cantos. Y cuan-
                                  
do llegue la época de las crías, regad motitas de algodón, como ha-
                                  
cen en los grandes parques los niños de otras ciudades. No olvidéis
                                  
que estos amigos alados tienen, como vosotros, su hogar, sus hijos,
                                  
la dulce encantadora libertad por la ‘cual han venido luchando to-
                                  
dos los hombres desde que la tierra recibió; allá, en la noche de los
                                  
tiempos, el primer beso del sol.
                                  
...a veces discuten acaloradamente entre sí...
                                  
O-O-O-O'O'O-O - $-0.0-0.0-0-0 -0*0

The results above were achieved by the IMPACT by using the following methods:

  • All PDF images were extracted using a tool pdfimages in Linux.
  • The digitization was done with the FineReader 11 SDK version.
  • The OCR FineReader 11 SDK version with Spanish language and different types of letters was used with normal and handprinted output in ALTO XML and Text Unicode Defaults.

Once we had the image-based digitized publication book in a txt format, we used ANALHITZA (Otegi et al. 2017). It is a tool created in collaboration with the Spanish CLARIN K-centre to extract words and frecuencies, identify proper nouns (NERC) and extract some word sequences (n-grams), among other things.

The text analysis results were as follows:

Freq.

Nouns

Freq.

Adjectives

255

niño

160

bueno

194

año

124

gran

159

hombre

99

grande

154

día

75

nuevo

148

padre

62

viejo

148

rey

57

blanco

134

hijo

51

pobre

131

vez

48

mayor

114

libro

48

largo

106

casa

45

azul

103

tiempo

44

mejor

A sample of the NERC (LOC means “location”, PER stand for “person”):


Freq.

W1

Type

8

alemania

LOC

2

dinamarca

LOC

2

alejandro

PER

1

16 de mayo de 1703

DATE

1

cataluña

LOC

After we extracted the most frequent bigrams (“P” for pronoum, “D” determiner, “C” connector, “V” verb, “N” noun):


Freq.

w1

Cat

w2

Cat

846

de

P

el

D

692

en

P

el

D

565

a

P

el

D

388

y

C

el

D

245

de

P

su

D

229

por

P

el

D

226

el

D

que

Q

224

todo

D

el

D

206

con

P

el

D

204

a

P

su

D

202

que

C

el

D

201

de

P

uno

D

165

ser

V

el

D

151

el

D

niño

N

After that we used Voyant Tools (Sinclair and Rockwell, 2016) to get visualizations of the data in order to achieve a more user-friendly representation of the data. The result was a word cloud of the entire book:

A further analysis of the word "niña" (Key Word in Context or KWIC) extracted with Voyant Tools, can be used to show how the girls were characterized in 1920 or to learn the cohesion between the gender (feminine) of the article and the noun:

Left

Term

Right

tenía, a su vez, una

niña

, que era dulce y bon

las excelentes cualidades de aquella

niña

. La encomendó las tareas más

pies a cabeza. La pobre

niña

todo lo sufría con paciencia

g n ■w- canzaria. La

niña

perdió uno de sus zapatos

meses regalaremos al niño o

niña

que mayor número de ellas

ha pensado mucho en la

niña

! El dice que siempre que

y escribe mejor- Y la

niña

se va, se va despacio

tropieza con todo! Pero la

niña

no se ha des- pertado

de olor: y es una

niña

de sombrero colorado, que trae

hoy en casa por mi

niña

”, le dijo su padre, “y

The analysis described above shows that there are still many errors and one should carefully check the extracted text, and correct to obtain a more reliable data. The overall task was very fast and efficient and proved to ask interesting research questions. The next steps are to use the Programing Historian publications and see if the text can be cleaned of all OCR errors using regular expressions (Turner-O'Hara 2013):
https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions

REFERENCES

Otegi, A. Imaz, O. Díaz de Ilarraza, A. Iruskieta, M. Uria, L. 2017.ANALHITZA: a tool to extract linguistic information from large corpora in Humanities research. Procesamiento del Lenguaje Natural 58: 77-84.

Pulgarcito Volumen No 2 - No 1 – 1920. URL: http://iiif.sld.cu/coleccion/07/06/2017/pulgarcito-volumen-no-2-no-1-1920 [January 10, 2019]

Sinclair, S. Rockwell, G. "Voyant tools." URL: http://voyant-tools. org/  [September 5, 2016] (2016).

Turner-O'Hara, L. 2013. Cleaning OCR’d text with Regular Expressions. URL: https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions [January 10, 2019]