It was in 2008 when we started to think about web services for the eHumanities. At this time, many doubts were expressed: can the web service technology deal with big amounts of data? How can we build asynchronous workflows? or ? Today, most of these questions are answered one way or another.
Surprisingly, it became clear that the web service technology is rock solid and works with the highest amounts of data we have in the humanities. We used the "Baden Württemberg Grid" as well as the Computing Centre of the Max Planck Society in Garching for annotating big text corpora. In both cases, the web services were running for approximately 3 weeks without any failure. In the scope of the project, we tested WebLicht web services again with big amounts of artifical data - at least, it was only the hardware and bandwidth which limited the amounts of data we processed. One reason for these scaling capabilities is the streaming character of the RESTStyle web services. Having these results in mind, the CLARIN-D project started to build first web services for processing and analyzing audio data into the WebLicht web service environment. The MPI Nijmegen, BAS Munich, HZSK Hamburg and University of Tübingen are involved in implementing web services for automatic segmentation (WebMaus), converting and visualization - sorry - playing of audio data.
Another point of attention in the last years has been the use of workflow engines to combine individual web services to chains or workflows. If you google around, you will find a lot of workflow engines which are suitable for scientific research. Again, it became clear that most of the generic workflow engines are very flexible. Integrating our WebLicht web services into them was a question of minutes (we tried out Kepler, VisTrails and Taverna). As a result, we can say that interoperability in the world is a approached on the level of web services, not on the level of workflow engines.
Interoperability in a SOA's World
To establish interoperability on the level of web services, two things are necessary:
A common metadata format for describing web services
There are some common approaches for describing a web service: files in the SOAP world, files in the RESTStyle world are the most widespread ones. In addition, the Web Service Core Schema from Menzo Windhouwer can be used to describe a web service independently of its technology with a file. In fact, you only need such a CMDI file, put it in a CLARIN repository and makes it harvestable via : ten minutes later, your web service is an integral part of the WebLicht environment. And there is nothing which keeps you away from using several formats for describing web services at the same time.
A processing format for sending data from one web service to the next one
If individual web services were chained together to a workflow, it is required that the output of a web service can be used as the input for the next web service in the chain. A common data format for sending data from one web service to the next one is needed. When we started to develop WebLicht in 2008, none of the existing data formats where suitable to be used for this purpose. Therefore, we developed the "Text Corpus Format" ( ) to send data from one WebLicht web service to the next one. Let me emphasize: the TCF format was never meant to be a new *standard* for annotated text corpora. Right from the beginning, it was designed to be used inside a SOA environment. Due to that, the development of the TCF format follows some rules:
- Already existing layers will never be changed again. This assures downward compatibility.
- But if necessary, new layers are added on the fly.
- Only one layer per type is allowed inside an TCF file.
In the past 5 years, many new layers representing linguistic annotations were added to the TCF schema. Especially in collaboration with the expert working group F-AG 7 "Computerlinguistics" of CLARIN-D, missing layers were developed and integrated.
Challenges in the Future
In the past, several national CLARIN project have established web service based research environments. It is now time to evaluate how these environments can be combined together so that each one can benefit from the other. The experience shows that this is not trivial, but its also not rocket science. Second, more work has to be done regarding the graphical user interfaces. Generic workflow engines like the ones mentioned before are complicate to use. In addition, they scale not good in the sense of complex workflows. For casual or unexperienced users, specialized interfaces are easier to use.
Another field for new developments is the integration of web services into other (web-) applications. The modularized approach of a SOA could prevent developers from reinventing the wheel again and again.
And last but not least - there are some new interesting technologies on the horizon. Especially Web Sockets, which are establishing a real bidirectional channel between server and client, could play a big role in the next generation of server-side executed tools.
A workshop to exchange information and to make plans?
Dear Thomas,
Thanks a lot for the post. I agree that “It is now time to evaluate how these environments can be combined together so that each one can benefit from the other.”
At UPF (Barcelona), we will be delighted to share what we have done as web services and workflows and to evaluate how to combine with what other groups have done. I guess that Prague meeting next October could be a good oportunity to have a workshop on that point, don’t you think?