This day-long workshop will teach basic natural language processing techniques for large volumes of online text (from websites, blogs, forums, social media, etc.) to researchers in the social sciences and humanities. It will be the final day in a three-day conference entitled ‘Digital Youth in East Asia: Theoretical, Methodological, and Technical Issues’, organised by the East Asian Studies research unit at the Université Libre de Bruxelles. English will be the predominant language analysed, in order to demonstrate the tools and methods via a lingua franca, although resources for Chinese, Korean, and Japanese will also be introduced. A dataset of YouTube comments from popular Korean pop music videos has been collected for the analysis.
Online text is qualitatively different from offline text, and many traditional corpus methods do not directly translate to online material. Moreover, the concept of 'big data' is closely connected with the internet, so issues of scale make quantitative approaches more necessary. How can we find patterns in online text? What are the opportunities, and what are the main challenges and constraints? These patterns can relate to sentiment, topic, or simple frequencies, all of which the workshop will cover (using the Python programming language). The workshop will also introduce participants to CLARIN resources and tools that are relevant to a) the analysis of large volumes of digital discourse and b) East Asian languages.
Videos of the workshop can be found on the CLARIN Videolectures channel.
Yin Yin Lu, Oxford Internet Institute, University of Oxford
Martin Wynne, CLARIN ERIC and Bodleian Libraries, University of Oxford
Folgert Karsdorp, Meertens Institute, Royal Netherlands Academy of Arts and Sciences
Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton
Chico Camargo, Department of Physics, University of Oxford