Cäcilia Zirn, Eike Mark Rinke, Charlotte Löb, Hartmut Wessler
Big data research as interaction between topic models and expert data: A new approach to capturing national online debates

3rd GESIS Computational Social Science Winter Symposium 2016, Cologne, Germany, November 30th to December 01st, 2016

Topic models have proven to be one of the most promising suites of techniques in computational social sciences across a variety of disciplines such as political science, sociology, and communication and journalism research (Günther & Quandt, 2016). In these fields, they have proven particularly useful for the automated estimation of thematic structures in mediated debates on different social levels and on different media platforms. One problem in such applications that has received little treatment thus far is that the thematic structures identified via topic models (i.e., the “cluster solutions”) in many cases depend on arbitrary decisions of the researchers: A human interpretation of the found cluster solutions is inevitable (Chang et al., 2009). This means that any inferences about thematic content structures based on topic models are in need of additional validation. But the social domain knowledge necessary for accurate validation is often lacking among the primary researchers, especially in unknown or obscure geographic and thematic contexts. In this contribution, we present a practical approach to solving this basic problem in the application of topic models in the social sciences. Specifically, we aim to develop a valid method of capturing the thematic structure of mediated debates on different internet platforms and in different national contexts. The proposed approach draws on a combination of crawler-generated big data with qualitative data generated in semi-structured expert surveys. Central to this approach is use of topic models for capturing thematic structures of national media publics that may be more or less integrated. Our approach should ultimately allow us to produce valid cross-national comparisons of such publics with the aim of estimating the degree of (non-)overlap between “national” online debates. We illustrate the approach with a case study in which we combine national data from surveys of media and journalism experts from six countries (Australia, Germany, Lebanon, Switzerland, Turkey, and the U.S.) with a massive dataset of all text published by the 101 nationally preeminent political websites (news websites and blogs) within a full calendar year (1 August 2016 to 31 July 2016, N = 1.6 million articles). In the case study, we show how our new approach may be used to identify from a diverse, non-preselected raw text dataset the topics that appeared on different online platforms and in different national contexts, and that belong to the “master topic” of the public role of religion in the societal life of the respective countries. Besides illustrating the general approach to solving the problem of identifying “optimal” topic solutions (Grimmer & King, 2011;), we also illustrate an approach to solving the more specific problem of capturing the themes of online text data that is more flexible and scalable than existing approaches in the social sciences, which draw on keyword lists (Schwotzer, 2014) or sampling procedures (Maier et al., 2014). In our contribution, we plan to discuss the pros and cons of two general approaches to combining topic models with expert knowledge: (a) an ex-ante approach, in which the parameters for the probability distribution are manipulated such that the topic solutions are skewed towards the expert-generated topic keywords (similar to SeededLDA, e.g., Andrzejewski et al., 2011); (b) an ex-post approach, in which topic models are at first built purely based on the analyzed text data, after which topics are identified that include expert-provided keywords through weighting procedures (see, e.g., Arun et al., 2010).