How can social scientists collect and analyze web logs – records of individuals’ browsing behavior – for their own research? In this Methods Bites Instructional Blog Post, Ruben Bach summarizes some key insights of his talk in the MZES Social Sciences Data Lab in December 2019. The blog post discusses how to obtain and extract information from web logs and related data, shows how they can be used for social research, and concludes with a short discussion of how to handle big data extracted from web logs.
How to handle “big data”
Finally, a note on useful tools and prerequisites for analyzing web logs and records of smartphone use. First of all, the amount of data can quickly exceed the computational power of a standard desktop computer. Four months of web log data used in Bach et al. (2019), for example, contained about 38 million observations. Working with data of this size, researchers may have to consider using remote computing services like Digital Ocean, Amazon AWS or Microsoft Azure, which offer computational resources for little money through virtual servers. Second, understanding URL contents by observing single URLs is straightforward. Analyzing thousands of URLs, however, requires text mining and natural language processing (NLP) techniques if one wants, for example, to select only those URLs that point to news articles. Moreover, in addition to analyzing the title of a news article (which can often be observed from the URL alone), one might also want to analyze the whole content of the article. In such cases, in addition to being able to automatically extract the topic of an article through NLP techniques, knowing how to scrape website contents will likely also be helpful. Some useful materials are linked below.
About the presenter
Ruben Bach is a postdoctoral researcher at the University of Mannheim, focusing on social science quantitative research methods. His interests include topics related to big data in the social sciences, machine learning, causal inference, and survey research.
Bach, R. L., C. Kern, A. Amaya, F. Keusch, F. Kreuter, J. Heinemann, and J. Hecht. 2019. “Predicting Voting Behavior Using Digital Trace Data.” Social Science Computer Review. https://doi.org/10.1177/0894439319882896.
Chancellor, S., and S. Counts. 2018. “Measuring Employment Demand Using Internet Search Data.” In Proceeding of the 2018 Chi Conference on Human Factors in Computing Systems, 1–14. CHI ’18. New York, NY, USA: ACM.
Cornesse, C., A. G. Blom, D. Dutwin, J. A. Krosnick, E. D. De Leeuw, S. Legleye, J. Pasek, et al. 2020. “A Review of Conceptual Approaches and Empirical Evidence on Probability and Nonprobability Sample Survey Research.” Journal of Survey Statistics and Methodology. https://doi.org/10.1093/jssam/smz041.
Dvir-Girsman, S. 2017. “Media Audience Homophily: Partisan Websites, Audience Identity and Polarization Processes.” New Media & Society 19 (7): 1072–91.
Flaxman, Seth, Sharad Goel, and Justin M Rao. 2016. “Filter Bubbles, Echo Chambers, and Online News Consumption.” Public Opinion Quarterly 80 (S1): 298–320.
Fourney, Adam, Ryen W. White, and Eric Horvitz. 2015. “Exploring Time-Dependent Concerns About Pregnancy and Childbirth from Search Logs.” In Proceedings of the 33rd Annual Acm Conference on Human Factors in Computing Systems, 737–46. CHI ’15. New York, NY, USA: ACM. https://doi.org/https://doi.org/10.1145/2702123.2702427.
Ginsberg, Jeremy, Matthew H Mohebbi, Rajan S Patel, Lynnette Brammer, Mark S Smolinski, and Larry Brilliant. 2009. “Detecting Influenza Epidemics Using Search Engine Query Data.” Nature 457 (7232): 1012–4.
Guess, Andrew M, Brendan Nyhan, and Jason Reifler. 2020. “Exposure to Untrustworthy Websites in the 2016 Us Election.” Nature Human Behaviour, 1–9.
Hinds, J., and A. N. Joinson. 2018. “What Demographic Attributes Do Our Digital Footprints Reveal? A Systematic Review.” PLoS One 13: 1–40.
Kreuter, Frauke, Georg-Christoph Haas, Florian Keusch, Sebastian Bähr, and Mark Trappmann. 2019. “Collecting Survey and Smartphone Sensor Data with an App: Opportunities and Challenges Around Privacy and Informed Consent.” Social Science Computer Review. https://doi.org/10.1177/0894439318816389.
Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176): 1203–5.
Mercer, Andrew W., Frauke Kreuter, Scott Keeter, and Elizabeth A. Stuart. 2017. “Theory and Practice in Nonprobability Surveys: Parallels between Causal Inference and Survey Inference.” Public Opinion Quarterly 81 (S1): 250–71. https://doi.org/10.1093/poq/nfw060.
Möller, Judith, Robbert Nicolai van de Velde, Lisa Merten, and Cornelius Puschmann. 2019. “Explaining Online News Engagement Based on Browsing Behavior: Creatures of Habit?” Social Science Computer Review, 0894439319828012. https://doi.org/10.1177/0894439319828012.
Peterson, Erik, Sharad Goel, and Shanto Iyengar. 2018. “Echo Chambers and Partisan Polarization: Evidence from the 2016 Presidential Campaign.”
Revilla, Melanie, Carlos Ochoa, and Germán Loewe. 2017. “Using Passive Data from a Meter to Complement Survey Data in Order to Study Online Behavior.” Social Science Computer Review 35 (4): 521–36.
Stephens-Davidowitz, Seth. 2014. “The Cost of Racial Animus on a Black Candidate: Evidence Using Google Search Data.” Journal of Public Economics 118: 26–40.
White, Ryen W., and Eric Horvitz. 2009. “Cyberchondria: Studies of the Escalation of Medical Concerns in Web Search.” ACM Trans. Inf. Syst. 27 (4): 23:1–23:37. https://doi.org/https://doi.org/10.1145/1629096.1629101.