Session 13

13. Looking Beyond Twitter Data: Methodical Challenges of Unconventional Large Digital Datasets

Digitization gives rise to ever new large digital datasets measuring some form of human activity. Such datasets can be comprised of social media posts, hold information about transactions on digital platforms, show details of users’ interactions with websites or map the GPS-traces of mobile digital devices. Social scientists increasingly use such new data sources to investigate phenomena like public opinion, political discourse, economic changes or mobility patterns. While the systematic measurement of human activities has traditionally been the remit of scientific bodies or public agencies, these new large-scale datasets are predominantly created by private organizations. They thus combine the well-known challenges of secondary data (1) with new ones resulting from the fact that the data is private property (2). (1) The purposes of data collection and the structure of the data themselves differ from the purposes of the research projects in which they are used. Researchers interested in social phenomena have to identify existing datasets that might fight their purpose, assess their qualities with respect to the research questions and often devise custom ways of transformation to prepare the data for the intended form of analysis. (2) These datasets are usually owned by private companies who tightly control access to them. Additionally, information about data structure is regularly limited and cannot be verified. Social scien-tists interested in using these datasets for academic research thus have to negotiate with these companies and find auxiliary sources of information about important characteris-tics, like the sample structure or the details of measurement. A few of these new types of large digital data, notably those provided by Twitter, are be-coming established as sources of social research. Researchers using these established da-tasets can rely on best practices for data handling and find answers to their methodical questions in an ongoing scientific debate about their chosen form of data. Other large digital datasets are only used sporadically for research, for example datasets created by programmatic advertising platforms or APIs provided by lesser-known digital platforms. Researchers interested in using such unconventional datasets have to answer basic me-thodical questions mostly on their own. Common questions include:

(a) Which large digital datasets can help us answer the specific research question?

(b) Where can we find information about data structure? Which assumptions about users, their relations and their activities lie behind the data?

(c) How can we get access to the data? How can we retrieve it? Which transfor-mations are necessary to prepare the data for the intended analysis?

(d) How can we evaluate data quality?

(e) Which aspects of research ethics have to be considered? This session invites social scientists working with new and unconventional large digital datasets to discuss these questions and possible answers.

The discussion will focus on identification, access, evaluation and transformation of large digital datasets. Contributions preferably use specific examples to introduce methodical challenges and discuss solutions found for them in individual research projects.