Big Data in social science research. What matters about the size?

From the picture of the black hole to social profiling, from gene engineering to fast fashion sales, from security and forensic architecture to Cambridge Analytica and global elections meddling — both the most promising and the most vicious global advances today tend to be based on big data.

Think about your regular day. Waking up from an alarm from a sleep-tracking application, the first thing you probably do is check your email and read the news. You then see what the weather is like and maybe plan the way to a meeting you ought to attend using Google Maps. On the way, you swipe your bank card to buy a coffee and check-in to your local transport network. During the day you might chat with some family and friends on social media, take a picture which automatically uploads to the cloud or even use cloud-based software for work. Or, you could also be the lucky one who owns an Apple watch or a fitness tracker. It probably comes as no surprise that all these actions generate data. Now multiply it by 4.6 billion people who own a smartphone and — voila — you get the gist!

I guess it is not only me who’s noticed how the term ‘Big Data’ is increasingly becoming a flashy expression in social science research. In this article, I will try to distill what Big Data actually is and how it fits into our understanding of empirical research methods.

Firstly, there is no exact definition of big data which would be any different from the semantic meanings the words ‘big’ and ‘data’ already have. In fact, the first appearance of Big Data as a subject of discussion dates as far back as 1997 where NASA scientists experienced a bottleneck when the size of their data set was so large that their computers could not handle the data analysis and they had to request for more resources. Fast-forward 20 years and the understanding of big data is little changed, with popular knowledge sources (aka Wikipedia) defining big data as “an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.” Largeness and complexity of big data were further broken down by Dough Laney, Chief Data Officer at Garther, into 3Vs — volume, variety, and velocity. For the lack of a better tool, I believe the 3Vs can be handy in explaining big data from the research methods’ perspective, and especially so in comparing quantitative (positivist) and qualitative (interpretivist) research traditions.

Volume

By definition, and let’s again reaffirm the obvious, Big data is characterized by volume. Usually in quantitative research not having large enough datasets constitutes a problem. In order to achieve reliable statistical outcomes, large samples of data are required. While small samples can surely be considered, statistical models based on small samples oftentimes generate results which are less conclusive. Small samples reduce the certainty with which we can make predictions by increasing the margins of error and reducing confidence intervals. In simpler words, in smaller samples, there might be a higher difference between what we can statistically deduce and the actual observations. Plus, a range of values with which you can be 95% certain of the population average will be rather little. As such, small samples can make you think that something is false when in reality it is true (type II statistical error).

Another problem with small samples is that the number of characteristics which can predict this or that event will be limited. There are many characteristics which will affect why a person will enjoy, say, riding bikes. Weather, age, the country they live in, their personality traits such as risk aversion, if they ever fell off the bike, their BMI, etc. If we have data from 1000 people, we can obviously account for more factors which would predict a person enjoying riding bikes. However, if we only ask 10 people, a large number of predictors would result in statistical noise. Hence, having larger samples would allow us to build better models which account for the complexity of the world.

Assuming that big data as defined by volume equals large sample size, it will definitely let us make more precise inferences. But does larger always mean better? The answer to this question is no. First of all, quantity should never come at the cost of quality. In other words, we do not wish to adhere to the ‘garbage in — garbage out’ idiom. Although no sample size is immune from poor quality data, more thoughtfully and systematically selected smaller samples can yield better and more relevant results. Take Amazon Mechanical Turk survey platform as an example. If one wants to research doctors perceptions of their patients, it would be much more telling to get limited data from real doctors and patients than analyze extensive amounts of data from lay individuals who are paid to fill the survey and may not even care to provide an actual answer. In fact, we might not find self-selecting doctors in Amazon Mechanical Turk at all.

In addition to data quality, voluminous amounts of data require more sophisticated systems, more processing power and most probably some programming skills. As such, if we account for costs, there is only so much the big data can add in terms of additional insights from the pool of zeroes and ones.

Variety

Thinking in terms of where the benefit from big data will be the greatest, it is most probably in the unobtrusive research or research which collects publicly available data, be it personal, situational or text. The data collected from individuals as described at the beginning of this post is of high variety. Besides providing information about individual characteristics, it also provides the context in which this person acts in a natural environment.

In other words, big data can add an account of subjectivity to quantitative research. From subject bias within an experimental design to the problem of generalizability, although quantitative research is said to be more objective it is also less nuanced and often does not take into account such important aspects as cultural values (e.g. in medical research) or well-being (e.g. in economics). Since the majority of quantitative research is over-reliant on levels of significance and p-values, sometimes we get such curiosities as significant correlations between the rates of divorce and margarine consumption. Hence, higher variety in big data gives quantitative researchers a chance to account for more vague but important contextual factors, making more accurate interferences about the world. Shout out to anthropologists among our readers, big data might also be thick!

If we look from the perspective of qualitative research it can be just as beneficial for objectivity and subjectivity to converge. In fact, big data cannot only be used for statistical analysis but also for discovery and exploration. Recall that, theoretically, people think in two different but interrelated ways — conscious and automatic — system 1 and system 2. What big data can contribute is making sure the subjective accounts do not become too justificatory and, simultaneously, looking at the environmental cues that may affect these subjective actions. In other words, when we acquire real data (i.e. forum discussions, Twitter chatter, online search behaviors) we cannot only observe what people are thinking but also put it into context (with whom, at what time, etc.).

Having large amounts of data which would usually be considered qualitative, researchers can apply quantitative statistical techniques to analyze it with more precision. Moreover, having big data researchers can apply such methods as automated linguistic and discourse analysis, or machine learning which enables to forecast and optimize observations and discover real and more objective patterns and processes.

Almost too good to be true because, of course, let’s not forget that the online environments may not be representative of the real world in that they happen online and may be affected by self-presentation or social desirability bias (unless one is using surveillance, which is a topic for a separate discussion). One other problem which presents itself when speaking about variety is the selection bias. It is a matter of researchers’ discretion on what to include or exclude, and making sense of data characterized by such high variety can be problematic. Hence, both transparency and intentions in big data driven research are important so the research in question does not result in a futile data-mining exercise.

Velocity

Lastly, the 3rd V — Velocity, is an important bonus which characterizes big data. In short, we can record and follow phenomena in time even in a small time-scale of interactions with better precision. The dynamic nature of the digital world together with historic traceability can aid in undertaking causal research and perform quasi-experiments with relative cost-efficiency. Additionally, with the application of machine learning, researchers can test the discovered patterns against ever-changing reality and adjust the understanding of phenomena as they naturally evolve. We should, however, be cautious that the patterns discovered do not perpetuate inequalities. At the end of the day, the inputs which capture already existing inequalities will inevitably reproduce the structures which are inherently discriminative.

And, to add my last 2 cents to the big data driven science, most of the big data is based on the user-generated content. As such, let us be cautious about how we would like to feed the algorithms.


WRITTEN BY

Alina Pavlova

Cultural Sociologist. Researcher, yogi, soul-singer.

I write stories about people, places and the nature of things.

Big Data in social science research. What matters about the size?

From the picture of the black hole to social profiling, from gene engineering to fast fashion sales, from security and forensic architecture to Cambridge Analytica and global elections meddling — both the most promising and the most vicious global advances today tend to be based on big data.

Think about your regular day. Waking up from an alarm from a sleep-tracking application, the first thing you probably do is check your email and read the news. You then see what the weather is like and maybe plan the way to a meeting you ought to attend using Google Maps. On the way, you swipe your bank card to buy a coffee and check-in to your local transport network. During the day you might chat with some family and friends on social media, take a picture which automatically uploads to the cloud or even use cloud-based software for work. Or, you could also be the lucky one who owns an Apple watch or a fitness tracker. It probably comes as no surprise that all these actions generate data. Now multiply it by 4.6 billion people who own a smartphone and — voila — you get the gist!

I guess it is not only me who’s noticed how the term ‘Big Data’ is increasingly becoming a flashy expression in social science research. In this article, I will try to distill what Big Data actually is and how it fits into our understanding of empirical research methods.

Firstly, there is no exact definition of big data which would be any different from the semantic meanings the words ‘big’ and ‘data’ already have. In fact, the first appearance of Big Data as a subject of discussion dates as far back as 1997 where NASA scientists experienced a bottleneck when the size of their data set was so large that their computers could not handle the data analysis and they had to request for more resources. Fast-forward 20 years and the understanding of big data is little changed, with popular knowledge sources (aka Wikipedia) defining big data as “an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.” Largeness and complexity of big data were further broken down by Dough Laney, Chief Data Officer at Garther, into 3Vs — volume, variety, and velocity. For the lack of a better tool, I believe the 3Vs can be handy in explaining big data from the research methods’ perspective, and especially so in comparing quantitative (positivist) and qualitative (interpretivist) research traditions.

Volume

By definition, and let’s again reaffirm the obvious, Big data is characterized by volume. Usually in quantitative research not having large enough datasets constitutes a problem. In order to achieve reliable statistical outcomes, large samples of data are required. While small samples can surely be considered, statistical models based on small samples oftentimes generate results which are less conclusive. Small samples reduce the certainty with which we can make predictions by increasing the margins of error and reducing confidence intervals. In simpler words, in smaller samples, there might be a higher difference between what we can statistically deduce and the actual observations. Plus, a range of values with which you can be 95% certain of the population average will be rather little. As such, small samples can make you think that something is false when in reality it is true (type II statistical error).

Another problem with small samples is that the number of characteristics which can predict this or that event will be limited. There are many characteristics which will affect why a person will enjoy, say, riding bikes. Weather, age, the country they live in, their personality traits such as risk aversion, if they ever fell off the bike, their BMI, etc. If we have data from 1000 people, we can obviously account for more factors which would predict a person enjoying riding bikes. However, if we only ask 10 people, a large number of predictors would result in statistical noise. Hence, having larger samples would allow us to build better models which account for the complexity of the world.

Assuming that big data as defined by volume equals large sample size, it will definitely let us make more precise inferences. But does larger always mean better? The answer to this question is no. First of all, quantity should never come at the cost of quality. In other words, we do not wish to adhere to the ‘garbage in — garbage out’ idiom. Although no sample size is immune from poor quality data, more thoughtfully and systematically selected smaller samples can yield better and more relevant results. Take Amazon Mechanical Turk survey platform as an example. If one wants to research doctors perceptions of their patients, it would be much more telling to get limited data from real doctors and patients than analyze extensive amounts of data from lay individuals who are paid to fill the survey and may not even care to provide an actual answer. In fact, we might not find self-selecting doctors in Amazon Mechanical Turk at all.

In addition to data quality, voluminous amounts of data require more sophisticated systems, more processing power and most probably some programming skills. As such, if we account for costs, there is only so much the big data can add in terms of additional insights from the pool of zeroes and ones.

Variety

Thinking in terms of where the benefit from big data will be the greatest, it is most probably in the unobtrusive research or research which collects publicly available data, be it personal, situational or text. The data collected from individuals as described at the beginning of this post is of high variety. Besides providing information about individual characteristics, it also provides the context in which this person acts in a natural environment.

In other words, big data can add an account of subjectivity to quantitative research. From subject bias within an experimental design to the problem of generalizability, although quantitative research is said to be more objective it is also less nuanced and often does not take into account such important aspects as cultural values (e.g. in medical research) or well-being (e.g. in economics). Since the majority of quantitative research is over-reliant on levels of significance and p-values, sometimes we get such curiosities as significant correlations between the rates of divorce and margarine consumption. Hence, higher variety in big data gives quantitative researchers a chance to account for more vague but important contextual factors, making more accurate interferences about the world. Shout out to anthropologists among our readers, big data might also be thick!

If we look from the perspective of qualitative research it can be just as beneficial for objectivity and subjectivity to converge. In fact, big data cannot only be used for statistical analysis but also for discovery and exploration. Recall that, theoretically, people think in two different but interrelated ways — conscious and automatic — system 1 and system 2. What big data can contribute is making sure the subjective accounts do not become too justificatory and, simultaneously, looking at the environmental cues that may affect these subjective actions. In other words, when we acquire real data (i.e. forum discussions, Twitter chatter, online search behaviors) we cannot only observe what people are thinking but also put it into context (with whom, at what time, etc.).

Having large amounts of data which would usually be considered qualitative, researchers can apply quantitative statistical techniques to analyze it with more precision. Moreover, having big data researchers can apply such methods as automated linguistic and discourse analysis, or machine learning which enables to forecast and optimize observations and discover real and more objective patterns and processes.

Almost too good to be true because, of course, let’s not forget that the online environments may not be representative of the real world in that they happen online and may be affected by self-presentation or social desirability bias (unless one is using surveillance, which is a topic for a separate discussion). One other problem which presents itself when speaking about variety is the selection bias. It is a matter of researchers’ discretion on what to include or exclude, and making sense of data characterized by such high variety can be problematic. Hence, both transparency and intentions in big data driven research are important so the research in question does not result in a futile data-mining exercise.

Velocity

Lastly, the 3rd V — Velocity, is an important bonus which characterizes big data. In short, we can record and follow phenomena in time even in a small time-scale of interactions with better precision. The dynamic nature of the digital world together with historic traceability can aid in undertaking causal research and perform quasi-experiments with relative cost-efficiency. Additionally, with the application of machine learning, researchers can test the discovered patterns against ever-changing reality and adjust the understanding of phenomena as they naturally evolve. We should, however, be cautious that the patterns discovered do not perpetuate inequalities. At the end of the day, the inputs which capture already existing inequalities will inevitably reproduce the structures which are inherently discriminative.

And, to add my last 2 cents to the big data driven science, most of the big data is based on the user-generated content. As such, let us be cautious about how we would like to feed the algorithms.


WRITTEN BY

Alina Pavlova

Cultural Sociologist. Researcher, yogi,

soul-singer. I write stories about people, places and the nature of things.