As we’ve done more work analyzing the spatio-temporal dimensions social media a few questions come up repeatedly.  Leading the charge is the bias of social media – Twitter is just a collection of pre-teens and bots discussing Justin Bieber.  The implication being that social media does not provide a representative sample of the population.  For example there is likely a lower percentage of grandmothers between the ages of 65 and 85 on Twitter than the percentage seen in the total population.  Further, the geographic penetration of social media is uneven across space.  In some geographies social media is more popular than others.  In some locations one social media service is more popular than another.  Okrut is huge in Brazil but not popular in the USA.  All of these are accurately stated short comings of using social media for analysis.  They also miss the point in regards to the power of social media as a data source.

Traditionally data collection strove to be unbiased by taking random samples of the population.  Part of the underpinnings for modern statistics is premised on applying random samples to a normal distribution.  It is what we all learn in intro to stats.  Even if you never took stats, we all have the daily news survey ingrained in our head.  ”85% of Americans have lost faith in the political system” followed by the small print stating this survey is accurate within +/- 5%.  The survey’s participants are most often selected randomly from the phone book and a few thousand people are contacted.  Each person is equal and the results are interpolated to the entire country/globe.

The problem is not all people are equal, especially in the digital age.  This is the bias in social media: it looks at people as an interconnected network instead of randomly selected isolated individuals.    It is the interconnected bias of social media that makes it such a powerful data collection tool.  Only by embracing this bias can we effectively use social media to collect meaningful data.

Surveys measure the end result of information propagation, but don’t really look at how information is propagated.  Social networks give us the ability to understand how information flows across a  population.  Why some memes die quickly and others go global.  The difference is largely because of the fact all people are not equal and the social network is biased.  This also reflects much of reality.  Some people have more influence than others and some voices are louder than others.  Outside of elections there are few times we all have equal and anonymous votes.  There are many days I wish this was not the case, but that is how the cookie crumbles.

Two useful characteristics of data generated by social media is 1) you often can retrieve the entire population of data (i.e. the Twitter firehose) and 2) you can rank users by how connected they are in the social network (i.e number of follower, number of friends etc.).  In short the bias in a social network can be quantified, although the richness of demographic information varies from network to network.

When you quantify the bias in a network one of the things you recognize quickly is each social networks can be different, but typically are not random.  Physicists and other researchers have spent a lot of time over the last decade investigating why network are not random.  Much like statistics the traditional simulation technique for a network was to randomly connect nodes together.  Researchers (Barabasi et al.) found that when new nodes enter a network they are much more likely to connect to a node that already has a lot of connections, than to nodes with only a few connections.  The concept was labeled preferential attachment and resulted in very skewed non-random connectivity distributions.  Physicists called them power laws and economists often call it a Pareto distribution and colloquially called the 80/20 rule.  80% of the connections in a network are accounted for by 20% of the nodes. We see almost this exact distribution with the number of followers on Twitter:

(source: http://www.annouckwelhuis.nl/twitter-and-the-pareto-principle-2/)

The same rule holds true for information generation as well as propogation:

(source: http://www.annouckwelhuis.nl/twitter-and-the-pareto-principle-2/)

This means a small subset of users both generate and propagate the vast majority of information across Twitter.  So, if I want to get the pulse of what is shaping opinion not all users are equal and some opinions are a lot more valuable than others.  This inherent bias in Twitter is also what makes it exceedingly valuable as a monitoring tool.  Often times I don’t want to know what everyone one thinks, or even worse a homogenized sample that extrapolates out what everyone thinks.  I want to know what the most influential people think.   Who is shaping the opinions of the entire population.  It is ok that I don’t have a representative sample of grandmothers aged 65-85 because they are not who I want to analyze.  Use the bias of the network to your advantage.  The right tool for the right job as they say.

Whether this in the context of monitoring foment during the Arab Spring or the response to a new advertising campaign, knowing how information is propagating and which memes will go global and which will quickly sputter out is massively valuable.  Realizing this potential means accepting the fact the network itself is biased and self reflective.  This though is exactly what allows us to answer deep question that traditional polling and surveys will never provide us.  The two are definitely not mutually exclusive, but objecting to using social media as a data source because it is biased misses the point of where the value in the data is.  Next time around we’ll talk about the geographic bias of social media, and the challenges of working with it.

 

6 Responses to The Benefits of the Bias in Social Media

  1. Muki Haklay says:

    As much as it is interesting to notice the one who shout loudest, the bias is more important than what you imply in terms of the analysis. Let’s examine just few issues:
    1. You assume that opinion formers are very influential on the rest of the population, but social media doesn’t tell you what the silent majority really think about the information that is coming their way. You don’t have information about that. Even the people that ‘like’ things or ‘retweet’ are in (very) small minority and you can’t say what is going on in the head of the people that are following. We all can self reflect on some people that we follow who we think are absolute idiots most of the time, but produce useful links from time to time enough to justify not dropping them. The assumption that the influence increases as the links increase is reasonable but you can’t know it without going to the massive silent majority…
    2. An example about the problem with the bias is now being played out in Egypt, where the actual people with power and organisation are not the people who made the noise on tweeter or facebook. Power and influence in one medium doesn’t translate so simply to other places (which is why you still need the unbiased analysis to go hand in hand with your biased stuff)
    3. I would argue that what you are looking at are the outliers. Put your 20% people on an imaginary normal distribution (and I know that this is not the case, but just for the example) and you can realise that you are biased towards the higher end of the distribution, and because the bias is even larger – e.g. the top 1% are doing 5% of the job in some cases even more – you are really not looking on anything like the general population.
    4. You are arguing that you are looking at the whole population – but it is not true. The noisy top is so noisy to the level that it distort what is going on amongst the bottom, where maybe more normal things are going on and the pattern can be very different. Also, because consumption is silent in your information, you have to infer consumption, but that is a strong assumption (see point 1).

    So what all that say is that if you are interested in this very specific group and have good reasons to look at what it talks about, and you are acutely aware of the bias all the way throughout the analysis and conclusions, then you can use it. Otherwise, you are making some silly generalisation on a population from outliers…

  2. seagor says:

    Hi Muki,

    Thanks for the feedback – lots of very good points through out. As in the post I agree traditional methods and social media analysis are not mutually exclusive and one does not replace the other. Specifically point by point:

    1) There will always be a silent majority that either is not on the social network at all or does not supply any content to the network to evaluate. My point is that there is still value in those that do provide data even if it is biased by the lack of the silent majority. It depends on the question you want to answer, but many times knowing the people that are willing to have a public opinion on a topic has inherent value itself. We need to recognize that bias, but also recognize there is value in the bias when used appropriately.

    2) Agree there is a disconnect between power on a social network like Twitter and the real world. If I wanted to monitor the post-revolution mood of the Egyptian dissenters who used social media to start the revolution, then monitoring Twitter could be a good source of data. There is a bias of dissenters using Twitter thus making a Twitter a good mechanism to generate data on the same. It would not be a good mechanism to monitor the thoughts of the military regime in control though. Although I doubt they are answering surveys either ;-)

    3) Exactly correct that we are monitoring outliers because in some cases those outliers create the most interesting information. It is embracing the fact that they are outliers and understanding them helps us better understand the network.

    4) The whole population refers to the whole population of information on the social network not the human population writ large. The researcher needs to understand what the whole population means for each social network, but it does not change the fact the data is available and collected. In the case of Twitter you can have the whole population of Tweets generated, but as you correctly point out not how they are consumed. It all depends on what you are measuring. Regardless, the fact an entire corpus of data can be tapped is a big change from doing small samples.

    My hopeful point in all this is that there is a middle ground between social media is the panacea for everything and it is all a bunch of worthless drivel to be ignored. I think there is potential for real science with the appropriate assumptions taken, but also embracing it is likely to not look like the old science. In my mind that is what makes it exciting as new frontier to try and understand. I defintely don’t have all the answers but it is great to get the dialog going.

  3. [...] about the nature of social information that is available on the Web that I partially articulated in a response to a post on GeoIQ blog . When Mark and Matt asked for an abstract, I have provided the following: The understanding of the [...]

  4. Britt Seley says:

    The best who exercises power with honor work throughout, beginning with himself.
    The project of the baby still remains to be the spark that moves mankind ahead a lot more than teamwork.

  5. Günümüzde en çok yapılan estetik diş işlemlerinin başında gelen diş beyazlatma yöntemleri bugün birçok kişi tarafından tercih edilmekte.

  6. Hi there, You have performed an excellent job. I’ll definitely digg it and in my view recommend to my friends. I am sure they will be benefited from this website.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>