As we’ve done more work analyzing the spatio-temporal dimensions social media a few questions come up repeatedly. Leading the charge is the bias of social media – Twitter is just a collection of pre-teens and bots discussing Justin Bieber. The implication being that social media does not provide a representative sample of the population. For example there is likely a lower percentage of grandmothers between the ages of 65 and 85 on Twitter than the percentage seen in the total population. Further, the geographic penetration of social media is uneven across space. In some geographies social media is more popular than others. In some locations one social media service is more popular than another. Okrut is huge in Brazil but not popular in the USA. All of these are accurately stated short comings of using social media for analysis. They also miss the point in regards to the power of social media as a data source.
Traditionally data collection strove to be unbiased by taking random samples of the population. Part of the underpinnings for modern statistics is premised on applying random samples to a normal distribution. It is what we all learn in intro to stats. Even if you never took stats, we all have the daily news survey ingrained in our head. ”85% of Americans have lost faith in the political system” followed by the small print stating this survey is accurate within +/- 5%. The survey’s participants are most often selected randomly from the phone book and a few thousand people are contacted. Each person is equal and the results are interpolated to the entire country/globe.
The problem is not all people are equal, especially in the digital age. This is the bias in social media: it looks at people as an interconnected network instead of randomly selected isolated individuals. It is the interconnected bias of social media that makes it such a powerful data collection tool. Only by embracing this bias can we effectively use social media to collect meaningful data.
Surveys measure the end result of information propagation, but don’t really look at how information is propagated. Social networks give us the ability to understand how information flows across a population. Why some memes die quickly and others go global. The difference is largely because of the fact all people are not equal and the social network is biased. This also reflects much of reality. Some people have more influence than others and some voices are louder than others. Outside of elections there are few times we all have equal and anonymous votes. There are many days I wish this was not the case, but that is how the cookie crumbles.
Two useful characteristics of data generated by social media is 1) you often can retrieve the entire population of data (i.e. the Twitter firehose) and 2) you can rank users by how connected they are in the social network (i.e number of follower, number of friends etc.). In short the bias in a social network can be quantified, although the richness of demographic information varies from network to network.
When you quantify the bias in a network one of the things you recognize quickly is each social networks can be different, but typically are not random. Physicists and other researchers have spent a lot of time over the last decade investigating why network are not random. Much like statistics the traditional simulation technique for a network was to randomly connect nodes together. Researchers (Barabasi et al.) found that when new nodes enter a network they are much more likely to connect to a node that already has a lot of connections, than to nodes with only a few connections. The concept was labeled preferential attachment and resulted in very skewed non-random connectivity distributions. Physicists called them power laws and economists often call it a Pareto distribution and colloquially called the 80/20 rule. 80% of the connections in a network are accounted for by 20% of the nodes. We see almost this exact distribution with the number of followers on Twitter:
The same rule holds true for information generation as well as propogation:
This means a small subset of users both generate and propagate the vast majority of information across Twitter. So, if I want to get the pulse of what is shaping opinion not all users are equal and some opinions are a lot more valuable than others. This inherent bias in Twitter is also what makes it exceedingly valuable as a monitoring tool. Often times I don’t want to know what everyone one thinks, or even worse a homogenized sample that extrapolates out what everyone thinks. I want to know what the most influential people think. Who is shaping the opinions of the entire population. It is ok that I don’t have a representative sample of grandmothers aged 65-85 because they are not who I want to analyze. Use the bias of the network to your advantage. The right tool for the right job as they say.
Whether this in the context of monitoring foment during the Arab Spring or the response to a new advertising campaign, knowing how information is propagating and which memes will go global and which will quickly sputter out is massively valuable. Realizing this potential means accepting the fact the network itself is biased and self reflective. This though is exactly what allows us to answer deep question that traditional polling and surveys will never provide us. The two are definitely not mutually exclusive, but objecting to using social media as a data source because it is biased misses the point of where the value in the data is. Next time around we’ll talk about the geographic bias of social media, and the challenges of working with it.
Welcome to the Esri DC Development Center blog. We write about features of our work on big data analytics, open platforms, and open data, what is new and exciting in the Esri and community, and general industry thought leadership and discussions of geospatial data visualization and analysis.
Please explore what we're working on and let us know if you have any questions or ideas!
- DFW Hail Storm 05152013
- Produksi Padi Sleman
- Aggregation of Station into Montreal census tracts
- Aggregation of Station into Montreal census tracts
- resiko angka REQUIRED: The person responsible for the metadata information.
- resiko22 REQUIRED: The person responsible for the metadata information.
- プラダ 財布 on World Bank’s Mapping for Results updates
- buy twitter follower on Cell phone service providers: Who's on top?
- shops on Dataset of the Day: Mega Millions!!!!
- fashion on Dataset of the Day: U.S. Census Bureau Annual Population Estimates
- outlet on If You Were Sec. Paulson for a Day: A Foreclosure Clearing House?