Friday, December 12, 2014

Anthropology on the Long Tail

Small Big Data?
Of the many hyperbolic predictions in bestselling books devoted to big data, none is more astounding than Mayer-Schönberger’s and Cukier’s claims that big data will eliminate the need for sampling (why sample when you’ve got all the data?). But here’s the thing. We don’t have all of the data. Let’s look at Twitter. First, people who tweet are not a representative sample of the population. Second, like most commercial platforms, Twitter has moved towards more proprietary policies on the data they have mined from us. Most of us can only access up to 1% of relevant tweets for a given query. That can still be a lot of tweets, and that data is, for the moment, free. But is that big data? In other words, we’ve got sampling bias. If you can detect it, though, you can correct for it—Morstatter et al recommend bootstrapping the data in order to correct for the biased sample.
But it may not be so easy with some of the work we do. For example, the authors note that the difficulties that researchers may have with the long tail of tweets—the 99 percent of hashtags that are not trending. Are these biased? And can that bias be corrected? Research so far has been on the popular terms–#Ferguson, #Obamacare. But for the most part, anthropologists study the long tail: the lives and perspectives of people engaged in quotidian action on a relatively small scale. Heck, we are the long tail: even if we engage in public anthropologies, those anthropologies (and their publics) rarely register a blip in the winner-take-all logic of power-law social media.
On the other hand, our fieldwork is rarely about achieving a certain sample size—it’s about collecting a range of experiences and practices and then contextualizing those results. With social media, we should take the same approach. We my not have big data, but we might use the same tools. And they can still be helpful, but not as a substitute for our painstaking, field research.
The following are two, quick examples of utilizing social network analysis for qualitative research drawn from a webinar I did for AAA in November (eventually to be posted on AAA’s YouTube channel). Both examples utilize a free and open source application for Microsoft Excel—NodeXL–which has the advantage of familiarity and also comes built-in queries for multiple SNS APIs: Twitter, YouTube, Flickr, and, with a little work, Facebook and hyperlink analysis. Finally, all of the complexities of graph theory are already built into the application.
In way of introduction: you’ve got nodes, dots representing people, concepts, organizations, etc., and edges, lines that represent some kind of relationship between the nodes. And although there are many ways we might analyze these relationships, for these examples I only use one measure of centrality—the relative importance of a node. Betweenness centrality ranks the importance of a node based on the number of times it’s crossed in the shortest path between all of the nodes in a graph to each other.
Who are my interlocutors?
I’ve been researching the intersection of place and social network platforms in Seoul, and one of my favorite places has been Gwanghwamun Plaza. But it’s a crowded field of social action, with events
Screenshot of NodeXL
Screenshot of NodeXL
overlapping each other every day, a complexity reflected in the tweets containing Gwanghwamun. First, I use the drop-down menu on NodeXL to query the Twitter API.
Choosing the Twitter Search Network, I enter in a search term 광화문 (Gwanghwamun) and set the parameters for my search. It returns 1528 vertices (dots) representing Twitter accounts connected by 1880 edges (lines) representing relationships between users who have  used the term, or users who were replied to or were mentioned in one of the tweets with that term. It’s pretty messy, but NodeXL gives us some options for ordering this chaos. After running metrics for the data, I have it group the nodes together into separate boxes by connected components.
Screenshot of Twitter users. Note that Twitter IDs have been cropped off
Screenshot of Twitter users. Note that Twitter IDs have been cropped off
Now I’ve got something more manageable: a series of groups that share some thematic similarities. This can give me a sense of the demonstrations, counter-demonstrations, unconnected events and encounters that make up the social practice of this space. Moreover, I can rank the nodes by centrality to find the most important Twitter accounts. So now I have a sense of this field in a way that both is and is not co-extensive with the physical fieldsite, but
without obscuring the role of physical place: it does matter, after all, that the protest is happening here in Gwanghwamun and not in Second Life.

What does my event mean?
Honfest is a highly commercialized neighborhood festival concocted by a neighborhood entrepreneur in order to brand the neighborhood for commodified consumption. As such, it is a flashpoint of contention, a social drama that reveals the divisions around gentrification, race and class in this formerly working-class neighborhood in north Baltimore.
Twitter users grouped into boxes by connected component and ranked by betweenness centrality
Twitter users grouped into boxes by connected component and ranked by betweenness centrality
We have sent students into the festival every year in order to document these negotiations. But we are not the only ones. Every year, there are hundreds of people posting their media on social networking sites like Instagram and Flickr. These images are important clues to the meaning of this event and analyzing these data can tell us much about the different ways people categorize space and place: photo elicitation and photovoice applied to social media platforms
Going back to the pull down import menu on NodeXL, I download the related tag network on Flickr for Honfest.
Screen Shot 2014-11-27 at 4.28.00 AM
This graph shows the relationship between tagged terms, and, like the preceding example, this may prove too messy for analysis. So: we can again run metrics, and remove terms that occur less frequently in these photos.
Screen Shot 2014-11-27 at 4.28.05 AM
Now, we have honfest (at the center of the graph) surrounded by a constellation of terms that co-occur with it. Like the preceding example, I can rank these terms by betweenness centrality—and we find predictable terms: hairspray (the film and the hair product), beehive (the hair style), retro etc.
Concluding thoughts
Both of these examples represent ways that we might utilize socially networked data (rather than big data) to open up our ethnographic work to other meanings, interlocutors and social relations. In neither case is the critical need for face-to-face ethnography eliminated. In fact, just the opposite. In the first example, Twitter helps us to identify issues and people that might be salient to fieldwork, while in the second, tags suggest (but only suggest) different discourses swirling around an urban festival. Both are only first steps in different phases of a sustained, ethnographic project. In other words: in the absence of big data, we still have the tools (if not the truth claims) of big data. We can utilize them to enrich our small-scale, place-bound ethnographic research in ways that are complementary.