Part 1: Building the “Tweets of a Native Son” Archive (twarc and twarc-report)

November 05, 2016

So my project “Tweets of a Native Son” examines the way that Twitter conversations about Ferguson and the #BlackLivesMater movement invoke the literary author James Baldwin. What’s my archive? How did I build it?

Tweets can be ephemeral little stinkers. Since Twitter’s Search API only allows you to collect tweets from the last 1-2 weeks, foresight is huge when building a Twitter archive. Thankfully, Ed Summers had the foresight to start collecting tweets that mentioned “Ferguson” (upper or lowercase; with or without hashtag) in the weeks and months following Michael Brown’s shooting and, just as important, the generosity to share them openly. Though Twitter’s Terms of Service does not allow bulk distribution of Twitter data, it does allow the distribution of tweet “ids,” which are unique identifiers that get assigned to every tweet and can be used to retroactively access the full tweet metadata, a process called “hydrating” that can be performed with a command line Python tool like twarc, which was also created by Ed Summers. Yeah, Ed rules.

So the first step in building my Baldwin Tweet archive was hydrating the Ferguson tweet collections from both August and November, which you can perform by running the command --hydrate:

mwalsh@ada:~/ferguson-tweet-ids/data$ gunzip ids.txt.gz
mwalsh@ada:~/ferguson-tweet-ids/data$ nohup --hydrate ids.txt > tweets.json &

Since Twitter’s API rate limit only allows requests for up to 72,000 tweets per hour, this process took approximately 8 and 9 days, respectively. The utility “twarc-report,” created by Peter Binkley, allows me to generate a helpful, summarizing overview of these collections (# of tweets and users, top hashtags, top URLs, top images, etc.), which you can perform by running with the flag -o text:

mwalsh@ada:~/ferguson-tweet-ids/data$ ~/twarc-report/ -o text tweets.json
mwalsh@ada:~/ferguson-indictment-tweet-ids/data$ ~/twarc-report/ -o text tweets.indictment.json

Here’s (some of) what the report spits out…

August Ferguson Tweet Collection

Tweets: 10,441,785
Users: 1,596,104
Has Hashtag: 6,898,701 (66.07%)
Hashtags: 99,695
Has URL: 3,317,445 (31.77%)
URLs: 664,691
Has Image URL: 1,593,403 (15.26%)
Image URLs: 134,431
Retweets: 7,411,799 (70.98%)
Geo: 74,678 (0.72%)
Earliest Tweet: 2014-08-10 22:44:33 UTC
Latest Tweet: 2014-08-27 15:15:50 UTC
Total Duration: 16 days, 16:31:07
Top Hashtags:  
1. ferguson 6,422,131
2. mikebrown 583,141
3. michaelbrown 158,552
4. justiceformikebrown 84,061
5. tcot 70,702
6. gaza 70,338
7. stl 58,662
8. handsupdontshoot 47,126
9. darrenwilson 40,091
10. fergusonshooting 31,643
Top URLs:  
1. 37,917
2. 12,906
3. 10,773
4. 9,841
5. 8,276
6. 7,535
7. 7,514
8. 6,771
9. 6,366
10. 5,909
Top Image URLs:  
1. alt text 21,899
2. alt text 10,486
3. alt text 10,319
4. alt text 9,029
5. alt text 7,956
6. alt text 7,586
7. alt text 6,842
8. alt text 6,223
9. alt text 5,998
10. alt text 5,195

November Ferguson Tweet Collection

Tweets: 7,868,540
Users: 1,761,950
Has Hashtag: 4,567,256 (58.04%)
Hashtags: 110,097
Has URL: 3,514,869 (44.67%)
URLs: 927,040
Has Image URL: 1,465,857 (18.63%)
Image URLs: 191,684
Retweets: 5,004,081 (63.60%)
Geo: 51,007 (0.65%)
Earliest Tweet: 2014-11-11 22:17:06 UTC
Latest Tweet: 2014-12-10 05:15:31 UTC
Total Duration: 28 days, 6:58:25
Top Hashtags:  
1. ferguson 3,969,513
2. mikebrown 188,976
3. ericgarner 178,139
4. blacklivesmatter 151,749
5. fergusondecision 118,405
6. michaelbrown 105,607
7. tcot 99,477
8. darrenwilson 89,154
9. icantbreathe 89,000
10. shutitdown 45,296
Top URLs:  
1. 11,592
2. 10,644
3. 9,848
4. 8,609
5. 8,535
6. 8,097
7. 7,460
8. 7,396
9. 7,038
10. 7,028
Top Image URLs:  
1. alt text 11,453
2. alt text 10,907
3. alt text 10,681
4. alt text 10,004
5. alt text 9,299
6. alt text 9,190
7. alt text 8,484
8. alt text 7,870
9. alt text 7,310
10. alt text 5,824

This is a pretty cool macro view of over 17 million tweets, and it can already tell us a lot about what’s inside these datasets. For instance, from these summaries alone, we can see #BlackLivesMatter emerging into the mainstream conversation about Ferguson. From August 10 to August 27, the hashtag #BlackLivesMatter doesn’t even make the Top 10. But by November, it’s rocketed all the way to No. 4. These four months were crucial for the rise and circulation of the hashtag #BlackLivesMatter, and this data helps tell part of that story.

But of course this is not the whole story, and in fact it’s not even the whole Twitter story. After all, from the 13,480,000 tweet ids in the August collection and the 15,080,078 ids in the November collection, I was only able to hydrate 10,441,785 tweets and 7,868,540 tweets, respectively. That’s a lot of missing tweets. If you’re wondering where the heck all those tweets went, that’s an excellent question. If a Twitter user deletes his or her tweets before the time of hydration, that tweet gets lost forever. Poof! Ed Summers thinks this is a lowkey ethical move on Twitter’s part, which allows users the right to be forgotten. Spam also gets routinely and retroactively deleted from Twitter, which also might account for some of the missing tweets. Even so, this is still a rather large gap, and I hope to theorize it more fully in later posts.

For now, however, I’ll conclude Part 1 by saying that twarc is an awesome tool for creating Twitter archives if you know even a little bit about Python/the command line (though I should mention that Ed Summers is currently leading a project called “Documenting the Now” which aims to develop an even easier-to-use tool that requires no programming skills whatsoever) and twarc-report is a small but powerful utility for getting a sense of the contours of one’s dataset.

In Part 2, I will discuss how I narrowed these collections to only the tweets that mention James Baldwin (by using jq and regular expressions) and show what the twarc-reports of these smaller James Baldwin collections reveal.