Part 2: Building the “Tweets of a Native Son” Archive (jq and regular expressions)

November 07, 2016

In Part 1, I explained how I “hydrated” 17 million tweets that mentioned “Ferguson” from August and November 2014 by using twarc and how I generated summaries for these collections (# of tweets and users, top hashtags, top URLs, top image URLS, etc.) using twarc-report. But how and where does James Baldwin fit into the picture?

In this post, I’m going to talk about what Twitter metadata looks like (spoiler alert: it’s pretty kooky) and how I reshaped the metadata to find only tweets that mentioned James Baldwin by using jq, a command-line utility for filtering JSON data.

Twitter metadata is wild. Every single tweet carries with it an enormous amount of information. Even a tweet as seemingly meaningless as “:poop:” carries with it not just the “text” (:poop:) of the tweet but also many other metadata fields such as:

  • when the tweet was created
  • how many times the tweet has been favorited or retweeted
  • what language the tweet was written in
  • which user wrote the tweet
  • when the user created his or her account
  • the user’s profile description and profile picture and profile background color
  • the user’s location
  • how many people follow the user
  • how many people the user follows
  • how many tweets the user has ever favorited
  • how many tweets the user has ever sent

…and much, much more.1

Here’s just a glimpse of a single tweet from the August archive, which goes on for 96 lines:

alt text

As a tweet gets more complex—a reply to another user with a hashtag and a URL or a retweet with an image—the metadata gets even bigger and richer.

This richness makes the metadata wonderful for research purposes, allowing us to investigate all kinds of questions, but it also makes the data difficult to work with. If we want to extract just the tweets that mention James Baldwin, the nested structure of the JSON data (which you can learn more about here) makes that a tricky task.

This is where jq comes in, a command-line utility for reshaping JSON data. With jq, you can extract information from particular fields (“text,” “coordinates,” retweet_count,” etc.) even when those fields are nested within other fields. You can even output the JSON data to a flat CSV file.2 For my purposes, I extracted only the tweets that mentioned James Baldwin by performing the filter operation select() on the .text field of the tweets, combined with a regular expressions string match for “James Baldwin” test("James Baldwin"):

jq -c "select(.text | test(\"James Baldwin\"))" tweets.json > jamesbaldwin.json

This operation returned 1,697 tweets that mentioned “James Baldwin” from the August Ferguson tweet collection. But that’s only when “James Baldwin” was spelled correctly with proper capitalization and spacing. So I decided to expand the search to collect other variations of his name without regard for capitalization and spacing, as well as permitting just a first initial and even misspellings that may have accidentally dropped the “d” from his name (#JamesBaldwin; J BALDWIN; james baldwin, James Balwin, etc.) :

jq -c "select(.text | test(\"J(ames|) ?Bald?win\";\"i\"))" tweets.json > jamesbaldwin.json

This operation returned an increased 1,839 tweets from the August collection and 1,393 tweets from the November collection. I experimented with an even more expansive search for simply “Baldwin,” which returned 3,410 tweets, but too many of them referenced a “Baldwin” not related to the literary James of interest, picking up instead on users’ last names, the actor Alec Baldwin, etc.

These 3,232 tweets from August and November quantitatively confirm scholars’ claims—among them, Eddie S. Glaude and William J. Maxwell—that Baldwin is and was a leading literary voice in the burgeoning #BlackLivesMatter movement, since it proves that James Baldwin was being invoked and talked about on Twitter in the aftermath of Ferguson.

Yet the James Baldwin conversation represents less than one-percent of the total Twitter conversation in either dataset, which serves as a tempering reminder that the scholarly sense of what is dominant in the larger cultural conversation may be skewed by scholarly newsfeeds. These numbers may even seem small enough to dismiss the case for Baldwin’s prominence on Twitter. But when cross-compared to other prominent black writers, Baldwin is far and away the most invoked. The words “James Baldwin” appear more in the August collection than “Claudia Rankine” (416), “Langston Hughes” (281), “Assata Shakur” (130), “Ta-Nehisi Coates” (129), “Toni Morrison” (72), “Teju Cole” (55), “Richard Wright” (50), “Ralph Ellison” (49), and “Amiri Baraka” (10) combined.

But why? Why does Baldwin appear so much more frequently than other black writers? What about his style, insights, or legacy resonates in this particular historical moment and on this particular platform?

These are the literary questions that I will seek to answer in future posts through close-reading and further investigation of the data. But for now, I’ll just share an overview of the Ferguson-Baldwin data produced using some twarc utilities and twarc-report. Check out the most popular retweets, hashtags, URLs, and image URLs below.

August Baldwin Tweet Archive

Tweets: 1,839    
Users: 1,753    
Has Hashtag: 1,169 (63.57%)    
Hashtags: 115    
Has URL: 609 (33.12%)    
URLs: 166    
Has Image URL: 47 (2.56%)    
Image URLs: 18    
Retweets: 1,327 (72.16%)    
Geo: 11 (0.60%)    
Earliest Tweet: 2014-08-11 04:16:44 UTC      
Latest Tweet: 2014-08-27 12:46:48 UTC
Total Duration: 16 days, 8:30:04

Top 10 Retweets:

Top Hashtags:      
1. ferguson 1,093    
2. jamesbaldwin 122    
3. vmas 84    
4. lookdifferent 61    
5. mikebrown 53    
6. nmos14 28    
7. vmawards 17    
8. books 15    
9. vmas2014 15    
10. moralmonday 15    
Top URLs:      
1. 73    
2. 61    
3. 59    
4. 49    
5. 33    
6. 27    
7. 23    
8. 15    
9. 15    
10. 11    
Top Image URLs:      
1. alt text 15    
2. alt text 6    
3. alt text 5    
4. alt text 4    
5. alt text 2    
6. alt text 2    
7. alt text 2    
8. alt text 1    
9. alt text 1    
10. alt text 1    

November Baldwin Archive

Tweets: 1,393    
Users: 1,231    
Has Hashtag: 952 (68.34%)    
Hashtags: 113    
Has URL: 388 (27.85%)    
URLs: 203    
Has Image URL: 213 (15.29%)    
Image URLs: 47    
Retweets: 987 (70.85%)    
Geo: 11 (0.79%)    
Earliest Tweet: 2014-11-12 05:38:56 UTC    
Latest Tweet (Retweet): 2014-12-10 04:08:08 UTC
Total Duration: 27 days, 22:29:12

Top 10 Retweets:

Top Hashtags:      
1. ferguson 836    
2. mikebrown 196    
3. blacklivesmatter 174    
4. jamesbaldwin 104    
5. noblackjusticenoblackfriday 48    
6. fergusiondecision 45    
7. icantbreathe 34    
8. fergusontheroot 23    
9. hiphoped 22    
10. standup 20    
Top URLs:      
1. 54    
2. 33    
3. 23    
4. 18    
5. 18    
6. 16    
7. 11    
8. 11    
9. 10    
10. 9    
Top Image URLs:      
1. alt text 45    
2. alt text 44    
3. alt text 23    
4. alt text 19    
5. alt text 9    
6. alt text 8    
7. alt text 7    
8. alt text 5    
9. alt text 4    
10. alt text 4    
  1. If you’re interested in finding out more about the structure of Twitter metadata, check out Twitter’s documentation here

  2. For a full jq tutorial, I recommend Matthew Lincoln’s very helpful “Reshaping JSON with jq”