Tree Rings by Tracy O
This post is really about exploring historical datasets with version control systems.

Mark Graham posed a question to me earlier today asking what we know about the Twitter accounts of the members of Congress, specifically whether they have been removed after they left office. The hypothesis was that some members of the House and Senate may decide to delete their account on leaving DC.

I was immediately reminded of the excellent congress-legislators project which collects all kinds of information about House and Senate members including their social media accounts into YAML files that are versioned in a GitHub repository. GitHub is a great place to curate a dataset like this because it allows anyone with a GitHub account to contribute to editing the data, and to share utilities to automate checks and modifications.

Unfortunately the file that tracks social media accounts is only for current members. Once they leave office they are removed from the file. The project does track other historical information for legislators. But the social media data isn’t pulled in when this transition happens, or so it seems.

Luckily Git doesn’t forget. Since the project is using a version control system all of the previously known social media links are in the history of the repository! So I wrote a small program that uses gitpython to walk the legislators-social-media.yaml file backwards in time through each commit, parse the YAML at that previous state, and merge that information into a union of all the current and past legislator information. You can see the resulting program and output in us-legislators-social.

There’s a little bit of a wrinkle in that not everything in the version history should be carried forward because errors were corrected and bugs were fixed. Without digging into the diffs and analyzing them more it’s hard to say whether a commit was a bug fix or if it was simply adding new or deleting old information. If the YAML doesn’t parse at a particular state that’s easy to ignore.

It also looks like the maintainers split out account ids from account usernames at one point. Derek Willis helpfully pointed out to me that Twitter don’t care about the capitalization of usernames in URLs, so these needed to be normalized when merging the data. The same is true of Facebook, Instagram and YouTube. I guarded against these cases but if you notice other problems let me know.

With the resulting merged historical data it’s not too hard to write a program to read in the data, identify the politicians who left office after the 116th Congress, and examine their Twitter accounts to see that they are live. It turned out to be a little bit harder than I expected because it’s not as easy as you might think to check if a Twitter account is live or not.

Twitter’s web servers return a HTTP 200 OK message even when responding to requests for URLs of non-existent accounts. To complicate things further the error message that displays indicating it is not an account only displays when the page is rendered in a browser. So a simple web scraping job that looks at the HTML is not sufficient.

And finally just because a Twitter username no longer seems to work, it’s possible that the user has changed it to a new screen_name. Fortunately the unitedstates project also tracks the Twitter User ID (sometimes). If the user account is still there you can use the Twitter API to look up their current screen_name and see if it is different.

After putting all this together it’s possible to generate a simple table of legislators who left office at the end of the 116th Congress, and their Twitter account information.

name url url_ok user_id new_url
Lamar Alexander True 76649729
Michael B. Enzi True 291756142
Pat Roberts True 75364211
Tom Udall True 60828944
Justin Amash True 233842454
Rob Bishop True 148006729
K. Michael Conaway True 295685416
Susan A. Davis False 432771620
Eliot L. Engel True 164007407
Bill Flores False 237312687
Cory Gardner True 235217558
Peter T. King True 18277655
Steve King True 48117116
Daniel Lipinski True 1009269193
David Loebsack True 510516465
Nita M. Lowey True 221792092
Kenny Marchant True 23976316
Pete Olson True 20053279
Martha Roby False 224294785
David P. Roe True 52503751
F. James Sensenbrenner, Jr. False 851621377
José E. Serrano True 33563161
John Shimkus True 15600527
Mac Thornberry True 377534571
Scott R. Tipton True 242873057
Peter J. Visclosky True 193872188
Greg Walden True 32010840
Rob Woodall True 2382685057
Ted S. Yoho True 1071900114
Doug Collins True 1060487274
Tulsi Gabbard True 1064206014
Susan W. Brooks True 1074101017
Joseph P. Kennedy III False 1055907624
George Holding True 1058460818
Denny Heck False 1068499286
Bradley Byrne True 2253968388
Ralph Lee Abraham True 2962891515
Will Hurd True 2963445730
David Perdue True 2863210809
Mark Walker True 2966205003
Francis Rooney True 816111677917851649
Paul Mitchell True 811632636598910976
Doug Jones True 941080085121175552
TJ Cox True 1080875913926139910
Gilbert Ray Cisneros, Jr. True 1080986167003230208
Harley Rouda True 1075080722241736704
Ross Spano True 1090328229548826627
Debbie Mucarsel-Powell True 1080941062028447744
Donna E. Shalala False 1060584809095925762
Abby Finkenauer True 1081256295469068288
Steve Watkins False 1080307235350241280
Xochitl Torres Small True 1080830346915209216
Max Rose True 1078692057940742144
Anthony Brindisi True 1080978331535896576
Kendra S. Horn False 1083019402046513152
Joe Cunningham True 1080198683713507335
Ben McAdams False 196362083
Denver Riggleman True 1080504024695222273

In most cases where the account has been updated the individual simply changed their Twitter username, sometimes remove “Rep” from it–like RepJoeKennedy to JoeKennedy. As an aside I’m kind of surprised that Twitter username wasn’t taken to be honest. Maybe that’s a perk of having a verified account or of being a politician? But if you look closely you can see there were a few that seemed to have deleted their account altogether:

name url url_ok user_id
Susan A. Davis False 432771620
Bill Flores False 237312687
F. James Sensenbrenner, Jr. False 851621377
Donna E. Shalala False 1060584809095925762
Steve Watkins False 1080307235350241280

There are two notable exceptions to this. The first is Vice President Kamala Harris. My logic for determining if a person was leaving Congress was to see if they served in a term ending on 2021-01-03, and weren’t serving in a term starting then. But Harris is different because her term as a Senator is listed as ending on 2021-01-18. Her old account @senkamalaharris is no longer available, but her Twitter User ID is still active and is now attached to the account at @VP. The other of course is Joe Biden, who stopped being a senator in order to become the President. His Twitter account remains the same at @joebiden.

It’s worth highlighting here how there seems to be no uniform approach to handling this process. In one case @senkamalaharris is temporarily blessed as the VP, with a unified account history underneath. In the other there is a separation between @joebiden and @POTUS. It seems like Twitter has some work to do on managing identities, or maybe the Congress needs to prescribe a set of procedures? Or maybe I’m missing part of the picture, and that just as @RepJoeKennedy somehow changed back to @JoeKennedy there is some namespace management going on behind the scenes?

If you are interested in other social media platforms like Facebook, Instagram and YouTube the unitedstates project tracks information for those platforms too. I merged that information into the legislators.yaml file I discussed here if you want to try to check them. I think that one thing this experiment shows is that if the platform allows for usernames to be changed it is critical to track the user id as well. I didn’t do the work to check that those accounts exist. But that’s a project for another day.

I’m not sure this list of five deleted accounts is terribly interesting at the end of all this. Possibly? But on the plus side I did learn how to interact with Git better from Python, which is something I can imagine returning to in the future. It’s not every day that you have to think of the versions of a dataset as an important feature of the data, outside of serving as a backup that can be reverted to if necessary. But of course data changes in time, and if seeing that data over time is useful, then the revision history takes on a new significance. It’s nothing new to see version control systems as critical data provenance technologies, but it felt new to actually use one that way to answer a question. Thanks Mark!