Recently I needed to find a way of quickly grabbing a list of usernames of Tumblr blogs who had liked and reblogged posts with a particular tag. Tumblr’s been in the news lately with a dashboard overhaul and suggestions that the platform is at least as popular – if not more so – than Facebook among teenagers. Especially those who like to share instagrammed photographs of questionable ‘vintage’ fashions (is this the new MySpace photo?!). It would certainly be churlish to ignore this blogging platform as an opportunity for outreach.
The problem is that there is no Followerwonk for Tumblr. No way to easily compare users, compile lists, or search by interest: really, all they give you to go on is tag URLs (and their Spotlight explore function thing, but this is pretty useless because of lack of any of the above.)
So, after a brief foray into Python territory and a failed attempt to do everything in the terminal I created a new spreadsheet in Google docs and got to work.
For those of you who don’t care about how it works, you can access a read-only copy of the project and start using it here. Make a copy first by navigating to File > Make a copy, then choosing a name and clicking Ok. This will save an editable version to your own Google Drive account. Once you’ve done this, you can enter in the hashtag you wish to scrape in the input tab, and then select the final topusers tab to see a list of all the users who have reblogged and liked the top ~18 most recent posts with that tag.
This isn’t a finished thing, however; I’m sharing it because I hope that someone else might be able to build on it and make it better. You’ll probably have to mess with it a bit to make it do what you want, but that’s half the fun, right? Right?
Here is what each tab does:
For anyone who is curious, or would like to better understand some basic – and very useful – spreadsheet automation, I’ll outline my process below, starting with the allurls tab:
This is really simple. Just an importxml function with two arguments: the URL we’re wanting to scrape, in this case a Tumblr tag page with a reference to the tag entered in the input tab, and a bit of xpath which tells the spreadsheet to scrape all of the links on that page: “//a/@href”.
(There’s a very excellent guide to using importxml over at Distilled so I won’t go into any more detail here.)
Ideally, I only wanted the URLs of the posts themselves, but my xpath expressions weren’t grabbing any data when I thought they should and it was taking too long to make it work, so I’ll figure it out within the spreadsheet instead.
So to filter only the URLs of posts, I’ve used column A to display TRUE if the URL in column B includes “post”. This is done via =ISNUMBER(FIND(“post”,B2)) where the nested FIND function returns a 1 if “post” is found in cell B2, the URL. ISNUMBER then picks up that this is a number, and spits out TRUE.
The allposts sheet then uses this true/false column to filter out just the stuff we want, with =filter(allurls!A2:B;allurls!A2:A).
Allnotes starts to get a little more complicated. For the column headers, I used a TRANSPOSE function to display the previous tab’s column of post URLs as the first row, to create column headers. Each column would include scraped data from that header URL via another importxml function. This time, we’re looking at the notes: these are the records of every time a post is liked or reblogged. Again we need to scrape for URLs, but this time it was possible to be more specific.
The formula =importxml(CONCATENATE(A1), “//a[@class=’avatar_frame’]/@href”) returns all of the links with the avatar_frame class for the URL in cell A1. Using Chrome’s inspect element view, I could see that this avatar_frame class is applied to all of the links to users who have reblogged/liked the post. Perfect!
So now you’ve got this big table of blog URLs for users who have liked and reblogged posts with your hastag. But it’s a bit messy, and totally impossible to read.
The first thing we need to do now is clean up the data a little: for me, this meant getting rid of errors, so that formulas I used on the data subsequently wouldn’t give more errors. This is as easy as using =IFERROR(allnotes!A3, “”). As I’ve left the space between the quotation marks blank, the converts any error cells to blank cells.
Next, we need to count how many times each of these users appears. Then, we can rank them, and determine which ones to look at in more detail. But before we can do that, we need to get this whole array into a one-column list in the usercount tab:
=ArrayFormula(unique(transpose(split(concatenate(allusers!A1:T102&” “),” “))))
This is a bit trickier. Reading from right to left, the CONCATENATE formula firstly smooshes that whole array into one long string, with each cell separated by a space. SPLIT then splits these up again by the space, and TRANSPOSE arranges them into a column. UNIQUE removes any duplicates, and ArrayFormula tells the whole lot that it’s dealing with an array input.
It’s then just a case of using a COUNTIF function to count each time that cell appears in the array on the allusers tab: =COUNTIF(allusers!A$1:T$102,B2) where the first argument is the array we’re searching in, and the second is the cell we’re counting instances of. This is nested inside an IF function which checks whether or not the cell is blank first – so if there’s no URL there, it doesn’t do any counting: =IF(ISBLANK(B2),””,COUNTIF(allusers!A$1:T$102,B2))
You can then sort this column yourself, or do as I did and use another TRUE/FALSE column to filter counts >1 into the next sheet, topusers, via =IF(C2>1,TRUE,FALSE) and then another FILTER function.
There are probably more efficient ways of doing most of these steps, but it works as a quick and dirty solution. Do let me know if you found this useful, or find a better implementation!
Get more great stuff delivered fresh to your inbox.