Render Positive
Get in touch

Menu

Render Positive

Using Google Docs to Scrape Tumblr Hashtags


By Han | 8th Feb 2013 | Posted in Data and Tools

Recently I needed to find a way of quickly grabbing a list of usernames of Tumblr blogs who had liked and reblogged posts with a particular tag. Tumblr’s been in the news lately with a dashboard overhaul and suggestions that the platform is at least as popular – if not more so – than Facebook among teenagers. Especially those who like to share instagrammed photographs of questionable ‘vintage’ fashions (is this the new MySpace photo?!). It would certainly be churlish to ignore this blogging platform as an opportunity for outreach.

Jumper

Fairly standard Tumblr fare, via subwaywhore.tumblr.com

The problem is that there is no Followerwonk for Tumblr. No way to easily compare users, compile lists, or search by interest: really, all they give you to go on is tag URLs (and their Spotlight explore function thing, but this is pretty useless because of lack of any of the above.)

So, after a brief foray into Python territory and a failed attempt to do everything in the terminal I created a new spreadsheet in Google docs and got to work.

For those of you who don’t care about how it works, you can access a read-only copy of the project and start using it here. Make a copy first by navigating to File > Make a copy, then choosing a name and clicking Ok. This will save an editable version to your own Google Drive account. Once you’ve done this, you can enter in the hashtag you wish to scrape in the input tab, and then select the final topusers tab to see a list of all the users who have reblogged and liked the top ~18 most recent posts with that tag.

This isn’t a finished thing, however; I’m sharing it because I hope that someone else might be able to build on it and make it better. You’ll probably have to mess with it a bit to make it do what you want, but that’s half the fun, right? Right?

A Quick Rundown

Here is what each tab does:

  • input: The only tab you should need to edit to see results. Put the tag you want to scrape in the blue cell.
  • allurls: This is a list of every link contained on the Tumblr tag page for the tag entered.
  • allposts: This is just the URLs to individual posts, filtered from the previous list of all URLs on the tag page.
  • allnotes: All of these post URLs are now the column headers, under which are links to all users who have liked/reblogged that post.
  • allusers: All of the user links from the previous tab, without the post URLs, and with any errors cleaned up.
  • usercount: All user links arranged in one column, duplicated removed, with a count of the number of times they appear.
  • topusers: The previous tab’s list, filtered to show only users who have liked/reblogged more than once.

Getting the Data

For anyone who is curious, or would like to better understand some basic – and very useful – spreadsheet automation, I’ll outline my process below, starting with the allurls tab:

This is really simple. Just an importxml function with two arguments: the URL we’re wanting to scrape, in this case a Tumblr tag page with a reference to the tag entered in the input tab, and a bit of xpath which tells the spreadsheet to scrape all of the links on that page: “//a/@href”.

(There’s a very excellent guide to using importxml over at Distilled so I won’t go into any more detail here.)

Ideally, I only wanted the URLs of the posts themselves, but my xpath expressions weren’t grabbing any data when I thought they should and it was taking too long to make it work, so I’ll figure it out within the spreadsheet instead.

So to filter only the URLs of posts, I’ve used column A to display TRUE if the URL in column B includes “post”. This is done via =ISNUMBER(FIND(“post”,B2)) where the nested FIND function returns a 1 if “post” is found in cell B2, the URL. ISNUMBER then picks up that this is a number, and spits out TRUE.

The allposts sheet then uses this true/false column to filter out just the stuff we want, with =filter(allurls!A2:B;allurls!A2:A).

Allnotes starts to get a little more complicated. For the column headers, I used a TRANSPOSE function to display the previous tab’s column of post URLs as the first row, to create column headers. Each column would include scraped data from that header URL via another importxml function. This time, we’re looking at the notes: these are the records of every time a post is liked or reblogged. Again we need to scrape for URLs, but this time it was possible to be more specific.

The formula =importxml(CONCATENATE(A1), “//a[@class=’avatar_frame’]/@href”) returns all of the links with the avatar_frame class for the URL in cell A1. Using Chrome’s inspect element view, I could see that this avatar_frame class is applied to all of the links to users who have reblogged/liked the post. Perfect!

So now you’ve got this big table of blog URLs for users who have liked and reblogged posts with your hastag. But it’s a bit messy, and totally impossible to read.

Cleaning Up

The first thing we need to do now is clean up the data a little: for me, this meant getting rid of errors, so that formulas I used on the data subsequently wouldn’t give more errors. This is as easy as using =IFERROR(allnotes!A3, “”). As I’ve left the space between the quotation marks blank, the converts any error cells to blank cells.

Next, we need to count how many times each of these users appears. Then, we can rank them, and determine which ones to look at in more detail. But before we can do that, we need to get this whole array into a one-column list in the usercount tab:

=ArrayFormula(unique(transpose(split(concatenate(allusers!A1:T102&” “),” “))))

This is a bit trickier. Reading from right to left, the CONCATENATE formula firstly smooshes that whole array into one long string, with each cell separated by a space. SPLIT then splits these up again by the space, and TRANSPOSE arranges them into a column. UNIQUE removes any duplicates, and ArrayFormula tells the whole lot that it’s dealing with an array input.

It’s then just a case of using a COUNTIF function to count each time that cell appears in the array on the allusers tab: =COUNTIF(allusers!A$1:T$102,B2) where the first argument is the array we’re searching in, and the second is the cell we’re counting instances of. This is nested inside an IF function which checks whether or not the cell is blank first – so if there’s no URL there, it doesn’t do any counting: =IF(ISBLANK(B2),””,COUNTIF(allusers!A$1:T$102,B2))

You can then sort this column yourself, or do as I did and use another TRUE/FALSE column to filter counts >1 into the next sheet, topusers, via =IF(C2>1,TRUE,FALSE) and then another FILTER function.

There are probably more efficient ways of doing most of these steps, but it works as a quick and dirty solution. Do let me know if you found this useful, or find a better implementation!

Sign up now!

Get more great stuff delivered fresh to your inbox.



One Response to “Using Google Docs to Scrape Tumblr Hashtags”

  1. Gabriella says:

    A truly good post and I was glad to discover it. Thanks for your effort-Gabriela



Leave a Reply

Want to know what we could achieve for you?

View all resources Useful stuff


Meet the team

Jon Buchan
Hannah Brown
Jenny Longmuir
Tess Bowles
Lee Buchan
Asher Baker
Bree Van Zyl
Sam Reynolds
Aida Staskeviciute
Laura Reddington
Dipak Hemraj
Jess Collett
Gemma MacNaught
Laila Khan
Gary Buchan
James Hackney
Stuart Lawrence

Jon Buchan Chief Executive Officer jon@renderpositive.com
07949 283 785
Hannah Brown Creative Yet Technical Manager hannah@renderpositive.com
07453 779 030
Jenny Longmuir Content Marketing Editor jenny@renderpositive.com
Tess Bowles Social Media & Content Marketing Manager tess@renderpositive.com
Lee Buchan SEO and Social Media Executive lee@renderpositive.com
Asher Baker SEO & PPC Manager (and Lord) asher@renderpositive.com
07525 744 178
Bree Van Zyl Video Productionista briarley.vanzyl@renderpositive.com
Sam Reynolds Copywriter sam@renderpositive.com
Aida Staskeviciute Graphic Designer aida@renderpositive.com
Laura Reddington Copywriter laura.reddington@renderpositive.com
Dipak Hemraj All Rounder dipak@renderpositive.com
Jess Collett Copywriter jess@renderpositive.com
Gemma MacNaught Head of UX and Conversion Rate Optimisation gemma@renderpositive.com
Laila Khan Head of PR laila@renderpositive.com
Gary Buchan Managing Director gary@renderpositive.com
07525 839 157
James Hackney Client Services Manager james@renderpositive.com
07725 209 820
Stuart Lawrence Chief Technical Officer stuart@renderpositive.com
07725 209 819



Stalk us on social…