It would be pointless to pontificate on the wealth of data insight social media can provide – at this point that knowledge is inescapable. However, awareness of something’s value doesn’t always translate to knowledge of how to harness it. Given that ‘data’ is simply information, and information is everything, it can be challenging to know where to start. One useful tool for social media insight is web scraping. Scraping is collecting a range of data from a website and exporting it into an interpretable format. There are several methods to scrape web data, ranging from free software offering standard templates to python libraries.

Blog

Scraping Reddit for insightful data and PR hooks – the quick and dirty way!

It would be pointless to pontificate on the wealth of data insight social media can provide – at this point that knowledge is inescapable.

However, awareness of something’s value doesn’t always translate to knowledge of how to harness it. Given that ‘data’ is simply information, and information is everything, it can be challenging to know where to start.

One useful tool for social media insight is web scraping. Scraping is collecting a range of data from a website and exporting it into an interpretable format. There are several methods to scrape web data, ranging from free software offering standard templates to python libraries.

Scraping a social media website can provide you with data that can be turned into insightful hooks with 10 minutes and Excel.

Virtually any website can be scraped (though you may wish to check the terms of use for blanket bans!) but in this blog I will be looking at Reddit.

You can easily use Reddit’s API to collect data, but I’m going to use Parsehub, which is a free software that allows the user to select and link the data items to be scraped.

Parsehub

This blog isn’t going to go into huge amounts of detail about Parsehub, because it’s an intuitive software that has an in-depth tutorial system, as well as countless guides online.

For scraping it’s better to use old Reddit (old.reddit.com) so make sure to use the correct url when setting up your scrape.

I’m going to look /r/unitedkingdom and select the post title, date, url, number of comments, karma, username and flair if present.

Analysis

There’s a lot you can do with this data once you have it – I’m going to do a simple, quick and dirty analysis in Excel.

/r/unitedkingdom is primary news related posts, so even something as simple as the URLs posted can be insightful.

The above shows the most popular websites linked to by posts on the subreddit over the last few days (or the first 100 pages on the subreddit). The numbers here show that the BBC news and Guardian websites are significantly more posted than the next most posted domains.

It could be interesting to do further analysis on the popularity of the Guardian – are posts from this domain generally more well received than the right leaning publications or are the topics they post about simply more likely to ‘click-bait’ reddit users.

Reddit has a very useful popularity metric which they call karma, which I’ve used for the rest of the analysis.

One thing to keep in mind is, the principal behind reddit is the people ‘upvote’ posts that are interesting, not only those which they agree with. However, like any social media, Reddit is vulnerable to dissenting views being suppressed by downvotes, political ‘brigading’ (where users flock to downvote posts about particular issues they feel strongly about) and karma farming (where accounts spam posts about topics that are performing well, such as about Matt Hancock, in order to reap the karma). This doesn’t really impact the usefulness of analysing a subreddit like /r/unitedkingdom but may be more significant for smaller or more controversial communities.

With that in mind…

Keyword search

A simple method of analysis is a keyword search. If you already know your topic, you can search for specific words and analyse how those compare to the average.

Using the search term “Euro” it’s possible to identify 48 posts out of 1481 as likely being about the Euros. By expanding the key words to include “Euros”, “Football” the number of posts likely to be about the Euros increases to 58.

We can see in the chart above that Euros 2021 posts have had significantly higher average karma than the average over the last few days.

Searching for keywords in this way can get an interesting hook in less than 10 minutes, with just a scraper and Excel.

Popular terms

Conversely, the same quick and dirty approach can be taken to find popular words. By splitting the post titles into individual strings and then counting the occurrences, it’s possible to pick out keywords. After excluding filler the words, the below came out top for the last few days;

Term
UK
COVID
HANCOCK
ENGLAND
NEWS
GOVERNMENT
BREXIT
LABOUR
JOHNSON
POLICE

Occurrences
203
167
74
73
67
66
52
52
49
48

Unsurprisingly, “Covid” comes out near the top, but Matt Hancock is beating Boris for the big topic of conversation on Reddit, after being all over the news cycle last week.

This is echoed in Google search trends which shows an increase in searches for Matt Hancock in the same period as he featured heavily on /r/UnitedKingdom.

Once you’ve identified trending topics you can proceed with more in-depth analysis, such as sentiment analysis or comparison with other social media sites or the media.

Reddit’s karma system is a solid metric for how popular a post is – it’s calculated essentially based on the number of ‘upvotes’ compared to ‘downvotes’ (Reddit’s version of like and dislike).

Here we can see that posts about Matt Hancock attract more comments than the average, but their post karma is massively higher than the average. This indicates that over the last few days – the same time as the Hancock scandal broke – posts about him performed particularly well.

Other possibilities

The possibilities are essentially limitless when it comes to Reddit scraping.

Comments on posts are another wealth of potential insights. Having identified Matt Hancock as a popular topic on /r/unitedkingdom, you could scrape the comments under the relevant posts to run a more in-depth sentiment analysis or run through the same process of key word identification.

Another avenue is a comparison between generic subreddits like /r/unitedkingdom and subreddits dedicated to a specific topic. So, you could compare sentiment of posts about Matt Hancock on /r/unitedkingdom compared with a UK politics subreddit, or even deeper on a Conservative/Liberal subreddit.

Love/hate/love to hate topics like Love Island can yield interesting results with this method. Posts about Love Island on/r/unitedkingdom averaged a karma score of 0.2 (very low) compared with an average closer to 200 on the dedicated subreddit. A sentiment analysis would likely reveal even more interesting results!

Conclusion

So, armed with only a scraper and Excel it’s possible to pull data hooks in a quick session. While data analysis can go significantly deeper than what’s been mentioned thus far, it’s possible that these simple techniques can produce interesting insights to be spun into a campaign or researched further.

Written by

Simran Gill

Latest posts.

Contact.

We’re always keen to talk search marketing.

We’d love to chat with you about your next project and goals, or simply share some additional insight into the industry and how we could potentially work together to drive growth.