Blog
Simran Gill
30 April 2019
Technical SEO audits can be an extensive and often repetitive task – using Python to automate the essentials can be a massive help for any SEO team.
Fair warning – this is a bit of a technical blog post. If you have questions about the process, we’re always happy to talk about the details of technical SEO – so get in touch 🙂
Python is a programming language that was first released in 1991 and (helpfully) has a massive standard library of functions. The ‘core philosophy’ of Python is a natural fit with how we like to carry out SEO at Blueclaw:
As programming languages go, Python is straightforward to learn and has some immediate benefits when it comes to fetching, extracting and optimising HTML to super power your SEO. We recommend reading a few of the basic tutorials to get to grips with Python, but for the purposes of this blog post – let’s dive right in!
The project fetches the page using the Requests library and then extracts the data using BeautifulSoup4. These fantastic packages are included with Anaconda for python 3.6, but if you don’t have them installed, do so using pip.
$ pip install --upgrade beautifulsoup4 $ pip install --upgrade requests
Now we need to fetch the HTML of a page and then parse it using BeautifulSoup’s HTML parser.
import requests from bs4 import BeautifulSoup URL = 'https://www.blueclaw.co.uk' response = requests.get(URL) soup = BeautifulSoup(response.text, 'html.parser')
This works well, and we can already extract SEO data from this object (for example, the title can be found using `soup.title`), but we will make this much easier to expand on by using classes.
class SeoAudit: def __init__(self, url): ''' If the url isn't provided with a scheme, prepend with `https://` ''' self.url = requests.utils.prepend_scheme_if_needed(url, 'https') self.domain = requests.utils.urlparse(self.url).netloc response = requests.get(self.url) self.soup = BeautifulSoup(response.text, 'html.parser') def get_title(self): ''' Title comes still wrapped in <title> tags. ''' title_tag = self.soup.title if title_tag is None: return title_tag ''' Using `get_text` and `strip` to remove the <title> tag and any leading or trailing whitespace. ''' return title_tag.get_text().strip()
By creating an instance of the SeoAudit class, initialised with our desired URL, we are able to work with the BeautifulSoup object.
page = SeoAudit('https://www.blueclaw.co.uk') print(page.get_title()) # Expected output: # <title>Award-Winning UK SEO Company, Blueclaw Search Agency, Leeds</title>
Now let’s start to expand our class! We will also write methods to pull out the h1, meta description and any links that are on the page. We will use python’s RegEx library to check if the links on the page contain our domain.
import re class SeoAudit: # ... def get_first_h1(self): h1_tag = self.soup.title if h1_tag is None: return h1_tag return h1_tag.get_text().strip() def get_meta_description(self): meta_tag = self.soup.find('meta', attrs={ 'name': re.compile(r'(?i)description') }) if meta_tag is None: return meta_tag return meta_tag.get('content') ''' Don’t get too bogged down here by the following RegEx’s - just know that they are being used to classify the links! ''' def find_links(self, link_type='all'): if link_type not in ['all', 'internal', 'external']: return [] if link_type == 'all': ''' Don’t extract scroll, telephone or email links. ''' href_ex = re.compile(r'^(?!#|tel:|mailto:)') elif link_type == 'internal': ''' Only extract links which match the domain name or use relative paths. ''' href_ex = re.compile( r'((https?:)?//(.+\.)?%s|^/[^/]*)' % re.sub( r'\.', '\\.', self.domain ) ) elif link_type == 'external': ''' Uses the not_domain method below to only match absolute paths which do not match the domain name. ''' href_ex = self.not_domain a_tags = self.soup.find_all('a', attrs={'href': href_ex}) return [tags.get('href') for tags in a_tags] def not_domain(self, href): ''' A RegEx to determine if the href is not a url belonging to the given domain. ''' return href and ( re.compile( r'^(https?:)?//').search(href ) and not re.compile( re.sub(r'\.', '\\.', self.domain) ).search(href) )
Stripping the title and h1 tags is becoming repetitive. Let’s write a decorator to make the process easier in the future and to improve readability.
def strip_tag(func): def func_wrapper(*args, **kwargs): rtn = func(*args, **kwargs) if rtn is None: return rtn return rtn.get_text().strip() return func_wrapper class SeoAudit: # ... @strip_tag def get_title(self): return self.soup.title @strip_tag def get_first_h1(self): return self.soup.h1
Lovely.
Finally, we will make our lives easier by writing another method in order to get this useful data into a Python dictionary
class SeoAudit: # ... def get_seo_data(self): return { 'title': self.get_title(), 'metaDescription': self.get_meta_description(), 'h1': self.get_first_h1(), 'internalLinks': self.find_links('internal'), 'internalLinksCount': len(self.find_links('internal')), 'externalLinks': self.find_links('external'), 'externalLinksCount': len(self.find_links('external')), }
In this demonstration, I will use a simple input to provide the script with a URL. There are many better ways of doing this! I will also write the results to a JSON file named seoData.json
In this demonstration, I will use a simple input to provide the script with a URL. There are many better ways of doing this! I will also write the results to a JSON file named seoData.json
import json url = input('Enter a URL to analyze: ') page = SeoAudit(url) out_obj = page.get_seo_data() with open('seoData.json', 'w') as f: json.dump(out_obj, f, indent=2)
We can configure this in any way we want. For example, parsing the URL as a command line argument, looping through a CSV or JSON file or a myriad of others! Plus, by using classes, we have made it easy to add new methods to extract even more insight from the page.
This project is intended to be a starter for a more complete technical SEO tool. Feel free to build on it to suit your needs – Happy coding!
Written by
Simran Gill
Latest posts.
We’d love to chat with you about your next project and goals, or simply share some additional insight into the industry and how we could potentially work together to drive growth.