Technical SEO audits can be an extensive and often repetitive task – using Python to automate the essentials can be a massive help for any SEO team. Fair warning – this is a bit of a technical blog post. If you have questions about the process, we’re always happy to talk about the details of technical SEO – so get in touch 🙂 Python is a programming language that was first released in 1991 and (helpfully) has a massive standard library of functions. The ‘core philosophy’ of Python is a natural fit with how we like to carry out SEO at Blueclaw: Beautiful is better than ugly Explicit is better than implicit Simple is better than complex Complex is better than complicated Readability counts As programming languages go, Python is straightforward to learn and has some immediate benefits when it comes to fetching, extracting and optimising HTML to super power your SEO. We recommend reading a few of the basic tutorials to get to grips with Python, but for the purposes of this blog post – let’s dive right in! Fetching the HTML The project fetches the page using the Requests library and then extracts the data using BeautifulSoup4. These fantastic packages are included with Anaconda for python 3.6, but if you don’t have them installed, do so using pip. $ pip install --upgrade beautifulsoup4 $ pip install --upgrade requests Now we need to fetch the HTML of a page and then parse it using BeautifulSoup’s HTML parser. import requests from bs4 import BeautifulSoup

Blog

Technical SEO Audit Automation with Python

Home

Blog

Simran Gill

30 April 2019

Technical SEO audits can be an extensive and often repetitive task – using Python to automate the essentials can be a massive help for any SEO team.

Fair warning – this is a bit of a technical blog post. If you have questions about the process, we’re always happy to talk about the details of technical SEO – so get in touch 🙂

Python is a programming language that was first released in 1991 and (helpfully) has a massive standard library of functions. The ‘core philosophy’ of Python is a natural fit with how we like to carry out SEO at Blueclaw:

Beautiful is better than ugly
Explicit is better than implicit
Simple is better than complex
Complex is better than complicated
Readability counts

As programming languages go, Python is straightforward to learn and has some immediate benefits when it comes to fetching, extracting and optimising HTML to super power your SEO. We recommend reading a few of the basic tutorials to get to grips with Python, but for the purposes of this blog post – let’s dive right in!

Fetching the HTML

The project fetches the page using the Requests library and then extracts the data using BeautifulSoup4. These fantastic packages are included with Anaconda for python 3.6, but if you don’t have them installed, do so using pip.

$ pip install --upgrade beautifulsoup4
$ pip install --upgrade requests

Now we need to fetch the HTML of a page and then parse it using BeautifulSoup’s HTML parser.

import requests
from bs4 import BeautifulSoup

URL = 'https://www.blueclaw.co.uk'
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html.parser')

This works well, and we can already extract SEO data from this object (for example, the title can be found using `soup.title`), but we will make this much easier to expand on by using classes.

Extracting the important tags

class SeoAudit:
    def __init__(self, url):
        '''
        If the url isn't provided with a scheme, prepend with `https://`
        '''
        self.url = requests.utils.prepend_scheme_if_needed(url, 'https')
        self.domain = requests.utils.urlparse(self.url).netloc
        response = requests.get(self.url)
        self.soup = BeautifulSoup(response.text, 'html.parser')

    def get_title(self):
        '''
        Title comes still wrapped in <title> tags.
        '''
        title_tag = self.soup.title
        if title_tag is None:
            return title_tag
        '''
        Using `get_text` and `strip` to remove the <title> tag and any
        leading or trailing whitespace.
        '''
        return title_tag.get_text().strip()

By creating an instance of the SeoAudit class, initialised with our desired URL, we are able to work with the BeautifulSoup object.

page = SeoAudit('https://www.blueclaw.co.uk')
print(page.get_title())

# Expected output:
# <title>Award-Winning UK SEO Company, Blueclaw Search Agency, Leeds</title>

Now let’s start to expand our class! We will also write methods to pull out the h1, meta description and any links that are on the page. We will use python’s RegEx library to check if the links on the page contain our domain.

import re

class SeoAudit:
    # ...
    def get_first_h1(self):
        h1_tag = self.soup.title
        if h1_tag is None:
            return h1_tag
        return h1_tag.get_text().strip()

    def get_meta_description(self):
        meta_tag = self.soup.find('meta', attrs={
            'name': re.compile(r'(?i)description')
        })
        if meta_tag is None:
            return meta_tag
        return meta_tag.get('content')

    '''
    Don’t get too bogged down here by the following RegEx’s - just
    know that they are being used to classify the links!
    '''

    def find_links(self, link_type='all'):
        if link_type not in ['all', 'internal', 'external']:
            return []

        if link_type == 'all':
            '''
            Don’t extract scroll, telephone or email links.
            '''
            href_ex = re.compile(r'^(?!#|tel:|mailto:)')
        elif link_type == 'internal':
            '''
            Only extract links which match the domain name or use
            relative paths.
            '''
            href_ex = re.compile(
                r'((https?:)?//(.+\.)?%s|^/[^/]*)' % re.sub(
                    r'\.', '\\.', self.domain
                )
            )
        elif link_type == 'external':
            '''
            Uses the not_domain method below to only match absolute
            paths which do not match the domain name.
            '''
            href_ex = self.not_domain
        
        a_tags = self.soup.find_all('a', attrs={'href': href_ex})
        
        return [tags.get('href') for tags in a_tags]

    def not_domain(self, href):
        '''
        A RegEx to determine if the href is not a url belonging to the
        given domain.
        '''
        return href and (
            re.compile(
                r'^(https?:)?//').search(href
            ) and not re.compile(
                re.sub(r'\.', '\\.', self.domain)
            ).search(href)
        )

Stripping the title and h1 tags is becoming repetitive. Let’s write a decorator to make the process easier in the future and to improve readability.

def strip_tag(func):
    def func_wrapper(*args, **kwargs):
        rtn = func(*args, **kwargs)
        if rtn is None:
            return rtn
        return rtn.get_text().strip()

    return func_wrapper

class SeoAudit:
    # ...
    @strip_tag
    def get_title(self):
        return self.soup.title

    @strip_tag
    def get_first_h1(self):
        return self.soup.h1

Lovely.

Finally, we will make our lives easier by writing another method in order to get this useful data into a Python dictionary

class SeoAudit:
    # ...
    def get_seo_data(self):
        return {
            'title': self.get_title(),
            'metaDescription': self.get_meta_description(),
            'h1': self.get_first_h1(),
            'internalLinks': self.find_links('internal'),
            'internalLinksCount': len(self.find_links('internal')),
            'externalLinks': self.find_links('external'),
            'externalLinksCount': len(self.find_links('external')),
        }

In this demonstration, I will use a simple input to provide the script with a URL. There are many better ways of doing this! I will also write the results to a JSON file named seoData.json

Making it user-friendly

In this demonstration, I will use a simple input to provide the script with a URL. There are many better ways of doing this! I will also write the results to a JSON file named seoData.json

import json

url = input('Enter a URL to analyze: ')
page = SeoAudit(url)
out_obj = page.get_seo_data()

with open('seoData.json', 'w') as f:
    json.dump(out_obj, f, indent=2)

We can configure this in any way we want. For example, parsing the URL as a command line argument, looping through a CSV or JSON file or a myriad of others! Plus, by using classes, we have made it easy to add new methods to extract even more insight from the page.

This project is intended to be a starter for a more complete technical SEO tool. Feel free to build on it to suit your needs – Happy coding!

Written by

Simran Gill

Contact.

We’re always keen to talk search marketing.

We’d love to chat with you about your next project and goals, or simply share some additional insight into the industry and how we could potentially work together to drive growth.

from XLMedia

We believe that great work comes from a passion for all things digital and an absolute commitment to excellence.