5 Python Scripts to Automate SEO Tasks Efficiently

Table of Contents

Are you looking to streamline your SEO workflow and save time? Python, a powerful programming language, has gained significant popularity among SEO professionals for its simplicity and efficiency. It offers a versatile toolkit to automate repetitive SEO tasks, making your process faster and more effective.

In this article, we’ll explore five Python scripts that can help you boost your SEO efforts. Whether you’re new to Python or have some coding experience, these scripts are designed to save time on tedious tasks, giving you more room to focus on strategy and growth.

👉 Check out the full project and get started with automation here: SEO Automation Project

Why Python is a Game-Changer for SEO

Python’s popularity in the SEO industry has skyrocketed due to its:

Easy-to-learn syntax, making it accessible to those without a technical background.
Wide range of libraries that offer specialized tools for scraping, text processing, and data analysis.
Efficiency in handling large data sets, which is often a necessity in SEO.

Whether you’re working on large-scale redirect maps or need to analyze keywords, Python can do the heavy lifting for you.

Get Started with Google Colab

If you’re new to Python, Google Colab is a great place to start. It’s a free, web-based platform that allows you to run Python code directly in your browser without complicated setups. With pre-installed libraries and the ability to upload files easily, it’s perfect for experimenting with these scripts.

Script 1: Automate a Redirect Map

Manually creating redirect maps for large websites can be a daunting task. This script simplifies the process by matching content between old and new URLs, ensuring that your redirects are precise.

How It Works:

It scrapes content from two lists of URLs (old and new).
It uses the BeautifulSoup library to extract content, ignoring headers and footers.
Then, it uses the Polyfuzz library to match the content with a similarity percentage, which can then be reviewed.
Finally, the script generates a CSV file with the similarity data, allowing you to quickly assess and refine your redirects.

#import libraries
from bs4 import BeautifulSoup, SoupStrainer
from polyfuzz import PolyFuzz
import concurrent.futures
import multiprocessing
import csv
import pandas as pd
import requests

# Function to read URLs from files
def read_urls(file_name):
    with open(file_name, "r") as file:
        return [line.strip() for line in file]

# Enhanced content scraper function via BeautifulSoup
def get_content(url):
    try:
        response = requests.get(url, timeout=10)  # added timeout for better error handling
        if response.status_code == 200:
            page_source = response.text
            strainer = SoupStrainer('p')
            soup = BeautifulSoup(page_source, 'lxml', parse_only=strainer)
            paragraph_list = [element.text for element in soup.find_all(strainer)]
            content = " ".join(paragraph_list)
            return content
        else:
            return ""
    except requests.RequestException as e:
        return ""  # return empty if request fails

# Function to process content extraction using multiple cores
def extract_content_in_parallel(urls, max_workers=None):
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        content_list = list(executor.map(get_content, urls))
    return content_list

# Function to map content to URLs
def get_key_from_value(content_dictionary, value):
    return next((key for key, val in content_dictionary.items() if val == value), None)

# Main function to perform the entire matching process
def match_content_and_export(source_urls, target_urls, max_workers=None):
    # Scrape content from both lists in parallel
    content_list_a = extract_content_in_parallel(source_urls, max_workers)
    content_list_b = extract_content_in_parallel(target_urls, max_workers)

    # Create content dictionary for target URLs
    content_dictionary = dict(zip(target_urls, content_list_b))

    # Use PolyFuzz to find similarities
    model = PolyFuzz("TF-IDF")
    model.match(content_list_a, content_list_b)
    data = model.get_matches()

    # Map the similarity data back to URLs
    with multiprocessing.Pool() as pool:
        result = pool.map(get_key_from_value, [(content_dictionary, to_val) for to_val in data["To"]])

    # Prepare final results and save to CSV
    df = pd.DataFrame(list(zip(source_urls, result, data["Similarity"])))
    df.columns = ["From URL", "To URL", "% Identical"]
    df.to_csv("redirect_map.csv", index=False)

# Reading source and target URLs from files
source_urls = read_urls("source_urls.txt")
target_urls = read_urls("target_urls.txt")

# Perform content matching and export the results using multiple cores
if __name__ == "__main__":
    max_workers = multiprocessing.cpu_count()  # Use all available CPU cores
    match_content_and_export(source_urls, target_urls, max_workers=max_workers)

Script 2: Bulk Meta Descriptions Generator

Meta descriptions might not directly affect rankings, but they enhance click-through rates (CTR). Writing them manually for thousands of pages is time-consuming. This script can automatically generate meta descriptions for your URLs.

How It Works:

It pulls the content from a list of URLs.
The script then uses the LSA summarizer from the Sumy library to generate short descriptions.
If the meta description exceeds 155 characters, it trims it down.

# Install necessary packages (run in the terminal or in a notebook environment)
# !pip install sumy

from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.lsa import LsaSummarizer
import concurrent.futures
import csv
import requests
from requests.exceptions import RequestException

# Function to get meta description from URL content
def generate_meta_description(url):
    try:
        # Parse the content from the URL
        parser = HtmlParser.from_url(url, Tokenizer("english"))
        stemmer = Stemmer("english")
        summarizer = LsaSummarizer(stemmer)
        summarizer.stop_words = get_stop_words("english")
        
        # Summarize the content to generate a meta description
        description = summarizer(parser.document, 3)
        description = " ".join([sentence._text for sentence in description])
        
        # Ensure the meta description is less than or equal to 155 characters
        if len(description) > 155:
            description = description[:152] + '...'
            
        return {'url': url, 'description': description}
    
    except (RequestException, ValueError) as e:
        # Handle any request errors or parsing issues
        return {'url': url, 'description': 'Error fetching content'}

# Function to read URLs from file
def read_urls(file_name):
    with open(file_name, 'r') as f:
        return [line.strip() for line in f]

# Main function to process URLs in parallel
def process_urls_in_parallel(urls, max_workers=None):
    results = []
    
    # Use ThreadPoolExecutor for parallel processing of URLs
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit tasks to the thread pool
        future_to_url = {executor.submit(generate_meta_description, url): url for url in urls}
        
        # Collect the results as they are completed
        for future in concurrent.futures.as_completed(future_to_url):
            try:
                result = future.result()
                results.append(result)
            except Exception as e:
                # Handle exceptions during execution
                url = future_to_url[future]
                results.append({'url': url, 'description': 'Error processing URL'})
    
    return results

# Function to write the results to a CSV file
def write_results_to_csv(results, output_file='results.csv'):
    with open(output_file, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=['url', 'description'])
        writer.writeheader()
        writer.writerows(results)

# Main script execution
if __name__ == "__main__":
    # Read URLs from the file
    urls = read_urls('urls.txt')
    
    # Use all available CPU cores for processing
    max_workers = concurrent.futures.cpu_count()
    
    # Process the URLs in parallel and generate meta descriptions
    results = process_urls_in_parallel(urls, max_workers=max_workers)
    
    # Write the results to a CSV file
    write_results_to_csv(results)

    print(f"Processed {len(results)} URLs and exported to 'results.csv'.")

Script 3: Keyword Analysis with N-Grams

Understanding keyword themes across large datasets is critical in SEO. This script breaks down keywords into unigrams, bigrams, and trigrams, helping you identify common patterns in your keyword strategy.

How It Works:

It reads a list of keywords from a file.
Using regular expressions and Counter, the script calculates the most frequent unigrams, bigrams, and trigrams.
The results are exported to a text file for easy analysis.

# Import necessary libraries
import re
from collections import Counter
import concurrent.futures

# Function to clean the words (remove non-alphabetic characters)
def clean_word(word):
    return re.sub(r'[^a-zA-Z]', '', word)

# Function to count unigrams, bigrams, and trigrams
def count_ngrams(words):
    unigrams = Counter()
    bigrams = Counter()
    trigrams = Counter()

    for i in range(len(words)):
        # Unigrams
        unigrams[words[i]] += 1

        # Bigrams
        if i < len(words) - 1:
            bigram = words[i] + ' ' + words[i + 1]
            bigrams[bigram] += 1

        # Trigrams
        if i < len(words) - 2:
            trigram = words[i] + ' ' + words[i + 1] + ' ' + words[i + 2]
            trigrams[trigram] += 1

    return unigrams, bigrams, trigrams

# Function to process words in parallel
def process_in_parallel(words, chunk_size=10000):
    chunks = [words[i:i + chunk_size] for i in range(0, len(words), chunk_size)]

    unigrams = Counter()
    bigrams = Counter()
    trigrams = Counter()

    # Use ThreadPoolExecutor for parallel processing
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [executor.submit(count_ngrams, chunk) for chunk in chunks]

        for future in concurrent.futures.as_completed(futures):
            uni, bi, tri = future.result()
            unigrams.update(uni)
            bigrams.update(bi)
            trigrams.update(tri)

    return unigrams, bigrams, trigrams

# Main function
if __name__ == "__main__":
    # Open the text file and read its contents into a list of words
    with open('keywords.txt', 'r') as f:
        words = f.read().split()

    # Clean the words
    words = [clean_word(word) for word in words if clean_word(word)]

    # Process the words in parallel
    unigrams, bigrams, trigrams = process_in_parallel(words)

    # Sort the dictionaries by the number of occurrences
    sorted_unigrams = sorted(unigrams.items(), key=lambda x: x[1], reverse=True)
    sorted_bigrams = sorted(bigrams.items(), key=lambda x: x[1], reverse=True)
    sorted_trigrams = sorted(trigrams.items(), key=lambda x: x[1], reverse=True)

    # Write the results to a text file
    with open('results.txt', 'w') as f:
        f.write("Most common unigrams:\n")
        for unigram, count in sorted_unigrams[:10]:
            f.write(f"{unigram}: {count}\n")
        f.write("\nMost common bigrams:\n")
        for bigram, count in sorted_bigrams[:10]:
            f.write(f"{bigram}: {count}\n")
        f.write("\nMost common trigrams:\n")
        for trigram, count in sorted_trigrams[:10]:
            f.write(f"{trigram}: {count}\n")

Script 4: Group Keywords into Topic Clusters

Keyword clustering is essential for content planning and SEO mapping. This script automatically groups your keywords into clusters, helping you identify content themes and keyword overlap.

How It Works:

It analyzes a list of keywords using TfidfVectorizer and clusters them using AffinityPropagation.
The script assigns each keyword to a cluster, making it easier to map out content topics.

import csv
import numpy as np
from sklearn.cluster import AffinityPropagation
from sklearn.feature_extraction.text import TfidfVectorizer
import concurrent.futures

# Read keywords from the text file
def read_keywords(file_name):
    with open(file_name, "r") as f:
        return f.read().splitlines()

# Function to cluster keywords using AffinityPropagation
def cluster_keywords(keywords):
    # Create a Tf-idf representation of the keywords
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(keywords)

    # Perform Affinity Propagation clustering
    af = AffinityPropagation().fit(X)
    labels = af.labels_
    n_clusters = len(af.cluster_centers_indices_)

    return labels, n_clusters

# Function to write clusters to a CSV file
def write_clusters_to_csv(keywords, labels, n_clusters, output_file="clusters.csv"):
    with open(output_file, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["Cluster", "Keyword"])
        # Assign keywords to their respective clusters
        for i in range(n_clusters):
            cluster_keywords = [keywords[j] for j in range(len(labels)) if labels[j] == i]
            if cluster_keywords:
                for keyword in cluster_keywords:
                    writer.writerow([i, keyword])

# Parallelized function for clustering and writing results
def process_clustering_in_parallel(keywords, max_workers=None):
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Cluster the keywords in parallel
        future_cluster = executor.submit(cluster_keywords, keywords)
        labels, n_clusters = future_cluster.result()

        # Write the results to CSV in parallel
        future_write = executor.submit(write_clusters_to_csv, keywords, labels, n_clusters)
        future_write.result()

# Main script execution
if __name__ == "__main__":
    # Load keywords from the text file
    keywords = read_keywords("keywords.txt")

    # Use all available CPU cores for processing
    max_workers = concurrent.futures.cpu_count()

    # Process clustering and write the results in parallel
    process_clustering_in_parallel(keywords, max_workers=max_workers)

    print(f"Keyword clustering completed and exported to 'clusters.csv'.")

Script 5: Match Keywords to Predefined Topics

If you already have predefined topics and want to map a keyword list against them, this script can match keywords to the closest topic, making it ideal for large-scale content categorization.

How It Works:

The script reads two lists: one of keywords and one of predefined topics.
It uses Spacy’s natural language processing capabilities to compare keywords to topics.
If a keyword doesn’t closely match a topic, it gets categorized as “Other.”

import pandas as pd
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import concurrent.futures

# Load the Spacy English language model
nlp = spacy.load("en_core_web_sm")

# Define the batch size for keyword analysis
BATCH_SIZE = 1000

# Load the keywords and topics files as Pandas dataframes
keywords_df = pd.read_csv("keywords.txt", header=None, names=["keyword"])
topics_df = pd.read_csv("topics.txt", header=None, names=["topic"])

# Define a function to categorize a keyword based on the closest related topic
def categorize_keyword(keyword):
    tokens = nlp(keyword.lower())
    tokens = [token.text for token in tokens if not token.is_stop and not token.is_punct]
    max_overlap = 0
    best_topic = "Other"
    
    for topic in topics_df["topic"]:
        topic_tokens = nlp(topic.lower())
        topic_tokens = [token.text for token in topic_tokens if not token.is_stop and not token.is_punct]
        overlap = len(set(tokens).intersection(set(topic_tokens)))
        
        if overlap > max_overlap:
            max_overlap = overlap
            best_topic = topic
            
    return {"keyword": keyword, "category": best_topic}

# Function to process batches in parallel
def process_keyword_batch_in_parallel(keyword_batch, max_workers=None):
    results = []
    
    # Use ThreadPoolExecutor to process the keyword matching concurrently
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_keyword = {executor.submit(categorize_keyword, keyword): keyword for keyword in keyword_batch}
        
        for future in concurrent.futures.as_completed(future_to_keyword):
            try:
                result = future.result()
                results.append(result)
            except Exception as e:
                # Handle any potential errors during execution
                keyword = future_to_keyword[future]
                results.append({"keyword": keyword, "category": "Error"})
    
    return pd.DataFrame(results)

# Initialize an empty dataframe to hold the results
results_df = pd.DataFrame(columns=["keyword", "category"])

# Use all available CPU cores for parallel processing
max_workers = concurrent.futures.cpu_count()

# Process the keywords in batches
for i in range(0, len(keywords_df), BATCH_SIZE):
    keyword_batch = keywords_df.iloc[i:i+BATCH_SIZE]["keyword"].tolist()
    
    # Process the keyword batch in parallel
    batch_results_df = process_keyword_batch_in_parallel(keyword_batch, max_workers=max_workers)
    
    # Append the batch results to the final results dataframe
    results_df = pd.concat([results_df, batch_results_df], ignore_index=True)

# Export the results to a CSV file
results_df.to_csv("results.csv", index=False)

print(f"Keyword to topic matching completed and exported to 'results.csv'.")

Why Python is Essential for SEO Automation

By integrating Python into your SEO workflows, you can automate time-consuming tasks, gain valuable insights from data, and enhance your SEO strategies. Whether you’re working on large-scale audits or optimizing content, Python offers the tools to help you succeed.

Give these scripts a try, and see how Python can transform your approach to SEO!

Pro Tip: Regularly update your Python scripts with the latest libraries to ensure compatibility and improve performance.

This guide provides a foundational look at how Python can revolutionize your SEO processes. Feel free to experiment and modify these scripts based on your project needs. Happy coding!

By leveraging Python, you’re not just automating tasks; you’re opening up a world of new possibilities in SEO.

SEO4ONE – Simplifying SEO with Python

Why Python is a Game-Changer for SEO

Get Started with Google Colab

Script 1: Automate a Redirect Map

Script 2: Bulk Meta Descriptions Generator

Script 3: Keyword Analysis with N-Grams

Script 4: Group Keywords into Topic Clusters

Script 5: Match Keywords to Predefined Topics

Why Python is Essential for SEO Automation

Leave a Comment Cancel