Parse and Delete Orphaned Images in Django

How to parse html content containing images and then delete any stored images not found in the content using BeautifulSoup in Django.

Problem

You're editing HTML content in Django using a WYSIWYG editor. After inserting an image, you realize it doesn't quite fit and you exchange the image for another. It was easy enough to get the image into your system. An API tied into to your editor accepts the image file, stores the object, and returns a file path for the image to the editor that's embedded as the src attribute to an img tag in the content. But now, you need to undo this process for the image that was replaced.

The issue is your backend isn't aware of this change that occurred in the frontend. If no additional logic is implemented, both the updated image as well as the original image will persist in the system. That's an issue because retaining resources that aren't being used creates unnecessary costs, storage bloat, maintenance overhead and so on.

Potential Approach

One approach to address this issue is to make the backend aware of frontend changes in real time. For instance, the editor could notify the backend via an API when an image is removed, triggering the deletion of the corresponding database record and file.

While feasible, this approach introduces significant challenges. Monitoring the WYSIWYG editor for image changes requires complex JavaScript logic, which imposes substantial computational overhead on the browser, particularly for large or complex HTML content. This can degrade the user experience by slowing down the editor.

Moreover, since frontend changes remain in a draft state until the form is submitted, real-time notifications may be unreliable, risking incomplete or erroneous deletions. Why not batch this work on the backend?

Solution

Since the backend isn't aware, unless we explicitly make it aware as discussed above, what we need to do is compare what images exist in the content against what we have stored in our system. In this article I'll provide a Python function that will parse images in HTML content and stored as a Django TextField, check for any orphaned images, and delete those orphans.

Assumes Django=5.2 and beautifulsoup4=4.13.4.

Code

# utils.html.parse_content_images.py

from urllib.parse import urlparse
from django.db.models import QuerySet

from bs4 import BeautifulSoup


def parse_content_images(html_content, imgs_qset, img_field='image'):
    """
    Parses html content and checks for orphaned images resulting from
        content updates. Found orphans are deleted and a list is returned
        containing the relative paths of the deleted objects.

    Important: only object.delete() is being called here, not
        object.image.delete(), meaning there still needs to be implementation
        (model method tie in, signals, etc.) to make sure the actual file
        is destroyed.
    """

    if not isinstance(html_content, str):
        raise TypeError(f"Expected string instead of {type(html_content)}.")

    if not isinstance(imgs_qset, QuerySet):
        raise TypeError(f"Expected QuerySet instead of {type(imgs_qset)}.")

    deletions = []
    if imgs_qset.count() > 0:
        if not hasattr(imgs_qset[0], img_field):
            raise AttributeError(f"Image field not found in model.")

        # no point if there's nothign to compare against
        soup = BeautifulSoup(html_content, 'html.parser')
        soup_imgs = soup.find_all('img')
        # extract and establish a list of all src attrs from imgs found
        img_srcs = [
            urlparse(img.get('src', '')).path
            for img in soup_imgs if img.get('src')
        ]
        # check if stored img exists in content
        # delete, if not
        for img in imgs_qset:
            img_path = getattr(img, img_field).name
            if not any(img_path in src for src in img_srcs):
                # image wasn't present, destroy:
                deletions.append(img_path)
                img.delete()
    return deletions

Explanation

The parse_content_images function is designed to identify and delete orphaned images by comparing those referenced in HTML content against those stored in the database. Below, I break down its arguments and validation logic, followed by its core functionality.

Arguments

The code contains a function parse_content_images that accepts three arguments, of which the first two are required.

The first argument, html_content, is a string containing the HTML data, typically stored in a Django TextField. The function will parse this string and extract img tags using the BeautifulSoup library.

Next we have imgs_qset. This argument is a Django QuerySet of the related objects that store information about the images. Remember, since content can contain many images, we need a related model that has a ManyToOne relationship with our content model.

Last, we have an optional field img_field that accepts the name of the field in the model instances within the QuerySet. This defaults to "image" since in most cases that's an appropriate designation for an image.

Validation

Since there's abstraction to make this function flexible enough for use with different models, I've added a bit of validation to make sure the function is properly implemented. Most importantly, we want to ensure we have a string of HTML to parse and we have an actual QuerySet since we'll be accessing a QuerySet method later on (QuerySet.count()). If either of these checks fail, a TypeError is raised.

Within the core logic to follow, we also have one more check to make sure the image field name specified by the img_field argument is actually a member of the related model. Because the QuerySet must have at least one object for this validation to be possible, I thought it better situated further down. If this check fails, an AttributeError is raised.

Core Logic

First, the function searches the HTML content and extracts every img tag using the BeautifulSoup HTML parser.

        soup = BeautifulSoup(html_content, 'html.parser')
        soup_imgs = soup.find_all('img')

Next, the src attributes are extracted from the img tags found within the content and are processed. Because the src attributes contain the entire URL pointing to the location of the stored image, we want to parse the URL and extract the relative path only.

        img_srcs = [
            urlparse(img.get('src', '')).path
            for img in soup_imgs if img.get('src')
        ]

The Django ImageField doesn't store the absolute URL of the image location. Instead, the field stores three components as a single string: the directory prefix, the filename, and the file extension. The directory prefix is set by the upload_to attribute of the Django FileField which the ImageField inherits from. This attribute allows you to be able to specify a particular directory for file uploads on a model-by-model basis. This way, you can have a separate directory for content images and still another for cover images.

Next, we loop through each image passed to the function with the imgs_qset QuerySet object.

        for img in imgs_qset:
            img_path = getattr(img, img_field).name
            if not any(img_path in src for src in img_srcs):
                # image wasn't present, destroy:
                deletions.append(img_path)
                img.delete()

The path stored by the ImageField is compared against what exists in our img_srcs iterable. Note, rather than simply check img_path in "set," I'm using any() to establish if the substring exists within the img_srcs iterable. This way, if there's an issue with prefixing, especially when you start to encounter more complicated setups with external object stores, we still have a good chance of the logic executing as intended.

Last, the function returns a list containing the relative path of each image deleted, should there be any at all. Mostly, I use this return value to construct meaningful notifications to the user following an operation.

Limitation

This logic won't cause the actual file to be deleted. Additional implementation is needed to delete files.

It's important to note that the physical file living on your file system or in object storage (S3, etc.) won't be destroyed by the logic discussed here. Instead it's up to you to implement the logic that deletes the actual files. This is a feature, not a bug, of Django. Due to the sensitive nature of operations relating to destroying files, Django wants you to deliberately implement this logic.

For ideas on how to accomplish this, check out my article Django Signals For Updating And Deleting Images.

Implementation

With a working function that parses HTML content and removes orphaned images, let's take a look at how to tie that in with your business logic. Out of the gate, you may be wondering where you should tie this function in. You could place it in a form, a view, or even override the model's save() method.

I generally implement this function in my views because I want to be able to display a notification to the user if objects are deleted. In order to display notifications, I need the request object.

from django.contrib import messages
from django.http import HttpResponseRedirect

from utils import parse_content_images


...
def form_valid(self, form):
    self.object = form.save()
    if 'html_content' in form.changed_data:
        deletions = parse_content_images(
            self.object.html_content,
            ContentImage.objects.filter(article=self.object.id)
        )
        if deletions:
            messages.info(
                self.request,
                f"Deleted {len(deletions)} orphaned images."
            )
            return HttpResponseRedirect(self.get_success_url())

First, I save the form. Because I need self.object.id, I can't risk creations not yet having an ID. By first calling save(), I know my object is up to date and an ID is available.

I only want to run this function if the field containing the HTML data has changed (html_content in this case) to avoid potentially expensive and unnecessary computation. To handle this condition, I simple check if my field is in the form's changed_data object.

If deletions occurred, I'll have a deletions object that will be truthy. I use the length of this object to construct my user notification utilizing the Django messages framework.

Last, I implement the return line manually rather than super() the method to avoid another database call.

Other Considerations

In-Line vs. Deferred Execution

The parse_content_images function, as implemented, operates in-line: when a form is saved, the HTML content stored in a Django TextField is parsed, and orphaned images are deleted immediately. This approach is straightforward but may introduce performance overhead in scenarios with frequent updates, as parsing HTML with BeautifulSoup4 and iterating over a large QuerySet can be computationally expensive.

Dirty Flag Approach

To optimize performance in high-frequency update scenarios, a "dirty" flag approach can defer the execution of parse_content_images. A dirty flag is a boolean or timestamp field in the database that marks n instance as needing cleanup due to an update in its HTML content field. Instead of running the function on every save, the system sets this flag and schedules the cleanup for a later time, reducing immediate overhead.

For example, a Content model could include a boolean field is_dirty or a timestamp field last_updated. When the HTML content is modified (e.g., via a WYSIWYG editor), a signal or form save logic sets is_dirty = True or updates last_updated. A background task, such as a Celery task or a cron job, periodically checks for flagged instances and calls parse_content_images with the relevant html_content and imgs_qset.

Final Thoughts

This article looks at how to handle a common scenario where image files become orphaned due to frontend and backend services not working perfectly in tandem. We have an interest in avoiding these situations because unneeded files place a burned on our systems, add to maintenance overhead, and cost money to store. Here we've looked at a function that eliminates orphaned image files while balancing performance and user experience. Last, we explored how to further optimize the approach with dirty flags and task scheduling.

Parse and Delete Orphaned Images in Django

Problem

Potential Approach

Solution

Code

Explanation

Arguments

Validation

Core Logic

Limitation

Implementation

Other Considerations

In-Line vs. Deferred Execution

Dirty Flag Approach

Final Thoughts

Details

Topics

Tags

Next

How Do You Embed Images in a Django Blog From a WYSIWYG Editor?