Problem
You're editing HTML content in Django using a WYSIWYG editor. After inserting an image, you realize it doesn't quite fit and you exchange the image for another. It was easy enough to get the image into your system. An API tied into to your editor accepts the image file, stores the object, and returns a file path for the image to the editor that's embedded as the src
attribute to an img
tag in the content. But now, you need to undo this process for the image that was replaced.
The issue is your backend isn't aware of this change that occurred in the frontend. If no additional logic is implemented, both the updated image as well as the original image will persist in the system. That's an issue because retaining resources that aren't being used creates unnecessary costs, storage bloat, maintenance overhead and so on.
Potential Approach
One approach to address this issue is to make the backend aware of frontend changes in real time. For instance, the editor could notify the backend via an API when an image is removed, triggering the deletion of the corresponding database record and file.
While feasible, this approach introduces significant challenges. Monitoring the WYSIWYG editor for image changes requires complex JavaScript logic, which imposes substantial computational overhead on the browser, particularly for large or complex HTML content. This can degrade the user experience by slowing down the editor.
Moreover, since frontend changes remain in a draft state until the form is submitted, real-time notifications may be unreliable, risking incomplete or erroneous deletions. Why not batch this work on the backend?
Solution
Since the backend isn't aware, unless we explicitly make it aware as discussed above, what we need to do is compare what images exist in the content against what we have stored in our system. In this article I'll provide a Python function that will parse images in HTML content and stored as a Django TextField
, check for any orphaned images, and delete those orphans.
Assumes Django=5.2 and beautifulsoup4=4.13.4.
Code
# utils.html.parse_content_images.py
from urllib.parse import urlparse
from django.db.models import QuerySet
from bs4 import BeautifulSoup
def parse_content_images(html_content, imgs_qset, img_field='image'):
"""
Parses html content and checks for orphaned images resulting from
content updates. Found orphans are deleted and a list is returned
containing the relative paths of the deleted objects.
Important: only object.delete() is being called here, not
object.image.delete(), meaning there still needs to be implementation
(model method tie in, signals, etc.) to make sure the actual file
is destroyed.
"""
if not isinstance(html_content, str):
raise TypeError(f"Expected string instead of {type(html_content)}.")
if not isinstance(imgs_qset, QuerySet):
raise TypeError(f"Expected QuerySet instead of {type(imgs_qset)}.")
deletions = []
if imgs_qset.count() > 0:
if not hasattr(imgs_qset[0], img_field):
raise AttributeError(f"Image field not found in model.")
# no point if there's nothign to compare against
soup = BeautifulSoup(html_content, 'html.parser')
soup_imgs = soup.find_all('img')
# extract and establish a list of all src attrs from imgs found
img_srcs = [
urlparse(img.get('src', '')).path
for img in soup_imgs if img.get('src')
]
# check if stored img exists in content
# delete, if not
for img in imgs_qset:
img_path = getattr(img, img_field).name
if not any(img_path in src for src in img_srcs):
# image wasn't present, destroy:
deletions.append(img_path)
img.delete()
return deletions
Explanation
The parse_content_images
function is designed to identify and delete orphaned images by comparing those referenced in HTML content against those stored in the database. Below, I break down its arguments and validation logic, followed by its core functionality.
Arguments
The code contains a function parse_content_images
that accepts three arguments, of which the first two are required.
The first argument, html_content
, is a string containing the HTML data, typically stored in a Django TextField
. The function will parse this string and extract img
tags using the BeautifulSoup library.
Next we have imgs_qset
. This argument is a Django QuerySet
of the related objects that store information about the images. Remember, since content can contain many images, we need a related model that has a ManyToOne
relationship with our content model.
Last, we have an optional field img_field
that accepts the name of the field in the model instances within the QuerySet
. This defaults to "image" since in most cases that's an appropriate designation for an image.
Validation
Since there's abstraction to make this function flexible enough for use with different models, I've added a bit of validation to make sure the function is properly implemented. Most importantly, we want to ensure we have a string of HTML to parse and we have an actual QuerySet
since we'll be accessing a QuerySet
method later on (QuerySet.count()
). If either of these checks fail, a TypeError
is raised.
Within the core logic to follow, we also have one more check to make sure the image field name specified by the img_field
argument is actually a member of the related model. Because the QuerySet
must have at least one object for this validation to be possible, I thought it better situated further down. If this check fails, an AttributeError
is raised.
Core Logic
First, the function searches the HTML content and extracts every img
tag using the BeautifulSoup HTML parser.
soup = BeautifulSoup(html_content, 'html.parser')
soup_imgs = soup.find_all('img')
Next, the src
attributes are extracted from the img
tags found within the content and are processed. Because the src
attributes contain the entire URL pointing to the location of the stored image, we want to parse the URL and extract the relative path only.
img_srcs = [
urlparse(img.get('src', '')).path
for img in soup_imgs if img.get('src')
]
The Django ImageField
doesn't store the absolute URL of the image location. Instead, the field stores three components as a single string: the directory prefix, the filename, and the file extension. The directory prefix is set by the upload_to
attribute of the Django FileField
which the ImageField
inherits from. This attribute allows you to be able to specify a particular directory for file uploads on a model-by-model basis. This way, you can have a separate directory for content images and still another for cover images.
Next, we loop through each image passed to the function with the imgs_qset
QuerySet
object.
for img in imgs_qset:
img_path = getattr(img, img_field).name
if not any(img_path in src for src in img_srcs):
# image wasn't present, destroy:
deletions.append(img_path)
img.delete()
The path stored by the ImageField
is compared against what exists in our img_srcs
iterable. Note, rather than simply check img_path
in "set," I'm using any()
to establish if the substring exists within the img_srcs
iterable. This way, if there's an issue with prefixing, especially when you start to encounter more complicated setups with external object stores, we still have a good chance of the logic executing as intended.
Last, the function returns a list containing the relative path of each image deleted, should there be any at all. Mostly, I use this return value to construct meaningful notifications to the user following an operation.
Limitation
This logic won't cause the actual file to be deleted. Additional implementation is needed to delete files.
It's important to note that the physical file living on your file system or in object storage (S3, etc.) won't be destroyed by the logic discussed here. Instead it's up to you to implement the logic that deletes the actual files. This is a feature, not a bug, of Django. Due to the sensitive nature of operations relating to destroying files, Django wants you to deliberately implement this logic.
For ideas on how to accomplish this, check out my article Django Signals For Updating And Deleting Images.
Implementation
With a working function that parses HTML content and removes orphaned images, let's take a look at how to tie that in with your business logic. Out of the gate, you may be wondering where you should tie this function in. You could place it in a form, a view, or even override the model's save()
method.
I generally implement this function in my views because I want to be able to display a notification to the user if objects are deleted. In order to display notifications, I need the request
object.
from django.contrib import messages
from django.http import HttpResponseRedirect
from utils import parse_content_images
...
def form_valid(self, form):
self.object = form.save()
if 'html_content' in form.changed_data:
deletions = parse_content_images(
self.object.html_content,
ContentImage.objects.filter(article=self.object.id)
)
if deletions:
messages.info(
self.request,
f"Deleted {len(deletions)} orphaned images."
)
return HttpResponseRedirect(self.get_success_url())
First, I save the form. Because I need self.object.id
, I can't risk creations not yet having an ID. By first calling save()
, I know my object is up to date and an ID is available.
I only want to run this function if the field containing the HTML data has changed (html_content
in this case) to avoid potentially expensive and unnecessary computation. To handle this condition, I simple check if my field is in the form's changed_data
object.
If deletions occurred, I'll have a deletions
object that will be truthy. I use the length of this object to construct my user notification utilizing the Django messages framework.
Last, I implement the return line manually rather than super()
the method to avoid another database call.
Other Considerations
In-Line vs. Deferred Execution
The parse_content_images
function, as implemented, operates in-line: when a form is saved, the HTML content stored in a Django TextField
is parsed, and orphaned images are deleted immediately. This approach is straightforward but may introduce performance overhead in scenarios with frequent updates, as parsing HTML with BeautifulSoup4 and iterating over a large QuerySet
can be computationally expensive.
Dirty Flag Approach
To optimize performance in high-frequency update scenarios, a "dirty" flag approach can defer the execution of parse_content_images
. A dirty flag is a boolean or timestamp field in the database that marks n instance as needing cleanup due to an update in its HTML content field. Instead of running the function on every save, the system sets this flag and schedules the cleanup for a later time, reducing immediate overhead.
For example, a Content
model could include a boolean field is_dirty
or a timestamp field last_updated
. When the HTML content is modified (e.g., via a WYSIWYG editor), a signal or form save logic sets is_dirty = True
or updates last_updated
. A background task, such as a Celery task or a cron job, periodically checks for flagged instances and calls parse_content_images
with the relevant html_content
and imgs_qset
.
Final Thoughts
This article looks at how to handle a common scenario where image files become orphaned due to frontend and backend services not working perfectly in tandem. We have an interest in avoiding these situations because unneeded files place a burned on our systems, add to maintenance overhead, and cost money to store. Here we've looked at a function that eliminates orphaned image files while balancing performance and user experience. Last, we explored how to further optimize the approach with dirty flags and task scheduling.