Background
You've built a content app with Django and you want to increase engagement by providing a "read next" suggestion at the bottom of a post. How do we decide which article to promote? Relevance is the goal but that's easier said than done considering there's a couple different factors that go into determining relevance.
We want to recommend content that's substantially similar. If a reader is interested in this then they'll be interested in that. But that may not be enough. And what if there isn't anything similar to choose from?
Recency also plays a role. We want to bias engagement towards the freshest content. But this will likely introduce tradeoffs. Let's take a look at how we can find a balance and implement a simple, working solution.
For purposes of this discussion, we'll use the following abbreviated Django model that's indexed across fields for category and tags:
# core.models
from django.db import models
class Article(models.Model):
title = models.CharField(max_length=100)
category = models.ForeignKey('article.Category', on_delete=models.PROTECT)
tags = models.ManyToManyField('article.Tag')
content = models.TextField(blank=True, null=True)
published_at = models.DateTimeField(blank=True, null=True)
In practice, my category/topic field will also be a Many-to-Many relationship. A post might span topics or it might not fit neatly in any one topic. But for our example, I want to keep things simple. Also, I'll probably use a third-party package to manage tags like Jazzband's open source "django-taggit." No reason to reinvent the wheel.
Implementation
Before we can construct a recommendation engine, we need to establish some order of precedence for our constraints. We'll apply our constraints in a particular order and incrementally reduce the possible selection at each step. For my purposes, I'm electing for category > tag > publish date.
The logic will be implemented in a custom Django manager instead of the model directly. Assume our base manager looks like the following.
# core.models
from django.db import models
class ArticleManager(models.Manager):
def published(self):
return self.filter(published_at__isnull=False).order_by('-published_at',)
The manager then ties into our "Article" model through the "objects" attribute.
# core.models
class Article(models.Model):
...
objects = ArticleManager()
Now let's frame out a "next_suggestion" method so that exits early if there's no published articles.
# core.models (ArticleManager)
def next_suggestion(self):
qset = self.published()
# this is a query! an efficient one, but still a db hit
if not qset.exists():
# no published articles
return None
It's important to keep in mind where and how we're creating queries. The more queries we have, the slower the page is going to load and eventually we'll overload our database service. Ideally, we only want one query, although that's not going to be possible with this basic implementation.
With that out of the way, let's dig into the logic.
Category
My highest-order constraint is "category." Since categories/topics are potentially non-overlapping in my project, I don't anticipate recommendations outside of the origin article's category to provide much value to the user. In the off chance that another post doesn't exist in the same category, the most recent post can be used as a fallback.
# core.models (ArticleManager)
def next_suggestion(self, article):
if not isinstance(article, Article):
raise ImproperlyConfigured("Expected 'Article' instance but instead "
f"received type {type(article)}.")
# not a query
qset = self.published()
# not a query
qset = qset.filter(category=article.category).exclude(id=article.id)
# switch to python to keep queries down
# query!
articles = list(qset)
if articles:
# return the most recent
return articles[0]
# no other articles with the same category
# second query :(
return self.published().first()
The "next_suggestion" method has been updated to filter by category. Deliberately checking if any published articles exist is no longer needed. The logic handling no articles found is now implicit in the last line "return self.published().first()." Instead, we filter by category, excluding self, return the first instance (most recent since we're ordered inversely by publish date), and provide a fallback if the query fails to return anything.
This executes at most two queries and hopefully just one in most circumstances where at least one other article exists within the topic. I convert the QuerySet to a python list object which executes the query and allows us to work with the result in python without creating any more trips to the database. Instead of using a query to grab the most recent article, I just slice the list with python since I know the list is populated with one or more objects already.
Tags
Now we'll filter by tags to further refine our selection. This is where the fun begins. I'll continue with python to keep queries to a minimum. I'll loop through each article within the category and rank them depending on how many matching tags are found.
# core.models (ArticleManager)
from collections import defaultdict
def next_suggestion(self, article):
if not isinstance(article, Article):
raise ImproperlyConfigured("Expected 'Article' instance but instead "
f"received type {type(article)}.")
# not a query
qset = self.published()
# queues two queries: one for the entire qset and one for the prefetch
qset = (qset
.filter(category=article.category)
.exclude(id=article.id)
.prefetch_related('tags')
)
# switch to python to keep queries down
# query! the two queued queries are now evaluated (queries #1 & #2)
articles = list(qset)
if not articles:
# no other articles with the same category
# potentially query #3 :(
return self.published().first()
# use set so 'intersection' can be used later on (query #3)
tag_ids = set(article.tags.all().values_list('id', flat=True))
groups = defaultdict(list)
for article in articles:
# be sure to prefetch tags or each lap will hit the db
# count how many tags from the qset obj are in the base obj
# a set comprehension must be used here in place of values_list
# otherwise a fresh query would be made for each article
tgt_ids = {tag.id for tag in article.tags.all()}
intersection = len(set(tag_ids.intersection(tgt_ids)))
# add article to group based on matching tag count
groups[intersection].append(article)
# get group with most matching tags
best_match = groups[max(groups.keys())]
return best_match[0]
First, I queue my base queries that will filter on category, excluding self, and prefetch tags for python-based work down the road. This will execute a total of two queries when "list()" is called on the "qset" object. If nothing is returned, I'll exit early with the most recent article as a fallback just like before.
Next, I fetch the tags associated with the base article and store the IDs in a set. Note, I want to use a python set here instead of a list so that I can use the "intersection" method that's available to sets later on. This will make the process of matching tags much easier.
From there I'll loop through the articles, collect tag IDs from the prefetched tag objects, and count the number of tags that intersect between the base article and the article from the "qset" object we're looping through. The "count" is used as a key with a python "defaultdict" object to store each article. This essentially "groups" articles by how many matching tags exist.
In the last step, I select the group with the most matching tags simply by taking the one with the largest key (count). Because the "defaultdict" was created with a "default_factory" of "list," I can slice out the group I'm looking for like with any other list. Throughout each step, the original ordering on the "publish" date is preserved so I know the first article in the group is the most recently published article of the available choices.
Random
As an alternate implementation, we could choose an article that meets our category and tag criteria at random. To do so, we'll use the standard python library "random" instead of just taking the most recent article. To make this change, we just need to import the library and change the last line of code to our "next_suggestion" method.
# core.models (ArticleManager)
import random
def next_suggestion(self, post):
...
# the most recent article
# return best_match[0]
# select an article at random in "best_match" group
return random.choice(best_match)
Personally, I like this approach better than just taking the most recent article. This will allow for a more dynamic experience and make sure different pieces of content are seeing the light of day.
Check Queries
With the suggestion engine built, let's take a look to make sure we're querying the database the number of times we expect. It's good practice with more complex database operations, especially ones that have loops in them, to make sure everything is working as expected and there aren't any leakages cause unnecessary database trips.
From my comment annotations, you can see that I expect exactly three queries to the database no matter which logic branch is taken. Now let's verify. First we'll clear any upstream queries from the database connection (if any), and then we'll print out a count of how many occurred once we've run the logic.
# core.models (ArticleManager)
def next_suggestion(self, post):
connection.queries_log.clear()
...
print(len(connection.queries_log))
# the most recent article
# return best_match[0]
# select an article at random in "best_match" group
return random.choice(best_match)
# output
3
Great! Three queries total as expected. Not terrible but ideally we would prefer this to be a single query as mentioned before. That said, we'll save more advanced database operations for a later day.
In practice, this approach is still completely fine. There's no reason to complicate your code until there's a reason to do so. Your traffic volume, resource consumption and page load times will tell you when additional complexity is needed.
Caching
As traffic grows on your site, you may want to consider caching the result. The queries that generate the output will be the same for any user that accesses a particular article. You could save the ID of the article suggestion in your cache so that you're only performing a single query to retrieve the article based on its primary key instead of the multiple queries required to produce the suggestion as detailed above.
Explore the Django cache framework to see if this would be appropriate for your project.
Other Considerations
There may be situations where an evaluated suggestion isn't desirable and a fixed solution would be better. For example, you have a series of articles that are related, and you want "Article Part 2" to be the suggestion following "Article Part 1." To accomplish this, you could add a field "next_suggestion" where the suggestion method will always return this article, if present. If there isn't a "next_suggestion", the method would return one programmatically.
Final Thoughts
Providing suggestions may increase engagement by keeping users on your site reading more articles. But the suggestions need to be relevant to the particular user. Since we don't know anything about this party without taking additional steps, the context of the article they're coming from can be used to offer suggestions.
There's technical considerations to be careful of. Evaluate how to most efficiently filter objects based on your constraints. Consider how you'd like to order and prioritize results. Most importantly, be sure not to generate unnecessary queries.