Next Article Suggestion in Django

Considerations in developing a content suggestion engine in Django.

Background

You've built a content app with Django and you want to increase engagement by providing a "read next" suggestion at the bottom of a post. How do we decide which article to promote? Relevance is the goal but that's easier said than done considering there's a couple different factors that go into determining relevance.

We want to recommend content that's substantially similar. If a reader is interested in this then they'll be interested in that. But that may not be enough. And what if there isn't anything similar to choose from?

Recency also plays a role. We want to bias engagement towards the freshest content. But this will likely introduce tradeoffs. Let's take a look at how we can find a balance and implement a simple, working solution.

For purposes of this discussion, we'll use the following abbreviated Django model that's indexed across fields for category and tags:

# core.models

from django.db import models


class Article(models.Model):
    title = models.CharField(max_length=100)
    category = models.ForeignKey('article.Category', on_delete=models.PROTECT)
    tags = models.ManyToManyField('article.Tag')
    content = models.TextField(blank=True, null=True)
    published_at = models.DateTimeField(blank=True, null=True)

In practice, my category/topic field will also be a Many-to-Many relationship. A post might span topics or it might not fit neatly in any one topic. But for our example, I want to keep things simple. Also, I'll probably use a third-party package to manage tags like Jazzband's open source "django-taggit." No reason to reinvent the wheel.

Implementation

Before we can construct a recommendation engine, we need to establish some order of precedence for our constraints. We'll apply our constraints in a particular order and incrementally reduce the possible selection at each step. For my purposes, I'm electing for category > tag > publish date.

The logic will be implemented in a custom Django manager instead of the model directly. Assume our base manager looks like the following.

# core.models

from django.db import models

class ArticleManager(models.Manager):

    def published(self):
        return self.filter(published_at__isnull=False).order_by('-published_at',)

The manager then ties into our Article model through the "objects" attribute.

# core.models

class Article(models.Model):
    ...

    objects = ArticleManager()

Now let's frame out a next_suggestion method so that exits early if there's no published articles.

# core.managers (ArticleManager)

    def next_suggestion(self):
        qset = self.published()

        # this is a query! an efficient one, but still a db hit
        if not qset.exists():
            # no published articles
            return None

It's important to keep in mind where and how we're creating queries. The more queries we have, the slower the page is going to load and eventually we'll overload our database service. Ideally, we only want one query, although that's not going to be possible with this basic implementation.

With that out of the way, let's dig into the logic.

Tags

Now we'll filter by tags to further refine our selection. This is where the fun begins. I'll continue with python to keep queries to a minimum. I'll loop through each article within the category and rank them depending on how many matching tags are found.

# core.models (ArticleManager)

from collections import defaultdict

from django.db import models
from django.core.exceptions import ImproperlyConfigured


class ArticleManager(models.Manager):

    def published(self):
        return self.filter(published_at__isnull=False).order_by(
            '-published_at', )

    def next_suggestion(self, article):
        # lazy import to avoid circular import problem
        from .models import Article

        if not isinstance(article, Article):
            raise ImproperlyConfigured("Expected 'Article' instance but instead "
                                       f"received type {type(article)}.")

        # not a query
        qset = self.published()

        # queues two queries: one for the entire qset and one for the prefetch
        # noinspection PyUnresolvedReferences
        qset = (qset
                .filter(category=article.category)
                .exclude(id=article.id)
                .prefetch_related('tags')
                )

        # switch to python to keep queries down
        # query! the two queued queries are now evaluated (queries #1 & #2)
        articles = list(qset)
        if not articles:
            # no other articles with the same category
            # potentially query #3 :(
            return self.published().first()

        # use set so 'intersection' can be used later on (query #3)
        tag_ids = set(article.tags.all().values_list('id', flat=True))

        groups = defaultdict(list)
        for article in articles:
            # be sure to prefetch tags or each lap will hit the db
            # count how many tags from the qset obj are in the base obj

            # a set comprehension must be used here in place of values_list
            #   otherwise a fresh query would be made for each article
            tgt_ids = {tag.id for tag in article.tags.all()}
            intersection = len(set(tag_ids.intersection(tgt_ids)))

            # add article to group based on matching tag count
            groups[intersection].append(article)

        # get group with most matching tags
        best_match = groups[max(groups.keys())]
        return best_match[0]

First, I queue my base queries that will filter on category, excluding self, and prefetch tags for python-based work down the road. This will execute a total of two queries when list() is called on the qset object. If nothing is returned, I'll exit early with the most recent article as a fallback just like before.

Next, I fetch the tags associated with the base article and store the IDs in a set. Note, I want to use a python set here instead of a list so that I can use the intersection method that's available to sets later on. This will make the process of matching tags much easier.

From there I'll loop through the articles, collect tag IDs from the prefetched tag objects, and count the number of tags that intersect between the base article and the article from the qset object we're looping through. The "count" is used as a key with a python defaultdict object to store each article. This essentially "groups" articles by how many matching tags exist.

In the last step, I select the group with the most matching tags simply by taking the one with the largest key (count). Because the defaultdict was created with a "default_factory" of "list," I can slice out the group I'm looking for like with any other list. Throughout each step, the original ordering on the "publish" date is preserved so I know the first article in the group is the most recently published article of the available choices.

Random

As an alternate implementation, we could choose an article that meets our category and tag criteria at random. To do so, we'll use the standard python library "random" instead of just taking the most recent article. To make this change, we just need to import the library and change the last line of code to our next_suggestion method.

# core.managers (ArticleManager)

import random


    def next_suggestion(self, post):
        ...

        # the most recent article
        # return best_match[0]

        # select an article at random in "best_match" group
        return random.choice(best_match)

Personally, I like this approach better than just taking the most recent article. This will allow for a more dynamic experience and make sure different pieces of content are seeing the light of day.

Check Queries

With the suggestion engine built, let's take a look to make sure we're querying the database the number of times we expect. It's good practice with more complex database operations, especially ones that have loops in them, to make sure everything is working as expected and there aren't any leakages cause unnecessary database trips.

From my comment annotations, you can see that I expect exactly three queries to the database no matter which logic branch is taken. Now let's verify. First we'll clear any upstream queries from the database connection (if any), and then we'll print out a count of how many occurred once we've run the logic.

# core.models (ArticleManager)
    
    def next_suggestion(self, post):

        connection.queries_log.clear()
        
        ...

        print(len(connection.queries_log))

        # the most recent article
        # return best_match[0]

        # select an article at random in "best_match" group
        return random.choice(best_match)

# output
3

Great! Three queries total as expected. Not terrible but ideally we would prefer this to be a single query as mentioned before. That said, we'll save more advanced database operations for a later day.

In practice, this approach is still completely fine. There's no reason to complicate your code until there's a reason to do so. Your traffic volume, resource consumption and page load times will tell you when additional complexity is needed.

Caching

As traffic grows on your site, you may want to consider caching the result. The queries that generate the output will be the same for any user that accesses a particular article. You could save the ID of the article suggestion in your cache so that you're only performing a single query to retrieve the article based on its primary key instead of the multiple queries required to produce the suggestion as detailed above.

Explore the Django cache framework to see if this would be appropriate for your project.

Other Considerations

There may be situations where an evaluated suggestion isn't desirable and a fixed solution would be better. For example, you have a series of articles that are related, and you want "Article Part 2" to be the suggestion following "Article Part 1." To accomplish this, you could add a field "next_suggestion" where the suggestion method will always return this article, if present. If there isn't a "next_suggestion", the method would return one programmatically.

Give It A Try

My Github repo for a working solution is available at django-next-article-suggestion. Since this is a public repo, simply clone it to your local machine and follow the steps in the "readme" to get up and running. I've included fixtures for all relevant models that way you can begin working right away. The models are also registered with the admin site so you can update objects and relationships in real time and see how the suggestions change.

Final Thoughts

Providing suggestions may increase engagement by keeping users on your site reading more articles. But the suggestions need to be relevant to the particular user. Since we don't know anything about this party without taking additional steps, the context of the article they're coming from can be used to offer suggestions.

There's technical considerations to be careful of. Evaluate how to most efficiently filter objects based on your constraints. Consider how you'd like to order and prioritize results. Most importantly, be sure not to generate unnecessary queries.

Next Article Suggestion in Django

Background

Implementation

Category

Tags

Random

Check Queries

Caching

Other Considerations

Give It A Try

Final Thoughts

Details

Topics

Tags

Next

Flatpages In Django Part 1: Basic Implementation