What are the three ways to build a recommender system when you don’t have audience data?

< Blog home

Even if you don't have data about your users, you can still build an effective recommender system to keep them engaged by showing them more great content

TL;DR: Your first approach could be to build a content-based recommender which recommends other similar content items without requiring any user data. The features (i.e. the mathematical representations of the different aspects of the content items the recommender algorithm needs to perform its computations) would be constructed from information about the content items themselves, rather than user behavior. With written content you might use semantic technology to derive your features from the text.

With this system as a benchmark, you could try to enhance your recommendations by introducing additional features such as metadata extracted from the text. You could also take it a step closer to a personalized recommender, even without explicit user identifiers, by using a proxy for the user ID to provide a form of personalization. Finally, supposing your users look at several content items during each visit, you could build a localized, session-based recommender, which is one that bases recommendations on immediate trends within a user’s session.


“How do you build a recommender system without user data?” This is a question we’ve come across a few times, so I thought that I’d have a go at answering it. I should say up front that this piece 1) assumes a certain amount of fundamental knowledge about how recommender systems work, and 2) uses some jargon (where essential!) So, I’ll try and give context where it ends up getting technical, and if you want to know more about the basics of recommendation that aren’t covered here, why not try this excellent practical tutorial.

Broadly, there are three possible ways to build a recommender system without user data. I’ve set them out below in increasing order of sophistication, and I’ve assumed that you’re taking advantage of the data that is available. Each method gets closer to leveraging user data, in the form of unique identifiers and user information, even if you don’t actually have it.


Build a content-based recommender system

Firstly, you can build a standard content-based recommender system using, for example, any tags or other content metadata as features. You could apply a form of TF-IDF scoring for your algorithm, where the tags represent individual words which are found in a pre-computed dictionary. (The dictionary is simply a data structure that contains the entire collection of individual words extracted from each content item.)

Specifically, if you take all of the tags and all of the other features and construct a dictionary out of that, then that will enable you to build a so-called ‘feature vector’. You then use vectors as the basis for comparing different content items before making a recommendation. Following this recipe will enable you to build a rudimentary content based recommender, which from my experience works quite well. All you’re doing is recommending content similar to your original piece, with ‘similar’ here meaning that the recommended content has similar tags and features to the original content.

The very first thing you can do to create a more sophisticated system is iterate and use your rudimentary content recommender as a benchmark to improve upon. There are other techniques you can use as well which I’ll describe below.


Improving your content-based recommender

The procedure described above draws upon a single dictionary consisting of your current tags and other features.The next level of sophistication is to have two or more separate dictionaries - one each for different classes of metadata. You produce TF-IDF scores for your recommendations and perform a weighted combination of these scores for each content item based on the multiple dictionaries. You can optimize the parameters (i.e. the weights of the scores) according to the results of a subjective evaluation. This is to see which parameter weighting gives the best quality recommendations.

If you have a class of metadata which doesn’t lend itself well to a TF-IDF scoring, e.g. it's not discrete, then I recommend you segment the data within that class into different categories. When you segment it that gives you another set of tags (e.g. one for each segmented category). Provided that you don’t create a huge number of features, the additional complexity shouldn’t impact you too much.

You could then enhance your recommendations further, or focus them, by introducing filters e.g. for a particular tag. This is not part of the core algorithm, but it’s a collateral scaffold in which you embed that algorithm which will enable users to specify criteria for the recommendations.


Build a recommender which draws upon a proxy for the user

The next level of sophistication is to look at data characteristics which can serve as a proxy for the user. Even though you don’t have unique user IDs, you may have the IP address, browser information or other kinds of information that you’ve harvested for each user session.

From this you can construct a notional user ID. It isn’t going to be bulletproof, but it’s a form of rudimentary fingerprinting. Once you are able to designate an ‘abstract’ user, you can start generating personalized recommendations for that user, specifically by drawing upon e.g. a variety of collaborative filtering approaches. Again I don’t think this is too complicated - there are lots of open source approaches out there. (A high-level Python package, for instance, can be found here.) The key thing here is that you now have one way of constructing the user ID based on the proxy information you have.

The other thing you need is click interaction data. You need to know what items have been clicked, otherwise you have no way of developing a notion of preference to optimize towards. Once you have both of this data and a notional user ID, you can create a form of recommender that personalizes to the combination of an IP address and some browser information. This is not true personalization, but it’s at least on the road there.


Build a session-based recommender

The final broad approach is to build a session-based recommender. It’s similar to the previous approach, but this time you’re focusing on data from a particular session, which you'll probably have even if you don’t have user information. If you have session IDs, then you can use those as a highly localized ‘user ID’ equivalent.

There are a variety of session-based recommenders, some of which are rather sophisticated based on recurrent neural networks (RNN) such as the work of Hidasi and Karatzoglou. These systems have been shown to deliver very promising results.

A session-based recommender does assume that users are prepared to go on a journey. If they are and they accumulate a sufficient number of clicks, then the system gets better at recommending a compelling next item for them to view and consider.

I hope that you've found this interesting! You can find more on this topic on my Quora profile. You might also find it interesting to look at my answer to a previous question: What are the best algorithms for building recommender systems?

Related articles

Related articles

Publishing in peril

Web users have enjoyed free access to content for years, which has meant some digital publishers are having a tough time of it. To make money they either put up a paywall or rely on ads and clickbait. Google and Facebook offer easy ways to share content, but these referrals are fickle, hard to monetize, and dependent on algorithms that can change in a heartbeat.

In this ecosystem, quality journalism cannot thrive, replaced instead by sensationalist content and filter-bubbled fake news. The experience for users is jumbled and distracting, putting customer loyalty at serious risk.

Help is at hand

Bibblio's discovery platform lets publishers increase engagement, build loyalty and improve how they monetize. Our AI understands the context of each content item, recommending it at the right place and time, recirculating the publisher's content and personalizing their audience's experiences across any touchpoint.

Using either light-touch self-serve tools or running deeper integrations with support from Bibblio's engineers, these successful publishers have found smarter ways to deliver value through their sites, gain a better understanding of their audience and return their focus to quality editorial.

Subscribe to the Vertical Heroes newsletter

Get the latest in vertical publishing in your inbox, from interviews and trending news to events.