From v.1 and choosing which data to use, to incremental improvements, learn all about how Elsevier built their research article recommender system.
On 8th November 2018, Bibblio held the 7th RecSys London at Elsevier’s offices on London Wall, welcoming a great group of speakers and guests to explore the world of Recommender Systems. Our first talk, from Maya Hristakeva at Elsevier, was ‘Beyond Collaborative Filtering: Learning to Rank Research Articles’, and we’ve brought you a really detailed look at what she said, together with her slides.
Elsevier is a global information analytics company, with digital products like ScienceDirect, Mendeley, Scopus and Evise. Their users, mainly researchers, academics and health professionals, don’t just need to keep up to date with research: they need to find places to publish, they need to find funding, they need to find people to collaborate with, and their editors need to find people to review their work. Elsevier has developed products to serve these use cases, and the recommender systems team help with all of them, as well as with the more conventional content use cases.
Recommender systems at Elsevier
The recommender team at Elsevier has a mix of data scientists, data engineers and product people, and they work quite closely together to build the products. Maya is in charge of the data scientists. She provided us with an overview of the kind of recommender systems they build, and took a deeper dive into the related article recommender that they’ve built for ScienceDirect.
ScienceDirect, has about 15 million articles – all of Elsevier’s content. Creating good tools to navigate such a large amount of content is crucial. As well as suggesting relevant content, Elsevier are also looking to create personalized experiences. Elsevier provide two different degrees of personalization, which to some extent are based on different types of data, e.g. a different balance of content and usage data. They have a lot of different data, and part of the decision-making process is deciding how to use which of that data to make for a more effective product.
As well as 15 million articles, ScienceDirect gets about 14 million monthly visitors. Not every user is required to log in, e.g. if you have institutional credentials that gives you the right to access to the content. So, they have a combination of logged-in activity and also anonymous logged-out activity, although for the logged-out activity they do have the concept of sessions, and they know what the user has done in that session.
Building the ScienceDirect article recommender
When they started working on the article recommender project, they had a huge amount of usage data for the platform, as well as a number of other data sources that they have from other Elsevier platforms. One thing that they’ve discovered generally from building recommender systems is that collaborative filtering works a lot better than content-based filtering when rich usage data is available. That’s unlikely to come as a surprise to anyone, as a lot has been published on the subject.
Also unsurprisingly, there can be a cold start problem with collaborative filtering algorithms – e.g. what do you do with new articles where there is no usage data. The recommender team have fallback solutions built on content-based filtering, but they find that as little as a few weeks of usage data is enough for them to use collaborative filtering. Users also come to the platform from external sources such as Google or Twitter, so it’s not as if people are only finding articles via the recommender system.
The first iteration of v.1 of the recommender used data from the browsing logs, i.e. what the user has browsed in a session. They used item-to-item collaborative filtering, and then the article content as business logic for filtering, e.g. based on recency or article type. People often request recent content, but one important aspect of research articles is that they don't really go out of fashion, so they need to stay discoverable. Maya said that she wouldn’t drill down too much on item-based collaborative filtering, as it’s a system that most people in the audience were likely to be familiar with.
For one of the aims of the system, to try and recommend articles that are similar to ones that users have already browsed, they use item-based nearest neighbor collaborative filtering. They’ve found that cosine similarity works the best as a metric for judging similarity. The nearest neighbor collaborative filter can be user-based for some of the more personalized setups, or item-based for the setups that are more focused on article similarity. The approach gives good results: it scales quite well, and it's relatively simple to implement. They do have their own in-house Spark Scala library for nearest neighbour collaborative filtering, and that scales well for them.
Maya said that they have the data available from all of Elsevier’s different products to train the algorithms, but all of the different products require a nuanced approach because the use case they're trying to solve, and the user experience they're trying to create, is always slightly different. For the ScienceDirect article recommender, they use a lot of the platform usage data, but they also bring in data from Scopus and from Mendeley. On the Mendeley platform, they use a lot of the usage data from Mendeley, but also bring in Scopus reputation data, and they sometimes enrich it with usage data from ScienceDirect as well. How they combine data really depends on the effect they’re trying to achieve for users, and users differ between platforms, even if there’s overlap.
When it comes to choosing the algorithms that they use for other platforms, the approach is always data-led. It really depends on the available data and use case (or user need) they are trying to address. Sometimes they use content-based solutions. On Mendeley they use user-based collaborative filtering algorithms. They do have libraries though, e.g. a collaborative filtering library that scales. So, when they set up an offline evaluation they go through a couple iterations of different algorithms and approaches to test and benchmark them, and then they narrow it down from there, and eventually A/B test. In terms of retraining the algorithms, they retrain the collaborative filtering ones about once a day, and they precompute the ranking models once a week.
They place quite a bit of emphasis on evaluation, and how they can know that what they're doing is working well enough. They use the browsing logs to help with this. They do time-split evaluation to split data into a training set and a test set. So they’re using the browsing logs, but using a related article to set it up as a session-based prediction task. For example, they generate item-to-item similarities for the recommendations, and then they can set it up so that they use the first article in the session for a query, and then try to predict what the user would browse next. They find that this works well, and correlates well with online A/B tests to evaluate potential improvements to the system. Evaluation of the recommendation doesn’t end with whether or not the user engages with the content: they're starting to look at how much time the users actually spend on it, but that’s not live on the platform yet.
Maya then discussed how they approach user inaction – i.e. how to treat recommendations that the user ignores. On ScienceDirect they don’t think about it yet, but on Mendeley Suggest they do something called impression discounting where (see this paper by LinkedIn) you can build a model around the data from when an item’s been shown ‘n’ number of times and the user hasn't engaged with it. You begin to discount it, and eventually it could fall off the list. They are looking how to deal with user inaction, but it’s a tricky area, because you don't know whether a recommendation is being ignored because it’s bad, or good, but not now.
For the candidate selection set, which they select the recommendations from, they have 100 or 150 items. Then they re-rank that list, and they used to show the top three to users, but now they're showing a paginated view of the top six for ScienceDirect. People rarely go beyond the first 5-10 results even when they’re shown more, although they do show more than six recommendation on other products.
The team were looking for improvements to the v.1 of the system, and found adding significance weighting to the scores was quite effective. You take your cosine and then you multiply it or scale it down, depending on how many times the two items have co-occurred. So ‘K’ controls for that minimum number of sessions that two items could co-occur in. They find that K = 5 works quite well for them (i.e. the pair need to have co-occurred in at least five sessions for you to have full confidence in the cosine score). An alternative is to just have a hard threshold, i.e. the two things need to have co-occurred a minimum number times in order for it to be considered, but that would impact your catalogue coverage. This can work quite well in combination with some kind of other score thresholding.
Other things they wanted to improve were scalability, performance and recommendation quality. They found that these can really be boosted by having minimal/maximal filters for the numbers of articles per session or number of sessions per article, and this can also weed out some odd things happening in the data. Because they work with research articles there is a seasonality effect in the academic year: summer, versus Christmas, versus the terms. They find that using at least a full year of data works quite well to eliminate some of these effects whilst retaining good coverage. In order to boost recency the team sometimes apply time-decay functions. They find that an exponential time-decay function on the usage data actually works well.
The team has other data that they can use to improve the algorithms too. On ScienceDirect, in addition to the browsing logs and the content data, they also have interaction data and they know what was shown by the recommender. From Scopus platform they also have the most accurate and complete citation and coauthor data, as well as the reputation metrics. These get computed on the journal level, the article level and on the author level as well.
Learning to Rank
Another challenge the team took on was improving the ranking function of the recommender. It’s quite a common tactic, but they took their candidate selection, enriched it with features and then trained a form of click prediction, using learning to rank models to try to generate a better ordering for the list of recommendations offered. They use the collaborative filtering similarity score as a feature for doing that, and they extract features from the citation network, the text and the topic as well, so that they can categorize articles into topics via different taxonomies.
They can also include temporal features. One of the main pieces of feedback from users is that they want recent content. The team’s issue is that if you always give users just recent content then you won’t necessarily get the highest engagement. The solution is a bias towards recent content rather than always selecting recent content. They’re also seeing how reputation and other metrics work as features as well.
When you set out to train the learning to rank model, the standard set up is that you have a query document and then you have the related documents, or the associated recommended documents, and then you have relevance judgment between them, essentially. You can then explore whether the pairs that have the higher relevance ranking are actually ranking higher, or not. There are different optimization objectives that you can have: pointwise, pairwise, listwise. There are a number of different algorithms out there to do this, and a lot of different libraries. The one they've used quite successfully is RankLib. It's a Java-based learning to rank package, and since most of their services are JVM-based, it plugs in nicely with that.
They evaluated a number of the different algorithms for the learning to rank task, and found that LambdaMART gave them the best results. In effect it combines a listwise information retrieval metric such as NDCG and decision trees to solve a listwise ranking optimization problem. In order to train the algorithm, it requires data. They looked into the recommender logs’ user engagement data – i.e. what was shown to users and what did they actually engage with (whether that’s a click or a download). That's easier said than done, sometimes. They have to work with the product teams to make sure that that required feedback data is actually recorded and sent, and it's in a form that they can actually work with. This engagement data is personal, so based on a session ID, or on your profile if you’re logged in.
So out of the recommendations that they show, they know which ones the user engaged with, with the timestamp. They also have the request ID which basically allows them to group which content was shown together, so you can say of the ‘n’ that were shown the 3th and the 5th were the ones that the user downloaded or engaged with. If a user goes on to actually read the content then they can record the further engagement.
To make the relevance judgement, they look at each query article, they take all of the recommended articles in all the user sessions, they count the impressions and clicks, and then they compute a graded relevance score for the label.
One of the problems that they ran into is there's not much variance and variability in the list of possible recommendations. This makes it hard to actually learn which recommendations are the best, and also to get a good distribution throughout the corpus. So something that they’ve done, and it’s pretty lightweight to implement in an API, is dithering, something which Ted Dunning has talked about. You can read Elsevier’s blog about it here. It's not pure explore/exploit, the way you might think of a multi-arm bandit, but it’s a similar idea, where you take your ramp and you add a bit of noise to it, and then you shuffle the list a little bit. As you increase epsilon, epsilon gives you the amount of shuffling that you're doing, and you can shuffle the list more or less.
The nice thing about this is it gives some freshness to the user, or the impression of freshness. It also allows the Elsevier team to do more exploration of the list, and it removes some of the bias from the feedback that they can then train these models with (although it doesn't unbias it completely). They also think about the filter bubble: it’s a big problem if everyone in the same field always sees the same articles, as it could e.g. bias the direction of research. It’s not something that they’ve been able to measure yet explicitly, though.
Selecting the parameter which you judge the shuffling by is a trade-off to some extent. Elsevier did an offline evaluation with the session-based recommendation. The idea is to work out how much shuffling of the list causes a drop off in the quality, and by how much. They picked a point that was an appropriate trade-off between exploration and the quality. It's a click prediction task. They did a lot of analysis around position bias, but there wasn't a significant drop between the first second and third position in the recommendations offered.
For the training model they do have a time/split evaluation, which Maya remarked is quite standard, and they have the standard training validation and test set. They ended up testing a lot of the different ranking algorithms, and different features, and did quite a bit of an exploration into the space, e.g. how much data to use. In the end, the model that seemed to perform the best did deliver a 8-10 percent improvement in user engagement when it was A/B tested.
They also did some post hoc analysis of how learning to rank changes the recommendation list, and they noticed two particular things. Firstly, it increases journal diversity, so the recommendations in the list come from different journals, without the team having to enforce it as a specific criteria. Secondly, learning to rank promotes more recently published articles, which was something that users were asking for.
It was also nice for them to see how they can then pull in a number of these data sets and data sources and put them into a ranking model for the team to use. One thing which still puzzles them a little bit is the relative performance of the collaborative filtering scores, the LambdaMART model and random. Random here is essentially that if they take the list of 100 that they generate and just completely shuffle it. On the click-prediction path, where you're trying to say "well what is the user going to engage with?", the LambdaMART model really outperforms the Collaborative Filtering Score ranker, and they're using NDCG as the metric.
They did the same post hoc analysis on the second prediction task: they asked, how well does the algorithm do predicting what the user is going to browse next? They were looking for the path that they could use to really tune and optimize the collaborative filtering candidate (what they now call candidate selection). There, it didn't really make that much of a difference, or at least they weren’t able to comment on the statistical significance of the different results. However, after a number of A/B tests using the click prediction task, they have managed to achieve a better online performance. This is something that they're thinking about more seriously now when they set up final evaluations: what is the problem they’re actually trying to solve online? The offline recommendations should be appropriate for the online goal. The right evaluation criteria aren’t always obvious when you’re working on these problems.
With a lot of the experiments in tuning the collaborative filtering item-based recommender, they did the session prediction and improvements in offline performance also resulted in improvement in online performance. When they worked on the click prediction task, which is taking the candidates and re-ranking them based on the recommendation click prediction, they saw that the models were improving upon collaborative filtering and random. A/B tests were seeing statistically significant improvements as well.
Looking to the future, they’re examining a number of different approaches. They have three separate citation, co-author and social network graph structures, so they're looking into using random walks, or being able to explore the citation graph for candidate generation, and the co-author. They’re also looking into using deep learning, either as neural embeddings for candidate selection or using hybrid systems for ranking models.
Maya said that one of the topics that she feels quite passionately about is evaluation. She believes that it's important that they spend a lot of time in the team talking exploring it. One element of that is correcting for bias, and another is algorithm confounding. Some other things that they're starting to look into as well are whether they should actually have a proper explore/exploit multi-arm bandit system, and whether they can use more counterfactual analysis. It may be that it’s not essential for their use case at the moment, but she thinks it would be good to be thinking slightly in advance of where they are.
With evaluation, there is a quantitative aspect but there is also qualitative aspect. They do A/B testing, but then they also have teams that do user studies, and it all needs to come together. Dogfooding is a big thing for them, so they need to use our own products. It's a great way to get feedback. Maya said that there was a really nice tutorial at RecSys about evaluation on mixed methods from Spotify.
Elsevier build recommender systems for different products, and different products have different needs, and different setups. One common challenge is how do you actually identify the users? How do you detect when users’ interests are changing, and then adapt? How much should you adapt? Should you actually take into account what they've done in the past? There's also more and more emphasis on interdisciplinary research: how do you start to connect different fields. You want to know whether you can actually start connecting the dots for users, but it's not just what's in your field, but what is also in your adjacent field and how do you use that. One example here is covering the different user needs: are you in the more exploratory mode, where you've started the new topic and you're just gathering a lot of information, or are you already an expert and just want incremental add-ons – basically keeping up to date.
Privacy and GDPR are also big topics. It's something that’s impacting the way they build the systems, and it's something that they're spending a lot of time thinking about.