How to properly design, evaluate and run the next generation of algorithms
At Bibblio Labs we recently organized our fourth RecSys London Meetup, hosted by the team at Deliveroo. The meetup took place at their office in Cannon Bridge House, a well-known part of London's skyline.
The presenters for the evening were Daoud Clarke (Data Scientist at Lucid and Founder at Hyperparameter), Mahbub Gani (my colleague and Lead Data Scientist at Bibblio) and Khurom Kiyani (Senior Data Scientist at Deliveroo).
They had some really interesting insights which you can find below, and I've included links to their presentation slides too:
"Little has been published about systems that can generate recommendations in response to changes in recommendable items and user behavior in a very short space of time - we wanted to change that." - Daoud Clark
Daoud founded Hyperparameter Limited in 2015, and since then has also been helping various organizations. In his talk he focused on a project he did at News UK in 2016 and 2017, improving the recommendations in newsletters for The Times and The Sun.
News UK were convinced that personalization could improve the daily email that went out to subscribers - the challenge was to do this well at scale. At the news organization they also needed to be able to quickly generate recommendations for news items as they were published.
Daoud found that the existing algorithms and systems weren't suitable and decided to create a new algorithm for updating collaborative filtering models incrementally.
He and his collaborators, Dion Bailey & Tom Pajak (News UK) and Carlos Rodriguez (Kainano Ltd.), worked out which algorithms and architecture were needed for real-time recommendations at News UK. They even wrote a paper on the productionalized system which they presented at the Thirty-seventh SGAI International Conference on Artificial Intelligence last year.
There was too much to cover in depth during a 20 minute talk, but Daoud did left us with some takeaways. While designing the algorithms and building the architecture that allowed recommendations to be generated on the fly at scale, they discovered that:
"Chatbots?", you might be thinking. It's a novel, future-oriented application Daoud envisages for the algorithms and architecture they designed. He's very excited about the progress he's making in that area and is keen to report on that in the near-future:
Journey to Personalization
"Much current ML research suffers from a growing detachment from real problems: subjective evaluation by humans is of the utmost importance." - Dr Mahbub Gani
Next up was Mahbub, my colleague and Lead Data Scientist at Bibblio. In his talk he gave an overview of our company's journey towards productionalizing a collaborative filtering algorithm.
This journey was characterized by dealing with data limitations around both user and session identifiers. This is a common challenge at a 'software-as-a-service' company, where the business has no direct connection to the audience of the recommendations. Mahbub suggested some ways to resolve this issue:
Whilst acknowledging their limitations, he went on to introduce the 'three kings' of collaborative filtering, which were all evaluated when deciding which algorithm would become the driver behind Bibblio's first personalization module:
The evaluation of these approaches was performed in a systematic matter, kicked off by selecting a sample of 10 personas with a history of at least C clicks. For each persona, Mahbub then generated N <= C recommendations for all three CF prototypes. He included two control recommenders: one which showed the top five most popular items across the catalog (Global Popularity) and one which was a list of five items chosen at random (Random).
The subjective tests took place by having each evaluator performing a blind evaluation for each recommendation set against accuracy and diversity criteria. Mahbub also had the evaluators assign an overall “business” score to each recommender. He then ranked the recommenders according to these three scores (you shouldn't forget to check inter-annotator divergence too). And the big winners were...
...the Factorization Machine (labeled B in above diagram) and Alternative Least Squares (labeled C). Notice how Global Popularity ranks worst by far on accuracy, and Random, not surprisingly, scored best on diversity. Algorithm B and C had a very similar mean overall score, but differ in implementation complexity and performance quite a lot.
Mahbub mentioned that the results had only come in on the day of the presentation, and the decision on which algorithm to productionalize would follow soon. Which algorithm would you go for - B or C? And why?
Which algorithm? Have a look at Mahbub's presentation here and decide for yourself
Model Evaluation and Testing in Ranking Experiments
"One of the biggest cons of offline evaluation is strong positional bias, which makes it crucial you asses the quality of different rankings by using online experiments such as interleaving." - Khurom H. Kiyani
Last but not least was Khurom, senior data scientist at our host of evening Deliveroo. The online food ordering company's goal is to connect users with the right restaurant at the right time and have them buy a meal. Choices are abundant, so a recommender system comes in handy.
When developing a new algorithm focused on optimizing goals such as conversion rate and minimizing bounce rate, Khurom suggested the following life-cycle:
Khurom focused on the segments highlighted blue: the offline- and online evaluations. He listed the pros and cons of both, emphasizing the fact that the offline evaluations are often "littered with biases that are hard to control".
Randomized controlled trials or A/B tests online are the de facto gold standard for hypothesis testing, Khurom explained, but it does need careful design otherwise it could end up being too noisy, take too long and even result in misleading results.
One of the ways you could run a well designed online test - while combating biases such as positional prejudice - is called interleaving. Khurom had a lot of admiration for Chapelle's 2012 paper and Netflix's 2017 blog by the Netflix team explaining this 'improved' A/B test clearly:
By using interleaved experiments you can rapidly sift through many similar ranking algorithms, in something also known as repeated measures design:
Khurom believes offline testing still holds value, but to increase the fidelity of this evaluation type we should be careful to control for biases within historical rankings. Context is very important in ranking - ignore it at your peril.
He left us with a thought-provoking question: "If our offline ranking metrics showed that we are smashing the benchmark and our online experiments corroborated this, should we still roll-out? For example pushing more popular restaurants higher up the page." For the vast majority of attendees, and Khurom, the answer was "no".
Algorithmic development never takes place in isolation. To go back to Mahbub's earlier point, algorithms can help to tackle real world problems if created and used properly... but can also make things worse. Either way, in the case of Deliveroo they will have a real impact on the livelihoods of restaurant owners. Khurom was keen that less popular restaurants shouldn't be locked out of success by Deliveroo's algorithmic decisions.
Overall it was a great event at an impressive venue, and it was fun chatting to everyone over drinks, pizza and donuts. If you want to join us at the next free London meetup on recommender systems, then join the group here (278 members already)! Our next meetup will be in September 2018. Also, we’re always on the lookout for speakers to light up our next meetups, so if you know,(or are) somebody who’d like to share a story — big or small — then please contact me.