Blog post

Lessons from ECIR 2023

From the 2nd of April until the 6th of April the European Conference on Information Retrieval (ECIR) was held in Dublin, Ireland.

During this 5 day event, papers from various fields related to information retrieval are presented. These works present cutting-edge methods and theories in their niche fields, such as search, text representation, large language models and recommender systems.
The mix of fields presented at ECIR provided an opportunity to gather ideas from other fields, applicable to our field, recommender systems.

I was happy to be able to attend and present our experiments with punishing popular items in recommendations during the first poster session.
In this blog, I’ll present some of my takeaways from the conference.

Operationalising of Fairness

From the keynotes by Dr. Biega and Dr. Cramer during the BIAS workshop, I realised that fairness is a broad concept, with many metrics that each measure something slightly different. For example, fair exposure is not the same as equal exposure, as it takes into account an inherent quality or relevance of an item. Niche items need to be exposed, but not as often as globally relevant items. In order to make the metrics actionable and usable in production settings, we need to work on operationalisation. By creating tools and frameworks that tell us which metric to use when, we can automate the measurement of fairness in the system, and take action when divergence is detected.

New Metrics and Measurements

Linked to the Quality of Service metric presented by Dr. Biega in her keynote presentation, we noticed that several papers were looking at ways to measure performance not just for the whole population at once, but also looking at results for individual subgroups of users. Groups where performance was bad, could then indicate relevant areas of algorithm improvement.

An example was the work by Zhang et al. [1] who divided users into three categories based on their mainstreaminess. They computed this by taking the average of the popularity of the items interacted with by the user.
Their findings indicated that a lot of the performance in single value metrics, comes from the ‘mainstream’ group, and performance on the niche users is a lot harder to get right.
Other ways of splitting the user groups also exist, such as splitting by sensitive attribute (gender, age, …) to guarantee fair results between these user groups.

In the main track, Türkmen et al. [2] presented a metric to measure the difference between an algorithm’s recommendations to those of the baselines used in the study.
In the Information Retrieval (IR) domain, many of the leaderboards are dominated by BERT-based models, which give very similar results, with minor differences leading to slightly different scores.
The authors argue that this can lead to local “hill climbing”, rather than radical advances.
Their proposed metric aims to address this by giving high value to retrieved items that the baselines did not find.

Interfaces Matter

It may not come as a surprise, but the design of the interface in which a user interacts with a recommendation system has a huge impact on the results and the perceived utility for the user.
During the conference multiple works looked at different ways in which the interface impacts user behaviour, and thus recommendation results.

In her BIAS workshop keynote, Asia Biega gave an example of how interfaces impact the fairness of recommendations. In their study, they used a website that allowed you to sponsor schools in the US. In one of the groups, the users were given access to a search bar, in addition to a list of recommended schools, the second group was not, they only got the recommended list.
Analysing the experiment, they found that the first group was much more likely to back a school that was local to their area, and so schools in smaller communities received less money, even though they needed it more. A follow-up study with banners indicating if a school was in high need of support, raised the fairness further, as users were pushed to support those schools that need the support, rather than schools they may know or are local to them.

In Mounia Lalmas’ keynote, she indicated that Spotify takes care to make sure the UI is aligned with the recommendations, for example keeping explorative recommendations (novel recommendations for the user) and exploitative recommendations (obvious classics the user has listened to repeatedly) clearly separated on the home page, and in separate playlists.

A user study presented by Putkonen et al. [3] showed that users read news homepages differently depending on how they are structured and what information about the item is shown.
In their study, they compared a single-column homepage (often found on mobile) with a multicolumn homepage (grid, often found on desktop). Further, they varied what information for each item they presented (title, image, description). Their results showed that in grid presentation, users were considering items much more fragmented, jumping from one item to another, whereas in the single-column setup, they found that the users were considering each item for longer and more attentively.
Interestingly, they mentioned during the presentation that they could not measure a clear difference in actual click-through rates between the two home pages.

Further work relating the interface to evaluation was presented by Olivier Jeunen during the Industry Day on Thursday. In experiments at ShareChat, they found that the traditional discount applied when using nDCG (1/log(rank)) did not correspond to the attention each article gets in their feed. Especially at lower positions in the (infinite scroll) feed, the 1/log(rank) score was much higher than the exposure those positions received in practice. To solve this issue, they propose to fit an exposure distribution based on collected data. In their case, a Yule-Simon distribution fits the data best.

Conclusion

Throughout the conference, it was an eye-opener to hear about problems in fields related to recommender systems. It is rewarding to look for ideas and solutions that are applicable in our field, even though they were not explicitly proposed in that area.

Resources

  • [1] Zhang K., Xie, M., Zhang, Y., Zhang, H. (2023). Implicit Feedback for User Mainstreaminess Estimation and Bias Detection in Recommender Systems. https://www.youtube.com/watch?v=K9Sssp2QaHY
  • [2] Türkmen, M.D., Lease, M., Kutlu, M. (2023). New Metrics to Encourage Innovation and Diversity in Information Retrieval Approaches. In: Advances in Information Retrieval. ECIR 2023.
  • [3] Putkonen, A., Nioche, A., Laine, M., Kuuramo, C., Oulasvirta, A. (2023). Fragmented Visual Attention in Web Browsing: Weibull Analysis of Item Visit Times. In: Advances in Information Retrieval. ECIR 2023.

NEWS Personalization
Case Study

Please accept marketing cookies to view this form.

Let’s get you started!

Ready to know more about how Froomle can boost your business in as little as 40 days? Our team of experts is here to help!