Beyond Engagement: Aligning Algorithmic Recommendations With Prosocial Goals

Jonathan Stray

January 21, 2021

Much of the media we see online — whether from social media, news aggregators, or trending topics — is algorithmically selected and personalized. Content moderation addresses what should not appear on these platforms, such as misinformation and hate speech. But what should we see, out of the thousands or millions of items available? Content selection algorithms are at the core of our modern media infrastructure, so it is essential that we make principled choices about their goals.

The algorithms making these selections are known as “recommender systems.” On the Internet, they have a profound influence over what we read and watch, the companies and products we encounter, and even the job listings we see. These algorithms are also implicated in problems like addiction, depression, and polarization. In September 2020, Partnership on AI (PAI) brought together a diverse group of 40 interdisciplinary researchers, platform product managers, policy experts, journalists, and civil society representatives to discuss the present and future of recommender systems. This unique workshop on recommender-driven media covered three topics:

How recommenders choose content today.
What should be the goal of recommenders, if not audience engagement?
Emerging technical methods for making recommenders support such goals.

Several promising directions for future recommender development emerged from the workshop’s presentations and subsequent discussions. These included: more understandable user controls, the development of survey-based measures to refine content selection, paying users for better data, recommending feeds not items, and creating a marketplace of feeds. The workshop also resulted in the first-ever bibliography of research articles on recommender alignment, as contributed by workshop participants.

How recommenders choose content

Recommender systems first emerged in the mid-1990s to help users filter the increasing deluge of posts on Usenet, then the main discussion forum for the fledgling Internet. One of the very first systems, GroupLens, asked users for “single keystroke ratings” and tried to predict which items each user would rate highly, based on the ratings of similar users. Netflix’s early recommender systems similarly operated on user-contributed star ratings. But it proved difficult to get users to rate each post they read or movie they watched, so recommender designers began turning to signals like whether a user clicked a headline or bought a product. By the mid-2000s, systems like Google News relied on a user’s click history to select personalized information.

Today’s recommender systems use many different kinds of user behavior to determine what to show each user, from clicks to comments to watch time. These are usually combined in a scoring formula which weights each type of interaction according to how strong a signal of value it’s thought to be. The result is a measure of “engagement,” and the algorithmic core of most recommender systems is a machine learning model that tries to predict which items will get the most engagement.

Engagement is closely aligned to both product and business goals, because a system which produces no engagement is a system which no one uses. This is true regardless of the type of content on the platform (e.g. news, movies, social media posts) and regardless of business model (e.g. ads, subscriptions, philanthropy). The problem is that not everything that is engaging is good for us — an issue that has been recognized since the days of sensationalized yellow journalism. The potential harmful effects of optimizing for engagement, from the promotion of conspiracy theories to increased political polarization to addictive behavior, have been widely discussed, and the question of whether and how different platforms are contributing to these problems is complex.

Even so, engagement dominates practical recommendations, including at public-interest news organizations like the BBC. Sometimes high engagement means the system has shown the user something important or sparked a meaningful debate, but sensational or extreme content can also be engaging. Recommender systems need more nuanced goals, and better information about what users need and want.

Building better metrics

Most modern AI systems are based on optimization, and if engagement is not a healthy objective then we need to design better processes for measuring the things we care about. The challenge that recommender designers face is expressing high-level concepts and values in terms of low-level data such as clicks and comments.

*From a talk by Rachel Thomas, Director, USF Center for Applied Data Ethics*

There’s a huge gap between the high-level qualities on the left side of the above graphic and the low-level data on the right, which includes user clicks and likes, the digital representation of the content itself, and metadata such as user and item location. The concepts we care about have to be operationalized — that is, translated from abstract ideas to something that can be repeatedly measured — before they can be used to drive AI systems. As a simple example, the “timeliness” of an item can be defined so that posts within the last day or the last week are considered most timely and gradually age out. Recent PAI research analyzed how Facebook operationalized the much more complex concept of a “meaningful social interaction,” and how YouTube operationalized “user satisfaction” as something more than just the time spent watching videos.

More complex ideas like “credibility” or “diversity” have proven quite difficult to translate into algorithmic terms. The News Quality Initiative (NewsQ) has been working with panels of journalists and technologists to try to define appropriate goals for content selection. As one of the journalists put it, “any confusion that existed among journalists regarding principles, standards, definitions, and ethics has only travelled downstream to platforms.”

The NewsQ panel studying opinion journalism recommended that “opinion” content should be clearly labelled and separated from “news” content, but noted that these labels are not used consistently by publishers. The NewsQ analysis of local journalism counted the number of news outlets which appeared in the top five stories from each place, e.g. “in the Des Moines feed we reviewed, 85 of 100 [top five] articles were pulled from four outlets.” The report calls for increased outlet diversity, but does not specify what an acceptable number of outlets in the top five results would be. While there is a deep history of journalistic practice and standards that can guide the design of news recommenders, defining a consensus set of values and translating them into algorithmic terms remains a challenge.

Even well-chosen metrics suffer from a number of problems. Using a metric as a goal or incentive changes its meaning, a very general problem sometimes known as Goodhart’s law. Metrics also break down when the world changes, just as a number of machine learning models stopped working when COVID reshaped the economy. And of course, qualitative research is essential: If you don’t know what is happening to your users, you can’t know that you should be measuring something new. Still, metrics are indispensable tools for grappling with scale.

As is true of AI in general, many of the problems with recommenders can be traced to mismatches between a theoretical concept and how it’s operationalized. For example, early news recommender systems operationalized “valuable to user” as “user clicked on the headline.” Clicks are indeed a signal of user interest, but what we now call “clickbait” lives entirely in the difference between user value and user clicks.

Controls and surveys

Many of the potential problems with recommenders might be alleviated by giving users more control. Yet few users actually use controls when offered: workshop participants who run recommenders noted that only one or two percent of their users actually adjust a control. It’s possible that this is because the controls that have been offered so far aren’t that well-designed or effective. For example, it’s not immediately obvious what will happen when you click on “see less often” on Twitter or “hide post” on Facebook. Better feedback on what such controls do might encourage their use, and one interesting idea is to show users a preview of how their feed will change.

Providing better control over content selection is crucial because it gives users agency, and also a kind of transparency as controls reveal something of how the underlying algorithm works. Yet even if controls were ten times as popular, most users would still not be using them. This means that a recommender’s default settings need to account for goals other than engagement, using some other source of data.

Surveys can offer much richer data than engagement because a survey question can ask almost anything. Surveys can help clarify why people use products the way they do, and how they feel about the results. Facebook asks users whether particular posts are “worth your time,” while YouTube sometimes asks users how satisfied they are with their recommendations. These are much more nuanced concepts than “engagement.” This sort of survey response data is usually used to train a machine learning model to predict people’s survey responses, just as recommenders already try to predict whether a user will click on something. Predicted survey answers are a nuanced signal that can be added into standard recommender scoring rules and directly weighted against other predicted behavior such as clicks and likes.

Despite their flexibility, there are a number of problems with using surveys to control content selection. The biggest problem is that people don’t come to platforms to fill out surveys. Too many surveys cause survey fatigue, where people become less likely to respond to surveys in the future. This severely limits the amount of data that can be collected through surveys, which makes the models constructed from survey responses far less accurate. Also, certain types of people are more or less likely to respond to surveys and this leads to survey bias. Opt-in surveys also don’t provide good data on individuals over time.

Many of these problems could be solved by paying a panel of users for more frequent feedback. It’s easy to tell if showing an article leads to a click or a share, but much harder to tell if it contributes to user well-being or healthy public discussion. To drive recommender behavior, we’ll need to know not just what users think about any particular item, but how their opinion changes over time as the algorithm adjusts to try to find a good mix of content.

Well-being metrics

AI developers are not the first people to think about the problem of capturing deep human values in metrics. Ever since GDP was introduced as a standardized measure of economic activity in 1934 it has come under criticism for being a narrow and myopic goal, and researchers have looked to replace it with more meaningful measures.

In the last decade, well-being measures have become increasingly used in policy-making. While the question “what is happiness?” is ancient, the 20th century saw the advent of positive psychology and systematic research into this question. Well-being is a multi-dimensional construct, and both subjective and objective measures are needed to get a clear picture. For example, the OECD Better Life Index includes both crime rates and surveys asking if people “feel safe walking alone at night.” To demonstrate how well-being metrics work, workshop participants took a survey which measured well-being across a variety of dimensions.

*Well-being survey results for Recommender Alignment (RA) workshop participants (orange) and all other takers, provided by Happiness Alliance.*

The IEEE has collected hundreds of existing well-being metrics that might be relevant to AI systems into a standard known as IEEE 7010. It also advocates a method to measure the well-being effect of a product change: take a well-being measurement before and after the change, and compare that to the difference in a control group of non-users. Facebook’s “meaningful social interaction” research is also framed in terms of well-being.

The well-being metrics used in policy-making don’t provide any sort of grand answer to the question of what an AI system like a recommender should be trying to achieve, and they’re not going to be specific enough for many AI domains. But they do have the advantage of representing a normative consensus that has developed over decades, and they provide guidance on the types of human experience worth considering when designing a metric.

Recommender recommendations

Despite widely differing perspectives, concerns, and types of recommenders, a number of themes emerged from the presentations and discussions. The practice of recommender alignment is in its infancy, but there are a number of places where progress can be made in the near future.

Build better controls

While recommenders have offered user controls of one sort of another for many years, they still mostly aren’t used. That might change if the controls were better and not simply a button that says “I don’t want to see this item.” More expressive controls could adjust the proportions of different topics, or the degree to which internal signals like news source credibility estimates shape content ranking. It remains challenging to communicate to users what a control does and give appropriate feedback. One possibility is showing which items would be added or removed as the user adjusts a control, and there is great opportunity for experimentation.

Develop standardized survey-based metrics

Engagement isn’t going away, but we need better measures to capture what it misses. YouTube has begun incorporating data from user satisfaction surveys into recommendations, while Facebook uses several different types of questions. Carefully designed survey measures can provide critical feedback about how a recommender system is enacting high-level values. Standardized questions would provide guidance on what matters for recommenders, how to evaluate them, and allow comparison between different systems.

Pay users for better data

Voluntary survey responses have not produced enough data to provide accurate, personalized signals of what should be recommended. The next step is to pay users for more detailed data, such as by asking them to answer a question daily. Such data streams could provide rich enough feedback to attempt more sophisticated techniques for content selection, such as reinforcement learning methods which try to figure out what will make the user answer the survey question more positively.

Recommend feeds, not items

Almost all production recommender systems are based on scoring individual items, then showing the user the top-ranked content. User feedback is used to train this item scoring process, but these models can’t learn how to balance a mix of different types of content in the overall feed. For example, a naive news ranker might fill all top 10 slots with articles about Trump because they get high engagement. Even if each of these stories is individually worthwhile, the feed should likely include other topics. Existing recommender algorithms and controls are almost entirely about items, instead of taking a more holistic view of the feed.

Each color represents a different type of content: posts from friends, political news, funny videos, etc. Recommendation algorithms currently rank each item against the others, as on the left. However, they should rank sets of items – whole feeds – to ensure a good mix of items overall, as on the right. Image by Dylan Hadfield-Menell.

Incentivize the creation of different feeds

Rather than trying to create one feed algorithm that fits everyone, we could have a variety of different feeds to serve different interests and values. If you trust the BBC, you might also trust a recommendation algorithm that they create, or their recommended settings for a particular platform. Similarly, you might trust a feed created by a doctor for health information. We normally think of media pluralism as being about a diversity of sources, but it may be important to have a diversity of feeds as well.