Latest LinkedIn AI Research on Using Bayesian Optimization for Balancing Metrics in Recommendation Systems

On Feb 16, 2022

The majority of large-scale recommender systems, such as newsfeed ranking, people recommendations, job recommendations, and so on, have many objectives that must be optimized simultaneously. User involvement, diversity, novelty, freshness, and justice are examples of these objectives. These goals can sometimes clash, so it’s essential to strike a compromise. Many LinkedIn products (such as the homepage feed) use multi-objective optimization (MOO) to assist different balance behaviors in our ecosystem.

MOO has two key components: training models that estimate the likelihood of a specific behavior (like looking for a job) and optimization algorithms that search for the best hyperparameters to balance the various possible objectives.

Automate parameter tuning in our MOO machine learning model, which recommends material on the newsfeed, generic optimization methodology, and the unified platform was designed to simplify onboarding new use cases.

The LinkedIn Notifications recommendation system, which tells users about various activities inside their network, is a case where things were optimized for many objectives.

By keeping guardrail measures like “disables” neutral, the goal is to boost click-through rate (CTR) and increase the number of sessions.

By keeping guardrail measures like “disables” neutral, the goal is to boost click-through rate (CTR) and increase the number of sessions. Sending more alerts to the members may improve the overall number of sessions, but it may impact CTR because the quality of the notice may suffer. Separate models optimize for CTR and sessions, and then a linear combination of those models’ outputs is utilized to send member notifications.

Assume we have n metrics m1, m2,… mn for which we constructed models M1, M2,… Mn to optimize and model M is the ultimate product.

Mx=M1 + x1*M2 + x2*M3 + xn-1*Mn

where x=(x1,x2,… xn-1) is an adjustable combination parameter vector for balancing several objectives. Different models Mx may create the Pareto Front, making it impossible to improve one metric without affecting others.

One metric is identified as the primary metric (e.g., m1) and other metrics as guardrail metrics to find an appropriate x. To determine the answer to the following constrained optimization issue, controlled A/B experimentation was performed to launch the model Mx and collect metrics m1(x), m2(x),…, mn(x).

The metrics of the control model in the A/B experiment are c2,cn, which are threshold values. To test alternative combinations of parameters x, random search and grid search are frequently utilized.

In some instances, launching A/B tests can be difficult because:

Large sample size is required for A/B tests.
Multiple A/B tests are running in parallel in LinkedIn’s production environment.

The available sample size for tuning the combination parameters is restricted.

A/B testing does not adapt to the possibility of suitable variants. Repurposing traffic has more promising opportunities on the fly, reducing the danger of harmful variants operating for an extended period. This will necessitate diverting traffic away from some variations with low metrics. Traditional A/B tests, on the other hand, are incapable of accomplishing this.

Furthermore, setting up A/B tests can take some time. Manually establishing and monitoring A/B testing is not the most efficient use of engineers’ time.

To overcome these obstacles, we use Bayesian optimization to solve the problems.

The Bayesian optimization is a sequential approach for optimizing “opaque-box” functions that are difficult to evaluate. It looks for the optimal hyperparameters consecutively until convergence. It’s a method for modeling objectives with uncertain functional forms. The objective function is surrogated in this approach, and the uncertainty in the objective function is quantified using Gaussian process regression. There are two parts to Bayesian optimization:

A method for generating posterior distributions of unknown functions via function fitting;
A candidate acquisition function that recommends the next candidate.

M1(x), m2(x),…, mn(x) are noisy and nonlinear online metrics. The metrics are modeled using Gaussian processes. Unknown hyperparameters in kernel functions must be optimized by maximizing the marginal likelihood of mi (x).

RBF kernel and Matérn kernel are two popular choices for kernels. The Gaussian process generates posterior distributions for finite intervals after kernel hyperparameters are replaced by estimated values (x).

The purpose is to create the optimization problem formulation to maximize the underlying mean function because the metric contains inherent noise. The optimization issue described above can be represented as

Then, by transforming constraints into indicator functions, the constraint optimization issue converts into an unconstrained optimization problem:

Where 1{ fi(x) >= ci} is an indicator function and ƛ is a huge positive constant.

The definition of an acquisition function to recommend the next candidate is the second step in Bayesian optimization. Thompson sampling was used because it delivers probabilistic recommendations. Assume we have N candidates (x1, x2, xN). Thompson sampling implies that we choose xi with the probability pi=P(U(xi) is the greatest among U(x1),, U(xN)).

Suppose the probability of xi being optimal does not have a closed-form. In that case, we can approximate it with Monte Carlo sampling, which involves taking many posterior samples from U(x) and counting the optimal frequency of each xi.

The discrete distribution Ft produced by one iteration of Bayesian optimization can be expressed as a list of tuples (x1, p1),(xN, pN). This result follows the online A/B test framework: the treatment group is randomly divided into N groups with pi percent subjects. They are served with the model using the combination parameter xi.

Using Bayesian optimization to identify the best hyperparameters for adjusting Notification models.

Notifications and emails are helpful tools in the LinkedIn app for keeping members up to speed on activities they may have missed. Figure 1 below shows an essay shared by the Lyft co-founder discussing his vision of a driverless future. This could be a potential notification candidate for those interested in the self-driving field. These kinds of notifications are known as activity-based notifications.

Source: https://engineering.linkedin.com/blog/2022/using-bayesian-optimization-for-balancing-metrics-in-recommendat

The Notifications platform is a streaming system that reads from a Kafka queue to read activity events. Each activity event has a content identifier (id) and an actor id associated with it. Candidates to provide the set of n receivers r_ikj, 1jn who may be interested in being notified, given the content id C_k and actor id ai. Transmit or drop the notification correctly for each tuple { a_i, c_k, r_ikj}. The purpose of sending a notice is to link members with exciting content. The effectiveness of providing activity-based notifications can be measured in various ways. The most evident is the number of times members click on the notifications.

Members may be encouraged to visit the site to begin a session due to notifications. Separate ML models are used to model both of these elements.

The scores of the two models are merged for a notification candidate, and if the score is higher than the threshold, the notification is delivered.

Where the likelihood of a click by member r_ijk is modeled by pClick. ΔpVisit is the difference between sending a notification now and not sending a notification in terms of the likelihood of a visit (over a defined time horizon). p(Visit | send) – p(Visit | not send) is another way to write it. 𝛼 measures relative importance and 𝜸 is the applied threshold.

To identify x={ a, 𝜸 } for ramping the models, business indicators like Sessions, Impressed CTR, and Notification Send Volume come into play. Definitions can be found below:

The following optimization problem is solved:

Constants c1 and c2 ensure that the new model performs relatively well than a previous control model.

Bayesian optimization can be used to solve the problem, as mentioned earlier. To begin, Sessions(x), Impressed CTR(x), and Send Volume are fit using a Gaussian process (x).

Then, instead of using the original metrics, the fitted function replaced them with the fitted initial functions.

ƛ is a colossal constant that ensures that all constraints are met. The discrete distribution Ft is obtained via Thompson sampling and is represented as a list of (combination parameter, probability) tuples (x1, p1), … ., (xN, pN).

Building the aforementioned Bayesian optimization methodology into a library that can be used by multiple LinkedIn teams.

The library is designed as a plugin to create a Hadoop workflow template utilized offline. The library’s client team will have an offline Spark workflow and an online component that uses the library’s output parameters while processing member queries. There are two main components.

The online component accepts member requests, while the offline component computes parameters and saves them in the Parameter Distribution Store.

When member mi accesses the LinkedIn platform, the values of the hyperparameters xi corresponding to the member mi are first resolved using the technique described in “Online parameter assignment” (below). Then the items are scored and displayed to the member via the UI. Every action made by the member mi triggers an event on the platform.

Before launching the Optimizer flow, the Utility Aggregation job gathers the data.

Each parameter set value will have its own record in the final dataset. For example, suppose we’re adjusting a single hyperparameter x with seven distinct values. In that case, the Utility Aggregator job’s output will have one record for each x, for a total of seven documents.

As previously stated in the “Notification application” section, this flow takes an optimization issue and the output of the Utility Aggregator as inputs and outputs a discrete distribution Ft, which is an output of the Thompson sampling method. The Parameter Distribution Store receives this output.

The distribution Ft contained in the parameter distribution store is fetched when a member visits.

The library is used by numerous LinkedIn teams, including Feed, Notifications, Ads, and People You May Know. A future piece intends to expand using a multitask Bayesian optimization strategy that combines observations from online A/B tests with a simple offline metrics simulator instead of just using online measurements. This can aid in improving modeling for measures with a lot of fluctuation.

Reference: https://engineering.linkedin.com/blog/2022/using-bayesian-optimization-for-balancing-metrics-in-recommendat

Suggested

Credit: Source link