Risk tolerance and model performance

When predicting user behavior, there is always a degree of uncertainty that requires a trade-off: you must decide whether to include fewer users in a predicted group for higher overall accuracy, or to include more users for lower overall accuracy.

Risk tolerance levels

Firebase Predictions defines risk tolerance levels based on two metrics:

  • The true positive rate of a prediction is the proportion of users that performed an action that were successfully predicted to perform that action (for example, the proportion of users who made a purchase that Firebase predicted would make a purchase).
  • The false positive rate of a prediction is the proportion of users that did not perform an action that were incorrectly predicted to perform that action (for example, the proportion of users who didn't make a purchase that Firebase predicted would make a purchase).

You tell Predictions how much uncertainty you are willing to tolerate when targeting users by choosing a risk tolerance level for a prediction. Each risk tolerance level guarantees that the false positive rate won't exceed some maximum threshold. Given that fixed false positive threshold, Predictions will target as many users as possible to maximize the true positive rate without exceeding the false positive threshold. If the maximum achievable true positive rate fails to meet a minimum threshold, the risk profile is disabled and no users will be targeted with the risk profile. In this way, risk profiles provide a mechanism for ensuring that any targeting you apply has a certainty threshold that, if unmet, disables targeting.

When you target users based on a prediction, you select a risk tolerance level. Depending on the type of prediction and the number of available Analytics events, one or more of the following levels are available to choose from:

Risk tolerance levels
High
  • Targets the most users, at the cost of prediction accuracy
  • Guarantees at most a 20% false positive rate
  • Inactive when the true positive rate falls below 45%
  • For every 10 users correctly targeted, at most 4.44 users (10 × 20% ÷ 45%) are incorrectly targeted*
Medium
  • Targets fewer users with higher accuracy
  • Guarantees at most a 10% false positive rate
  • Inactive when the true positive rate falls below 35%
  • For every 10 users correctly targeted, at most 2.86 users (10 × 10% ÷ 35%) are incorrectly targeted*
Low
  • Targets the fewest users, with the best accuracy
  • Guarantees at most a 5% false positive rate
  • Inactive when the true positive rate falls below 25%
  • For every 10 users correctly targeted, at most 2 users (10 × 5% ÷ 25%) are incorrectly targeted*

*Assuming an equal number of actual positive and negative cases among your users. If there are X times as many negative cases as positive cases, multiply the maximum number of false positives by X.

Examples

Suppose you have an app with 35,000 users, and you want to predict which users will stop using your app (or churn) in the next few days so that you can do something to encourage them to keep using your app.

In the figures below, each face represents 1,000 of your users, with those groups who are satisfied and who won't churn in green, and those who are dissatisfied and who will churn in red.

High risk tolerance

With high risk tolerance, Predictions might create a group like the one in the figure below. This group includes 10,000 of the 13,000 users who are dissatisfied. Therefore, the true positive rate of this prediction is about 76.9%. If, while high risk tolerance is selected, this value ever fell below 45%, the prediction would become inactive until the true positive rate improved.

This group also includes 4,000 users who are actually satisfied with your app, and who you might not want to target in your re-engagement strategy. Because 4,000 of your 22,000 satisfied users were falsely predicted to churn, the false positive rate of this prediction is about 18.18%, which is below the 20% maximum false positive rate guaranteed by the high risk tolerance profile.

sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied
sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied
sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied
sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied
sentiment_very_satisfied sentiment_very_satisfied sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied
sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied
sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied

Low risk tolerance

On the other hand, the figure below shows what a group created with low risk tolerance might look like. This group contains fewer false positives—only 1,000 users—but also includes 4,000 fewer dissatisfied users than the high risk tolerance group. The true positive rate of this prediction is about 46.15% and the false positive rate is about 4.55%.

sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied
sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied
sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied
sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied sentiment_very_satisfied
sentiment_very_satisfied sentiment_very_satisfied sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied
sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied
sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied sentiment_very_dissatisfied

See how risk tolerance affects performance

You can use the cards on the Predictions page of the Firebase console to see how well your predictions perform with different risk tolerance levels.

The graph shows the prediction model's true positive rate for your data over the last two weeks. Each data point on the graph indicates how well that day's model performed on a holdout dataset (see How performance statistics are calculated). The graph is displayed in red on any day the true positive rate falls below the required threshold. On such days, Firebase deactivates user targeting based on the prediction.

If you see that your model has been inactive on some of the past 14 days, you might want to consider increasing your risk tolerance level to target more users and avoid inactive days, at the expense of potentially more false positives. You can see how different risk tolerance levels affect your model's performance by moving the Risk tolerance slider to different positions:

When you do so, the graph shows how well each day's model would have performed with the risk tolerance level you selected. In the example above, you can see that increasing the risk tolerance from medium to high results in the model's true positive rate remaining well above the 45% threshold for all of the past two weeks (but with greater tolerance for false positives).

When you find a risk tolerance level that achieves a balance between user reach and accuracy that you're comfortable with, select that risk tolerance level when you target users with Remote Config, A/B Testing, or the Notifications composer.

How performance statistics are calculated

Labeling

Like many machine learning tasks, training a Predictions model is a "surpervised learning" task. This means all of the users used to train the model must be assigned labels such as "will churn", "will not spend", and so on. To label users, Predictions takes all 28-day active users of your app, and removes the last 7 days of events from their data. This period is called the label window. Firebase Predictions uses events from the label window to assign labels to the user, and then uses the user's events prior to those 7 days (events from the training window) to train the model.

So, in essence, the model is always being trained on data that's 7 days old.

Holdout data and training data

Not all of the data is used directly for training. As is typical for supervised learning tasks, Predictions sets aside 20% of the data as holdout data and uses only the remaining 80% of the data to train the model. Then, to evaluate the model's performance, predictions are generated for the users in the holdout set, based on the data in the training window, and compared to the actual outcomes for each user, based on the labels generated from the label window.

All of the statistics presented in the Firebase console come from evaluating the model against the holdout data.

Оставить отзыв о...

Текущей странице
Нужна помощь? Обратитесь в службу поддержки.