To help you maximize the relevance and usefulness of your test results, this page provides detailed information about how Firebase A/B Testing works.
During experiment creation, it is possible to change the default variant weights to place a larger percentage of experiment users into a variant.
Bayesian inference does not require the identification of a minimum sample size prior to starting an experiment. However, higher sample sizes reduce uncertainty in results and can lead to better decision-making. In general, you should pick the largest experiment exposure level that you feel comfortable with.
Several aspects of running experiments may be edited. To edit an experiment, select Edit running experiment from the experiment menu on the experiment results page. The experiment name, description, targeting conditions, and variant values can be edited. Changing the app’s behavior during a running experiment may impact results.
Remote Config variant assignment logic
Users who match all experiment targeting conditions including the percentage exposure condition are assigned to experiment variants according to a hash of the experiment ID and the user’s Firebase installation ID.
Google Analytics Audiences are subject to latency and are not immediately available when a user initially meets the audience criteria. For time-sensitive targeting, consider the use of Google Analytics user properties or built- in targeting options such as country/region, language, and app version.
Once a user has entered an experiment, they are persistently assigned to their experiment variant and receive parameter values from the experiment as long as the experiment remains active, even if their user properties change and they no longer meet the experiment targeting criteria.
Experiment activation events limit experiment measurement to app users who trigger the activation event. The experiment activation event does not have any impact on the experiment parameters that are fetched by the app; all users who meet the experiment targeting criteria will receive experiment parameters. Consequently, it is important to choose an activation event that occurs after the experiment parameters have been fetched and activated, but before the experiment parameters have been used to modify the app’s behavior.
Interpreting test results
Firebase A/B Testing results are powered by Google Optimize. Google Optimize uses Bayesian inference to generate insightful statistics from your experiment data. To learn more, see the Optimize resource hub.
Results are split into "observed data" and "modeled data." Observed data is calculated directly from analytics data, and modeled data comes from the application of our Bayesian model to the observed data.
For each metric, the following statistics are displayed:
- Total value (sum of metric for all users in the variant)
- Average value (average value of metric for users in the variant)
- % difference from baseline
- Probability to beat baseline: how likely that the metric is higher for this variant compared to the baseline
- Percent difference from baseline: based on the median model estimates of the metric for the variant and the baseline
- Metric ranges: the ranges where the value of the metric is most likely to to be found, with 50% and 95% certainty
Overall, the experiment results give us three important insights for each variant in the experiment:
- How much higher or lower each experiment metric is compared to the baseline, as directly measured (i.e., the actual observed data)
- How likely it is that each experiment metric is higher than the baseline / best overall, based on Bayesian inference (probability to be better / best respectively)
- The plausible ranges for each experiment metric based on Bayesian inference--"best case" and "worst case" scenarios (credible intervals)
A test must run for at least 7 days (Notifications) or 14 days (Remote Config) before Firebase will declare that a variant is a "clear leader."
Firebase will declare that a variant is a "clear leader" if it has greater than 95% chance of being better than the baseline variant on the primary metric. If multiple variants meet the "clear leader" criteria, only the best performing variant overall will be labeled as the "clear leader."
Since leader determination is based on the primary goal only, you should consider all relevant factors before deciding whether or not to roll out a leading variant. For example, you may wish to consider the expected upside of making the change, the downside risk (such as the lower end of the credible interval for improvement), and the impact to metrics other than the primary goal.
It is possible to roll out any variant, not just a leading variant, based on your overall evaluation of performance.
Firebase recommends that an experiment continue to run until two conditions are met:
- Minimum experiment duration has been met (7 days for Notifications and 14 days for Remote Config and Firebase In-App Messaging)
- The potential value of continuing the experiment drops below a minimum threshold--defined as a 5% chance of a >1% loss from incorrectly choosing the current leader
It is possible to roll out an experiment before a leader has been declared if you are comfortable with the results. Most experiments that have been running for two weeks without declaring a leader will not meet the leader-finding criteria prior to expiration.
Experiment data is processed by Google Optimize for a maximum of 90 days after experiment start. After 90 days, the experiment will continue to set Remote Config parameters for users in the experiment, but experiment results will no longer be updated in the Firebase console.
A/B Testing does not have a separate BQ table--experiment and variant memberships are stored as user properties on every GA event in the normal GA event tables.
The user properties containing experiment information are of the form "userProperty.key like "firebase_exp_%" and userProperty.value.string_value contains the (zero-based) index of the experiment variant.
You can use these experiment user properties to identify the specific users who were in each experiment and variant. This gives you the power to slice your experiment results in many different ways.
A/B Testing has a limit of 300 total draft, running, and completed experiments. Deleting an experiment can free up space to create a new one.
A/B Testing is limited to 24 simultaneous experiments. Stopping an experiment can free up space to start a new one.
An experiment can have a maximum of 8 variants including the baseline.