This article covers frequently asked questions about Amplitude Experiment's duration estimate.
Amplitude Experiment uses the means, variances, and exposures of your control and variants to forecast expected behavior and calculate the number of days your experiment takes to reach statistical significance. As Amplitude Experiment receives more data over time, this prediction improves. If any of these inputs change significantly during the experiment, the accuracy of the prediction is likely to decrease.
How does the duration estimate work?
What is the difference between the duration estimate and the duration estimator?
If the estimate isn't showing, it likely means that one or more of these criteria aren't met.
Why is the duration estimate not showing?
Is there a cap for the duration estimate?
How does Amplitude Experiment determine the number of exposures per day?
Irreducible error Irreducible error is error inherent to the estimation process; unfortunately, you can't correct for it. In simulations, each one reaches statistical significance at different times: this difference is the main reason to run multiple simulations. The time it takes for an experiment to reach statistical significance is a random variable itself. It depends on the p-value, which in turn depends on the data your experiment collects. Even if you know the control mean, control standard deviation, treatment mean, and treatment standard deviation, and if we force normal distribution and independence on everything, Experiment still can't reduce error all the way to zero. See this video on irreducible error and bias for more information. Incorrect estimates When Amplitude Experiment generates a duration estimate, it estimates the control population mean and control population standard deviation, among other things, with the sample estimates. These estimates are as good as they can be. That said, there is potential for error here also. Drift For example, if today the control mean equals 5, and ten days from now the control mean equals 15, there is drift in the control mean. A common example of drift is seasonality. If there is any drift in any of the statistics, the estimate does poorly. The estimate assumes no drift when doing hypothesis testing.
What types of errors are there?
This isn't necessarily a bad result if your recommendation metric is a guardrail metric, since the effect size would be smaller than the allowed amount. Conversely, it's a bad sign if your recommendation metric is a success metric the effect size would be smaller than what you hoped for. It's recommended to end the experiment if this happens; even if you would have reached statistical significance, the lift would be smaller than what's practically significant and you wouldn't have moved the metric like you were hoping to.
What does 'Threshold reached' mean?
What does 'Statistical significance may never reach' mean?
Thanks for your feedback!
November 21st, 2024
Need help? Contact Support
Visit Amplitude.com
Have a look at the Amplitude Blog
Learn more at Amplitude Academy
© 2024 Amplitude, Inc. All rights reserved. Amplitude is a registered trademark of Amplitude, Inc.