The article is spot on. We at http://visualwebsiteoptimizer.com/ know that there...

martingoodson · on Feb 24, 2014

I'm afraid that's not quite right. A simple python simulation will show you that a variant with -5% (ie NEGATIVE) uplift will still give a positive results around 10% of the time if you perform early stopping of the test.

paraschopra · on Feb 24, 2014

No matter which method you adopt, you cannot eliminate false positives entirely. You merely decrease / control the proportion of them.

martingoodson · on Feb 24, 2014

To remove all doubt, your interpretation of the statistics is incorrect. In particular this sentence is demonstrably false: "They are directionally correct, [...] the business will still do better implementing the variation (v/s not doing anything)."

IanCal · on Feb 24, 2014

> They are directionally correct, and with most A/B tests even if 95% confidence is really a true confidence of 90% or less, the business will still do better implementing the variation (v/s not doing anything).

What? That's not right at all! A confidence measure is how much you can trust that there's actually a difference. You can't say it'll improve things if your confidence is lower than your original threshold!

In addition to this, every time you change something you:

1) Might introduce bugs

2) Spend money

3) Spend time you could be spending adding a new feature or getting a new customer

paraschopra · on Feb 24, 2014

> What? That's not right at all! A confidence measure is how much you can trust that there's actually a difference. You can't say it'll improve things if your confidence is lower than your original threshold!

A 95% confidence doesn't magically translate into a binary decision of winner v/s no decision. A 90% confidence means that the variation is more likely to be better than control, but of course not as likely if confidence was 95%. The p-value is an arbitrary cut-off. (A p-value of 0.945 shouldn't make you throw your results) Of course, in fields such as clinical trials, you'd want to be very sure of your results and might not want to take chances, but on the web when you're running many tests, you are usually OK with something that is probable to work better than the existing version.

Of course, if it is a high stakes A/B test on the web, you'd be as careful as a clinical trial design. We're working towards making all those techniques available within the tool itself.