The article is spot on. We at http://visualwebsiteoptimizer.com/ know that there are some biases (particularly related to 'Multiple comparisions' and 'Multiple seeing of data') that lead of results that seem better than they actually are. Though the current results are not wrong. They are directionally correct, and with most A/B tests even if 95% confidence is really a true confidence of 90% or less, the business will still do better implementing the variation (v/s not doing anything).
Of course, these are very important issues for A/B testing vendors like us to understand and fix, since users mostly rely on our calculations to base their decisions. You will see us working towards taking care of such issues.
I'm afraid that's not quite right. A simple python simulation will show you that a variant with -5% (ie NEGATIVE) uplift will still give a positive results around 10% of the time if you perform early stopping of the test.
To remove all doubt, your interpretation of the statistics is incorrect. In particular this sentence is demonstrably false: "They are directionally correct, [...] the business will still do better implementing the variation (v/s not doing anything)."
> They are directionally correct, and with most A/B tests even if 95% confidence is really a true confidence of 90% or less, the business will still do better implementing the variation (v/s not doing anything).
What? That's not right at all! A confidence measure is how much you can trust that there's actually a difference. You can't say it'll improve things if your confidence is lower than your original threshold!
In addition to this, every time you change something you:
1) Might introduce bugs
2) Spend money
3) Spend time you could be spending adding a new feature or getting a new customer
> What? That's not right at all! A confidence measure is how much you can trust that there's actually a difference. You can't say it'll improve things if your confidence is lower than your original threshold!
A 95% confidence doesn't magically translate into a binary decision of winner v/s no decision. A 90% confidence means that the variation is more likely to be better than control, but of course not as likely if confidence was 95%. The p-value is an arbitrary cut-off. (A p-value of 0.945 shouldn't make you throw your results) Of course, in fields such as clinical trials, you'd want to be very sure of your results and might not want to take chances, but on the web when you're running many tests, you are usually OK with something that is probable to work better than the existing version.
Of course, if it is a high stakes A/B test on the web, you'd be as careful as a clinical trial design. We're working towards making all those techniques available within the tool itself.
Of course, these are very important issues for A/B testing vendors like us to understand and fix, since users mostly rely on our calculations to base their decisions. You will see us working towards taking care of such issues.