A turning point in the war on orange
Mozilla now runs over a million tests on each checkin. We're consistently including tests with new features, and many old features now have tests as well. We're running tests on multiple versions of Windows. We've upped the ante by considering assertion failures and memory leaks to be test failures. We're testing things previously thought untestable, on every platform, on every checkin.
One cost of running so many tests is that a few tests that each fail 1% of the time can quickly add up to 3-5 intermittent failures per checkin. Historically, this has been a major source of pain for Mozilla developers, who are required to identify all oranges before and after checking in.
Ehsan and I have pretty much eliminated the difficulty of starring intermittent failures on Tinderbox. Ehsan's assisted starring feature for TinderboxPushlog was a breakthrough and keeps getting better. The orange almost stars itself now. The public data fairy lives.
I'm only aware of two frequent oranges that are difficult to star, and we have fixes in hand for both.
But we should not forget the need to reduce the number of intermittent failures now that they are easy to ignore. They're still an annoyance, and many of them are real bugs in Firefox.
What makes it hard to diagnose and fix intermittent failures in Firefox's automated tests? Let's fix these remaining unnecessary difficulties.