Mon, Apr 12, 2004
Here in Windows (and in other groups around the company) we have a system that is
generally called stress. Different groups have different implementations, but
the idea is always the same. You develop a test suite that hammers you system
and then have everyone install it and run it overnight. The idea is that
this way we can catch those hard to repro issues that only our customers see.
When you get one of these errors, what do you do with it? Generally there is
a team of people who have the thankless job of coming in early, finding all the machines
that have broken in to the debugger, logging the failures in a database and having
the system send mail to various people to investigate. The actual debugging
is done via the NT command line debugger (cdb)
redirected over the network via the remote.exe utility. It is up to this team
to figure out which issues have already been caught and fixed (but perhaps aren't
in the build yet) and what issues are new and need to be investigated. These
guys cover a lot of code so the generally don't have the knowledge to figure out what
developer should be looking at any particular problem. That falls to the rotating
list of people who are on "stress duty." The shell team calls it "dev of the
day." We've learned by hard experience that if you send these types of things
to a big mailing list, they either get ignored or the same sorry glutton for punishment
picks up all of the issues.
I happen to be on stress duty this week. My job is to handle, in as expedient
a fashion as possible, any incoming issues. I have to do my best to
exercise my debugging skills and try to figure out what is going on. BTW, Raymond
Chen is one of the undisputed masters of the cdb kung-fu. If I can figure
out what is going on myself, I can file a bug on the issue and let the person whose
machine I'm debugging reclaim their machine. If the problem is in code that
I don't know well it is my job to reassign the issue to someone who does know the
code well.
Beyond handling incoming stress breaks in the morning, I also have to be responsive
to all sorts of other issues that might get dropped on the floor. This includes
other random crashes people might have during the day along with helping out to resolve
any build breaks. (We try super hard to make sure that build breaks don't happen,
but with something the size of Windows it still sometimes happens.)
Do other companies have systems like this set up? I'd be interested to hear
what works for you.