Wednesday, March 04, 2009

This Has All Happened Before And Will All Happen Again

Disclaimers: 1. I'm writing this because I feel like writing it, not because I think you will be interested in reading it (ok, that's most of my posts, but this one has the potential to be extra boring for non-engineers). 2. All comments are based on a combination of experiences at several companies as well as stories from others and should not be taken as a comment on any particular company. 3. Sorry, but the post has nothing to do with BSG.

Ahhh the chip doesn't work at all. What do you think is wrong with the design?

That's what I've heard every time a chip I have worked on has come back (a chip "coming back" means the manufacturer sending you the completed part). You might think I'm leaving out the part in the middle where they say what the problem is, sadly that's typically not the case. While there have been issues with circuits I've designed, that first scare nearly always turns out to be nothing. And I'm always shocked by how few people have picked up on this pattern.

Why do I think chips always get this reaction? It takes a long time to design a chip and it costs a bunch of money to have it made so everyone is very anxious when it shows up. So when a problem occurs, instead of following the normal course of debugging, the person in the lab tells the group of managers looking over their shoulder about the problem that just occurred. With everyone on edge about the chip, the problem is instantly assumed to be a chip problem. So they go to the person who can find a problem in the chip and look to that person to find explanations based on the assumption it is a chip problem.

How should the designer respond? It is tempting to question the chip, partially because an engineer should always be questioning their work (don't trust anyone or anything, including yourself) and because the news/request normally comes from a boss or boss's boss or CEO (seriously). But the designer should take a deep breath, stay calm, and become 100% convinced that the design is perfect. The goal should be to have faith so unwavering a suicide bomber would be jealous. And it is amazing how much progress happens once the designer uses the magic words: "Let's go to the lab." An expert designer would have said the magic words before the frazzled looking manager even had a chance to exclaim how screwed up the chip is (I hope to be that good one day).

Why? Well the designer should stay calm, because someone should stay calm and no one else seems to be doing it. The blind faith has a few reasons. Everyone else is looking to blame the chip so someone needs to question things like: lab equipment (seen it), lab setup (seen it), is it the right chip (seen it), excel equations (seen it), interpretation of data (seen it), random unexplainable set of bad data that can't be repeated (seen it)... Even under normal circumstances people are more motivated to find problems with others' work than their own, so the designer will be very motivated to track down non-design related problems.

There's another reason for the temporary blind faith. One reoccurring theme on House is if he thinks the patient has one of two diseases and one is curable and one is not, he will assume it is the curable one. Fixing a chip typically means months of work, hundreds of thousands of dollars to fab (manufacture) the new one, and a month or so wait to get it back. A lab mistake typically takes a few hours to fix. A math mistake can take minutes. So burn a few days assuming it isn't a chip mistake, because if it is anything else everyone can get right back to making progress. If it is a chip mistake there's going to be months of delay anyway so what's a few days wasted going down the wrong path.

But why is such absolute faith needed? Why not think about possible chip problems while checking everything else? Because if people get even a whiff of a possible problem with the chip they will grasp onto it and expect the designer to track it down, distracting from the real task - blaming everyone and everything else.

However, it is important to be like one of those religious people others respect because they aren't forcing their beliefs on everyone else. Because after a few days it is time for a 180 and looking for design screw ups begins. Cause, honestly, there's a reasonable chance it is a chip problem. There are plenty of tiny mistakes that can ruin an entire chip and an incredible number of subtle issues in chip design.

2 comments:

Anonymous said...

Heh. Boy, do I know what you're talking about :). Of course, software is a bit easier to rebuild than hardware, but the problem of measuring the right thing or running the right code is very familiar.

Here are some debugging rules (courtesy of David Agans). I think you've hit most of them, though the last one is by far my favorite:
- Understand the system
- Make it fail
- Quit thinking and look
- Divide and conquer
- Change one thing at a time
- Keep an audit trail
- Check the plug
- Get a fresh view
- If you didn't fix it, it ain't fixed

The Owl Archimedes said...

i wonder what the average area of a dorito chip triangle is? and the average grams of ground cheese per square cm of chip? i thought of this other chip problem about 6 lines into your post. Back to the real chip problem...