Academic Insights
Researchers clear 'fog' around post-silicon debugging
By Richard Goering
01/27/08
Bringing automation to a process that has become increasingly slow, difficult, and costly, researchers at the University of Michigan (UM) have developed FogClear, a post-silicon debugging methodology that can trace bugs and repair functional and electrical errors. The approach promises to cut the costs of post-silicon debugging and silicon respins, while speeding time-to-market and producing chips with fewer defects.
The UM FogClear research was conducted by Igor Markov, associate professor of electrical engineering and computer science; Valeria Bertacco, assistant professor of electrical engineering and computer science; and Kai-hui Chang, a former student who wrote a PhD thesis about post-silicon debugging. Markov, Bertacco, and Chang published a paper at the last International Conference on Computer-Aided Design (ICCAD 2007) that describes FogClear, and the three have prepared a paper for this year's International Symposium on Physical Design (ISPD 2008) that outlines some follow-up research on the use of spare cells for post-silicon metal repair.
FogClear uses simulation traces to detect errors in the post-silicon prototype, and then runs a bug tracing minimization algorithm to reduce the complexity of the trace. Currently, FogClear can detect the sources of functional errors, but electrical errors require manual diagnosis. However, FogClear provides automated tools for fixing both functional and electrical errors. It modifies layouts primarily by changing wires on metal layers and making use of spare standard cells, which is far less expensive than changing transistor layers.
The result, said Bertacco, is a reduction in the cost of silicon respins. "Because you can produce a new variation of the design just by changing the metal masks, you save the money it requires to produce transistor masks," she said. "Silicon becomes much less expensive."
Additionally, Markov noted, FogClear can save engineering time by providing a way to find post-silicon defects and fix them. And, he said, it can help chipmakers avoid the biggest expense of all associated with respins – loss of revenue due to delayed market entry.
The inspiration for tackling post-silicon debugging, Bertacco said, came in part from a visit by Intel representatives. "They were telling us that post-silicon debug is the fastest-growing expenditure area at Intel. That led us to understand that if we can do something in that space, there's a lot of money to be saved," she said.
Post-silicon debugging today is a slow and tedious process, Bertacco said. Engineers typically run simulations of the intended design on a workstation and run the same tests on the silicon prototype, and if the results differ, they manually track down the cause. "It's much harder than pre-silicon verification because of the low visibility and accessibility," she said. "With a prototype, it's hard to read the values inside the chip."
Relatively few post-silicon debugging tools are available today, and they tend to be "limited in scope" and not very automated, said Markov. One EDA company that specializes in post-silicon debugging is startup DAFCA Inc., which offers on-chip instrumentation that provides visibility and control during post-silicon debugging. Markov said that FogClear, which does not use on-chip instrumentation, is complementary to DAFCA's ClearBlue product, which focuses on debugging rather than automated repair.
"The FogClear work is specifically aimed at being a systematic repair solution that exploits built-in fixed-function spares [cells] in the most efficient way," said Rob Rutenbar, professor of electrical and computer engineering at Carnegie-Mellon University and a DAFCA advisor. "DAFCA is aiming at the more general problem of adding some reconfigurable fabric that can be programmed to deal with a broader range of repairs, as well as providing some observability and controllability for debug." On-chip instrumentation is a "big job" best handled by a startup, he said.
Bertacco said that the FogClear research project took shape only within the past year, although some elements of it, such as the Butramin algorithm for bug trace minimization, were developed earlier. Markov said the project was "generated on the fly. We didn't plan it from the start." As such, he said, it hasn't received any direct funding, although the University of Michigan received some support from the National Science Foundation (NSF) and the Gigascale Systems Research Center (GSRC).

University of Michigan researchers responsible for FogClear (left to right) are: Valeria Bertacco, Kai-hui Chang, and Igor Markov. Source: University of Michigan
Markov said his team has had discussions with DAFCA and Intel representatives, and that there's a possibility of future collaboration with both of these companies. Markov noted that industrial contacts would like to see more work with electrical errors. Thus far, the main focus of the work has been on functional errors.
Intel is not directly collaborating with the Michigan team today, but is following the research work through GSRC reviews, said Shekhar Borkar, Intel fellow and director of microprocessor research. "This is an important research area and I am glad that they are doing it," Borkar said. "It is premature to say how much of this is applicable as is, but the evaluation continues."
"Post-silicon debug is very tedious, since observability of any bugs is poor," Borkar said. "It's not like you can probe signals on a printed circuit board. You have much less visibility into a silicon chip."
Rutenbar observed that there have been few systematic solutions for the use of spare cells to repair IC designs. "The UM guys did a nice job of moving from ad-hoc ideas to something with more solid science to it," he said.
A tougher challenge
As the ICCAD FogClear paper notes, pre-silicon and post-silicon debugging differ in several significant ways. Post-silicon bugs are typically subtle errors that affect the responses to only a few input vectors, and they can often be fixed in ways that affect only a few gates. However, they're hard to find. Internal signals in a silicon die are hard to observe, fixes are difficult to verify, and it's important to minimize the layout impact with any repair. As a result, the paper notes, most debugging techniques used for pre-silicon verification cannot be applied post-silicon.
The "good side" of post-silicon debugging is that real chips run software "millions of times faster" than pre-silicon simulation, Bertacco said. "The downside is that observability is extremely limited, so debugging becomes an enormous challenge," she said.
Back to Academic Insights
