Academic Insights
DAC 'best paper' eases post-silicon debug
By Richard Goering
07/14/08
Research described in a Design Automation Conference (DAC 2008) "best paper" will make it much easier to locate bugs during post-silicon validation, according to Subhasish Mitra, assistant professor of electrical engineering and computer science at Stanford University. The paper describes Instruction Footprint Recording and Analysis (IFRA), a technology for post-silicon bug localization in processors.IFRA research addresses a vexing problem – "localizing" a bug to determine where it is physically located, and determining a short instruction sequence that exposes the bug. By inserting low-cost hardware structures for recording instruction footprints into a processor, and by providing post-failure analysis software, IFRA promises to localize bugs with 96 percent accuracy without requiring a full system simulation and without having to reproduce bugs in a full system setup.
"Our vision is to increase productivity and to cut down the cost of validation," Mitra said. "The industry has made a big focus on the fact that post-silicon validation costs are dominating." In his DAC presentation, Mitra noted that 35 percent of chip development time and 25 percent of design resources can go into post-silicon validation.
Bug localization today is a difficult process, Mitra said. He noted that a bug can be caused by one signal on a chip that may have billions of gates and trillions of possible circuit paths, and that an interaction that slows down just one circuit path could bring down the entire system. "It's like finding a needle in a haystack," he said. Mitra also noted that post-silicon bugs can behave like car problems that show up on the freeway but can't be reproduced by the mechanic.
Typically, he noted, design validation teams will run applications until a bug shows up as a visible problem, perhaps causing a crash and a "blue screen." They will then use tools such as logic analyzers to try to isolate the bug. These bugs are very difficult to reproduce, however, especially electrical bugs, which show up only under certain operating conditions. Electrical bug localization is the primary focus of Mitra's IFRA research.
Mitra said that IFRA offers two big advantages over existing bug localization methods. First, it doesn't require design validation teams to reproduce bugs at the system level, and secondly, it doesn't require a full system-level simulation. Debug engineers still need to run simulation to find the cause of a bug, but it's a small, targeted simulation. IFRA requires only about a one percent increase in area, Mitra said, and has little impact on performance or power.
The IFRA research is part of Stanford's Robust Systems Group, which is looking at various aspects of system reliability. The IFRA paper was co-authored by Mitra and by his student Sung-Boem Park, who is now beginning an internship at Intel Labs. Mitra’s research group is funded by the Semiconductor Research Corp. (SRC), National Science Foundation (NSF), the Focus Center Research Program (FCRP), Gigascale Systems Research Center (GSRC), Center for Circuit and System Solutions (C2S2), Defense Threat Reduction Agency (DTRA), Stanford Center for Integrated Systems (CIS), and a host of companies – Cisco, IBM, Intel, NEC, Texas Instruments and Toshiba.
"This is absolutely significant research," said David Yeh, director of integrated circuit and system sciences at SRC. "As integrated circuits get more complex, the ability to find the root cause of a system failure is getting harder and harder. There may be a bug in hardware that is exposed to the user only after a million clock cycles. Adding footprint recording to the design helps make that job tractable."
What IFRA offers, Yeh said, is a monitoring structure and analysis methodology that can capture errors and help diagnose the cause without exactly duplicating the precise instruction and data streams. Electrical failures can be intermittent, so engineers today may need to "replay" the streams a number of times to get the same failure mode. IFRA, Yeh said, "will not find every last bug, but this approach certainly looks like it can find lots of bugs that are the most difficult to find right now."
Intel is supporting the IFRA research, noted Hong Wang, senior principal engineer and director of the Microarchitecture Research Lab at Intel. "Post-silicon validation has become a very expensive task in the modern microprocessor development process, where bug localization often dominates most of the validation efforts," he said. "IFRA is a breakthrough research idea that has the potential to tackle this challenge cost effectively."
The key idea of IFRA, Wang said, is to enhance a traditional microprocessor architecture with some low-cost monitoring hardware that dynamically takes "snapshots" of program execution behavior. When a failure occurs, the recorded instruction footprints can be used in a "post-mortem forensic analysis" to trace the bug to its location.
Validating an idea
Mitra said the IFRA research began in 2006. "I had this idea," he said, "and I told my student, Sung-Boem, about it. For a while we thought the idea was not going to work. So I told Sung-Boem, 'you should write a report about why this idea is not going to work.' He started writing the report, and that's when we figured out the idea would work."
IFRA has not been built in silicon yet. The DAC 2008 paper uses an Alpha 21264-like superscalar processor model to explain the IFRA recording infrastructure. In simulations, IFRA exactly pinpointed both the location and timing of over 75 percent of injected bugs. For 21 percent of injected bugs, it correctly identified location and time along with 2 to 6 other candidates. Only 4 percent of injected bugs were completely missed. "These numbers are very significant for anyone familiar with diagnostics," Mitra said.
IFRA employs hardware recorders called Footprint Recording Structures (FRS) to record semantic information about data and control flows of instructions passing through various design blocks of a processor. This information is recorded concurrently during the normal operation of a processor in a post-silicon validation setup. When a problem is detected, the recorded information is scanned out via a JTAG interface and analyzed. Program analysis techniques, along with the binary of the application that was executed, are used to help localize the bug.

Figure 1 – IFRA inserts recorders inside a chip, and uses post-analysis software to localize bugs. (Source: DAC 2008 presentation)
An FRS is basically a circular buffer, Mitra noted. From a hardware standpoint, it's similar to a traditional trace buffer, but IFRA plays some "tricks" as to what's stored in the buffer and when the recording is stopped and started. The end result is that engineers don't have to simulate the entire system to localize a bug. The FRSes require 50 Kbytes of distributed on-chip storage in total, "very small compared to a couple megabytes of on-chip L2 caches in state of the art processors," Mitra said.
“The key to the success of IFRA is its ability to tag the semantic information collected by FRSes using very short IDs -- typically, 10 bits for a complex processor," said Park. "A very special, yet simple, ID assignment rule plays a central role in enabling the information collection for complex processors with multiple clock domains, dynamic voltage and frequency scaling, speculative execution and pipeline flushes.”
With cache sharing, Mitra said, the area impact for the FRSes could be even less than one percent. In validation mode, he said, the IFRA hardware will use a small amount of extra power, but not in operation mode. One caution: "There could be some leakage power if it's not designed properly," Mitra said.
When an application is running during post-silicon validation, soft or hard "post triggers" could occur. A hard post-trigger fires when there is evident sign of a failure, and it causes both the recording and the processor operation to terminate. A soft post-trigger fires when there is an early symptom of a possible failure, and it pauses the FRS recording but allows the processor to keep running.
The IFRA post-failure analysis software starts with the construction of a global control-data flow graph. This graph determines where each dynamic instruction was present at each time instance. After that, four high-level post-analyses are run. These include a data dependency analysis, program control flow analysis, load/store analysis, and decoding analysis.
The end result? "I'm going to tell you that this is the problem, and this is the set of instructions that were processing when the problem is caught," said Mitra. "I'm absolutely sure the problem came from this place, and this is what was going on in this part of the processor. The debug guys will take this information and run some simulation, but it will be a simple simulation, not a full system simulation."
The next challenge, Mitra said, is to expand the IFRA concept beyond single processors. "All the chips today are multiprocessors, and there are problems not just inside processor cores but in the interactions between processors and memories. That's where we want to take this thing."
The IFRA paper was one of two DAC 2008 "best paper" award winners. The other winning paper, from Texas A&M University, is entitled "WavePipe: Parallel transient simulation of analog and digital circuits on multicore shared memory machines."
Related articles
Back to Academic Insights
