SEU MitigationSingle-event upsets (SEUs) in FPGAs can cause major problems due to the fact that these upsets affect not only the user memory on the chip (as is the case with ASICs) but the hardware implemented on the chip as well. See FPGAs in Space for more information. These upsets, therefore, must be mitigated--meaning that their effects must be prevented or masked.
There are many ways to mitigate the effects of SEUs on FPGAs. All of the techniques involve some sort of redundancy whether in time, area, or information. First of all, the configuration bitstream of the FPGA--which determines the hardware function that is implemented on the chip--must be corrected when upset by an SEU. This can be accomplished by refreshing the configuration memory with a golden (correct) copy when a configuration bit is upset. Often this golden copy is kept off-chip in a radiation-hardened RAM. The golden copy may also be a simple CRC or checksum to save space.
Scrubbing and Readback with Compare
The periodic refresh of the FPGA configuration memory is called configuration scrubbing (or simply, scrubbing). Scrubbing can be performed periodically to ensure that a single upset is present no longer than the time it takes to refresh the entire FPGA configuration memory. Alternatively, the configuration bitstream may be read and compared to a golden copy and the configuration refresh only done when an error in the bitstream is detected. This is the preferred method as reading the configuration memory is faster than writing to it. This procedure is called readback with compare, and is also often referred to as scrubbing.
Although scrubbing ensures that the configuration bistream is free of errors, there is a period of time between the moment the upset occurs and the moment when it is repaired in which the FPGA configuration is incorrect. Thus the design may not function correctly during that time. To completely mitigate the errors caused by SEUs, scrubbing must be used in conjunction with another form of mitigation that masks the faults in the bitstream. The most popular of these techniques is triple modular redundancy (TMR). TMR is a well-known technique that masks any single-bit fault in the configuration bitstream. Combined with scrubbing, which is used to ensure that no more than one configuration bit is upset at any point in time, TMR can completely mask the effects of SEUs.
Triple Modular Redundancy
TMR is implemented by creating three identical copies of a module and feeding their outputs into a majority voter, which simply outputs the most popular of the three outputs. Thus if one of the three modules fails and produces an incorrect result, the majority voter will still output the correct result since the other two modules' outputs agree. TMR will not fail unless two of the modules have failed. TMR can also protect against failures in the user-defined memory (flip-flops, etc.), which scrubbing alone is unable to do.
To fully protect against errors, majority voters must also be inserted into all of the feedback loops in a design. The voter ensures that, even if the data in the feedback loop is incorrect at a point in time, which will occur if an SEU affects one of the three branches in a feedback loop, the other two branches will outweigh the incorrect third. In this way, the output of the erroneous branch will be corrected in the next clock cycle. In addition, since the voters on an FPGA are just as susceptible to SEUs as the original logic, the voters in the feedback loops should be triplicated to eliminate any single points of failure.
The obvious downside to TMR is the area cost. A design with full TMR applied will consume at least three times the area of the original circuit. Triple modular redundancy has the advantage of complete SEU mitigation combined with its simplicity of implementation. The implementation of TMR is simple enough to be programmed as an automated tool that can convert any generic FPGA design into a triplicated one. We have developed such a tool at BYU and are currently researching improvements to the tool. My research for the time being is focused in this area.