External SDRAM memory in FPGA based design.
Key words: FPGA, soft-core, soft-errors, fault-tolerance, SDRAM
FPGA technology has advanced to the point where even smaller devices from low-cost families, like Xilinx Spartan-6 (which this paper is focused on), have enough logical resources to perform complex functionality and to process relatively high amount of data with sophisticated algorithms.
One of the limiting factors for FPGA based systems is need for memory. The second smallest device of the Spartan-6 family with 1430 slices (1 slice = 8 flip flops + 4 LUTs), which is enough for soft-core processor implementatio, contains 64 KB block-RAM only. Amount of memory grows with device size to 512 KB in the largest one with 23 k slices. Memory requirements of applications have usually very different relation between needed amount of memory and logical resources. Required amount of memory is often so high that it simply cannot be satisfied by internal memory blocks. In these cases external memory is the only solution.
Spartan-6 family integrates at least two (in larger devices 4) hard-wired embedded memory controllers (MC) in all devices except the smallest one, which saves a lot of resources and simplifies design significantly. Each controller enables designers to connect up to DDR3 memory clocked up to 400 MHz through 16 bit bus (no support for ECC memories). Peak data throughput per controller is 1.6 GB/s then. Maximum capacity of the memory chip connected to one controller is 4 Gb. This possibility significantly enhances spectrum of applications where Spartan-6 family can be deployed.
High reliability and safety integrity are often requirements for current applications. This paper is focused on the impact of using external SDRAM on reliability of FPGA design. Substitution of the memory subsystem based on internal SRAM memories with the one based on external SDRAM must be analyzed from three different perspectives. First of all external memory brings to the system additional risk of hardware malfunction. This risk is quantified in paragraph two.
Radiation induced effects that can alter state of memory elements (SEE--Single Event Effects), also called soft-errors, pose another source of risk. Good overview of the topic can be found in (Adell & Allen, 2008) or in Actel's documents (***, 2007). They can cause data corruption in internal or external memories, FPGA's configuration corruption, or even persistent malfunction. Soft errors in data memories are object of the second perspective of the comparison and it is discussed in paragraph three.
The third perspective is the architecture of memory subsystem. If high reliability or safety integrity is required, fault tolerance must be built into the memory subsystem. It must be able to cope with data corrupted by soft-errors in data or configuration memories and, in ideal case, with hardware errors of external memory. These issues are discussed in the fourth paragraph.
2. HARDWARE RISKS
If new components are added to the system they automatically become sources of failures and overall failure rate necessarily increases.
Hardware failure rate of the SDRAM chip itself can be estimated to be 70 FIT. The memory must be accompanied by approximately 40 resistors, 0.5 FIT per each and 20 capacitors, 1 FIT per each. Failure rate are estimated according to the Siemens SN 29500-2005-1 for Ground Fixed environment and ambient temperature 55 degrees.
Another failure rate increase is caused by additional PCB wiring. Rough estimate using FIDES 2009 method is 20 FIT. It is assumed that the rest of the PCB doesn't have to be redesigned to higher construction class or layers don't have to be added.
This estimate is rather pessimistic because e.g. capacitor malfunction is unlikely to cause memory failure, but gives us rough starting data. Integration of one SDRAM chip to the system adds roughly 130 FIT to the system failure rate.
3. SOFT ERRORS
Soft error occurs when radiation particle strikes sensitive area of the memory element. Both SRAM and SDRAM may be affected. Events that affect memory are usually classified into three categories. Single bit upset (SBU)--when only one bit in word is corrupted, multiple bit upset (MBU)--when more bits in one data word are corrupted and single event functional interrupt (SEFI) when memory is rendered inoperable until power cycle.
Relevant experimental data for these effects can be found in (Borucki et al., 2008; Borucki et al., 2007) where DDR and DDR2 SDRAM memories are studied. We can assume very similar results for DDR3 memory. According to these resources we can expect about 100FIT/Gb for SBUs and, more importantly, approximately the same rate of MBUs per chip. MBUs are caused by upset of control logic that demonstrates itself as errors in thousands of bits. This kind of corruption is more dangerous, because its detection and correction requires more redundancy.
Xilinx quality report for Q2 2011 states that soft error rate for block ram (internal SRAM memory Spartan 6 family is 381 FIT/Mb. According to Xilinx for Virtex 4 MBU rate is approximately 3 % of soft error rate (11 FIT/Mb). We can expect similar results for newer families. If we normalize soft error rate 1 Mb, we can see that SDRAMs are significantly less susceptible to SBUs, but MBU rate is significantly higher.
4. MEMORY SUBSYSTEM ARCHITECTURE
If reliability is of concern the main goal of the memory subsystem is to mask out as many errors as possible and if it is not possible at least to detect them and give the system chance to react and recover. These fault-tolerant functions should be transparent for the rest of the application and they should minimize resource overhead and performance impact.
Several failure modes must be taken into account. Non-persistent single and multiple bit corruption are the most probable ones. The less often, but more serious is failure of the memory as the whole--HW failure (short circuit, SEFI, etc.).
Common solution of memory subsystem for soft-core processors for high reliable systems use internal SRAM in combination with information redundancy in form of ECC (Error Correction Code)--(Ichinomiya, 2010) can be an example.
If we simply replace internal memory with MC and external memory we get scheme at Fig. 1a. SBU/MBU immunity is determined by ECC strength. For ECC with ability to correct 1bit errors and detect 2bit errors within one 128bit word (SEC-DEC), considering 9bit of ECC, protected data can be 16 times bigger than internal ECC memory. Long code word increases MBU probability, if each 16-bit word is covered by ECC separately ratio decreases to 1:3. This may not be a problem because there are cases where not all data are needed to be protected (e.g. picture data in image processing application). HW failure can be only detected.
At Fig. 1b internal ECC memory is replaced by block in the external memory dedicated for ECC. It removes limit on size of the protected area. Memory layout must be carefully designed together with caching mechanism and ECC length because random accesses limit throughput. ECC placement and length together with requirement on system transparency determines necessary data volume overhead.
[FIGURE 1 OMITTED]
If hardware failure rate is too high or stronger MBU protection is needed, another external memory must be used to form duplex redundant system--at Fig. 1c. This configuration enables to detect malfunction of one channel, and all SBUs and MBUs. Probability of undetected SBU/MBU is negligible, because data are stored twice in different chips. Malfunction of one channel can be tolerated only if the information which one is the malfunctioning one can be determined from e.g. MC behavior. Though SBU/MBU cannot be even detected in this degraded state. Next advantage of this scheme is doubled bandwidth for unprotected data.
All failure modes of one channel can be tolerated only if there is a diagnostic mechanism that can resolve which channel is the malfunctioning one. To do this there is a need for additional information redundancy.
Fig. 1d shows one possible scheme that allows detection and correction of HW malfunction of one of the channels, SBU, MBU of any number of bits or MC malfunction. In degraded state when one channel is not operational, data are still protected by ECC against SBU or MBU depeding on ECC strength. Another advantage can be that ECC doesn't have to be checked during read operation. Comparison of two copies of data is sufficient to detect any error and ECC can be used only to correct it. Reading can be accelerated this way.
Scheme at Fig. 1e shows the system where ECC is stored in internal memory. The only advantage over the last one is maximum throughput. Error coverage is practically the same as in the last case.
Several schemes of external memory subsystems have been proposed. Table I summarizes their properties from reliability point of view.
Further research will evaluate performance of proposed schemes in terms of throughput.
This research has been kindly supported by ARTEMIS projects POLLUX (project no. 100205) and Internet of Energy for Electric Mobility (project no. 269374)
Adell, P. & Allen G. (2008). Assesing and Mitigating Radiation Effects in Xilinx FPGAs, Available from: http:// trsnew.jpl.nasa.gov Accessed: 2011-08-29
Borucki, L.; Schindlbeck, G. & Slayman, Ch. (2008). Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level, Proceedings of Reliability Physics Symposium, 2008. IRPS 2008, pp. 482-487, IEEE International, 2008-May
Borucki, L.; Schindlbeck, G. & Slayman, Ch. (2007). Impact of DRAM process technology on neutron-induced soft errors, Proceedings of Integrated Reliability Workshop Final Report, 2007. IRW 2007, pp. 143-146, IEEE International, 2007-Oct
Ichinomiya, Y.; Tanoue, S., Amagasaki, M., Iida, M., Kuga, M., Sueyoshi, T. (2010). Improving the Robustness of a Softcore Processor against SEUs by Using TMR and Partial Reconfiguration, Proceedings of the 2010 18th IEEE AISFPCCM, pp. 482-487, IEEE Computer Society Washington, DC, USA, ISBN: 978-0-7695-4056-6
*** (2007) http://www.actel.com--Actel Corporation, Single-Event Effects in FPGAs, Accessed on: 2011-08-29
Tab. 1. Summary of proposed architectures (W--whole word, D--no. of detectable bit errors, C--correctable, T--tolerated, UD--undetectable, UC--uncorrectable, (1)--related to system without redundancy, (2)--Ichinomiya, 2010, (3)--considered mean time to repair 10h) Soft- HW- error error Failure rate Over- Scheme D C D T head (1) UD UC [Int. 2 1 na na ECC M. <11 FIT/Mb 11 FIT/Mb [sup.2] a 2 1 Y N ECC M. <0.1 FIT/Mb 0.1 FIT/Mb + 1x 130 FIT/Chip b 2 1 Y N 1/3 <11 FIT/Mb 11 FIT/Mb + 1x 130 FIT/Chip c W 0 Y Y 1/2 (One Neg. 0.2 FIT/Mb + 2x chip) 130 FIT/Chip d W W Y Y 4/6 (One Neg. Neg. (3) chip) e W W Y Y 1/2 (One Neg. Neg. (3) chip) + ECC M
|Printer friendly Cite/link Email Feedback|
|Author:||Kvas, Marek; Valach, Sobeslav; Cervinka, Ludek|
|Publication:||Annals of DAAAM & Proceedings|
|Date:||Jan 1, 2011|
|Previous Article:||Models for market structure.|
|Next Article:||National strategy for research, development and innovation by 2013.|