Printer Friendly

An efficient low power log based FPU design for FPGAs.


Most of the problems in the real world are very complex in nature and hence demand highly sophisticated hardware structures especially for floating point computations. Moreover complexity of these floating computations increases in every stage of technological developments, which insists a separate hardware unit for its computations, thus resulted in invention of floating point unit (FPU). This invention paved a new way in research field, since then lots of works are done to achieve an optimal model for FPU. As field programmable gate arrays (FPGAs) are flexible programming hardware, it is easy to evaluate and prototype the FPUs with FPGAs (Dhandapani and Ramachandran 2014).

Though powerful FPGAs even with embedded cores like block random access memory (RAM), digital signal processing (DSP) cores, etc., are fabricated, the performance of the FPGAs is not up to the mark because it lacks in efficient floating point handling. As most of the scientific works utilize floating point representation, FPGAs have to be customized for floating point operations, like optimized FPU in hybrid FPGA as suggested by Yu et al. (2012) an island style with embedded FPU as recommended by Beauchamp et al. (2006,2008), configurable multimode FPU for FPGAs by Chong et al. (2011) etc.,. Performance improvisation and optimization on these suggested models are studied and employed in each successive development time frames.

All these proposed FPUs follow a standard representation of floating point numbers and are mainly dedicated to perform floating point computations. From this detailed study of existing FPUs, it is found that the operation units in these FPUs merely contain the integer based computation units with slight modification to cope with floating point standards. The study further shows these normal integer style multipliers used in these FPUs consume more power because of increased operation strength, and also suffer with poor accuracy due to truncations. Though various floating point multipliers were suggested in meantime including Even's et al. (2000) single precision multiplier, Akkas and Schulte's (2003) quad precision multiplier, etc., all suffers with the above said crucial problems. Hence a completely new model for FPU is proposed (Harish et al., 2013) using log look up tables (LUTs) which utilizes the logarithmic principle to achieve good accuracy with reduced power consumption. But the model suffers with serious drawbacks of increased delay and additional memory for the log LUT handling further logarithmic number systems (LNS) are quite popular in low power circuit, adaptation of this scheme in the model introduces area overhead due to addition complexity. The above said factors affects the performance in both area and speed, hence in this paper we proposed a separate log coder instead LUTs and further the overall FPU is redesigned for tuned partial employment of LNS with a new switching circuit, to achieve an optimized performance in the model. As the application of LNS with multiplication reduces its strength and further the strength of the input operand can also be reduced by log coding, these shows this way of partial employment of LNS in the model will directly results in reduction in power by avoiding unwanted signal activities. From Paul et al. (2009) the log word of the mantissa is given as

[log.sub.2] (1 + m) [congruent to] m + a + (b - a) n1/[2.sup.k-t], 0 [less than or equal to] m < 1. (1)

with approximated error of

Error [congruent to] log2(1 + m) - m (2)

Here m represents mantissa bit value, t represents the bits after leading one in mantissa, a represents error value for first t bits, b represents adjacent value to a, k represents total number of most significant bits in mantissa and n1 represents decimal value of k-t bits.

As the log word is attained after rounding off, the width of the log code generated has a major impact on the model accuracy. Hence widths of log codes generated also taken into account for the optimum design of FPU and an optimal length log code is to be integrated with the FPUs. Thus proposed FPU has to be standardized for generalized use, since there are lot of standards for floating point numbers including single precision style, double precision style, etc., are employed. We mainly

discuss about IEEE754 (2008) standard single precision floating point representation and is adopted in this paper as whole. In IEEE754 a real number X has to be represented in 32 bits with a sign bit (s) followed by eight exponent bit (E) and twenty three mantissa (m) bits, and is given by,

X = [(-1).sup.s] x 1.m x [2.sup.E-127], 0 [less than or equal to] m < 1 (3)

1. Related Work:

As importance of the floating point operation increases, the research work on designing enhanced hardware for their computations also increases. There are lots of works suggesting a model for FPU each comes with some enrichments for the design. Hence a brief survey is discussed in this section for the clear understanding of necessity of the proposed model. The works are briefed starting from floating point representation standards to embedding floating point unit cores in FPGAs; this also includes some studies about hardware complexities of logarithmic circuits as the proposed model utilizes LNS for performance improvement.

The floating point numerals are different from the normal integer numbers. Hence usual numerical representations cannot be used with floating point computations, so several representations were suggested for floating point numbers which leads in puzzlement. Thus out of the suggested representations some of the standard forms were accepted universally for generalized understanding discuss in detail about the IEEE/ANSI 754 standard (2008). Though there were many forms for floating points suggested and employed by precedent researchers, our whole work fully utilizes IEEE754 single precision standard alone because of its easy usage in experimental level. As it is difficult to perform floating point computations with normal arithmetic units, a specially designed hardware is needed for dedicated floating point operations, which results in inducement of lots of research works in designing efficient FPUs.

Hence a brief on previous works are discussed in order to fully understand the concept of FPUs and also to know about the shortcomings in those models. Among which Beauchamp et al. (2006) proposed a model and a update version (Beauchamp et al., 2008) for embedded FPU which utilized island-style to achieve speed which was better competitor for the models suggested earlier by Even et al. (2000). The shortcomings of the model are large area consumption and poor accuracy as considerable size of bits were truncated with the use of normal integer style Wallace Tree multiplier structures. Later Ho et al. (2007, 2009) designed a hybrid FPGA to manipulate floating point applications, where the issue of high area utilization of floating point hardware is dealt. This model achieved 25 times lesser area than previous works, though the model showed a good delay improvement and out rates Ye and Rose's model (2006), but it lacked to provide details about power consumed by the model in FPGA and no rooms were allotted for the discussion of accuracy. A configurable style was also suggested for embedding FPUs in FPGAs by Chong and Parameswaran (2009, 2011), it achieved a good improvement in area and delay, but like all the other models it also fails to state about accuracy and power. Later Yu et al. (2012), suggested an optimization for FPU for hybrid FPGAs, which used coarse-grained floating point units with word blocks, Look up tables (LUTs) and registers for floating point operations, and attained improved area, delay and throughput.

As accuracy is the important factor in numerical arithmetic, lots of researches were also done on floating point computation accuracy. Notably Martinelli et al. (1976), Paliouras et al. (1999), Hannington (1980) suggested various improvements in the floating point accuracy. As the suggested model employs the addition of log coded values the study was also expanded to the work of Chen (2012), which analyzed the error in LNS addition and subtraction, and it showed the errors occurred with the LNS are very least when compared with the errors stated in floating point arithmetic works.

As the suggested design uses LNS the survey on log based hardware were made, most of the binary logarithm model utilizes Mitchell (1962) approximation because of its easiness but it lacks in accuracy of about 3.5 bits, various works were employed to reduce this error most notably LUT based method of Brubaker and Becker (1975), which was further enhanced and modified by Paul et al. (2009) in their hardware for Logarithm and Antilogarithm computations, they utilized combination of LUT-based approach with linear interpolation technique to achieve good accuracy with low area utilized. The extract of which was adopted along with rounding based Decimal Floating Point (DFP) antilogarithm of Chen et al. (2012) in the model to achieve a good accuracy and further the adoption reduces the area overhead. We compare our designed model using LNS system with the above works to ensure the novelty and peak performance of the model. And it is noted that as all these previous works suggested the area reductions and delay improvement, though some works dealt about arithmetic accuracies, there were no deliberations about the power consumption of the FPUs, one of the preceding model (Harish et al., 2013) was redesigned with same environment for power comparison analysis.

2. Conventional Floating Point Computations:

Conventional floating point computation are performed with the computation units which are similar to the integer styled arithmetic units with some slight modifications in computation logics as shown in Fig. 1. As the floating point representation encloses three individual parts, individual computation procedures are requisite for each part. Though three individual computations are employed, the most of the arithmetic complexity restrain within the mantissa operations. As the mantissa terms involves the real arithmetic computations, and the computations are performed with respect to its sign and exponent parts.

For e.g., the floating point adders uses any of the adder structures like carry look ahead adder or ripple carry or carry save adder, etc., to add the mantissa terms, but the addition have to be done in accordance with the sign and exponent bits. Let X1 (s1, e1, m1) and X2(s2, e2, m2) be two floating point numerals. Then X3(s3, e3, m3) be the addition result of X1 and X2, the computation is done basically in two steps.

s3 = s1 xor s2 (4)

e3= max{e1,e2}; d= e1~e2 (5)

m3 = {m1}d {[+ or -]}s3 {m2}d (6)

The sign bit chooses the operations whereas the exponent decides the operands or bits in mantissas to which be added. This shows the whole addition process apart from adding also includes some extra logics, which increases the complexity. This will be worse in the case of multiplication in which the random multiplication patterns have to be done on the mantissa parts with respect to their exponent parts, Let X4(s4, e4, m4) be the output of product of floating point numbers X1 and X2, then the computation done as

s4 = s1 & s2 (7)

e4 = e1 + e2 (8)

m4 = {m1} [much less than] e1 * {m2} << e2 (9)

Log Based FPU:

The proposed architecture for the power efficient FPU is shown in Fig. 2. The model is been designed by completely analyzing the existing FPU models, from which it is found that the computation operations on the floating point numerals are most similar to the integer based computations with similar multiplier and adder circuits but employed in floating point standards. This integer styled computations results in poor efficiency and also reduces the overall performance of the FPUs. Thus an alternate model for the numerical computation is discussed in this paper. Moreover the floating point numbers are represented in the standard IEEE754 format, which demands the whole numeral to be divided into three parts, and demanding three individual computation procedures.

Hence the proposed log based FPU is designed with three individual computation entities, to made it suit with IEEE754 standard. As the inputs are fed in standard format, it has to be separated into three individual data likewise the segregated data components have to reunite at the output. Bit segregators and bit concatenations are employed in the input and output ports to make edible the standard floating point numerals. A xor cum comparator unit is used to operate with the sign bits of the inputs, as discussed in the previous section, different computations have to performed on the sign bits either xor or comparator module, based on the module is activated using a operator switch. Similarly the exponent bits are also to be manipulated with either of operator switch activated bit-shifting module or adder module. A log based arithmetic unit is employed on the mantissa bits. The overall advantage in the proposed architecture is obtained from this unit, as it reduces the complexity involved in the computations of mantissa bits especially multiplication. The detailed overview of this proposed log based unit is discussed as follows.

Log Based Arithmetic Unit:

The proposed log based arithmetic unit in proposed FPU is shown in Fig. 3, which employs the logarithmic number system (LNS), by using the basic logarithmic relationships the overall complexity involved in the arithmetic computations can be reduced. As both the strengths of operand as well as the operators got reduced with log transformations, this results in a good power reduction. Further the overall computations on the floating point numerals can be realized with adder circuit alone. Though the above avowals support the advantage of the LNS adaptation in the proposed model, LNS suffers a severe disadvantage of increased complexity in addition. Hence LNS is partially adapted in the model, where the multiplier alone subjected to LNS whereas the addition component of the network follows the usual computation procedures. This segregation is done with the help of specially designed operator switch as shown in Fig. 4, which shows a simplified NAND based toggle system.

This switching system feeds the computation network with either of two sets of data, one of the direct form and another as a logarithm equivalent of the data generated through a log coder, based on addition or multiplication operators respectively. As the input feeding system is designed with the NAND gate, which is immune to the signal activity, hence by the model is insensitive to the unwanted signal glitches which endows a good power reduction in the primary stage. Further the model is symmetrical about input and output parts the same switching network can be utilized for the antilog decoding part which hoards further area in the design.

As multiplication can be realized with the adders in LNS, log coders play an important role in the design. Design for the log and antilog coder is adopted from the Paul's et al. (2009) and designed with slight modifications in interpolator design as shown in Fig. 5, which shows a simple shift based bit coder network. The overall simplicity of the model is attained by the shift based log coder, thus the overall error in shifting the bits are from the interpolation procedures. The model also possesses the log level checker, which acts as a stopper for generating the required size of the log word. As the log word generated for the input is directly induces the accuracy in output, different levels of log coders were designed. These log coders are classified in to six levels namely 6, 9, 12, 15, 18 and 21 level based on the width of the log words generated, from which an optimum log coder is chosen by implementing and testing all levels of log coders for finest accuracy and minimum area utilized. As the antilog decoder is also designed with the similar structure of log coder, most of the log utilized area can be reconfigured for the antilog decoder design by considering the maximum pipelining efficiency, which in turn achieves a good area reduction and makes the proposed model a best suit for embedded FPU cores for FPGAs.

3. Experimental Results And Comparisons:

The detailed experimental results of the proposed floating point unit are analyzed in this section. The analyses are done by comparing various performance factors like accuracy, resource utilized, speed and power. These factors are extracted from both the proposed logarithmic based multiplier and Wallace Tree multiplier by implementing them in standard Xilinx XC4VLX15-12SF363 and simulated and synthesized using Xilinx 12.2 synthesis tool. Thus obtained results were taken for the performance discussion of designed FPU, though obtained results for the Wallace tree multiplier may differ than that of the previously published results, this is 10


This paper delivers a new approach for FPUs based on logarithmic principles. As signal activities and its strength are being the major factors inducing power consumption, the application of LNS in the design suppresses the signal activity and its strength by using log codes. These log codes are nothing but the transformations of inputs with less signal activity. As for the comparison purpose both designed and existing model were designed and implemented in similar environment and standards, to ensure the exactness in verification. As log based method transforms the input into a new code, the accuracy of the output depends on the code formed, which shows the bit width of the log code has a great impact in design. Hence different levels of log coders were designed and from them an optimum coder is chosen with optimum accuracy, speed and minimal resource utilized. Thus obtained optimal log coder is employed in the model and implemented for the comparisons, and from the results it is clearly seen that the proposed model occupies lesser area and gives more accurate output than existing Wallace tree based multiplications. Though the computation path of designed model involves multiple blocks, and re-configurability employed among those blocks, the delay seems to be comparable with the existing one, but it can be forgone by the 71% accuracy achieved with 24.8% reduced area and 23.5% reduction in power. Thus obtained results show the designed model is a best competitor to the existing integer style FPUs. Further the system can also be enhanced in many means, as the system is proposed for the first time, which includes the modification of log coders to ensure still more accuracy level, the reconfigurable architecture can be fine tuned for lesser delays. The most attractive lead of this work will be the study of in-building the log based FPU cores in FPGAs.


Article history:

Received 12 October 2014

Received in revised form 26 December 2014

Accepted 1 January 2015

Available online 25 February 2015


The authors would like to thank Harish Anand .T and S. Anith, who have completed their master degree in VLSI design, Department of Electronics and Communication Engineering, College of Engineering Guindy, Anna University, Chennai, India, for their contribution towards this work.


Akkas, A. and M.J. Schulte, 2003. A quadruple precision and dual double precision floating-point multiplier, in Proceeding of the Euromicro Symposium on Digital System Design, 76-81.

Beauchamp, M.J., S. Hauck, K.D. Underwood and K.S. Hemmert, 2006. Embedded floating point units in FPGAs. in Proceeding of the ACM/SIGDA 14th international symposium on Field programmable gate arrays, Monterey, California, USA, 12-20.

Beauchamp, M.J., S. Hauck, K.D. Underwood and K.S. Hemmert, 2008. Architectural Modification to Enhance the Floating-Point Performance of FPGAs. IEEE Trans. on Very Large Scale Integer. (VLSI) systems, 16(2): 177-187.

Brubaker, T.A. and J.C. Becker, 1975. Multiplication using logarithms implemented with ready-only memory. IEEE Trans. Computers, C-24(8): 761-766.

Chen, C., 2009. Error analysis of LNS addition/subtraction with direct-computation implementation. IET Comput. Digit. Tech., 3(4): 329-337.

Chen, D., L. Han and S.B. Ko, 2012. Decimal floating-point antilogarithmic converter based on selection by rounding: algorithm and architecture. IET Comput. Digit. Tech., 6(5): 277-289.

Chong, Y.J. and S. Parameswaran, 2009. Flexible multi-mode embedded floating-point unit for field programmable gate arrays. in Proceeding of the ACM/SIGDA international symposium on Field programmable gate arrays, Monterey, California, USA, 171-180.

Chong, Y.J. and S. Parameswaran, 2011. Configurable multimode embedded floating-point units for FPGAs. IEEE Trans. on Very Large Scale Integer. (VLSI) systems, 19(11): 2033-2044.

Dhandapani, V. and S. Ramachandran, 2014. Power-optimized log-based image processing system. EURASIP Journal on Image and Video Processing, 37: 1-15.

Even, G., S.M. Mueller and P.M. Seidel, 2000. A dual precision IEEE floating-point multiplier. Integration, the VLSI journal, 29(2): 167-180.

Hannington, G. 1980. Improves to binary floating-point digital differential analysers. IEE Electronics Letters, 16(9): 337-338.

Harish Anand, T., D. Vaithiyanathan and R. Seshasayanan, 2013. Optimized Architecture for Floating Point Computation Unit. in Proceeding of the Int. Conf. on Emerging Trends in VLSI, Embedded Sys., Nano Elec. and Tele. Sys., Thiruvannamalai, India, 1-5.

Ho, C.H., C.W. Yu, P.H.W. Leong, W. Luk and S.J.E. Wilton, 2009. Floating-Point FPGA: Architecture and Modeling. IEEE Trans. on Very Large Scale Integer. (VLSI) systems, 17(12): 1709-1718.

Ho, C.H., C.W. Yu, P.H.W. Leong, W. Luk and S.J.E. Wilton, 2007. Domain-specific hybrid FPGA: Architecture and floating point applications. in Proceeding of the Int. Conf. on Field Program. Logic Appl. (FPL), Amsterdam, The Netherlands, 196-201.

IEEE Standard for Binary Floating-Point Arithmetic, IEEE Std 754-2008, pp. 1-70.

Martinelli, G., G. Orlandi and M. Salerno, 1976. Adder errors versus multiplier errors in floating-point digital filers. in Proceeding of the IEE, 123(3): 207-211.

Mitchell, J.N., 1962. Computer multiplication and division using binary logarithms. IRE Trans. Electron. Computers, 11: 512-517.

Paliouras, V., K. Karagianni and T. Stouraitis, 1999. Error bounds for floating-point polynomial interpolators. IEE Electronics Letters, 35(3): 195-197.

Paul, S., N. Jayakumar and S.P. Khatri, 2009. A fast hardware approach for approximate, efficient logarithm and antilogarithm computations. IEEE Trans. on Very Large Scale Integer. (VLSI) systems, 17(2): 269-277.

Ye, A. and J. Rose, 2006. Using Bus-Based Connections to Improve Field programmable Gate Array Density for Implementation Datapath Circuits. IEEE trans. on very large scale integr.(VLSI) systems, 14(5): 462-473.

Yu, C.W., A.M. Smith, W. Luk, P.H.W. Leong and S.J.E. Wilton,, 2012. Optimizing floating point units in Hybrid FPGAs. IEEE Trans. on Very Large Scale Integer. (VLSI) systems, 20(7): 45-65.

D. Vaithiyanathan and R. Seshasayanan

Department of Electronics and Communication Engineering, College of Engineering Guindy, Anna University, Chennai, Tamil Nadu 600025, India

Corresponding Author: D.Vaithiyanathan, Department of Communication and Engineering, College of Engineering, Guindy, Anna University, Chennai, Tamil Nadu, 600025, India.
COPYRIGHT 2015 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2015 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:floating point unit and field programmable gate arrays
Author:Vaithiyanathan, D.; Seshasayanan, R.
Publication:Advances in Natural and Applied Sciences
Article Type:Report
Date:Jun 1, 2015
Previous Article:Alzheimer disease classification using SVM and multi-SVM.
Next Article:Hand gesture recognition for deaf and dumb people.

Terms of use | Privacy policy | Copyright © 2022 Farlex, Inc. | Feedback | For webmasters |