Printer Friendly

A new systolic array algorithm and architecture for the VLSI implementation of IDST based on a pseudo-band correlation structure.


The forward and inverse discrete sine transforms are important functions used in digital signal processing applications especially in image compression due to the fact that they behave very much like the statistically optimal transform known to be Karhunen-Loeve transform (KLT). It is well known [1] that the discrete sine transform (DST) [2] gives better results for low correlation images than discrete cosine transform (DCT) [3] being used in data compression applications as, for example, in JPEG compression [4].

It is well known that the forward and inverse DCT and DST are computational intensive. In many real time applications it is necessary to use hardware implementations of these transforms as the software ones are too slow. Thus, in real-time applications it is necessary to use VLSI implementations using FPGA [5] or ASIC technology [6-9]. Most of the algorithms developed for a software implementation can not be used to obtain efficient VLSI architectures. Thus, in order to obtain an efficient VLSI implementation it is necessary to restructure existing algorithms or develop new ones dedicated for an efficient hardware implementation.

In order to obtain an efficient VLSI algorithm is very important to take into consideration the way of data moving in the algorithm. This aspect represents a key element in designing an efficient algorithm with a high performance VLSI implementation. Thus, the use of regular and modular computational structures as cycle convolution and circular correlation having local and regular data communications can be very useful in obtaining a good VLSI implementation for DCT, IDCT [10-18], DST and IDST [20-23], discrete Fourier transform (DFT) [12] or discrete Hartley transform (DHT) [24-29] using both the systolic array architectural paradigm or distributed arithmetic [30-34].

In the literature there are several efficient VLSI architectures for direct and inverse DST using such computational structures [20-21], [35]. Some of them are using circular correlations and other cyclic convolutions.

In this paper we propose the use of another such modular and regular structure that has been called pseudo-band correlation. It will be shown that an efficient way to convert the inverse DST into such structures can lead to an optimal VLSI implementation using systolic arrays.

Systolic arrays [36-37] are a good architectural paradigm to be used in VLSI implementations for real-time applications. It is already well known that systolic arrays have some specific features as local and regular interconnections and a modular and regular structure being well suited for a VLSI implementation. In order to obtain efficient VLSI implementations using the systolic array architectural paradigm we have to use specific VLSI algorithms called systolic array algorithms. A systolic array algorithm has some distinctive features. Thus it uses regular and modular computational structures having a specific data dependence structure. All the data dependences in a systolic algorithm have a local and regular topology. This specific topology is very important in obtaining in the target VLSI architectures a local and regular interconnection structure specific to systolic arrays. In this paper will be shown that our proposed algorithm based on pseudo-band correlation structure can lead to an efficient VLSI algorithm that can be efficiently implemented in VLSI using systolic arrays.

The rest of the paper is organized as follows: in Section II it is presented the proposed VLSI algorithm based on a pseudo-band correlation structure. In section III it is presented an example that illustrates the proposed algorithm and in Section IV it is presented the systolic array architecture that implements the proposed algorithm. In Section V there are presented the performances of the proposed algorithm and some comparisons are made.


The 1-D N-point inverse discrete sine transform (IDST) is defined as follows:

x(k) = [N-1.summation.i = 0]Y(i) * sin[(2k + 1)i * [alpha]]; (1)

where k = 1,2,...,N and

[alpha] = [pi]/2N (2)

In order to reformulate relation (1) as a pseudo-band correlation form we introduce some auxiliary sequences and use the proprieties of the Galois Field of indexes to appropriate permute the input and output sequences.

The output auxiliary sequence {T(k):k = 1,2,...,N-1} can be computed as a pseudo-band correlation, if the transform length N is a prime number, as following:


where [< x >.sub.n] denotes the result of x modulo N.

We have used the properties of the Galois Field of indexes to convert the computation of the auxiliary output sequence {T'(k): k = 1,2,...,N-1} as a pseudo-band correlation.

The auxiliary input sequence {[[Y.sub.S](i):i = 1,2,...,N-1} is defined as following:

[Y.sub.s](i) = Y(i) * sin(i[alpha]) (4)

Finally, the output sequence can be recursively computed using the auxiliary output sequence {T (k): k = 1,2,..., N -1} as:

x'(0) = x(0) (5)

x'(k) = [(-1).sup.k] T(k) - x'(k -1), k = 1,2,..., N -1 (6)


x(0) = [N.summation over (i = 1)][Y.sub.S](i) (7)


[Y.sub.S](i) = Y(i) * sin(i[alpha]) (8)

Finally we compute the output sequence as follows:

x(k) = [(-1).sup.k] x'(k) (9)


To illustrate our approach, we will consider an example for 1-D IDST with the length N = 11 and the primitive root g = 2.

We can write (3) in matrix-vector product form as




In (11) we noted by c(k) as 2 x cos(2k[alpha]) and the sign of the items in relation (10) is given by the following matrix:


where the first bit designates the sign before the brackets and the second bit denotes the sign inside the brackets. The value of "1" indicates the minus sign (the first bit) and the subtraction operation (the second one)

As can be seen from equation (10), we have a computational structure with a special form that was calledpseudo-band correlation. The special form of this computational structure can be better seen in Fig. 1.

In Fig. 1 we can see that it is put in evidence a band structure where all the elements along the diagonal are the same. As in equation (10) there are some differences in sign we have called this structure a pseudo-band correlation structure. This special structure can be efficiently mapped on a systolic array as will be shown further on.


In Fig. 2 it is presented the linear systolic array that forms the hardware core of the proposed VLSI architecture used to implement the inverse DST.

The hardware core implements equation (10) that represents the pseudo-band correlation structure. In order to obtain the presented systolic array we have used the data dependence graph technique [38]. More details about the use of this design procedure can be found in [16]. The proposed systolic array for an inverse discrete sine transform of length N=11 consists of five processing elements, some pipeline registers and a control circuit that generate the control tag bits from above and from the right end of the systolic array. The function of the processing elements PE of the proposed systolic array is given in Fig.2(b). One processing element consists of one adder, one multiplier and some MUXs that select the appropriate input sequence and the appropriate operation, addition or subtraction, using the control bits denoted by sign. In order to control the loading of the input data into the local registers we have used another control bit denoted by tc. In equation (10) there are some differences in sign that are managed using a tag-control technique [39]. This control mechanism that is well adapted to a systolic array design procedure can be also used to control the content of the internal registers using input and output channels placed only at the two ends of the linear array.

The pre-processing and post-processing stages realize the appropriate reorder of the auxiliary input and output sequences. In the pre-processing stage we also compute the auxiliary input sequence {[Y.sub.S](i):i = 1,2,...,N-1} using equation (8). The structure of the pre-processing stage is presented in Fig. 3. It consists of a multiplier that computes the auxiliary input sequence, a permutation block that permute the input sequence in an appropriate order and two MUXs and a adder/ subtractor used to obtain the input sequences in the necessary form as shown in equation (10). The structure of the permutation block is shown in Fig. 4, where by SR we have noted a shift register on twelve bits. The permutation block consists of ten shift registers, that shift the input sequence to the left with one position on each clock, of ten latches that are used to store the shifted input sequence at the inputs of the two MUXs and a control circuit that generate the control bits necessary to load the input sequence in the latches and also to appropriate select the input data for the two MUXs with 5 inputs and one output. In the post-processing stage we recursively compute the output sequence using the equations (5) and (6) and we finally control the sign of the output sequence using equation (9). The structure of the preprocessing and post processing stage is very simple as compared with those in [20].


The average computation time is (N -1) x [T.sub.cycle]. The hardware complexity is reduced having only (N -1)/2 +1 multipliers and (N -1) / 2 + 2 adders. The hardware complexity was thus reduced as compared to [20]. Thus, low hardware and I/o costs can be obtained. We can easily improve the speed performances using a two-level pipelining mechanism with low hardware and I/o costs.

As compared with [30] the number of adders in the hardware kernel has been significantly reduced from 2N + 3 to (N -1) / 2 + 2 and also the number of multiplexers has been reduced. But the most significant improvement in the hardware complexity is done in the pre-processing and postprocessing stages. Thus, the hardware complexity of the post-processing stage has been reduced at a half and the structure of the pre-processing stage has been also reduced as compared with [20]. We have replaced also the permut&split stage with only a permutation stage and we are using only one multiplier in the preprocessing stage.

The structures proposed in [40-42] do not allow two level pipelining due to the datapath feedback. As compared with these solutions the throughput of the proposed solution is significantly increased using a two-level pipelining.

As compared with [35], where a DA based architecture is used, the throughput of our solution is also much increased when using a two-level pipelining. This is explained due to the presence of the feedback in RACs of [35].

The proposed structured has also a low I/O cost. As compared with [43] the I/O cost is significantly lower. The I/O cost can significantly limit the speed performances due to so called I/O bottleneck.

The proposed VLSI architecture for inverse discrete sine transform (IDST) has been implemented using the RTL Compiler from Cadence for an ASIC technology of 180 nm.

The resulted layout is presented in Fig.5. We have obtained an area of 481,87x481,87 [mu]m at a clock frequency of 314.96 MHz. The area figures for ASIC implementation are shown in Table I:

Type         Instances   Area        Area%

Sequential   817         62925.4     27.1
Inverter     1776        22491.7     9,7
Buffer       98          2181.715    0.9
Unresolved   1           0           0
Logic        6282        144605      62.3
Total        8974        232 204.5   100

We have also implemented the proposed architecture using a FPGA technology using ISE Design from Xilinx. We have used the FPGA family Kintex 7 and we have obtained a clock frequency of 180.54 MHz and the area figures presented in Table II and Table III.


In this paper a new VLSI array architecture for the VLSI implementation of 1-D inverse discrete sine transform is presented based on a new VLSI algorithm that uses a new computational structure called pseudo-band correlation structure that has never been used until now for a VLSI implementation of IDST. It has been shown that the proposed computational structure can be efficiently mapped on a linear systolic array that has some appealing features as a low hardware and I/O cost with high speed performances.

The obtained linear systolic array has a low hardware complexity and a high throughput being an appealing solution for the VLSI implementation of inverse discrete sine transform. It is also possible to further increase the speed performances using a two-level pipelining applied to the presented linear systolic array.

The proposed VLSI architecture is highly regular and modular and has local interconnections with a small number of I/O channels placed at the two extreme ends of the array with a reduced I/O bandwidth. Thus, the proposed VLSI architecture it is well suited for a VLSI implementation.

Digital Object Identifier 10.4316/AECE.2017.01011


[1] M. Narrroschke, "Coding efficiency of DCT and DST in hybrid video coding," IEEE Journal of Selected Topics in Signal Processing, Vol.7, No.6, pp.1062-1071, 2013. doi: 10.1109/JSTSP.2013.2272192

[2] F. Kamisli, "Lossless intra coding in HEVC with integer-to-integer DST," 24th European Signal Processing Conference (EUSIPCO 2016), pp. 2440-2444, 2016. doi: 10.1109/EUSIPCO.2016.7760687

[3] E. A. Baran, A. Kuzu, S. Bogosyan, M. Gokasan, A. Sabanovic, "Comparative analysis of a selected DCT-based compression scheme for haptic data transmission," IEEE Transaction on Industrial Informatics, Vol.12, No.3, pp.1146-1155, Dec. 2016. doi: 10.1109/ TII.2016.2555982

[4] G. K. Wallace, "The JPEG still picture compression standard", IEEE Transactions on Consumer Electronics, vol. 32, no.1, pp. 18-34, 1992. doi: 10.1109/30.125072

[5] M. Kazmi, A. Aziz, P. Akhtar, D.-S. Kundi, "FPGA Based Compact and Efficient Full Image Buffering for Neighborhood Operations," Advances in Electrical and Computer Engineering, vol.15, no.1, pp.95-104, 2015. doi:10.4316/AECE.2015.01014

[6] S.H. Bae, J. Kim, M. Kim, "HEVC-based perceptually adaptive video coding using a DCT-based local distortion detection probability model," IEEE Transaction on Image Processing, Vol.25, No.7, pp.3343-3357, July 2016. doi: 10.1109/TIP.2016.2568459

[7] R. Jia, R. Chen, C. Y. Lin, Z. Guo, H. Yang, "Low cost 1D DCT core for multimedia video codec," Chinese Journal of Electronics, Vol.25, No.6, pp. 1052-1057, June 2016, doi: 10.1049/cje.2016.10.026

[8] M. Jridi, A. Alfalou, P. K. Meher, "A generalized algorithm and reconfigurable architecture for efficient and scalable orthogonal approximation of DCT," IEEE Transactions on Circuits and Systems I; Regular Papers, Vol.62, No.2, pp.449-457, February 2015. doi: 10.1109/TCSI.2014.2360763

[9] J. Nan, N. Yu, W. Lu, D. Wang, "A DST hardware structure of HEVC," 2nd International Conf. on Information Science and Control Engineering, pp.546-549, 2015. doi: 10.1109/ICISCE.2015.127

[10] C.M. Rader, "Discrete Fourier transform when the number of data samples is prime," Proc. IEEE, vol.56, pp.1107-1108, June 1968. doi: 10.1109/PROC. 1968.6477

[11] J.I. Guo, C.M. Liu, and C.W. Jen, "A new array architecture for prime-length discrete cosine transform," IEEE Transactions on Signal Processing, vol. SP-41, no.1, pp. 436-442, 1993. doi: 10.1109/TSP.1993.193173

[12] J.I. Guo, C.M. Liu, and C.W. Jen, "The efficient memory-based VLSI array design for DFT and DCT," IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol.39, no.10, pp. 723-733, Oct. 1992. doi: 10.1109/82.199898

[13] D.F. Chiper, "A new systolic array algorithm for memory-based VLSI array implementation of DCT," Proc. Second IEEE Symp. on Computers and Communications, pp.297-301, 1997 doi: 10.1109/ISCC.1997.616015

[14] C. Cheng and K. K. Parhi, "Hardware efficient fast DCT based on novel cyclic convolution structures," IEEE Transaction on Signal Processing, vol. 54, no. 11, pp. 4419^434, Nov. 2006. doi: 10.1109/ TSP.2006.881269

[15] Y. H. Chan and W. C. Siu, "On the realization of discrete cosine transform using the distributed arithmetic," IEEE Transaction on Circuits and Systems I: Fundamental Theory and Applications, vol. 39, no. 9, pp. 705-712, Sept. 1992. doi: 10.1109/81.250161

[16] D.F. Chiper, M.N.S. Swamy, and M.O. Ahmad, "An efficient unified framework for the VLSI implementation of a prime-length DCT/IDCT with high throughput," IEEE Transactions on Signal Processing, Regular Papers, vol.55, no.6, pp. 2925-2936, June 2007. doi: 10.1109/TSP.2007.893746

[17] P.K. Meher, "Systolic designs for DCT using low-complexity concurrent convolutional formulation," IEEE Trans. on Circuits and Systems for Video Technology, vol.16(9), pp.1041-1050, Sept. 2006. doi: 10.1109/TCSVT.2006.880191

[18] D.F. Chiper and P. Ungureanu, "Novel VLSI algorithm and architecture with good quantization properties for a high-throughput area efficient systolic array implementation of DCT" EURASIP Journal on Advances in Signal Processing- Special issue on quantization of VLSI digital signal processing systems, Vol. 2011, Jan. 2011. doi: 10.1155/2011/639043

[19] J. Xie, P.K. Meher, J. He, "Hardware-efficient realization of prime length DCT based on distributed arithmetic," IEEE Transactions on Computers, Vol.62, No.6, pp.1170-1178, June 2013. doi: 10.1109/TC.2012.64

[20] D.F. Chiper, M.N.S. Swamy, M.O. Ahmad, and T. Stouraits, "Systolic algorithms and a memory-based design approach for a unified architecture for the computation of DCT/DST/IDCT/IDST," IEEE Transactions on Circuits and Systems I: Regular Papers, vol.52, no.6, pp. 1125-1137, June 2005. doi: 10.1109/ TCSI.2005.849109

[21] C.H. Chen, B.D. Liu and J.F. Yang, "Direct recursive structures for computing radix-r two-dimensional DCT/IDCT/DST/IDST," IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 51, pp. 2017-2030, 2004. doi: 10.1109/TCSI. 2004. 835685

[22] P. K. Meher and M. N. S. Swamy, "New Systolic Algorithm and Array Architecture for Prime-Length Discrete Sine Transform," IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 54, no. 3, pp. 262-266, March 2007. doi: 10.1109/TCSII.2006.889453

[23] I. Prots'ko, V. Teslyuk, "Computational structure of DST-II using convolvers," The Experience of Designing and Application of CAD System in Microelectronics, pp. 200-202, 2015, doi: 10.1109/ CADSM.2015.7230835

[24] D.F. Chiper, M. N. S. Swamy, and M. O. Ahmad, "An efficient systolic array algorithm for the VLSI implementation of a primelength DHT," in Proc. ISSCS 2005, vol. 1, pp. 167-169, 2005. doi: 10.1109/ISSCS.2005.1509879

[25] D.F.Chiper, "A Novel VLSI DHT Algorithm for a Highly Modular and Parallel Architecture," IEEE Transactions on Circuits and Systems-II, vol.60, no.5, pp. 282-286, 2013. doi: 10.1109/TCSII.2013.2251974

[26] D.F.Chiper, "Radix-2 Fast Algorithm for Computing Discrete Hartley Transform of Type III," IEEE Transactions on Circuits and SystemsII, vol.59, no.5, pp.297-301, 2012. doi: 10.1109/TCSII.2012.2190863

[27] D.F.Chiper, "Fast Radix-2 Algorithm for the Discrete Hartley Transform of Type II," IEEE Signal Processing Letters, vol.18, no.11, pp.687-689, 2011. doi: 10.1109/LSP.2011.2170166

[28] H. Z. Shu, J. S. Wu, C. F. Yang, L. Senhadji, "Fast Radix-3 Algorithm for the Generalized Discrete Hartley Transform of Type II", IEEE Signal Processing Letters., vol. 19, no. 6, pp. 348-351, Jun. 2012. doi: 10.1109/LSP. 2012.2194782

[29] L. Jiang, H. Shu, J. Wu, L. Wang and L. Senhadji, "A Novel SplitRadix Fast Algorithm for 2-D Discrete Hartley Transform," IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 57, pp. 911-924, 2010. doi: 10.1109/TCSI. 2009.2028639

[30] S.A. White, "Applications of distributed arithmetic to digital signal processing: A tutorial review," IEEE ASSP Mag., Vol.6, no.3, pp.419, Jul.1989. doi: 10.1109/53.29648

[31] S. Yu and E. E. Swartzlander, "DCT implementation with distributed arithmetic," IEEE Transaction on Computers, vol. 50, no.9, pp. 985991, Sept. 2001. doi: 10.1109/ 12.954513

[32] M. T. Sun, T. C. Chen, and A. M. Gottlieb, "VLSI implementation of a 16 x 16 discrete cosine transform," IEEE Transaction on Circuits and Systems, vol. 36, no. 4, pp. 610-617, Apr. 1989. doi: 10.1109/31.92893

[33] Y. H. Chen, T. Y. Chang and C. Y. Li, "High Throughput DA-Based DCT With High Accuracy Error-Compensated Adder Tree," IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 4, pp. 709-714, April 2011. doi: 10.1109/TVLSI.2009.2037968

[34] J. Xie, P. K. Meher and J. He, "Hardware-Efficient Realization of Prime-Length DCT Based on Distributed Arithmetic," IEEE Transactions on Computers, vol. 62, no. 6, pp. 1170-1178, June 2013. doi: 10.1109/TC.2012.64

[35] J.I. Guo, C. Chen and C-W. Jen, "Unified array architecture for DCT/DST and their inverses," Electronics Letters, vol. 31, no. 21, pp. 1811-1812, 1995.

[36] H.T. Kung, "Why systolic architectures?," IEEE Computer, vol.15, no.1, pp. 37-46. 1982. doi: 10.1109/MC.1982.1653825

[37] E. I. Milovanovic, M. K. Stojcev, I. Z. Milovanovic, T. R. Nikolic, "Design of Linear Systolic Arrays for Matrix Multiplication," Advances in Electrical and Computer Engineering, vol.14, no.1, pp.37-42, 2014. doi:10.4316/AECE.2014.01006

[38] J. V. McCanny, J. G. McWhirter, S. Y. Kung, "The use of data dependence graphs in the design of bit-level systolic arrays", IEEE Trans. Acoustics Speech and Signal Processing, vol. 38, no. 5, pp. 787-793, May 1990. doi: 10.1109/29.56023

[39] C.W. Jen and H.Y. Hsu, "The design of systolic arrays with tags input," Proceedings IEEE International Symposium on Circuits and Systems, Vol. 3, pp. 2263-2266, 7-9 June 1988. doi: 10.1109/ISCAS. 1988.15395

[40] J. F. Yang and C.P. Fang, "Compact recursive structures for discrete cosine transform," IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 47, pp. 314-321, Apr. 2000. doi: 10.1109/82.839667

[41] W. H. Fang and M. L. Wu, "An efficient unified systolic architecture for the computation of discrete trigonometric transforms," Proc. of IEEE International Symposium on Circuits and Systems, vol. 3, pp. 2092-2095, 1997. doi: 10.1109/ISCAS.1997.621569

[42] W. H. Fang and M.L. Wu, "Unified fully-pipelined implementations of one- and two-dimensional real discrete trigonometric transforms," IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, Vol. E82-A No.10 pp. 2219-2230, Oct. 1999.

[43] S.B. Pan and R.H. Park, "Unified systolic arrays for computation of DCT/DST/DHT," IEEE Transactions on Circuits and Systems for Video Technology, vol. 7, no. 2, pp. 413-419, Apr 1997. doi: 10.1109/76.564119

Doru Florin CHIPER, Arcadie CRACAN, Danut BURDIA

Technical University Gheorghe Asachi of Iasi, 700050, Romania

Caption: Figure 1. The structure of the computational structure called pseudo-band correlation

Caption: Figure 3. The structure of the pre-processing block

Caption: Figure 4. The structure of the permutation block. Note: SR--Shift Register

Caption: Figure 5. The obtained layout for the proposed VLSI architecture for inverse discrete sine transform.

Type                  Instances

RAMs                  1
MACs                  5
Multipliers           1
Adders/Subtractors    9
Registers             879
Multiplexers          158
FSMs                  1


Type                 Instances

BELS                 553
FlipFlops/Latches    412
Shift Registers      13
Clock Buffers        1
IO Buffers           25
DSPs                 6

Figure 2. (a) The VLSI array architecture of the hardware-core of
1D-IDST (b) The function of the processing elements Pes

[t.sub.c]\sign    00                  01

0                 y + [X.sub.i1 * c]  y + [X.sub.i2 * c]
1                 y + [X.sub.e1 * c]  y + [X.sub.e2 * c]

[t.sub.c]\sign    10                  11

0                 y - [X.sub.i1 * c]  y - [X.sub.i2 * c]
1                 y - [X.sub.e1 * c]  y - [X.sub.e2 * c]
COPYRIGHT 2017 Stefan cel Mare University of Suceava
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Chiper, Doru Florin; Cracan, Arcadie; Burdia, Danut
Publication:Advances in Electrical and Computer Engineering
Article Type:Report
Date:Feb 1, 2017
Previous Article:Design options for thermal shutdown circuitry with hysteresis width independent on the activation temperature.
Next Article:Proportional-integral-resonant AC current controller.

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |