SlideShare ist ein Scribd-Unternehmen logo
1 von 5
Downloaden Sie, um offline zu lesen
Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research
                and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                    Vol. 2, Issue 5, September- October 2012, pp.1458-1462
   Verilog Implementation Of Da Based Dct With High Accuracy
                 Error-Compensated Adder Tree

          Y Venkat Seshaiah                         B.Chinna rao                    P.M.Francis
             M.tech (PG student)                     Prof. &Head, Dept.of ECE        Asst.Prof. In Dept. of ECE
              Gokul Eng College                      Gokul Eng College               Gokul Eng College



Abstract
         In brief, by operating the shifting and           area cost [8], [9]. Hence the NEDA architecture is
addition in parallel, an error-compensated adder-          the smallest architecture for DA-based DCT core
tree (ECAT) is proposed to deal with the                   designs, but speed limitations exist in the operations
truncation errors and to achieve low-error and             of serial shifting and addition after the DA-
high-throughput discrete cosine transform (DCT)            computation. The high-throughput shift-adder-tree
design. Instead of the 12 bits used in previous            (SAT) and adder-tree (AT), those unroll the number
works, 9-bit distributed arithmetic-precision is           of shifting and addition words in parallel for DA-
chosen for this work so as to meet peak-signal-to-         based computation, were introduced in [10] and
noise-ratio (PSNR) requirements. Thus, an area-            [11], respectively. However, a large truncation error
efficient DCT core is implemented to achieve 1             occurred. In order to reduce the truncation error
Gpels/s throughput rate with gate counts of 22.2           effect, several error compensation bias methods have
K for the PSNR requirements outlined in the                been presented [12]–[14] based on statistical
previous works.                                            analysis of the relationship between partial products
                                                           and multiplier-multiplicand. However, the elements
Index Terms—Distributed arithmetic (DA)-                   of the truncation part outlined in this work are
based, error-compensated adder-tree (ECAT), 2-             independent so that the previously described
D discrete cosine transform (DCT).                         compensation methods cannot be applied.

I. INTRODUCTION                                                      This brief addresses a DA-based DCT core
         Discrete cosine transform (DCT) is a              with an error-compensated adder-tree (ECAT). The
widely used tool in image and video compression            proposed ECAT operates shifting and addition in
applications [1]. Recently, the high-throughput DCT        parallel by unrolling all the words required to be
designs have been adopted to fit the requirements of       computed. Furthermore, the error-compensated
real-time applications [2]–[11].                           circuit alleviates the truncation error for high
                                                           accuracy design. Based on low-error ECAT, the DA-
         The multiplier-based DCTs were presented          precision n this work is chosen to be 9 bits instead of
and implemented in [2] and [3].To reduce                   the traditional 12 bits so as to achieve the peak-
area,ROM-based distributed arithmetic (DA)was              signal-to-noise-ratio (PSNR) [1] requirements.
applied      in     DCT        cores[4]–[6].Uramotoet      Therefore, the hardware cost is reduced, and the
.[4]implemented       the     DA-based      multipliers    speed is improved using the proposed ECAT.
usingROMsto produce partial products together with
adders that accumulated these partial products. In                   This brief is organized as follows. In
this way, instead of multipliers, the DA-based ROM         Section II, the mathematical derivation of the
can be applied in a DCT core design to reduce the          distributed arithmetic is given. The proposed ECAT
area required. In addition, the symmetrical properties     architecture is discussed in Section II. The proposed
of the DCT transform and parallel DA architecture          8 X 8 2-D DCT core is demonstrated in Section III.
can be used in reducing the ROM size in [5] and [6],       The comparisons and results are presented in Section
respectively. Recently, ROM-free DA architectures          IV, and conclusions are drawn in Section V.
         were presented [7]–[11]. Shams et al.
employed a bit-level sharing scheme to construct the       II. MATHEMATICAL DERIVATION OF
adder-based butterfly matrix called new DA                 DISTRIBUTED ARITHMETIC
(NEDA) [7]. Being compressed, the butterfly-adder-
matrix in [7]                                                       The inner product is an important tool in
         utilized 35 adders and 8 shift-addition           digital signal processing applications. It can be
elements to replace the ROM. Based on NEDA                 written as follows:
architecture, the recursive form and arithmetic logic
unit (ALU) were applied in DCT design to reduce



                                                                                                 1458 | P a g e
Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research
                and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                    Vol. 2, Issue 5, September- October 2012, pp.1458-1462


                                                                                                             (3)
                                              (1)
                                                            In general, the shifting and addition computation
Where Ai,Xi and L are ith fixed coefficient, ith input   uses a shift-and-add operator [7] in VLSI
data, and number of inputs, respectively. Assume         implementation in order to reduce hardware cost.
that coefficient 4i is Q-bit two’s complement binary     However, when the number of the shifting and
fraction number. Equation (1) can be expressed as        addition words increases, the computation time will
follows:                                                 also increase. Therefore, the shift-adder-tree (SAT)
                                                         presented in [10] operates shifting and addition in
                                                         parallel by unrolling all the words needed to be
                                                         computed for high-speed applications. However, a
                                                         large truncation error occurs in SAT, and an ECAT
                                                         architecture is proposed in this brief to compensate
                                                         for the truncation error in high-speed applications.

                                                              In Fig. 1, the Q P-bit words operate the shifting
                                                         and addition in parallel by unrolling all
                                                         computations. Furthermore, the operation in Fig. 1
                                                         can be divided into two parts: the main part (MP)
                                                         that includes _ most significant bits (MSBs) and the
                                                         truncation part (TP) that has        least significant
                                                         bits (LSBs). Then, the shifting and addition output
                                                         can be expressed as follows:


                                                (2)
                                                                                                            (4)

                                                         The output Y will obtain the P-bit MSBs using a
                                                         rounding operation called post truncation (Post-T),
                                                         which is used for high-accuracy applications.
         Note that y0 may be 0 or a negative number      However, hardware cost increases in the VLSI
due to two’s complement representation. In (2), y0       design. In general, the TP is usually truncated to
can be calculated by adding all Xi values when           reduce hardware costs in parallel shifting and
Ai,j=1 and then the transform output Y can be            addition operations, known as the direct truncation
obtained by shifting and adding all nonzero yi alues.    (Direct-T) method. Thus, a large truncation error
Thus the inner product computation in (1) can be         occurs due to the neglecting of carry propagation
implemented by using shifting and adders instead of      from the TP to MP. In order to alleviate the
multipliers. Therefore, low hardware cost can be         truncation error effect, several error compensation
achieved by using DA-based architecture.                 bias methods have been presented [12]–[14]. All
                                                         previous works were only applied in the design of a
                                                         fixed-width multiplier. Because the products in a
                                                         multiplier have a relationship between the input
                                                         multiplier and multiplicand, the compensation
                                                         methods usually use the correlation of inputs to
                                                         calculate a fixed [12] or an adaptive [13], [14]
                                                         compensation bias using simulation or statistical
                                                         analysis. Note that the addition elements yqp in the
                                                         TP in Fig. 1 (where 1<=Q<=(Q-1) and P-q-
                                                         1<=p<=(P-1)) are independent from each other.
Fig.1.Q P-bit words shifting and addition operations     Therefore, the previous compensation method
in      parallel.                                        cannot be applied in this work, and the proposed
                                                         ECAT is explained as follows.
                                                         A. Proposed Error-Compensated Scheme
III. ECAT ARCHITECTURE
                                                         From Fig. 1, (4) can be approximated as
        From (2), the shifting and addition
computation can be written as follows:
                                                                                                                   (5)




                                                                                              1459 | P a g e
Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research
                and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                    Vol. 2, Issue 5, September- October 2012, pp.1458-1462
where     is the compensated bias from the TP to the     Fig. 2. Proposed ECAT architecture of shifting and
MP as listed in (6)–(8)                                  addition operators for the (P,Q)=(12,6) example
                                                                   The proposed ECAT has the highest
                                                         accuracy with a moderate areadelay product. The
                                                         shift-and-add [7] method has the smallest area, but
                                                         the overall computation time is equal to 10.8ns that
                                                  (6)    is the longest. Similarly, the SAT [10], which
                                                         truncates the TP and computes in parallel, takes 3.72
                                                         ns to complete the computation and uses 406 gates,
                                                         which is the best area-delay product performance.
                                                         However, for system accuracy, the SAT is the worst
                                                         option shown in Table I. Therefore, the ECAT is
                                                         suitable for high-speed and low-error applications.
                                            (7)




                                                         TABLE 1: COMPARISONS OF THE PROPOSED
                                                         ECAT WITH OTHER ARCHITECTURES FOR A
                                               (8)       SIX 8-BIT WORDS EXAMPLE
B. Performance Simulation for an Error-
Compensated Circuit                                      IV. DISCUSSION AND COMPARISONS
         In this subsection, comparisons of the                   The test image “Lena” used to check
absolute average error ξ , the maximum error ξmax,       system accuracy is comprised of 512 X 512 pixels
and the mean square error ξm.se for the proposed         with each pixel being represented by 8-bit 256 gray
error-compensated circuit with Direct-T and Post-T       level data. After inputting the original test image
are listed. The ξ,, ξmax and ξm.se are defined as        pixels to the proposed 2-D DCT core, the transform
follows:                                                 output data is captured and fed into MATLAB to
                                                         compute the inverse DCT using 64-bit double-
                                                         precision operations. The PSNRs are close to 44 and
                                                         47 dB for test image and for random 8-bit 256 gray
                                                         level data inputs, receptively. Table II compares the
                                                         proposed 8 X 8 2-D DCT core with previous 2-D
                                                         DCT cores. In [3], a multiplier-based DCT core
                                                         based on pipeline radix-4square single delay
                                                         feedback path architecture to achieve high-speed
C. Proposed ECAT Architecture                            design. The ROM-based DCT core is presented in
          The proposed ECAT architecture is              [4] to reduce adders to reduce the chip area of DCT
illustrated in Fig. 2 for (P,Q)=(12,6) (case 3), where   core. Nevertheless, a speed limitation for shift-and-
block FA indicates a full-adder cell with three inputs   add is in NEDA design. In [10] and [11], the SAT
(a, b, and c) and two outputs, a sum (s) and a carry-    and AT architectures for DA-based DCTs improve
out (co). Also, block HA indicates half-adder cell       the throughput rate of the NEDA method. However,
with two inputs (a and b) and two outputs, a sum (s)     DA-precision must be chosen as 13 bits to meet the
and a carry-out (co). The comparisons of area, delay,    system accuracy with more area overhead. The
area-delay product, and accuracy for the proposed        proposed DCT core uses low-error ECAT to achieve
ECAT with other architectures are listed in Table II.    a high-speed design, and the DA-precision can be
The area and delay are synthesized using a Synopsys      chosen as 9 bits to meet the PSNR requirements for
Design Compiler with the Artisan TSMC 0.18- μm           reducing hardware costs. The proposed DCT core
Standard cell library.                                   has the highest hardware efficiency, defined as
                                                         follows (based on the accuracy required by the
                                                         presented standards).




                                                                                             1460 | P a g e
Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research
                and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                    Vol. 2, Issue 5, September- October 2012, pp.1458-1462




                                                       TABLE III: COMPARISONS OF 2-D DCT
                                                       ARCHITECTURES IN FPGAS

                                                       IV.RESULTS AND DISCUSSION
                                                                The 8-point 1-D DCT architecture and the
                                                       implementation were discussed in the previous
                                                       chapters. Now this chapter deals with the simulation
TABLE II :COMPARISONS OF DIFFERENT 2-D                 and synthesis results of the implemented 1-D 8-point
DCT ARCHITECTURES WITH THE PROPOSED                    DCT. Here ModelSim tool is used in order to
ARCHITECTURE                                           simulate the design and checks the functionality of
                                                       the design. Once the functional verification is done,
                                                       the design will be taken to the Xilinx tool for
                                                       Synthesis process and the net-list generation.

                                                       Simulation Result:




Fig. 3. Architecture of the proposed 1-D 8-point
DCT

                                                       Fig : Simulation results of Distributed Arithmatic
                                                       DCT

                                                       V. CONCLUSION
                                                                In this brief, a high-speed and low-error 8
                                                       X 8 2-D DCT design with ECAT is proposed to
                                                       improve the throughput rate significantly up to about
                                                       13 folds at high compression rates by operating the
                                                       shifting and addition in parallel. Furthermore, the
                                                       proposed error-compensated circuit alleviates the
                                                       truncation error in ECAT. In this way, the DA-
Fig. 4. Core layout and characteristics
                                                       precision can be chosen as 9 bits instead of 12 bits
                                                       so as to meet the PSNR requirements. Thus, the
         Furthermore, the proposed 2-D DCT core
                                                       proposed DCT core has the highest hardware
synthesized by using Xilinx ISE 9.1, and the Xilinx
                                                       efficiency than those in previous works for the same
XC2VP30 FPGA can achieve 792 mega pixels per
                                                       PSNR requirements. Finally, an area-efficient 2-D
second (M-pels/sec) throughput rate (up to about 7
                                                       DCT core is implemented using a TSMC 0.18- μm
folds of previous work [16]). Table III compares the
                                                       process, and the maximum throughput rate is 1
proposed 2-D DCT core with previous FPGA
                                                       Gpels/s. In summary, the proposed architecture is
implementations.



                                                                                           1461 | P a g e
Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research
                and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                    Vol. 2, Issue 5, September- October 2012, pp.1458-1462
suitable for high compression rate applications in               IEEE Trans. Circuits Syst. II, Exp. Briefs,
VLSI designs.                                                    vol. 43, no. 2, pp. 90–95, Feb. 1996.
                                                          [13]   K. J. Cho, K. C. Lee, J. G. Chung, and K.
REFERENCES                                                       K. Parhi, “Design of low-error fixed-width
  [1]    Y.Wang, J. Ostermann, and Y. Zhang,                     modified booth multiplier,” IEEE Trans.
         Video Processing and Communications, 1st                Very Large Scale Integr. (VLSI) Syst., vol.
         ed. Englewood Cliffs, NJ: Prentice-Hall,                12, no. 5, pp. 522–531, May 2004.
         2002.                                            [14]   L. D. Van and C. C. Yang, “Generalized
  [2]    Y. Chang and C.Wang, “New systolic array                low-error       area-efficient    fixedwidth
         implementation of the 2-D discrete cosine               multipliers,” IEEE Trans. Circuits Syst. I,
         transform and its inverse,” IEEE Trans.                 Reg. Papers, vol. 52, no. 8, pp. 1608–1619,
         Circuits Syst. Video Technol., vol. 5, no. 2,           Aug. 2005.
         pp. 150–157, Apr. 1995.                          [15]   C. C. Sun, P. Donner, and J. Gotze, “Low-
  [3]    C. T. Lin, Y. C. Yu, and L. D. Van, “Cost-              complexity multi-purpose IP core for
         effective     triple-mode      reconfigurable           quantized discrete cosine and integer
         pipeline FFT/IFFT/2-D DCT processor,”                   transform,” in Proc. IEEE Int. Symp.
         IEEE Trans. Very Large Scale Integr. Syst.,             Circuits Syst., 2009, pp. 3014–3017.
         vol. 16, no. 8, pp. 1058–1071, Aug. 2008.        [16]   A. Tumeo, M. Monchiero, G. Palermo, F.
  [4]    S. Uramoto, Y. Inoue, A. Takabatake, J.                 Ferrandi, and D. Sciuto, “A pipelined fast
         Takeda, Y. Yamashita, H. Yerane, and M.                 2D-DCT accelerator for FPGA-based
         Yoshimoto, “A 100-MHz 2-D discrete                      SoCs,” in Proc. IEEE Comput. Soc. Annu.
         cosine transform core processor,” IEEE J.               Symp. VLSI, 2007, pp. 331–336.
         Solid-State Circuits, vol. 27, no. 4, pp. 492–   [17]   S. Ghosh, S. Venigalla, and M. Bayoumi,
         499, Apr. 1992.                                         “Design and implementation of a 2D-DCT
  [5]    S. Yu and E. E. S. , Jr., “DCT                          architecture using coefficient distributed
         implementation          with       distributed          arithmetic,” in Proc. IEEE Comput. Soc.
         arithmetic,” IEEE Trans. Comput., vol. 50,              Ann. Symp. VLSI, 2005, pp. 162–166.
         no. 9, pp. 985–991, Sep. 2001.
  [6]    P. K. Meher, “Unified systolic-like
         architecture for DCT and DST using
         distributed arithmetic,” IEEE Trans.
         Circuits Syst. I, Reg. Papers, vol. 3, no. 12,
         pp. 2656–2663, Dec. 2006.
  [7]    A. M. Shams, A. Chidanandan, W. Pan, and
         M. A. Bayoumi, “NEDA: A low-power
         high-performance DCT architecture,” IEEE
         Trans. Signal Process., vol. 54, no. 3, pp.
         955–964, Mar. 2006.
  [8]    M. R. M. Rizk and M. Ammar, “Low
         power small area high performance 2D-
         DCT architecture,” in Proc. Int. Design
         Test Workshop, 2007, pp. 120–125.
  [9]    Y. Chen, X. Cao, Q. Xie, and C. Peng, “An
         area efficient high performance DCT
         distributed      architecture    for     video
         compression,” in Proc. Int. Conf. Adv.
         Comm. Technol., 2007, pp. 238–241.
  [10]   C. Peng, X. Cao, D. Yu, and X. Zhang, “A
         250 MHz optimized distributed architecture
         of 2D 8� DCT,” in Proc. Int. Conf. ASIC,
                    8
         2007, pp. 189–192.
  [11]   C. Y. Huang, L. F. Chen, and Y. K. Lai, “A
         high-speed 2-D transform architecture with
         unique kernel for multi-standard video
         applications,” in Proc. IEEE Int. Symp.
         Circuits Syst., 2008, pp. 21–24.
  [12]   S. S. Kidambi, F. E. Guibaly, and A.
         Antonious, “Area-efficient multipliers for
         digital signal processing applications,”




                                                                                            1462 | P a g e

Weitere ähnliche Inhalte

Was ist angesagt?

A Pipelined Fused Processing Unit for DSP Applications
A Pipelined Fused Processing Unit for DSP ApplicationsA Pipelined Fused Processing Unit for DSP Applications
A Pipelined Fused Processing Unit for DSP Applicationsijiert bestjournal
 
Modified approximate 8-point multiplier less DCT like transform
Modified approximate 8-point multiplier less DCT like transformModified approximate 8-point multiplier less DCT like transform
Modified approximate 8-point multiplier less DCT like transformIJERA Editor
 
An Area Efficient Mixed Decimation MDF Architecture for Radix 22 Parallel FFT
An Area Efficient Mixed Decimation MDF Architecture for Radix 22  Parallel FFTAn Area Efficient Mixed Decimation MDF Architecture for Radix 22  Parallel FFT
An Area Efficient Mixed Decimation MDF Architecture for Radix 22 Parallel FFTIRJET Journal
 
A Review of Different Methods for Booth Multiplier
A Review of Different Methods for Booth MultiplierA Review of Different Methods for Booth Multiplier
A Review of Different Methods for Booth MultiplierIJERA Editor
 
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...IOSR Journals
 
Optimization Of K-Means Clustering For DECT Using ACO
Optimization Of K-Means Clustering For DECT Using ACOOptimization Of K-Means Clustering For DECT Using ACO
Optimization Of K-Means Clustering For DECT Using ACOIRJET Journal
 
Design and Implementation of a Programmable Truncated Multiplier
Design and Implementation of a Programmable Truncated MultiplierDesign and Implementation of a Programmable Truncated Multiplier
Design and Implementation of a Programmable Truncated Multiplierijsrd.com
 
IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...
IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...
IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...IRJET Journal
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
 
Paper id 37201520
Paper id 37201520Paper id 37201520
Paper id 37201520IJRAT
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
IRJET- Color Image Compression using Canonic Signed Digit and Block based...
IRJET-  	  Color Image Compression using Canonic Signed Digit and Block based...IRJET-  	  Color Image Compression using Canonic Signed Digit and Block based...
IRJET- Color Image Compression using Canonic Signed Digit and Block based...IRJET Journal
 

Was ist angesagt? (15)

A Pipelined Fused Processing Unit for DSP Applications
A Pipelined Fused Processing Unit for DSP ApplicationsA Pipelined Fused Processing Unit for DSP Applications
A Pipelined Fused Processing Unit for DSP Applications
 
Modified approximate 8-point multiplier less DCT like transform
Modified approximate 8-point multiplier less DCT like transformModified approximate 8-point multiplier less DCT like transform
Modified approximate 8-point multiplier less DCT like transform
 
An Area Efficient Mixed Decimation MDF Architecture for Radix 22 Parallel FFT
An Area Efficient Mixed Decimation MDF Architecture for Radix 22  Parallel FFTAn Area Efficient Mixed Decimation MDF Architecture for Radix 22  Parallel FFT
An Area Efficient Mixed Decimation MDF Architecture for Radix 22 Parallel FFT
 
A Review of Different Methods for Booth Multiplier
A Review of Different Methods for Booth MultiplierA Review of Different Methods for Booth Multiplier
A Review of Different Methods for Booth Multiplier
 
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
 
05725150
0572515005725150
05725150
 
Optimization Of K-Means Clustering For DECT Using ACO
Optimization Of K-Means Clustering For DECT Using ACOOptimization Of K-Means Clustering For DECT Using ACO
Optimization Of K-Means Clustering For DECT Using ACO
 
F1074145
F1074145F1074145
F1074145
 
Design and Implementation of a Programmable Truncated Multiplier
Design and Implementation of a Programmable Truncated MultiplierDesign and Implementation of a Programmable Truncated Multiplier
Design and Implementation of a Programmable Truncated Multiplier
 
IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...
IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...
IRJET- MAC Unit by Efficient Grouping of Partial Products along with Circular...
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
 
Paper id 37201520
Paper id 37201520Paper id 37201520
Paper id 37201520
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
A05410105
A05410105A05410105
A05410105
 
IRJET- Color Image Compression using Canonic Signed Digit and Block based...
IRJET-  	  Color Image Compression using Canonic Signed Digit and Block based...IRJET-  	  Color Image Compression using Canonic Signed Digit and Block based...
IRJET- Color Image Compression using Canonic Signed Digit and Block based...
 

Andere mochten auch

Andere mochten auch (20)

Hf2513081311
Hf2513081311Hf2513081311
Hf2513081311
 
Hu2514081412
Hu2514081412Hu2514081412
Hu2514081412
 
Hf2513081311
Hf2513081311Hf2513081311
Hf2513081311
 
Kn2518431847
Kn2518431847Kn2518431847
Kn2518431847
 
Hp2513711375
Hp2513711375Hp2513711375
Hp2513711375
 
Hc2512901294
Hc2512901294Hc2512901294
Hc2512901294
 
Hs2513891402
Hs2513891402Hs2513891402
Hs2513891402
 
Hx2514211427
Hx2514211427Hx2514211427
Hx2514211427
 
Kj2518171824
Kj2518171824Kj2518171824
Kj2518171824
 
Kf2517971799
Kf2517971799Kf2517971799
Kf2517971799
 
Iv2616901694
Iv2616901694Iv2616901694
Iv2616901694
 
Ht2515231526
Ht2515231526Ht2515231526
Ht2515231526
 
W26142147
W26142147W26142147
W26142147
 
I26043047
I26043047I26043047
I26043047
 
Ig2616051609
Ig2616051609Ig2616051609
Ig2616051609
 
Ic2615781586
Ic2615781586Ic2615781586
Ic2615781586
 
Bp31457463
Bp31457463Bp31457463
Bp31457463
 
Programa completo job day 1º jornada de innovación en el empleo y talento en...
Programa completo job day 1º jornada de innovación en el empleo y talento en...Programa completo job day 1º jornada de innovación en el empleo y talento en...
Programa completo job day 1º jornada de innovación en el empleo y talento en...
 
Bérchules
BérchulesBérchules
Bérchules
 
Competencias docentes
Competencias docentesCompetencias docentes
Competencias docentes
 

Ähnlich wie Id2514581462

Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467IJRAT
 
IRJET- A Review of Approximate Adders for Energy-Efficient Digital Signal Pro...
IRJET- A Review of Approximate Adders for Energy-Efficient Digital Signal Pro...IRJET- A Review of Approximate Adders for Energy-Efficient Digital Signal Pro...
IRJET- A Review of Approximate Adders for Energy-Efficient Digital Signal Pro...IRJET Journal
 
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...IRJET Journal
 
IIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGAIIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGAIRJET Journal
 
Bivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm forBivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm foreSAT Publishing House
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
 
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization and
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization andIaetsd fpga implementation of cordic algorithm for pipelined fft realization and
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization andIaetsd Iaetsd
 
IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...
IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...
IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...IRJET Journal
 
A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier us...
A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier us...A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier us...
A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier us...IJERA Editor
 
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Editor IJMTER
 
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...inventionjournals
 
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
IRJET - Design and Implementation of FFT using Compressor with XOR Gate TopologyIRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
IRJET - Design and Implementation of FFT using Compressor with XOR Gate TopologyIRJET Journal
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
 
Design and implementation of Closed Loop Control of Three Phase Interleaved P...
Design and implementation of Closed Loop Control of Three Phase Interleaved P...Design and implementation of Closed Loop Control of Three Phase Interleaved P...
Design and implementation of Closed Loop Control of Three Phase Interleaved P...IJMTST Journal
 
IRJET - High Speed Inexact Speculative Adder using Carry Look Ahead Adder...
IRJET -  	  High Speed Inexact Speculative Adder using Carry Look Ahead Adder...IRJET -  	  High Speed Inexact Speculative Adder using Carry Look Ahead Adder...
IRJET - High Speed Inexact Speculative Adder using Carry Look Ahead Adder...IRJET Journal
 
Iaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd Iaetsd
 

Ähnlich wie Id2514581462 (20)

Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467
 
H010414651
H010414651H010414651
H010414651
 
IRJET- A Review of Approximate Adders for Energy-Efficient Digital Signal Pro...
IRJET- A Review of Approximate Adders for Energy-Efficient Digital Signal Pro...IRJET- A Review of Approximate Adders for Energy-Efficient Digital Signal Pro...
IRJET- A Review of Approximate Adders for Energy-Efficient Digital Signal Pro...
 
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
 
IIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGAIIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGA
 
Bivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm forBivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm for
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
 
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization and
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization andIaetsd fpga implementation of cordic algorithm for pipelined fft realization and
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization and
 
Hv2514131415
Hv2514131415Hv2514131415
Hv2514131415
 
Hv2514131415
Hv2514131415Hv2514131415
Hv2514131415
 
IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...
IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...
IRJET- Low Complexity Pipelined FFT Design for High Throughput and Low Densit...
 
A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier us...
A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier us...A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier us...
A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier us...
 
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
 
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
 
Comparative study of single precision floating point division using differen...
Comparative study of single precision floating point division  using differen...Comparative study of single precision floating point division  using differen...
Comparative study of single precision floating point division using differen...
 
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
IRJET - Design and Implementation of FFT using Compressor with XOR Gate TopologyIRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
IRJET - Design and Implementation of FFT using Compressor with XOR Gate Topology
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
 
Design and implementation of Closed Loop Control of Three Phase Interleaved P...
Design and implementation of Closed Loop Control of Three Phase Interleaved P...Design and implementation of Closed Loop Control of Three Phase Interleaved P...
Design and implementation of Closed Loop Control of Three Phase Interleaved P...
 
IRJET - High Speed Inexact Speculative Adder using Carry Look Ahead Adder...
IRJET -  	  High Speed Inexact Speculative Adder using Carry Look Ahead Adder...IRJET -  	  High Speed Inexact Speculative Adder using Carry Look Ahead Adder...
IRJET - High Speed Inexact Speculative Adder using Carry Look Ahead Adder...
 
Iaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformation
 

Id2514581462

  • 1. Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1458-1462 Verilog Implementation Of Da Based Dct With High Accuracy Error-Compensated Adder Tree Y Venkat Seshaiah B.Chinna rao P.M.Francis M.tech (PG student) Prof. &Head, Dept.of ECE Asst.Prof. In Dept. of ECE Gokul Eng College Gokul Eng College Gokul Eng College Abstract In brief, by operating the shifting and area cost [8], [9]. Hence the NEDA architecture is addition in parallel, an error-compensated adder- the smallest architecture for DA-based DCT core tree (ECAT) is proposed to deal with the designs, but speed limitations exist in the operations truncation errors and to achieve low-error and of serial shifting and addition after the DA- high-throughput discrete cosine transform (DCT) computation. The high-throughput shift-adder-tree design. Instead of the 12 bits used in previous (SAT) and adder-tree (AT), those unroll the number works, 9-bit distributed arithmetic-precision is of shifting and addition words in parallel for DA- chosen for this work so as to meet peak-signal-to- based computation, were introduced in [10] and noise-ratio (PSNR) requirements. Thus, an area- [11], respectively. However, a large truncation error efficient DCT core is implemented to achieve 1 occurred. In order to reduce the truncation error Gpels/s throughput rate with gate counts of 22.2 effect, several error compensation bias methods have K for the PSNR requirements outlined in the been presented [12]–[14] based on statistical previous works. analysis of the relationship between partial products and multiplier-multiplicand. However, the elements Index Terms—Distributed arithmetic (DA)- of the truncation part outlined in this work are based, error-compensated adder-tree (ECAT), 2- independent so that the previously described D discrete cosine transform (DCT). compensation methods cannot be applied. I. INTRODUCTION This brief addresses a DA-based DCT core Discrete cosine transform (DCT) is a with an error-compensated adder-tree (ECAT). The widely used tool in image and video compression proposed ECAT operates shifting and addition in applications [1]. Recently, the high-throughput DCT parallel by unrolling all the words required to be designs have been adopted to fit the requirements of computed. Furthermore, the error-compensated real-time applications [2]–[11]. circuit alleviates the truncation error for high accuracy design. Based on low-error ECAT, the DA- The multiplier-based DCTs were presented precision n this work is chosen to be 9 bits instead of and implemented in [2] and [3].To reduce the traditional 12 bits so as to achieve the peak- area,ROM-based distributed arithmetic (DA)was signal-to-noise-ratio (PSNR) [1] requirements. applied in DCT cores[4]–[6].Uramotoet Therefore, the hardware cost is reduced, and the .[4]implemented the DA-based multipliers speed is improved using the proposed ECAT. usingROMsto produce partial products together with adders that accumulated these partial products. In This brief is organized as follows. In this way, instead of multipliers, the DA-based ROM Section II, the mathematical derivation of the can be applied in a DCT core design to reduce the distributed arithmetic is given. The proposed ECAT area required. In addition, the symmetrical properties architecture is discussed in Section II. The proposed of the DCT transform and parallel DA architecture 8 X 8 2-D DCT core is demonstrated in Section III. can be used in reducing the ROM size in [5] and [6], The comparisons and results are presented in Section respectively. Recently, ROM-free DA architectures IV, and conclusions are drawn in Section V. were presented [7]–[11]. Shams et al. employed a bit-level sharing scheme to construct the II. MATHEMATICAL DERIVATION OF adder-based butterfly matrix called new DA DISTRIBUTED ARITHMETIC (NEDA) [7]. Being compressed, the butterfly-adder- matrix in [7] The inner product is an important tool in utilized 35 adders and 8 shift-addition digital signal processing applications. It can be elements to replace the ROM. Based on NEDA written as follows: architecture, the recursive form and arithmetic logic unit (ALU) were applied in DCT design to reduce 1458 | P a g e
  • 2. Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1458-1462 (3) (1) In general, the shifting and addition computation Where Ai,Xi and L are ith fixed coefficient, ith input uses a shift-and-add operator [7] in VLSI data, and number of inputs, respectively. Assume implementation in order to reduce hardware cost. that coefficient 4i is Q-bit two’s complement binary However, when the number of the shifting and fraction number. Equation (1) can be expressed as addition words increases, the computation time will follows: also increase. Therefore, the shift-adder-tree (SAT) presented in [10] operates shifting and addition in parallel by unrolling all the words needed to be computed for high-speed applications. However, a large truncation error occurs in SAT, and an ECAT architecture is proposed in this brief to compensate for the truncation error in high-speed applications. In Fig. 1, the Q P-bit words operate the shifting and addition in parallel by unrolling all computations. Furthermore, the operation in Fig. 1 can be divided into two parts: the main part (MP) that includes _ most significant bits (MSBs) and the truncation part (TP) that has least significant bits (LSBs). Then, the shifting and addition output can be expressed as follows: (2) (4) The output Y will obtain the P-bit MSBs using a rounding operation called post truncation (Post-T), which is used for high-accuracy applications. Note that y0 may be 0 or a negative number However, hardware cost increases in the VLSI due to two’s complement representation. In (2), y0 design. In general, the TP is usually truncated to can be calculated by adding all Xi values when reduce hardware costs in parallel shifting and Ai,j=1 and then the transform output Y can be addition operations, known as the direct truncation obtained by shifting and adding all nonzero yi alues. (Direct-T) method. Thus, a large truncation error Thus the inner product computation in (1) can be occurs due to the neglecting of carry propagation implemented by using shifting and adders instead of from the TP to MP. In order to alleviate the multipliers. Therefore, low hardware cost can be truncation error effect, several error compensation achieved by using DA-based architecture. bias methods have been presented [12]–[14]. All previous works were only applied in the design of a fixed-width multiplier. Because the products in a multiplier have a relationship between the input multiplier and multiplicand, the compensation methods usually use the correlation of inputs to calculate a fixed [12] or an adaptive [13], [14] compensation bias using simulation or statistical analysis. Note that the addition elements yqp in the TP in Fig. 1 (where 1<=Q<=(Q-1) and P-q- 1<=p<=(P-1)) are independent from each other. Fig.1.Q P-bit words shifting and addition operations Therefore, the previous compensation method in parallel. cannot be applied in this work, and the proposed ECAT is explained as follows. A. Proposed Error-Compensated Scheme III. ECAT ARCHITECTURE From Fig. 1, (4) can be approximated as From (2), the shifting and addition computation can be written as follows: (5) 1459 | P a g e
  • 3. Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1458-1462 where is the compensated bias from the TP to the Fig. 2. Proposed ECAT architecture of shifting and MP as listed in (6)–(8) addition operators for the (P,Q)=(12,6) example The proposed ECAT has the highest accuracy with a moderate areadelay product. The shift-and-add [7] method has the smallest area, but the overall computation time is equal to 10.8ns that (6) is the longest. Similarly, the SAT [10], which truncates the TP and computes in parallel, takes 3.72 ns to complete the computation and uses 406 gates, which is the best area-delay product performance. However, for system accuracy, the SAT is the worst option shown in Table I. Therefore, the ECAT is suitable for high-speed and low-error applications. (7) TABLE 1: COMPARISONS OF THE PROPOSED ECAT WITH OTHER ARCHITECTURES FOR A (8) SIX 8-BIT WORDS EXAMPLE B. Performance Simulation for an Error- Compensated Circuit IV. DISCUSSION AND COMPARISONS In this subsection, comparisons of the The test image “Lena” used to check absolute average error ξ , the maximum error ξmax, system accuracy is comprised of 512 X 512 pixels and the mean square error ξm.se for the proposed with each pixel being represented by 8-bit 256 gray error-compensated circuit with Direct-T and Post-T level data. After inputting the original test image are listed. The ξ,, ξmax and ξm.se are defined as pixels to the proposed 2-D DCT core, the transform follows: output data is captured and fed into MATLAB to compute the inverse DCT using 64-bit double- precision operations. The PSNRs are close to 44 and 47 dB for test image and for random 8-bit 256 gray level data inputs, receptively. Table II compares the proposed 8 X 8 2-D DCT core with previous 2-D DCT cores. In [3], a multiplier-based DCT core based on pipeline radix-4square single delay feedback path architecture to achieve high-speed C. Proposed ECAT Architecture design. The ROM-based DCT core is presented in The proposed ECAT architecture is [4] to reduce adders to reduce the chip area of DCT illustrated in Fig. 2 for (P,Q)=(12,6) (case 3), where core. Nevertheless, a speed limitation for shift-and- block FA indicates a full-adder cell with three inputs add is in NEDA design. In [10] and [11], the SAT (a, b, and c) and two outputs, a sum (s) and a carry- and AT architectures for DA-based DCTs improve out (co). Also, block HA indicates half-adder cell the throughput rate of the NEDA method. However, with two inputs (a and b) and two outputs, a sum (s) DA-precision must be chosen as 13 bits to meet the and a carry-out (co). The comparisons of area, delay, system accuracy with more area overhead. The area-delay product, and accuracy for the proposed proposed DCT core uses low-error ECAT to achieve ECAT with other architectures are listed in Table II. a high-speed design, and the DA-precision can be The area and delay are synthesized using a Synopsys chosen as 9 bits to meet the PSNR requirements for Design Compiler with the Artisan TSMC 0.18- μm reducing hardware costs. The proposed DCT core Standard cell library. has the highest hardware efficiency, defined as follows (based on the accuracy required by the presented standards). 1460 | P a g e
  • 4. Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1458-1462 TABLE III: COMPARISONS OF 2-D DCT ARCHITECTURES IN FPGAS IV.RESULTS AND DISCUSSION The 8-point 1-D DCT architecture and the implementation were discussed in the previous chapters. Now this chapter deals with the simulation TABLE II :COMPARISONS OF DIFFERENT 2-D and synthesis results of the implemented 1-D 8-point DCT ARCHITECTURES WITH THE PROPOSED DCT. Here ModelSim tool is used in order to ARCHITECTURE simulate the design and checks the functionality of the design. Once the functional verification is done, the design will be taken to the Xilinx tool for Synthesis process and the net-list generation. Simulation Result: Fig. 3. Architecture of the proposed 1-D 8-point DCT Fig : Simulation results of Distributed Arithmatic DCT V. CONCLUSION In this brief, a high-speed and low-error 8 X 8 2-D DCT design with ECAT is proposed to improve the throughput rate significantly up to about 13 folds at high compression rates by operating the shifting and addition in parallel. Furthermore, the proposed error-compensated circuit alleviates the truncation error in ECAT. In this way, the DA- Fig. 4. Core layout and characteristics precision can be chosen as 9 bits instead of 12 bits so as to meet the PSNR requirements. Thus, the Furthermore, the proposed 2-D DCT core proposed DCT core has the highest hardware synthesized by using Xilinx ISE 9.1, and the Xilinx efficiency than those in previous works for the same XC2VP30 FPGA can achieve 792 mega pixels per PSNR requirements. Finally, an area-efficient 2-D second (M-pels/sec) throughput rate (up to about 7 DCT core is implemented using a TSMC 0.18- μm folds of previous work [16]). Table III compares the process, and the maximum throughput rate is 1 proposed 2-D DCT core with previous FPGA Gpels/s. In summary, the proposed architecture is implementations. 1461 | P a g e
  • 5. Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1458-1462 suitable for high compression rate applications in IEEE Trans. Circuits Syst. II, Exp. Briefs, VLSI designs. vol. 43, no. 2, pp. 90–95, Feb. 1996. [13] K. J. Cho, K. C. Lee, J. G. Chung, and K. REFERENCES K. Parhi, “Design of low-error fixed-width [1] Y.Wang, J. Ostermann, and Y. Zhang, modified booth multiplier,” IEEE Trans. Video Processing and Communications, 1st Very Large Scale Integr. (VLSI) Syst., vol. ed. Englewood Cliffs, NJ: Prentice-Hall, 12, no. 5, pp. 522–531, May 2004. 2002. [14] L. D. Van and C. C. Yang, “Generalized [2] Y. Chang and C.Wang, “New systolic array low-error area-efficient fixedwidth implementation of the 2-D discrete cosine multipliers,” IEEE Trans. Circuits Syst. I, transform and its inverse,” IEEE Trans. Reg. Papers, vol. 52, no. 8, pp. 1608–1619, Circuits Syst. Video Technol., vol. 5, no. 2, Aug. 2005. pp. 150–157, Apr. 1995. [15] C. C. Sun, P. Donner, and J. Gotze, “Low- [3] C. T. Lin, Y. C. Yu, and L. D. Van, “Cost- complexity multi-purpose IP core for effective triple-mode reconfigurable quantized discrete cosine and integer pipeline FFT/IFFT/2-D DCT processor,” transform,” in Proc. IEEE Int. Symp. IEEE Trans. Very Large Scale Integr. Syst., Circuits Syst., 2009, pp. 3014–3017. vol. 16, no. 8, pp. 1058–1071, Aug. 2008. [16] A. Tumeo, M. Monchiero, G. Palermo, F. [4] S. Uramoto, Y. Inoue, A. Takabatake, J. Ferrandi, and D. Sciuto, “A pipelined fast Takeda, Y. Yamashita, H. Yerane, and M. 2D-DCT accelerator for FPGA-based Yoshimoto, “A 100-MHz 2-D discrete SoCs,” in Proc. IEEE Comput. Soc. Annu. cosine transform core processor,” IEEE J. Symp. VLSI, 2007, pp. 331–336. Solid-State Circuits, vol. 27, no. 4, pp. 492– [17] S. Ghosh, S. Venigalla, and M. Bayoumi, 499, Apr. 1992. “Design and implementation of a 2D-DCT [5] S. Yu and E. E. S. , Jr., “DCT architecture using coefficient distributed implementation with distributed arithmetic,” in Proc. IEEE Comput. Soc. arithmetic,” IEEE Trans. Comput., vol. 50, Ann. Symp. VLSI, 2005, pp. 162–166. no. 9, pp. 985–991, Sep. 2001. [6] P. K. Meher, “Unified systolic-like architecture for DCT and DST using distributed arithmetic,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 3, no. 12, pp. 2656–2663, Dec. 2006. [7] A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, “NEDA: A low-power high-performance DCT architecture,” IEEE Trans. Signal Process., vol. 54, no. 3, pp. 955–964, Mar. 2006. [8] M. R. M. Rizk and M. Ammar, “Low power small area high performance 2D- DCT architecture,” in Proc. Int. Design Test Workshop, 2007, pp. 120–125. [9] Y. Chen, X. Cao, Q. Xie, and C. Peng, “An area efficient high performance DCT distributed architecture for video compression,” in Proc. Int. Conf. Adv. Comm. Technol., 2007, pp. 238–241. [10] C. Peng, X. Cao, D. Yu, and X. Zhang, “A 250 MHz optimized distributed architecture of 2D 8� DCT,” in Proc. Int. Conf. ASIC, 8 2007, pp. 189–192. [11] C. Y. Huang, L. F. Chen, and Y. K. Lai, “A high-speed 2-D transform architecture with unique kernel for multi-standard video applications,” in Proc. IEEE Int. Symp. Circuits Syst., 2008, pp. 21–24. [12] S. S. Kidambi, F. E. Guibaly, and A. Antonious, “Area-efficient multipliers for digital signal processing applications,” 1462 | P a g e