Modular Architecture for Ultrasound Beamforming with
FPGAs
J. F. Cruza, J. Camacho, J. Brizuela, J.M. Moreno, C. Fritsch
Consejo Superior de Investigaciones Científicas (CSIC)
UMEDIA Group, La Poveda (Arganda), 28500 Madrid, Spain
Abstract. Reception beamforming requires a high-resolution time base to keep low delay quantization lobes. Usually
signals acquired at the Nyquist sampling rate are up-sampled by a factor of 4 to 16 to achieve the required time-
resolution. Although this has been carried out for decades using fractional delay interpolating filters, the interpolation and
coherent addition processes use a large silicon area and consume power, mainly when implemented in FPGAs. These
drawbacks are reduced if the beamformer is implemented in ASICs, but this precludes any retro-fitting operation.
This work presents a beamfoming architecture to overcome these shortcomings, based on the DSP cells incorporated into
state-of-the art FPGAs. Since the manufacturing process is similar to that of ASIC, these cells consume low power and
processing speed is high. The proposed beamformer interleaves the interpolation and the coherent sum. Besides reducing
hardware resources, the design is modular and scalable, which allows to trade off among the number of channels per
device, time resolution and resource requirements. It can be easily tailored to very different applications, from the low-
channel count systems to the more demanding ones with multi-beamforming capabilities.
Keywords: Beamforming, dynamic focusing, interpolation, real-time imaging, parallel beamforming.
PACS: 43.58-e, 43.35.Yb, 43.60-Lq
INTRODUCTION
The conventional delay-and-sum beamformer with dynamic focus introduces time-varying delays to the signals
received by every array element to compensate the time-of-flight differences from every focus, usually located along
a steering direction. This operation produces the aperture data, which are added together (coherent sum) to provide
an output sample of the focused A-scan along the steering direction.
High timing resolution is required to reduce the level of the delay quantization lobes [1]. To this purpose, several
methods have been proposed, usually falling in one of the following categories:
a) Dynamic interpolation techniques, where signals are up-sampled by a factor large enough to achieve the
required time resolution and operate with a uniform sampling clock at the Nyquist rate. For this purpose, an
interpolating filter for every channel is required before the coherent sum [2-3]. Since the operation is linear, it
was suggested to use a single reconstruction filter at the beamformer output [4], but it fails for dynamic
focusing because this function is not time-invariant.
b) Oversampling ∆Σ techniques, which use low bit count sigma-delta A/Ds operated at a high sampling rate that
provide the required time resolution and avoids interpolating hardware [5-6]. However, the availability of
multi-channel, high-performance front-end analog processors in a single chip [7] makes this approach less
attractive, since they use a similar number of interfacing lines, are less sensitive to digital noise, provide high
resolution (up to 14 bits) and can operate with higher frequency transducers.
c) Selective sampling techniques, which acquire the signals at the instant they arrive to the array element. These
use a separate high resolution clock generator for every channel [8-10] and do not require any interpolation
process. However, to take advantage of the new analog front-end processors that share the same clock for
several channels, these techniques needs to be combined with an interpolation scheme [11].
So that, with the exception of ∆Σ beamformers, some interpolation process must be carried out. Many
approaches are based on fractional delay filters, which allow splitting the sampling period in an integer number of
sub-sampling intervals [12]. Here, a bank of L fractional delay filters allow to chose the equivalent sampling instant
at 1/L, 2/L, ..., 1 the sampling period. L is closely related to the delay resolution and usually ranges from 4 to 16. On
the other hand, every filter in the bank requires a number F of coefficients large enough to achieve the required
accuracy (F~L). A common variant is to switch the coefficient set to that required for a given fractional delay.
The interpolation and coherent sum processes require a large silicon area. For example, a parallel implementation
demands L·F multiply-and-add operations per channel. Thus, only a moderate number of channels can be integrated
into a single FPGA [13] and the power consumption is relatively high. The alternative is to switch to an ASIC
design [14], but this precludes ulterior hardware updates and greatly reduces system flexibility.
However, state-of-the art FPGAs include DSP cells intended to implement FIR filters. The DSP cells operate at
high speed (above 350 MHz) and include, basically, a multiplier and adder with several possible configurations. The
manufacturing process is similar to that used for ASICs, so their power consumption is rather low.
This work addresses the design of a beamformer based on the DSP cells currently available in FPGAs. This
yields the flexibility of upgrading and low cost of FPGAs together with the low power consumption of ASICs.
Furthermore, the design is modular and scalable. The proposed approach allows implementing a complete 128-
channel beamformer in a single, moderately sized FPGA.
A SYSTOLIC BEAMFORMING ARCHITECTURE
The beamforming process for an N-element array performs the following operation:
N
S (t ) = ∑ wi ri [t −τ i (t )] (1)
i =1
where ri(t) is the signal acquired by channel i, wi is a weighting function (apodization) and τi(t) is the dynamically
changing focusing delay for channel i. In the discrete time domain, the variable t is substituted by the index k to
samples with t=kTS, where TS is the sampling period:
N
S (k ) = ∑ wi ri [k − Ti (k )] (2)
i =1
The value Ti(k) represents the delay, in sampling periods, that must be applied to obtain the k-th output sample
S(k) from the sequence of input samples {ri}. Unless Ti(k) is an integer, the value of ri[k- Ti(k)] is not available from
the set {ri}, so that it must be estimated by an interpolation process. The interpolation can be performed for band-
limited signals by means some kind of FIR or IIR filter or other functions (splines, polynomials, etc.) [12].
To obtain a resolution L times higher than the sampling period, the delay Ti(k) is rounded to the nearest multiple
of 1/L, which yields an integer part Mi(k) and a fractional part mi(k)/L. The set of L possible fractional delays is {1/L,
2/L, ...(L-1)/L, 1}. The interpolated sample for a given delay Ti(k) = Mi(k)+mi(k) /L can be approximated by using a
(P-1) order FIR filter (P coefficients) as:
mi (k ) P
ri [k − Ti (k )] = ri [k − M i ( k ) −
L
]≈ ∑ hm i ,k
( j )ri [ k − M i (k ) − j ] (3)
j =1
The filter is applied to the P samples in the set {ri} ahead the nik = k-Mi(k) sample, which changes dynamically
and is handled by the focusing logic. Then,
P
ri [k − Ti (k )] ≈ ∑ hmi ,k ( j )ri (nik − j ) (4)
j =1
Substitution in (2) yields,
N P N P
S (k ) = ∑ wi ∑ hm i ,k
( j )ri (nik − j ) = ∑∑ hm i ,k
( j ) wi ri ( nik − j ) (5)
i =1 j =1 i =1 j =1
Changing the addition order and including the weight wi in the coefficients for channel i,
P N
S (k ) = ∑∑ hm i ,k
( j )ri (nik − j ) (6)
j =1 i =1
Note that the coherent sum can be performed simultaneously with the filtering function in different channels. In a
single cycle, a partial result for sample S(k) is obtained as the sum of partial results of the interpolation. After P
cycles, the full result for S(k) is obtained.
This mechanism allows the construction of an efficient systolic architecture for beamforming, based on the DSP
cells included into state-of-the-art FPGAs. Figure 1 shows schematically the processing element for a single channel
based on the DSP-48 cells of Xilinx FPGAs [15]. A small FIFO (1Kx18) adjusts the differences on the time of
arrival of echoes to the array elements. Writing samples to the FIFO is carried out at the sampling rate clock CKS
and is enabled once the integer delay Mi has elapsed. A/D converters of up to 18 bits can be used.
Si(k)
To channel i+1
z-1 48
Mi
TC WE 18
Q D si(k)
FRAC-FIR Q
CH(i) D FIFO RD(i)
18 RD F 48
CKS E(i) 48
EMPTY
F b
Qi
16 CE F
K-FOC ADR MEM-FOC F(i)
CKS
RD(i-1)·z-1 Si-1(k)
FIGURE 1. Processing element for a systolic beamformer
The focusing logic must provide the integer part nik as well as the fractional delay for every focus k. Here it is
based on the Progressive Focusing Correction (PFC) technique described in [10], where a single bit Qi(k) codes the
focusing delay: when Qi(k)=0, the fractional delay remains unchanged; when Qi(k) =1, the next fractional delay is
advanced by a quantum 1/L. From some minimum range that sets the delay ni1, the successive delays are defined
incrementally as Tk = Tk-1+ (1-Qk)TS, which provides implicitly the nik value.
The counter F of b=log2L bits selects the set of filter coefficients for the current fractional delay in the FRAC-
FIR filter. Data read from the FIFO pass through this filter, producing the aperture data si(k) that is added to the
partial result from the precedent processing element. Note that F is incremented every time Qi = 1 to select the next
fractional delay. This way, a round-robin sequence of fractional delays is produced.
It is worth to note that, when the fractional delay is ‘1’, the sample at the FIFO passes to the filter output without
interpolation. If, in this situation, a focusing code Qi=1 is produced, the next delay will be L-1/L, which is obtained
by interpolation of the preceding P samples in the same cycle. In this case, the FIFO read is disabled since the
required data is already available in the FIR filter. This exception is easily implemented (not shown).
The adders are 48-bits wide, which avoids saturation effects (up to thousand of channels). On the other hand,
new data samples are read on FIFO(i) when a read is performed in FIFO(i-1) after a pipeline delay and E(i)=0 (not
empty). If the first FIFO read is enabled once the maximum initial delay max(Mi) has elapsed, there is no need to
care about the FIFO empty flags, since all of them will have stored one or more samples.
An important aspect is the system modularity. An N-channel system is implemented by simply pipelining N
processing elements. No logic dependant on system size is required. Therefore, this approach allows using a larger
FPGA to integrate more channels or to break the design in several devices for lower cost or other reasons.
Figure 2 shows the overall beamformer architecture. In this case it uses a single DSP48 cell per channel, but the
actual implementation code allows using D cells per channel. Note that another cell is used as the final accumulator.
DSP48-A
Accumulator RS Sk
Ak
FIFO (N) A DSP48-A
Q rk rk-1 rk-2 rk-3
RD RDN
INH VSN
EMPTY EN h0 h1 h2 h3
FIFO (1) A DSP48-A
Q rk rk-1 rk-2 rk-3
RD RD1 VS1
INH
E1 B
EMPTY h0 h1 h2 h3 ‘0’
PRO
FIGURE 2. A modular, systolic beamforming architecture
In the example shown, for a given set of coefficients selected by the focusing logic ({h0, ..., h3}), partial results for
sample k are obtained in P steps (P=4 in this case) as:
N
Step j: cell i output= hi,j·ri, k-j, 1 ≤ i ≤ N → In the accumulator: Ak ( j ) = Ak ( j − 1) + ∑ hi, j ri,k − j (7)
i =1
Following the P-th step the value Ak is transferred to register RS that stores the final result Sk. Thus, P processing
cycles are required for every output sample. DSP cells operate with clocks above 320 MHz, while ultrasound signals
are acquired at a much lower rate, typically below 40 MHz. This 8:1 ratio R on speed can be used to get the output
samples at the same rate they are acquired using FIR filters with up to 8 coefficients (order 7 or below). Usually this
is enough to yield accurate results using, for instance, Lagrange interpolators with L≤8 [12]. If higher filter orders
were required, a number D>1 cells DSP-48 per channel should be used, but this is usually not required since the
delay resolution obtained with these figures will be better than 1/32 the signal fundamental period.
RESULTS
Current VHDL description allows setting, for every processing element, the number D of DSP-48 cells, the
number P of filter coefficients and the speed ratio R, where D=P/R. This way, the number of DSP-48 cells required
for an N-element implementation, taking into account the final accumulator, is:
P
NC = N + 1 (8)
R
Setting N=128 elements, P=8 coefficients and R=8, yields a total NC=129 DSP-48 cells. Other required
resources are BRAMs for FIFOs (128) and distributed memory for coefficient and data storage. All this logic fits in
a low-cost, moderately sized FPGA of the Xilinx Spartan-6 family XC6SLX75, which has 172 BRAMs and 132
cells DSP-48, leaving room for other processing functions.
CONCLUSION
A systolic beamformer has been described. It efficiently uses the new hardware Digital Signal Processing (DSP)
cells available in state-of-the-art FPGAs, which basically provide a multiplier and adder per cell and operate at high
speed. The beamformer interleaves the coherent summation with the sample interpolation process required to
achieve a high focusing delay resolution. This allows reducing the resource requirements; in particular, only one
accumulator is required for all the process. The high-speed of DSP cells are advantageously used to perform several
processing cycles in a single sampling period for real-time operation. Furthermore, every cell is time-multiplexed to
perform the fractional filtering function, thus reducing the number of required cells.
ACKNOWLEDGMENTS
Work supported by project DPI2010-17648 of the Spanish Ministry for Science and Innovation.
REFERENCES
1. D. K. Peterson and G. S. Kino, IEEE Trans. on Sonics and Ultrasonics, 31, 4, 337-351, (1984).
2. J.-Y. Lee, H. S. Kim, J. H. Song, J. Cho, T. K. Song, Proc. IEEE Ultrason. Symp., 2, 1396-1399, (2005).
3. C. H. Hu, X. C. Xu, J. M. Cannata, J. T. Yen, K. K. Shung, , IEEE Trans. UFFC, 53, 2, pp. 317-323, (2006).
4. R. A. Mucci, IEEE Trans. Acoust., Speech and Signal Proc., 32, 3, 548-558, (1984).
5. S. R. Freeman et al., IEEE Trans. UFFC, 46, 2, 320-332, (1999).
6. M. Kozak and M. Karaman, IEEE Trans. UFFC, 48, 4, 922–930, (2001).
7. B. Reeder, Analog Dialogue, 41-07, 2007
8. M. O’Donnell and M. G. Magrane, US. Patent No. 4.809.184, (28 Feb. 1989).
9. T. K. Song and J. F. Greenleaf, IEEE Trans. UFFC, 41, 3, 326-332, (1994).
10. C. Fritsch, M. Parrilla, A. Ibáñez, R. Giacchetta and O. Martínez, IEEE Trans. UFFC, 53, 10, 1820-1831, (2006)
11. J. Camacho, A. Ibáñez, M. Parrilla and C. Fritsch, Proc. IEEE Ultrason. Symp, 1631-1634, (2006).
12. T. I. Laakso, V. Valimaki, M. Karjalainen and U.K.Laine, IEEE Sig. Proc. Magazine, 30-58, Jan. 1996.
13. J. Camacho, M. Parrilla and C. Fritsch, ICU’2007, #1229, Vienna, April 9 - 13, (2007).
14. M. Karaman, A. Atalar and H. Köymen, IEEE Trans. Medical Imaging, 12, 4, 711-720, (1993).
15. Xilinx Inc., “XtremeDSP for Virtex-4 FPGAs”, UG 073, 2008.