1. Linkoping Electronic Articles in
Computer and Information Science
Vol. 2(1997): nr 07
Linkoping University Electronic Press
Linkoping, Sweden
http://www.ep.liu.se/ea/cis/1997/007/
Batched Range Searching on a
Mesh-Connected SIMD
Computer
Per-Olof Fjallstrom
Department of Computer and Information Science
Linkoping University
Linkoping, Sweden
2. Published on July 7, 1997 by
Linkoping University Electronic Press
581 83 Linkoping, Sweden
Linkoping Electronic Articles in
Computer and Information Science
ISSN 1401-9841
Series editor: Erik Sandewall
c 1997 Per-Olof Fjallstrom
Typeset by the author using LATEX
Formatted using etendu style
Recommended citation:
<Author>. <Title>. Linkoping Electronic Articles in
Computer and Information Science, Vol. 2(1997): nr 07.
http://www.ep.liu.se/ea/cis/1997/007/. July 7, 1997.
This URL will also contain a link to the author's home page.
The publishers will keep this article on-line on the Internet
(or its possible replacement network in the future)
for a period of 25 years from the date of publication,
barring exceptional circumstances as described separately.
The on-line availability of the article implies
a permanent permission for anyone to read the article on-line,
and to print out single copies of it for personal use.
This permission can not be revoked by subsequent
transfers of copyright. All other uses of the article,
including for making copies for classroom use,
are conditional on the consent of the copyright owner.
The publication of the article on the date stated above
included also the production of a limited number of copies
on paper, which were archived in Swedish university libraries
like all other written works published in Sweden.
The publisher has taken technical and administrative measures
to assure that the on-line version of the article will be
permanently accessible using the URL stated above,
unchanged, and permanently equal to the archived printed copies
at least until the expiration of the publication period.
For additional information about the Linkoping University
Electronic Press and its procedures for publication and for
assurance of document integrity, please refer to
its WWW home page: http://www.ep.liu.se/
or by conventional mail to the address stated above.
3. Abstract
Given a set of n points and hyperrectangles in d-dimensional
space, the batched range-searching problem is to determine which
points each hyperrectangle contains. We present two parallel
algorithms for this problem on a pn pn mesh-connected paral-
lel computer: one average-case e cient algorithm based on cell
division, and one worst-case e cient divide-and-conquer algo-
rithm. Besides the asymptotic analysis of their running times,
we present an experimental evaluation of the algorithms.
Keywords Parallel algorithms, mesh-connected parallel comput-
ers, range searching.
The work presented here is funded by CENIIT (the Center for
Industrial Information Technology) at Linkoping University.
A shorter version of this report has been accepted for presentation at
the Ninth IASTED International Conference on Parallel and
Distributed Computing and Systems, October 13-16, 1997,
Washington D.C., USA.
4. 1
1 Introduction
The batched range-searching problem is as follows. Given a set P of
points and a set Q of hyperrectangles in d-dimensional space, report,
for each hyperrectangle, which points it contains. (A hyperrectan-
gle is the Cartesian product of intervals on distinct coordinate axes.)
In on-line range searching, the hyperrectangles are given one at a
time. Several sequential range-searching algorithms have been pro-
posed 3, 10, 11]. Both on-line and batched range searching have
several important applications, for example in statistics, geographic
data processing, and computer-aided engineering. More speci cally,
we have identi ed batched range searching as an important subprob-
lem in computer simulation of mechanical deformation processes such
as vehicle collisions 4].
A two-dimensional mesh-connected parallel computer of size
pnpn consists of n identical processors organized in a rectangular array
of
pn rows and
pncolumns. A bidirectionalcommunication linkcon-
nects each pair of adjacent processors along the same row or column.
Due to the regular interconnection pattern, mesh-connected comput-
ers are inexpensive to build, and several such computers are on the
market. In an SIMD (Single Instruction, Multiple Data) computer,
the processors are synchronized and operate under the control of a
single program. Throughout this paper we refer to a mesh-connected
SIMD computer as a mesh. Many algorithms have been designed for
the mesh. For a survey of mesh algorithms for geometric problems,
see Atallah 2].
In this paper we describe and analyze two mesh algorithms for
batched range searching. One algorithm is based on an average-
case e cient sequential algorithm, whereas the other is a worst-case
e cient divide-and-conquer algorithm. We have implemented and
experimentally evaluated both of the algorithms. Our algorithms are
based on well-known techniques such as divide-and-conquer, but we
are not aware of any other mesh algorithms for range searching. (Oh
and Suk 9] present a mesh algorithm for the on-line version of the
range-counting problem. That is, their algorithm gives the number
of points contained in a hyperrectangle.)
In our development of range-searching algorithms for the mesh,
we assume that P and Q together have at most n elements, and that
each processor initially has at most one point or hyperrectangle in
its local memory. At the end of execution, the points contained in
a hyperrectangle must reside in the local memory of the processor
that initially contained the hyperrectangle. We assume also that the
number of points and the number of hyperrectangles are of the same
order of magnitude, and that the number of points contained in a
hyperrectangle is independent of n. These assumptions are valid in
many applications.
We organize the rest of the paper as follows. In the next section we
give some additional information concerning the mesh, and describe
5. 2
some basic operations used by our algorithms. In Section 3 and 4, we
describe our mesh algorithms for batched range searching. In Section
5, we describe how we implemented the algorithms on a MasPar
MP-1, and report some experimental results. Section 6 o ers some
concluding remarks.
2 Preliminaries
As mentioned in the previous section, a single program controls the
mesh, that is, it is a Single Instruction, Multiple Data computer. In
its most rigid form, SIMD requires that all processors execute the
same instruction, and access data from the same address in their re-
spective memories. We relax these requirements as follows. First,
a processor may be either active or inactive, and an instruction is
executed only by active processors. Moreover, to be able to carry
out operations that require all processors to be active, we assume
that activating all processors temporarily is possible. Second, each
processor can do its own address computation. More speci cally, we
assume that processors simultaneously can execute an array index-
ing instruction such as A i] = b", where the value of i may di er
between processors. These features are all available in modern SIMD
computers such as the MasPar MP-1 computer.
Each processor is identi ed by its pair of row and column indexes,
(i;j), where 0 i;j < pn. In addition, processors are often indexed
by some one-to-one mapping from f0;1;:::;pn 1g f0;1;:::;pn
1g to f0;1;:::;n 1g. Various indexing schemes are used, for example
row-major, snake-like row-major, and shu ed row-major indexing.
In this paper we use snake-like row-major and shu ed row-major
indexing (see Figure 1). We assume that each processor knows its
indexes. The local memory of each processor consists of a xed num-
ber of memory cells (words). We assume that the size of a word is
su ciently large to contain a single coordinate value or processor in-
dex. The transfer of a word of data between adjacent processors and
the standard arithmetic operations on the contents of a word can be
done in O(1) time.
Sorting is one of the most important operations in parallel com-
putation. In many situations we need to rearrange a set of n keys,
one in each processor, such that the i-th smallest key is moved to the
processor with index i 1, for all i = 1;2;:::;n. Sorting can be done
in O(
pn) time 12, 7, 6].
Two other important data movement operations are concurrent
read and concurrent write. In a concurrent read operation, denoted
q = s(i):p, each processor i holds an index s(i) in its local memory.
The operation copies the data in memory cell p in the local memory of
the processor s(i) to memory cell q in the local memory of processor i.
In the concurrent write operation, denoted d(i):q = p, each processor
i holds a unique index d(i) in its local memory. The operation copies
6. 3
0, 0
7, 2
15, 10
6, 3
2, 4 3, 5
4, 7
14, 11
0
1
2
3
0 1 2 3
i
j
11, 13
12, 1513, 14
10, 128, 8 9, 9
5, 6
1, 1
Figure 1: Mesh with n = 16. The rst integer within each processor
is the snake-like row-major index of the processor and the second
integer is the shu ed row-major index.
the content of memory location p in processor i's local memory to
location q in the processor d(i)'s local memory. Concurrent read and
write can be done in O(
pn) time 8].
Another fundamental operation is the global sum operation, for
which the input consists of an array a0;a1;:::;an 1], where ai is
contained in processor i. The output consists of the value of a0
a1 an 1 (where represents some associative binary opera-
tor such as +, maximum etc.), stored in the local memory of each
processor. Closely related to the global sum operation is the pre x
sum operation. It has the same input as the global sum operation,
but the output consists of the value of a0 a1 ai stored in the
local memory of processor i. We can compute both global and pre x
sums in O(
pn) time 1].
3 An Average-Case E cient Algorithm
The parallel algorithm presented in this section is based on the cell
method. This is one of the simplest sequential methods for range
searching. It consists of a preprocessing algorithm, the output of
which is a data structure on the given point set P, and a query
processing algorithm, that determines which points are contained in
a given hyperrectangle. The preprocessing algorithm is as follows.
First, nd the smallest hyperrectangular box that contains P. Par-
tition this box into a number of identical hyperrectangular cells, and
initialize a point list for each cell. Finally, for each point, determine
which cell it is contained in and add it to the cell's point list. To
determine which points are contained in a hyperrectangle q, deter-
7. 4
mine which cells are intersected by q. For each intersected cell, nd
its point list and test each point in the list for inclusion in q.
Although its worst-case performance is poor, the sequential cell
method is quite e cient in practice. Intuitively, the reason for this is
that the input points are often uniformlydistributedover the smallest
hyperrectangle containing the points. If, in addition, the hyperrect-
angles are almost cubical", that is, not too long and thin, it can be
shown that the number of point inclusion tests and the number of
intersected cells are of the same order of magnitude as the number of
points contained in a hyperrectangle 11].
Before we give our parallel version of the cell method, we intro-
duce some notation used throughout this paper. The i-th coordinate,
i = 1;2;:::;d of point p is denoted by xi(p); the minimum and max-
imum coordinate values of hyperrectangle q in the i-th coordinate
direction are denoted by xl
i(q) and xu
i (q).
Algorithm: The Parallel Cell Method
Input: A set P of d-dimensional points and a set Q of d-dimensional
hyperrectangles are distributed on a
pn pn mesh, at most one
point or one hyperrectangle per processor. We index the mesh in
snake-like row-major order.
Output: For each q 2 Q, we store the points lying in the interior of q
in the processor containing q.
1. Compute B, the smallest hyperrectangle containing P. B is
divided into m cells along each coordinate direction, where
m = bn1=d
P c and nP = jPj. With each cell is associated a unique
d-tuple i1;i2;:::;id] such that 0 ik < m, for k = 1;2;:::;d.
For each cell is also de ned a unique processor index; the pro-
cessor index of a cell with d-tuple i1;i2;:::;id] is
Pd
k=1 ikmj 1.
We illustrate the cell subdivision in Figure 2.
2. For each processor, initialize the local variables rst and last
such that rst > last.
3. For each point p, determine rst i1(p);i2(p);:::;id(p)], the d-
tuple of the cell that contains p. We have that
ik(p) =
$
(m 1) xk(p) xl
k(B)
xu
k(B) xl
k(B)
%
; for k = 1;2;:::;d,
where xl
k(B) and xu
k(B) are the minimum and maximum coor-
dinate values of B in the k-th coordinate direction. Next, com-
pute c(p), the processorindexcorrespondingto i1(p);:::;id(p)].
4. For each point p, create the record G(p) = x1(p);:::;xd(p);c(p)].
Sort the records into nondecreasing order with respect to their
last component.
5. For each point p (from now on, point" refers to the d rst
components of a G record), do
8. 5
(a) if c(pp) 6= c(p), then set c(p): rst = i(p), and
(b) if c(ps) 6= c(p), then set c(p):last = i(p),
where pp (ps) denotes the point that precedes (succeeds) p in
snake-like row-major order, and i(p) is the index of the proces-
sor containing p.
6. For each hyperrectangle q, do as follows.
(a) Determine the two d-tuples l1(q);:::;ld(q)] and u1(q);:::;
ud(q)], such that q intersects each cell i1;i2;:::;id] for
which lk(q) ik uk(q), for all k = 1;2;:::;d. Compute
s(q) = d
k=1sk(q), where sk(q) = uk(q) lk(q) + 1.
(b) If s(q) > 0, then do as follows.
i(q) = 1;
For k = 1;2;:::;d, compute
ik(q) = b(i(q) 1)= k 1
l=1 sl(q)c mod sk(q) + lk(q);
Compute c(q), the processor index of cell
i1(q);:::;id(q)];
j(q) = c(q):first;
last(q) = c(q):last;
L: if j(q) last(q) then
If the point in processor j(q) is contained in q, then
store a copy of it in the processor containing q;
j(q) = j(q) + 1;
if j(q) > last(q) and i(q) < s(q) then
i(q) = i(q) + 1;
For k = 1;2;:::;d, compute
ik(q) = b(i(q) 1)= k 1
l=1 sl(q)c mod sk(q) + lk(q);
Compute c(q), the processor index of cell
i1(q);:::;id(q)];
j(q) = c(q):first;
last(q) = c(q):last;
if j(q) last(q) then goto L;
Theorem 1. The parallel cell method takes O((d + rmax)
pn) time,
where rmax = maxfs(q)+d t(q) : q 2 Qg, s(q) is the number of cells
intersected by the hyperrectangle q and t(q) is the number of points
tested for inclusion in q.
Proof. Since we assume d to be much smaller than
pn, we re-
strict our analysis to operations that require communication between
processors. The rst ve steps of the algorithm correspond to the
preprocessing algorithm of the sequential cell method. In Step 1, B
and nP can be determined by the global sum operation; this takes a
total of O(d pn) time. In Step 4, we use dummy" records, that is,
we create a record for every processor. If a processor does not con-
tain a point, the last component of the record is set to +1". The
9. 6
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
q
q
B
i
i
0 1 2 3
0
1
2
3
1
2
1
2
Figure 2: Two-dimensional example of a cell division for nP = 16.
Numbers within cells represent processor indexes, and dots represent
points. Two hyperrectangles, q1 and q2, are included in the example.
e ect of this is that, after sorting, all real" points are contained in
the processors indexed 0 through nP 1. Sorting the records requires
O(d pn) time. A point can compare its processor index with the pro-
cessor index of its predecessor and successor in O(1) time. The rst
and last variables can be set in O(
pn) time. Therefore, Step 5 takes
O(
pn) time. In Step 6(b), let a round be the activities taking place
between two consecutive executions of the rst if-statement. Clearly,
the number of rounds cannot exceed maxfs(q) + t(q) : q 2 Qg. 2
Corollary 1. If the points are chosen uniformly and independently
at random from the d-dimensional unit hypercube, and the hyperrect-
angles are cubical and fall completely within the unit hypercube, then
the average-case time for the parallel cell method is O((r+1)3d pn)),
where r is the average number of points contained in the largest hy-
perrectangle.
Proof. We assume that B is equal to the d-dimensional unit hyper-
cube. For su ciently large values of nP, this is a reasonable assump-
tion. Let w denote the width of the largest hyperrectangle. Then,
s(q) (dw me+ 1)d (dr1=de+ 1)d;
where r = wdnP is the average number of points contained in the
largest hyperrectangle. For r1=d 1, s(q) = 2d. Otherwise,
s(q) < (r1=d + 2)d < r 3d:
The average number of points per cell is O(1). 2
10. 7
4 A Worst-Case E cient Algorithm
In this section, we present a worst-case e cient algorithm based on
divide-and-conquer. Many mesh algorithms are based on divide-and-
conquer. For example, Jeong and Lee 5] describe an algorithm to
solve a two-dimensional multipoint location problem that is based
on ideas similar to those used in our algorithm. Before we give the
actual algorithm, let us brie y describe the algorithm for the two-
dimensional case.
First, divide the input, P and Q, into two equal-sized parts, P1
and Q1, and P2 and Q2, such that each point in P1 and the lower
horizontal boundary of each hyperrectangle in Q1 lies below every
element in P2 and Q2. See Figure 3. Solve the corresponding sub-
problems recursively. We must now solve the problem for input P2
p
p
p
p
q
q
q
q
x
x
1
1
2
3
4
3
2
2
1
4
Figure 3: In this example, P1 = fp1g, Q1 = fq1;q2;q3g, P2 =
fp2;p3;p4g, and Q2 = fq4g.
and Q1. To this end, divide the input into two equal-sized parts, P1
and Q1, and P2 and Q2, such that each point in P2 and the upper
horizontal boundary of each hyperrectangle in Q2 lies above every
element in P1 and Q1. See Figure 4. Again, solve the corresponding
subproblems recursively. It remains to solve the problem for input
P1 and Q2. The dimension of this problem is, however, of one dimen-
sion less than the dimension of the original problem. If the problem is
one-dimensional, solving it directly is easy; otherwise, we again apply
divide-and-conquer.
Algorithm: Parallel Divide-and-Conquer
Input: A set P of d-dimensional points and a set Q of d-dimensional
hyperrectangles are distributed on a
pn pn mesh, at most one
point or one hyperrectangle per processor. We index the mesh in
shu ed row-major order, and we assume that
pn = 2k for some
positive integer k.
11. 8
p
p
p
q
q
q
x
x
1
2
3
4
3
2
2
1
Figure 4: In this example, P1 = fp2;p3g, Q1 = fq2g, P2 = fp4g, and
Q2 = fq1;q3g.
Output: For each q 2 Q, we store the points lying in the interior of q
in the processor containing q.
1. Preprocessing:
For each point p, create the record Gd(p) = x1(p);:::;xd(p);a(p)],
where a(p) is called the address of p, i.e., a(p) is equal to the in-
dex of the processor containing p. Next, for each hyperrectangle
q, create the record
Gd(q) = xl
1(q);xu
1(q);xl
2(q);xu
2(q);:::;xl
d(q);xu
d(q);id(q)];
where id(q) is the index of the processor containing q. Finally,
sort all records into nondecreasing order with respect to xd-
coordinate (Gd(q) records are sorted with respect to their xl
d(q)-
coordinate).
2. Call range search(
pn;pn;d).
Procedurerange search (together withprocedurerange search )
does the main part of the computations. These procedures are
given below. The output from this step is, for each hyperrect-
angle q, a list of the addresses of the points contained in q. We
store this list in the processor that contains the corresponding
Gd(q) record.
3. Postprocessing:
For each processor containing a Gd(q) record, move the point
addresses stored in the processor to the processor that contains
q, that is, to the processor with index id(q). Then, for each
hyperrectangle q, process its list of point addresses. That is, for
each address in the list, copy the point stored at that address
to the processor containing q.
12. 9
procedure range search(r;c;d)
for each submesh of size r c do in parallel
1. if r = c = 1 then return;
2. if d = 1 then
(a) For each processor, determine the index of the next pro-
cessor (in shu ed row-major order) that contains a G1(p)
record. Store the index in the local variable successor. (If
no such processor exists, then successor = NIL.)
(b) For each record G1(q) = xl
1(q);xu
1(q);i1(q)] do:
k(q) = successor;
while k(q) 6= NIL do
Let G1(p) = x1(p);a(p)] be the record in the
processor k(q);
if xu
1(q) > x1(p) then
copy a(p) to the processor containing G1(q);
k(q) = k(q):successor;
else exit the while-loop;
3. if d > 1 then
(a) if r = c then call range search(r=2;c;d).
(b) if r = c=2 then call range search(r;c=2;d).
(c) For each Gd(p) record in the higher-indexed half of the
submesh, create the record Gd(p) = Gd(p). For each Gd(q)
record in the lower-indexed half of the submesh, create the
record
Gd(q) = xl
1(q);xu
1(q);:::;xl
d 1(q);xu
d 1(q);xu
d(q);jd(q)];
where jd(q) is the index of the processor containing the
record. Finally, sort the Gd records into nondecreasing or-
der with respect to xd-coordinate (Gd(q) records are sorted
with respect to their xu
d(q)-coordinate).
(d) Call range search (r;c;d).
(e) For each processor containing a Gd(q) record, move the
point addresses stored in the processor during the call to
range search to the processor with index jd(q).
return
procedure range search (r;c;d)
for each submesh of size r c do in parallel
1. if r = c = 1 then return;
13. 10
2. if r = c then call range search (r=2;c;d).
3. if r = c=2 then call range search (r;c=2;d).
4. For each Gd(p) record in the lower-indexed half of the submesh,
create the record
Gd 1(p) = x1(p);x2(p);:::;xd 1(p);a(p)]:
For each Gd(q) record in the higher-indexedhalf of the submesh,
create the record
Gd 1(q) = xl
1(q);xu
1(q);:::;xl
d 1(q);xu
d 1(q);id 1(q)];
where id 1(q) is the indexof the processor containing the record.
Finally, sort the Gd 1 records into nondecreasing order with re-
spect to xd 1-coordinate (the Gd 1(q) records are sorted with
respect to their xl
d 1(q)-coordinate).
5. Call range search(r;c;d 1).
6. For each processor containing a Gd 1(q) record, move the point
addressesstored inthe processor duringthe callto range search
to the processor with index id 1(q).
return
Theorem 2. The parallel divide-and-conquer method takes O((r +
1)16d pn) time, where r is the maximum number of points contained
in any hyperrectangle.
Proof. It is su cient to show that the number of routing steps
(i.e., the transfer of one word of data between adjacent processors) is
O((r+1)16d pn). In the Preprocessing step, records containing 2d+1
words of data are sorted; the number of routing steps is thus O(dpn).
The Postprocessing step requires r concurrent write operations and
d r concurrent read operations, which gives a total of O(d rpn)
routing steps.
To bound the number of routing steps done in procedures range
search and range search , we rst consider the number of routing
steps required when r = 0. Let the number of routing steps done by
procedure range search and range search on a mesh of size 2i 2j
be denoted by R(i;j;d) and R (i;j;d), respectively. We can easily
see that R(k;k;1) is O(2k); Step 2(a) is a variant of the pre x sum
operation, and in Step 2(b) at most one concurrent read operation is
required. Suppose now that d 2 and that k > 0; we then have the
recurrence relations
R(k;k;d) R(k 1;k;d) + 2dRs(k;k) + R (k;k;d); and
R (k;k;d) R (k 1;k;d) + (2d 1)Rs(k;k) + R(k;k;d 1);
14. 11
where Rs(i;j) denotes the number of routing steps required to sort
numbers lying (one number per processor) in a mesh of size 2i 2j.
By expanding the rst term on the right-hand side of each inequality,
we get
R(k;k;d) 2d
kX
i=1
(Rs(i 1;i) + Rs(i;i)) +
kX
i=1
(R (i 1;i;d)
+ R (i;i;d))
4d
kX
i=1
Rs(i;i) + 2
kX
i=1
R (i;i;d);
and
R (k;k;d) 4d
kX
i=1
Rs(i;i) + 2
kX
i=1
R(i;i;d 1):
By inserting the last inequality into the equation for R(k;k;d), we
obtain
R(k;k;d) 4d
kX
i=1
Rs(i;i) + 8d
kX
i=1
(k + 1 i)Rs(i;i)
+ 4
kX
i=1
(k + 1 i)R(i;i;d 1)
= Rss(k;d) + 4
kX
i=1
(k + 1 i)R(i;i;d 1);
where we introduce Rss(k;d) to denote the value of the sums that
involves Rs. Expansion of this inequality, gives us
R(k;k;d) Rss(k;d) + 4
kX
i1=1
(k + 1 i1)Rss(i1;d 1)
+ 42
kX
i1=1
(k + 1 i1)
i1X
i2=1
(i1 + 1 i2)Rss(i2;d 2) +
+ 4d 2
kX
i1=1
(k + 1 i1)
i1X
i2=1
(i1 + 1 i2)
id 3X
id 2=1
(id 3 + 1 id 2)Rss(id 2;2)
+ 4d 1
kX
i1=1
(k + 1 i1)
i1X
i2=1
(i1 + 1 i2)
id 2X
id 1=1
(id 2 + 1 id 1)R(id 1;id 1;1):
To evaluate the left-hand side of this inequality, we note the fact that
kX
i=1
(k + 1 i)2i 42k:
15. 12
This implies that Rss(k;d) is O(d2k). Moreover, by repeated use
of this fact we see that R(k;k;d) is O(2k Pd 1
i=0 (d i)16i), which is
O(16d 2k).
Let us nally consider how many additional routing steps are
required when r > 0. Let the number of additional routing steps
done by procedure range search and range search on a mesh of
size 2i 2j be denoted by R+(i;j;d) and R+(i;j;d), respectively. It
is then easy to see that R+(k;k;1) is O(r2k). For d 2 and k > 0,
we have the recurrence relations
R+(k;k;d) R+(k 1;k;d) + R+(k;k;d) + rRm(k;k); and
R+(k;k;d) R+(k 1;k;d) + R+(k;k;d 1) + rRm(k;k);
where Rm(i;j) denotes the number of routing steps required to move
numbers (one number per processor) from one set of processors to
another set of processors in a mesh of size 2i 2j. These recurrence
relations are similar to the previous recurrence relations. Together
with the fact that Rm(k;k) is O(2k), this implies that R+(k;d) is
O(r16d 2k). 2
5 Experimental Evaluation
So far we have described and analyzed our algorithms at a theoretical
level. To understand better how the algorithms work in practice, we
have also implemented them on a MasPar MP-1 computer.
The MasPar MP-1 consists of an array control unit and a proces-
sor array. The array control unit controls the processor array, and
the interaction between the front end computer and the processor ar-
ray. In addition, the array control unit performs operations on scalar
data. On the machine that we have access to, the processor array
consists of a total of 16,384 processors arranged in a two-dimensional
array of 128 rows and columns. Each processor in the processor ar-
ray is a 1.8-MIPS processor, and has forty 32-bit registers, and 16
or 64 kilobytes of RAM. Communication between two processors in
the processor array can be via X-net or Global Router, where X-net
communications are restricted to be either horizontal, vertical or di-
agonal. The Global Router allows communication between any pair
of processors but its e ciency is very data dependent: if many proces-
sors want to communicate with the same processor, the performance
deteriorates dramatically.
The MasPar can be programmed either by MPL or Fortran, where
MPL is based on ANSI C with extensions for data parallelism. Since
MPL allows direct control over the machine, we have used MPL.
As already mentioned in Section 2, MPL also o ers some degree of
exibility such as addressing autonomy.
In our implementation we have as much as possible used the li-
brary functions provided with MPL. More speci cally, library func-
tions have been for global sum operations in the parallel cell method.
16. 13
We have consistently used the Global Router for concurrent read/write
operations. In only one case has this turned out to be problematic:
in Step 6(b) of the parallel cell method it may happen that many pro-
cessors rst copy the rst and last values from the same processor,
and then continue to copy point coordinates from the same proces-
sors. We avoid this by using randomization. To each hyperrectangle
q is associated a random number, r(q), such that 0 r(q) 1. This
number is used to modify the order in which cells and points are
processed by a hyperrectangle.
In both algorithms we need to sort records of data. There is no
library function for this, but in the parallel cell method we can sort by
using a ranking function, and then the Global Router to move each
record to its correct location. For each active processor, the ranking
function computes the rank of the value of a local variable. This
approach does not work in the parallel divide-and-conquer method,
since we then need simultaneously to sort records within submeshes.
We have instead implemented sorting routines based on bitonic sort
12].
We have evaluated the algorithms for two kindsof two-dimensional
input data. For both kinds of input there are 8192 points and 8192
equal-sized hypersquares. For uniform kind of input, points and
squares are chosen at random from the unit square. For diagonal
kind of input, points and squares are chosen at random along the di-
agonal of the unit square (i.e., the diagonal of each square coincides
with the diagonal of the unit square). The width of the squares is
in each case chosen such that each square contains four points on
average.
The running times (in milliseconds (ms)) are as follows. The
running time of the parallel cell method is 55 ms for uniform input,
and 205 ms for diagonal input. For the parallel divide-and-conquer
algorithm the corresponding running times are 1291 ms and 1207 ms.
The parallel cell method is thus much faster than the parallel
divide-and-conquer method for both kinds of input. Although the
diagonal kind of input is not a worst-case input for the cell method,
it is still fairly bad": on average we must test each square against
at least 90 points.
A substantial part of the running time of the parallel divide-and-
conquer algorithm is used for sorting records. The sorting algorithm
that we have implemented is asymptotically optimal, but it is likely
that a more careful implementation could make it run faster. It is
thus possible that we can improve the running time of the parallel
divide-and-conquer algorithm considerably.
To compare our algorithms with sequential algorithms for range
searching, we have implemented the sequential cell method on our
front end machine, a DECstation 5000 (Model 200, 25 MHz). The
parallel cell method is 13{15 times faster than the sequential cell
method for both kinds of input. The speedup is thus not very impres-
sive. Partly this is due to unavoidable communication costs. Another
17. 14
reason is that the front end computer is more powerful in terms of
oating point operations: our measurements show that a point in-
clusion test (that is, to test if a point lies within a hyperrectangle)
is about forty times faster on the front end computer than on the
MasPar MP-1.
6 Conclusions
We have presented two algorithms for batched range searching on
a mesh: one algorithm based on cell division, and another algo-
rithm based on divide-and-conquer. The divide-and-conquer algo-
rithm takes O((r+1)16dpn) time, where r is the maximum number
of points contained in any hyperrectangle. We can show that if some
constant independent of n bounds r, and the points contained in a hy-
perrectangle must be stored in the processor that initially contained
the hyperrectangle, then any algorithm must take (d(r + 1)
pn)
time in the worst case. For a xed dimension d, the divide-and-
conquer algorithm is thus worst-case optimal (within a multiplicative
constant). The cell method takes O((d + wmax)
pn) time, where
wmax = maxfs(q)+d t(q) : q 2 Qg, s(q) is the number of cells inter-
sected by the hyperrectangle q and t(q) is the number of points tested
for inclusion in q. Thus, this method may take (dnpn) time even
when r = 0. However, as shown by Corollary 1 and our experimen-
tal results, the cell method may outperform the divide-and-conquer
method in practice.
We require that both algorithms store copies of the points con-
tained within a hyperrectangle in the processor that initially con-
tained the hyperrectangle. An alternative would have been to design
load-balanced algorithms, that store the output evenly distributed
over the processors. However, such algorithms would be forced to
spend time on various load-balancing activities. At least for the ap-
plications that we consider, it is likely that such algorithms would be
slower than the cell method.
References
1] S.G. Akl. The design and analysis of parallel algorithms.
Prentice-Hall International, London, UK, rst edition, 1989.
2] M.J. Atallah. Parallel techniques for computational geometry.
Proc. IEEE, 80(9):1435{1448, 1992.
3] J.L. Bentley and J.H. Friedman. Data structures for range
searching. Computing Surveys, 11:397{409, 1979.
4] P-O. Fjallstrom, J. Petersson, L. Nilsson, and Z-H. Zhong. Eval-
uation of range searching methods for contact searching in me-
chanical engineering. To appear in Int. J. Computational Geom-
etry & Applications.
18. 15
5] C.S. Jeong and D.T. Lee. Parallel geometric algorithms on a
mesh-connected computer. Algorithmica, 5:155{177, 1990.
6] M. Kumar and D.S. Hirschberg. An e cient implementation
of Batcher's odd-even merge algorithm and its application in
parallel sorting schemes. IEEE Transactions on Computers, C-
32(3):254{264, March 1983.
7] D. Nassimi and S. Sahni. Bitonic sort on a mesh-connected
parallel computer. IEEE Transactions on Computers, C-28(1):2{
7, January 1979.
8] D. Nassimi and S. Sahni. Data broadcasting in SIMD computers.
IEEE Transactions on Computers, C-30(2):101{107, February
1981.
9] S-J. Oh and M. Suk. Parallel algorithms for geometric searching
problems. In Proc. Supercomputing'89, pages 344{350, 1989.
10] F.P Preparata and M.I. Shamos. Computational geometry: An
Introduction. Springer-Verlag, New York, NY, second edition,
1985.
11] R. Sedgewick. Algorithms. Addison-Wesley Publishing Compa-
ny, Reading, MA, second edition, 1988.
12] C.D. Thompson and H.T. Kung. Sorting on a mesh-connected
parallel computer. Communications of the ACM, 20(4):263{271,
1977.