Authors: Barry, William and Vacharasintopchai, Thiti
Issue Date: 5-Dec-2001
Type: Article
Series/Report no.: Proc. 8th Asia-Pacific Conference on Structural Engineering and Construction (EASEC-8), Singapore, December 5-7, 2001;
Abstract: This work focuses on the application of parallel processing to element-free Galerkin method analyses, particularly in the formulation of the stiffness matrix, the assembly of the system of discrete equations, and the solution for nodal unknowns. The objective is to significantly reduce the analysis time while retaining high efficiency and accuracy. Several relatively low-cost Intel Pentium-based personal computers are joined together to form a parallel computer. The processors communicate via a local high-speed network using the Message Passing Interface. Load balancing is achieved through the use of a dynamic queue server that assigns tasks to available processors. Benchmark problems in 3D structural mechanics are analyzed to demonstrate that the parallelized computer program can provide substantially shorter run time than its serial counterpart, without loss of solution accuracy.
URI: http://dspace.siu.ac.th/handle/1532/132
A Parallel Implementation of the Element-Free Galerkin Method
1. Paper No. 1068, Proc. 8 th. East Asia-Pacific Conference on
Structural Engineering and Construction (EASEC-8),
Singapore, December 5-7, 2001.
A PARALLEL IMPLEMENTATION OF THE ELEMENT-FREE
GALERKIN METHOD
W. Barry1 and T. Vacharasintopchai2
ABSTRACT : This work focuses on the application of parallel processing to element-free Galerkin method
analyses, particularly in the formulation of the stiffness matrix, the assembly of the system of discrete equations,
and the solution for nodal unknowns. The objective is to significantly reduce the analysis time while retaining
high efficiency and accuracy. Several relatively low-cost Intel Pentium-based personal computers are joined
together to form a parallel computer. The processors communicate via a local high-speed network using the
Message Passing Interface. Load balancing is achieved through the use of a dynamic queue server that assigns
tasks to available processors. Benchmark problems in 3D structural mechanics are analyzed to demonstrate that
the parallelized computer program can provide substantially shorter run time than its serial counterpart, without
loss of solution accuracy.
KEYWORDS : meshless method, parallel processing, element-free Galerkin method, EFGM, queue server,
Beowulf, solid mechanics
1. INTRODUCTION
In performing the finite element analysis of structural components, meshing, which is the process of
discretizing the problem domain into small sub-regions or elements with specific nodal connectivities,
can be a tedious and time-consuming task. Although some relatively simple geometric configurations
may be meshed automatically, some complex geometric configurations require manual preparation of
the mesh. The element-free Galerkin method (EFGM), one of the recently developed meshless
methods, avoids the need for meshing by employing a moving least-squares (MLS) approximation for
the field quantities of interest. With EFGM, the discrete model of the problem domain is completely
described by nodes and a description of the problem domain boundary. This is a particular advantage
for problems involving propagating cracks or large deformations since no remeshing is required at
each step of the analysis. Detailed formulations of the MLS approximation functions and the
application of EFGM to problems in solid mechanics may be found in [1].
However, the advantage of avoiding the requirement of a mesh does not come cheaply, as EFGM is
much more computationally expensive than the finite element method (FEM). The increased
computational cost is especially evident for three-dimensional and non-linear applications of the
EFGM, due to the usage of MLS shape functions, which are formulated by a least-squares procedure
at each integration point. This computational costliness is the predominant drawback of EFGM.
Parallel processing has long been an available technique to improve the performance of scientific
computing programs. Typically, a parallel computer program employs the ‘divide and conquer’
1
Asian Institute of Technology, Thailand, Assistant Professor
2
Asian Institute of Technology, Thailand, Graduate Student
2. paradigm [2], which involves the partitioning of a large task into several smaller tasks that are then
assigned to available computer processors. Efficient load balancing ensures that all processors are
busy working on assigned tasks as long as there are unfinished tasks. The most common approach
taken in computational mechanics is domain decomposition [3], a method of static load balancing in
which the tasks are identified prior to the analysis and assigned to each processor, along with any data
that may be required. Due to the complex nodal connectivities that arise in the EFGM, domain
decomposition may not be the most efficient approach, and thus a dynamic class of load balancing
based on the concept of a queue server is employed in this work.
2. THE AIT BEOWULF
The effort to deliver low-cost, high-performance computing platforms to scientific communities has
been on-going for many years. A network of personal computers is attractive for this type of use since
it has the same architecture as a distributed memory multi-computer system [4]. Many research groups
have assembled commodity off-the-shelf PC’s and fast LAN connections to build parallel computers.
Parallel computers of this type, termed Beowulf computers after the NASA project of the same name
[5], are suitable for coarse-grained applications that are not communication intensive because of the
high communication start-up time and the limited bandwidth associated with the underlying network
architectures [6].
The AIT Beowulf, a four-node Beowulf class parallel computer was assembled based on the
guidelines in [5] and [7]. Red Hat Linux 6.0, including both the server and workstation operating
system packages, was installed on each node. The AIT Beowulf is a message-passing multiple-
instruction, multiple-data (MIMD) architecture and thus a message-passing infrastructure is needed.
The mpich library [8], which is the most widely used free implementation of the Message Passing
Interface was chosen for the AIT Beowulf. Meschach, a powerful matrix computation library [9] is
employed for serial matrix operations that are performed on each processor.
3. THE QUEUE SERVER
Load balancing has a crucial role in the performance of parallel software. If unbalanced workloads are
assigned to the processors, some may finish their work and be forced to wait for the other processors
to finish, leading to reduced efficiency and increased run-times. In this work, a dynamic load-
balancing agent named Qserv is developed within the framework of the EFGM. Qserv balances the
computational load among the processors in the AIT Beowulf during run-time by acting as clerk that
directs the queued tasks to the available processors. When one processor finishes a task, it requests
another task from Qserv, which continues assigning the tasks to processors until no unfinished tasks
remain.
Figure 1 presents a flowchart of the queue server designed and implemented in the current work. To
separate the dynamic workload allocation from normal operations, the communication between Qserv
and the processors is done through the UNIX socket concept developed at the University of California
at Berkeley [4]. When the Qserv process is initiated, it creates a socket that allows the processors to
simultaneously connect. Initially, the number of total unprocessed subtasks known to Qserv is zero,
and one processor, usually the master processor, must inform Qserv of the actual value. This number
is stored in the max_num variable and can be altered by processors through the SET_MAX_NUM
request. A processor can ask Qserv, through the GET_NUM request, for a subtask to work on. It will be
assigned the numerical identifier of an unprocessed subtask, ranging from zero to max_num. When
the unprocessed subtasks are exhausted, an ALL_DONE signal will be sent to acknowledge the
requesting processor. During the execution of Qserv, a process can also reset the subtask identifier
counter by the RESET_COUNTER request. Qserv will continue serving tasks to processors until the
TERMINATE signal is received.
3. START
fd = current client identifier
Initialize the Socket max_fd = number of client connections maintained
runstate = run state of the server program
count = current counter value
runstate = READY max_num = maximum counter value
count = 0 request_msg = current client's request message
max_num = 0
runstate = Close the
YES END
TERMINATE Socket
NO NO
Accept a client connection
request and update max_fd
fd <=
max_fd
YES
Receive
request_msg
Error
receiving
request_msg
YES NO
Process the
request
request msg = request msg = request msg = request msg =
TERMINATE RESET_COUNTER SET_MAX_NUM GET_NUM
YES YES YES YES
get the new
runstate = count <=
count = 0 max_num from NO
TERMINATE max_num
the client
YES
send the message
max_num = new send 'count' to the
ALL_DONE to the
max_num client
client
Close the
connection
and update
max_fd count = count + 1
move to the next client
Figure 1: Flowchart of the Queue Server
4. SOFTWARE IMPLEMENTATION
When a parallel program is run, each parallel processor will have one copy of the executable program,
termed a process. One process is assigned as the master process while the remaining processes are
worker processes. The MPI default process identifier of the master is 0. In addition to performing the
basic tasks of a worker process, the master process performs additional work involved with
coordinating the tasks among all the workers. Therefore the master process is assigned to run on the
4. server node, which is the most powerful processor, in terms of both processor speed and core
memory, in the AIT Beowulf.
A flowchart of the main process computer code for both the master node and the workers nodes is
presented in Figure 2. The analysis procedures can be grouped into five phases, namely, the pre-
processing phase, the stiffness matrix formulation phase, the force vector formulation phase, the
solution phase, and the post-processing phase. A custom-made parallel Gaussian elimination equation
solver, developed based on the algorithm presented in [10], is employed in the solution phase since
the available public domain parallel equation solvers are typically efficient only for banded, sparse
matrices, which does not match the dense property of the EFGM global stiffness matrix.
MASTER PROCESS WORKER PROCESSES
START START
dd_input
(process the input file)
Broadcast the processed input data broadcast Receive the processed input data
Connect to the queue server Connect to the queue server
ddefg_stiff ddefg_stiff
gather
(form the stiffness matrix) (form the stiffness matrix)
Form the concentrated load vector
ddforce ddforce
gather
(form the distributed load vector) (form the distributed load vector)
Assemble the global force vectors
master_ddsolve worker_ddsolve
collaborate
(apply B.C.'s then solve eqns) (solve eqns)
Write nodal displacements
to the output file
ddpost ddpost
(post-process for desired gather (post-process for desired
displacements and stresses) displacements and stresses)
Write the post-processed
results to the output file
Disconnect from the queue server Disconnect from the queue server
END END
Figure 2: Flowcharts of the Master and Worker Modules
5. NUMERICAL RESULTS
Several 3D, elastostatic examples are solved to illustrate the performance and to verify the validity of
the parallel EFGM analysis code. The results obtained for each analysis closely matched the
analytical solutions [11], as shown in previous serial EFGM works [1]. Thus, the main focus of these
numerical examples is to investigate the run-time and efficiency of the parallel implementation of the
EFGM. Four test cases, with increasing numbers of degrees of freedom, are analyzed using parallel
5. processor counts ranging from one to four. The 4.5
NP1
specific test cases are listed as: 1) linear 4.0
NP2
displacement patch test (336 d.o.f.); 2) cantilever
3.5 NP3
beam with end loading (825 d.o.f.); 3) pure bending NP4
Overall Speedup
of a thick arch (975 d.o.f.); and 4) perforated 3.0
tension strip (2850 d.o.f.). The speedup of the 2.5
overall solution process, the computation and
2.0
assembly of the global stiffness matrix, and the
solution of the discrete system of equations are 1.5
shown in Figures 3 to 5, respectively. When the 1.0
number of degrees of freedom is less than 1,000,
Figure 4 shows that the speedup of the stiffness 0.5
matrix formulation phase gradually approaches the 0.0
theoretical limit value which is equal to the number 0 1000 2000 3000
of processors used in the analysis. However, the Degrees of Freedom
speedup begins to decrease when the number of
degrees of freedom exceeds 1,000, apparently due Figure 3: Overall Speedup of the EFGM
to the initiation of memory page file swapping on Analysis Code
each processor. This may occur since the current
implementation requires the full storage of the global stiffness matrix on each processor. Figure 5
shows that the optimal points, in terms of speedup, for the parallel Gaussian elimination solver are
near 350, 550, and 600 equations for two, three, and four processors, respectively. When the number
of equations is greater than 1000, the speedup of the solver begins to decrease. This may be due to the
same reason as in the stiffness matrix formulation phase, that is, memory page file swapping
commences. Hence, it can be concluded that the current implementation is scalable up to 1,000
degrees of freedom.
4.5
NP1
2.5
4.0 NP2
NP3
3.5 2.0
Stiffness Speedup
NP4
Solver Speedup
3.0
2.5 1.5
2.0
1.0
1.5
NP1
1.0 NP2
0.5
NP3
0.5 NP4
0.0 0.0
0 1000 2000 3000 0 1000 2000 3000
Degrees of Freedom
Degrees of Freedom
Figure 4: Speedup of the Stiffness Figure 5: Speedup of the Gaussian
Computation Module Elimination Solver
6. CONCLUSION
AIT Beowulf, a high-performance yet low-cost parallel computer assembled from a network of
commodity personal computers, was established. A parallel implementation of the element-free
Galerkin method was developed on this platform. Four desired properties of parallel software, which
are concurrency, scalability, locality, and modularity, were taken into account during the design of the
6. parallel version of the element-free Galerkin method. A dynamic load-balancing algorithm was
utilized for the computation of the structural stiffness matrix and external force vector and a parallel
Gaussian elimination algorithm was employed in the solution for the nodal unknowns
(displacements). Several numerical examples showed that the displacements and stresses obtained
from the parallel implementation closely matched the analytical solutions and exactly matched
solutions obtained by the sequential element-free Galerkin method software. With Qserv, a dynamic
load-balancing algorithm, high scalability was obtained for the three-dimensional structural
mechanics problems up to approximately 1,000 degrees of freedom. However, scalability was not
achieved for larger problems, due to the requirement of full stiffness matrix storage on each processor
while only 64 megabytes of memory was available on each worker node. The parallel Gaussian
elimination equation solver took less time to solve the system of equation than its sequential
counterpart. With larger systems of equations, the efficiency of the parallel equation solver tended to
increase because of the increased computation-to-communication ratio. Nevertheless, in the current
implementation of the parallel EFGM analysis code, when the number of equations was more than
1,000, high efficiency was not obtained. Refinement of the memory management algorithms is
recommended so that the parallel EFGM analysis code may be scalable for problem sizes much larger
than 1,000 degrees of freedom.
7. REFERENCES
[1] T. Belytschko, Y. Krongauz, D. Organ, M. Fleming, and P. Krysl, “Meshless methods: An
overview and recent developments”, Computer Methods in Applied Mechanics and
Engineering, Vol. 139, No. 1-4, pp. 3-47, 1996.
[2] Adeli and O. Kamal, Parallel Processing in Structural Engineering, Elsevier Science
Publishers Ltd., U.K., 1993.
[3] K. T. Danielson, S. Hao, W. K. Liu, A. Uras, and S. Li, “Parallel computation of meshless
methods for explicit dynamic analysis”, Accepted for publication in International Journal for
Numerical Methods in Engineering, 1999.
[4] Brown, UNIX Distributed Programming, Prentice Hall International (UK) Limited, UK, 1994.
[5] P. Merkey, “Beowulf: Introduction & overview”, Center of Excellence in Space Data and
Information Sciences, University Space Research Association, Goddard Space Flight Center,
Maryland, USA, September 1998, URL:http://www.beowulf.org/intro.html.
[6] Baker and R. Buyya, “Cluster computing: The commodity supercomputer”, Software—Practice
and Experience, Vol. 29, No. 6, pp. 551-576, 1999.
[7] J. Radajewski and D. Eadline, “Beowulf HOWTO”, November 1998,
URL:http://www.linux.org/help/ldp/howto/Beowulf-HOWTO.html.
[8] W. Gropp and E. Lusk, User's Guide for mpich, a Portable Implementation of MPI, Technical
Report ANL-96/6, Argonne National Laboratory, USA, 1996.
[9] Stewart and Z. Leyk, Meschach: Matrix Computations in C, Proceedings of the Center for
Mathematics and Its Applications, Vol. 32, Australian National University, 1994.
[10] Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and
Analysis of Algorithms, The Benjamin/Cummings Publishing Company, Inc., USA, 1994.
[11] S. P. Timoshenko and J. N. Goodier, Theory of Elasticity, 3rd ed., McGraw-Hill, 1970.